feature(notizen): add notes from l2 afternoon
This commit is contained in:
+108
-1
@@ -7,7 +7,7 @@
|
|||||||
## recap
|
## recap
|
||||||
|
|
||||||
> [!NOTE] Definition
|
> [!NOTE] Definition
|
||||||
algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
|
> algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
|
||||||
|
|
||||||
- Model
|
- Model
|
||||||
- represents the solution to the tasks T
|
- represents the solution to the tasks T
|
||||||
@@ -253,3 +253,110 @@ Bewertung:
|
|||||||
- conference series
|
- conference series
|
||||||
- research articles
|
- research articles
|
||||||
- data collecting companies and public administrations
|
- data collecting companies and public administrations
|
||||||
|
|
||||||
|
## Unsupervised Learning
|
||||||
|
|
||||||
|
> [!NOTE] Definition
|
||||||
|
> an algorithm learns from experience E to solve some tasks T with performance P if P improves with E
|
||||||
|
|
||||||
|
- K-Means Clustering
|
||||||
|
- Sample Vektor repräsentiert als feature vektor
|
||||||
|
- Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder
|
||||||
|
- Man möchte das in eine sinnvolle Verzeichnisstruktur bringen
|
||||||
|
- Clustering könnte helfen
|
||||||
|
|
||||||
|
## Clustering
|
||||||
|
|
||||||
|
- Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster
|
||||||
|
- Iris Datenset kann in Clusters unterteilt werden
|
||||||
|
- News Artikel auf Google News
|
||||||
|
- Kunden in Kundengruppen
|
||||||
|
- Das alles ist **unsupervised Classification**
|
||||||
|
- Based on some similarity measure
|
||||||
|
- all instances within a cluster should be similar
|
||||||
|
- ähnliche feature Werte (numerische Werte)
|
||||||
|
- and instances in different clusters should be dissimilar
|
||||||
|
- Erster Punkt: Cluster Bilden
|
||||||
|
- Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster?
|
||||||
|
- Häufig muss dies der Mensch machen, da die Algorithmen das nicht können
|
||||||
|
- Cluster discovered
|
||||||
|
- Beispielsweise durch die Euklidsche Distanz
|
||||||
|
- ähnlichkeit zwischen zwei Vektoren (Feature Vektor)
|
||||||
|
|
||||||
|
|
||||||
|
## K-Means
|
||||||
|
|
||||||
|
- Idea
|
||||||
|
- creates K clusters
|
||||||
|
- interpret samples x as real-valued vectors x-> (vector)
|
||||||
|
- data preparation: numeric data only
|
||||||
|
- assignement of x to a cluster is based on its distance to the cluster centroids
|
||||||
|
- jeder Cluster hat einen Mittelpunkt (centroids)
|
||||||
|
- der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist
|
||||||
|
- Mittelpunkt (centroid) ist im Vorhinein nicht bekannt
|
||||||
|
- und ist auch nicht statisch, der verschiebt sich
|
||||||
|
- Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set
|
||||||
|
- Am Anfang ist aber noch kein Cluster bekannt
|
||||||
|
- Daher löst man das mit einem iterativen Prozess
|
||||||
|
|
||||||
|
```
|
||||||
|
# select K random samples {c1, c2, ... , ck} as approximatrion of centroids
|
||||||
|
until termination condition
|
||||||
|
for each sample xi:
|
||||||
|
assign xi to the cluster Cj such that dist(xj,cj) is minimal
|
||||||
|
for each cluster Cj update the approximations of centroids
|
||||||
|
cj = u(Cj)
|
||||||
|
```
|
||||||
|
|
||||||
|
### How many clusters K
|
||||||
|
|
||||||
|
1. Number of clusters K is given
|
||||||
|
- partition n samples into predetermined number of clusters
|
||||||
|
2. Finding the "right" number of cluster is part of the problem
|
||||||
|
- partition n samples into appropriate number of clusters
|
||||||
|
- often try and error
|
||||||
|
3. Use an algorithm to determine K automatically
|
||||||
|
- define a function to assess the "qualits" of all clusters
|
||||||
|
- e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is
|
||||||
|
- increase K until no further quality improvement
|
||||||
|
|
||||||
|
### Discussion K-Means
|
||||||
|
|
||||||
|
- Advantages
|
||||||
|
- easy to implement and understand (white box)
|
||||||
|
- Disadvantages
|
||||||
|
- assumes that clusters are sphere-shaped
|
||||||
|
- number of iterations and resulting clusters results depend on seed choice
|
||||||
|
- use heuristic rather than random picks
|
||||||
|
- algorihm may converge on local minima
|
||||||
|
- re-run with different seeds
|
||||||
|
- post-process resulting clusters
|
||||||
|
- split the n "worst" clusters into 2 or more subclusters
|
||||||
|
- merge 2 close clusters (where centroids are close) into one
|
||||||
|
- slow
|
||||||
|
- updating centroid after each new sample assignment may speed up the process
|
||||||
|
|
||||||
|
### Cluster Evaluation Metrics
|
||||||
|
|
||||||
|
1. in case we have a classified data set (gold standard)
|
||||||
|
- homogeneity score
|
||||||
|
- between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster
|
||||||
|
- completeness score
|
||||||
|
- between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster
|
||||||
|
- Adjusted rand index (ARI)
|
||||||
|
- Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge)
|
||||||
|
- overlap = number of common items
|
||||||
|
- between -1 or 1, where 1 means equality
|
||||||
|
2. in case the gold standard is not known
|
||||||
|
- dann kann man es nur noch geometrisch bewerten
|
||||||
|
- SSE: sum of squared error (Fehlerquadratsumme)
|
||||||
|
sum of squared distance of each sample to the centroid of its assigned cluster
|
||||||
|
- silhouette coefficient
|
||||||
|
- wie gut ist der punkt abgegrenzt vom Nachbarcluster
|
||||||
|
- average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster
|
||||||
|
- between -1 and 1 where 1 means dense clusters
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user