feature(notizen): add notes from l2 afternoon
This commit is contained in:
+108
-1
@@ -7,7 +7,7 @@
|
||||
## recap
|
||||
|
||||
> [!NOTE] Definition
|
||||
algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
|
||||
> algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
|
||||
|
||||
- Model
|
||||
- represents the solution to the tasks T
|
||||
@@ -253,3 +253,110 @@ Bewertung:
|
||||
- conference series
|
||||
- research articles
|
||||
- data collecting companies and public administrations
|
||||
|
||||
## Unsupervised Learning
|
||||
|
||||
> [!NOTE] Definition
|
||||
> an algorithm learns from experience E to solve some tasks T with performance P if P improves with E
|
||||
|
||||
- K-Means Clustering
|
||||
- Sample Vektor repräsentiert als feature vektor
|
||||
- Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder
|
||||
- Man möchte das in eine sinnvolle Verzeichnisstruktur bringen
|
||||
- Clustering könnte helfen
|
||||
|
||||
## Clustering
|
||||
|
||||
- Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster
|
||||
- Iris Datenset kann in Clusters unterteilt werden
|
||||
- News Artikel auf Google News
|
||||
- Kunden in Kundengruppen
|
||||
- Das alles ist **unsupervised Classification**
|
||||
- Based on some similarity measure
|
||||
- all instances within a cluster should be similar
|
||||
- ähnliche feature Werte (numerische Werte)
|
||||
- and instances in different clusters should be dissimilar
|
||||
- Erster Punkt: Cluster Bilden
|
||||
- Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster?
|
||||
- Häufig muss dies der Mensch machen, da die Algorithmen das nicht können
|
||||
- Cluster discovered
|
||||
- Beispielsweise durch die Euklidsche Distanz
|
||||
- ähnlichkeit zwischen zwei Vektoren (Feature Vektor)
|
||||
|
||||
|
||||
## K-Means
|
||||
|
||||
- Idea
|
||||
- creates K clusters
|
||||
- interpret samples x as real-valued vectors x-> (vector)
|
||||
- data preparation: numeric data only
|
||||
- assignement of x to a cluster is based on its distance to the cluster centroids
|
||||
- jeder Cluster hat einen Mittelpunkt (centroids)
|
||||
- der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist
|
||||
- Mittelpunkt (centroid) ist im Vorhinein nicht bekannt
|
||||
- und ist auch nicht statisch, der verschiebt sich
|
||||
- Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set
|
||||
- Am Anfang ist aber noch kein Cluster bekannt
|
||||
- Daher löst man das mit einem iterativen Prozess
|
||||
|
||||
```
|
||||
# select K random samples {c1, c2, ... , ck} as approximatrion of centroids
|
||||
until termination condition
|
||||
for each sample xi:
|
||||
assign xi to the cluster Cj such that dist(xj,cj) is minimal
|
||||
for each cluster Cj update the approximations of centroids
|
||||
cj = u(Cj)
|
||||
```
|
||||
|
||||
### How many clusters K
|
||||
|
||||
1. Number of clusters K is given
|
||||
- partition n samples into predetermined number of clusters
|
||||
2. Finding the "right" number of cluster is part of the problem
|
||||
- partition n samples into appropriate number of clusters
|
||||
- often try and error
|
||||
3. Use an algorithm to determine K automatically
|
||||
- define a function to assess the "qualits" of all clusters
|
||||
- e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is
|
||||
- increase K until no further quality improvement
|
||||
|
||||
### Discussion K-Means
|
||||
|
||||
- Advantages
|
||||
- easy to implement and understand (white box)
|
||||
- Disadvantages
|
||||
- assumes that clusters are sphere-shaped
|
||||
- number of iterations and resulting clusters results depend on seed choice
|
||||
- use heuristic rather than random picks
|
||||
- algorihm may converge on local minima
|
||||
- re-run with different seeds
|
||||
- post-process resulting clusters
|
||||
- split the n "worst" clusters into 2 or more subclusters
|
||||
- merge 2 close clusters (where centroids are close) into one
|
||||
- slow
|
||||
- updating centroid after each new sample assignment may speed up the process
|
||||
|
||||
### Cluster Evaluation Metrics
|
||||
|
||||
1. in case we have a classified data set (gold standard)
|
||||
- homogeneity score
|
||||
- between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster
|
||||
- completeness score
|
||||
- between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster
|
||||
- Adjusted rand index (ARI)
|
||||
- Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge)
|
||||
- overlap = number of common items
|
||||
- between -1 or 1, where 1 means equality
|
||||
2. in case the gold standard is not known
|
||||
- dann kann man es nur noch geometrisch bewerten
|
||||
- SSE: sum of squared error (Fehlerquadratsumme)
|
||||
sum of squared distance of each sample to the centroid of its assigned cluster
|
||||
- silhouette coefficient
|
||||
- wie gut ist der punkt abgegrenzt vom Nachbarcluster
|
||||
- average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster
|
||||
- between -1 and 1 where 1 means dense clusters
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user