feature(notizen): add notes from l2 afternoon

This commit is contained in:
2026-04-30 14:07:33 +02:00
parent 1c35aa1f79
commit 45b154fee6
+108 -1
View File
@@ -7,7 +7,7 @@
## recap
> [!NOTE] Definition
algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
> algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
- Model
- represents the solution to the tasks T
@@ -253,3 +253,110 @@ Bewertung:
- conference series
- research articles
- data collecting companies and public administrations
## Unsupervised Learning
> [!NOTE] Definition
> an algorithm learns from experience E to solve some tasks T with performance P if P improves with E
- K-Means Clustering
- Sample Vektor repräsentiert als feature vektor
- Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder
- Man möchte das in eine sinnvolle Verzeichnisstruktur bringen
- Clustering könnte helfen
## Clustering
- Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster
- Iris Datenset kann in Clusters unterteilt werden
- News Artikel auf Google News
- Kunden in Kundengruppen
- Das alles ist **unsupervised Classification**
- Based on some similarity measure
- all instances within a cluster should be similar
- ähnliche feature Werte (numerische Werte)
- and instances in different clusters should be dissimilar
- Erster Punkt: Cluster Bilden
- Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster?
- Häufig muss dies der Mensch machen, da die Algorithmen das nicht können
- Cluster discovered
- Beispielsweise durch die Euklidsche Distanz
- ähnlichkeit zwischen zwei Vektoren (Feature Vektor)
## K-Means
- Idea
- creates K clusters
- interpret samples x as real-valued vectors x-> (vector)
- data preparation: numeric data only
- assignement of x to a cluster is based on its distance to the cluster centroids
- jeder Cluster hat einen Mittelpunkt (centroids)
- der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist
- Mittelpunkt (centroid) ist im Vorhinein nicht bekannt
- und ist auch nicht statisch, der verschiebt sich
- Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set
- Am Anfang ist aber noch kein Cluster bekannt
- Daher löst man das mit einem iterativen Prozess
```
# select K random samples {c1, c2, ... , ck} as approximatrion of centroids
until termination condition
for each sample xi:
assign xi to the cluster Cj such that dist(xj,cj) is minimal
for each cluster Cj update the approximations of centroids
cj = u(Cj)
```
### How many clusters K
1. Number of clusters K is given
- partition n samples into predetermined number of clusters
2. Finding the "right" number of cluster is part of the problem
- partition n samples into appropriate number of clusters
- often try and error
3. Use an algorithm to determine K automatically
- define a function to assess the "qualits" of all clusters
- e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is
- increase K until no further quality improvement
### Discussion K-Means
- Advantages
- easy to implement and understand (white box)
- Disadvantages
- assumes that clusters are sphere-shaped
- number of iterations and resulting clusters results depend on seed choice
- use heuristic rather than random picks
- algorihm may converge on local minima
- re-run with different seeds
- post-process resulting clusters
- split the n "worst" clusters into 2 or more subclusters
- merge 2 close clusters (where centroids are close) into one
- slow
- updating centroid after each new sample assignment may speed up the process
### Cluster Evaluation Metrics
1. in case we have a classified data set (gold standard)
- homogeneity score
- between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster
- completeness score
- between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster
- Adjusted rand index (ARI)
- Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge)
- overlap = number of common items
- between -1 or 1, where 1 means equality
2. in case the gold standard is not known
- dann kann man es nur noch geometrisch bewerten
- SSE: sum of squared error (Fehlerquadratsumme)
sum of squared distance of each sample to the centroid of its assigned cluster
- silhouette coefficient
- wie gut ist der punkt abgegrenzt vom Nachbarcluster
- average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster
- between -1 and 1 where 1 means dense clusters