feature(notizen): add notes from l2 afternoon

2026-04-30 14:07:33 +02:00
parent 1c35aa1f79
commit 45b154fee6
1 changed files with 108 additions and 1 deletions
@@ -7,7 +7,7 @@
 ## recap

 > [!NOTE] Definition
-algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
+> algorithm that learns from experience E to solve some tasks T with performance P and P improves with E

 - Model
    - represents the solution to the tasks T
@@ -253,3 +253,110 @@ Bewertung:
    - conference series 
    - research articles
    - data collecting companies and public administrations
+
+## Unsupervised Learning
+
+> [!NOTE] Definition
+> an algorithm learns from experience E to solve some tasks T with performance P if P improves with E
+
+- K-Means Clustering
+    - Sample Vektor repräsentiert als feature vektor
+    - Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder
+    - Man möchte das in eine sinnvolle Verzeichnisstruktur bringen
+        - Clustering könnte helfen
+
+## Clustering
+
+- Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster
+    - Iris Datenset kann in Clusters unterteilt werden
+    - News Artikel auf Google News
+    - Kunden in Kundengruppen
+- Das alles ist **unsupervised Classification**
+- Based on some similarity measure
+- all instances within a cluster should be similar
+    - ähnliche feature Werte (numerische Werte)
+    - and instances in different clusters should be dissimilar
+- Erster Punkt: Cluster Bilden
+- Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster?
+    - Häufig muss dies der Mensch machen, da die Algorithmen das nicht können
+- Cluster discovered
+    - Beispielsweise durch die Euklidsche Distanz
+        - ähnlichkeit zwischen zwei Vektoren (Feature Vektor)
+
+
+## K-Means
+
+- Idea
+    - creates K clusters
+    - interpret samples x as real-valued vectors x-> (vector)
+        - data preparation: numeric data only
+    - assignement of x to a cluster is based on its distance to the cluster centroids
+        - jeder Cluster hat einen Mittelpunkt (centroids)
+        - der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist
+        - Mittelpunkt (centroid) ist im Vorhinein nicht bekannt
+            - und ist auch nicht statisch, der verschiebt sich
+            - Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set
+            - Am Anfang ist aber noch kein Cluster bekannt
+            - Daher löst man das mit einem iterativen Prozess
+
+```
+# select K random samples {c1, c2, ... , ck} as approximatrion of centroids
+until termination condition
+    for each sample xi:
+        assign xi to the cluster Cj such that dist(xj,cj) is minimal
+    for each cluster Cj update the approximations of centroids
+        cj = u(Cj)
+```
+
+### How many clusters K
+
+1. Number of clusters K is given
+    - partition n samples into predetermined number of clusters
+2. Finding the "right" number of cluster is part of the problem
+    - partition n samples into appropriate number of clusters
+    - often try and error
+3. Use an algorithm to determine K automatically
+    - define a function to assess the "qualits" of all clusters
+        - e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is
+    - increase K until no further quality improvement
+
+### Discussion K-Means
+
+- Advantages 
+    - easy to implement and understand (white box)
+- Disadvantages
+    - assumes that clusters are sphere-shaped
+    - number of iterations and resulting clusters results depend on seed choice
+        - use heuristic rather than random picks
+    - algorihm may converge on local minima
+        - re-run with different seeds
+        - post-process resulting clusters
+            - split the n "worst" clusters into 2 or more subclusters
+            - merge 2 close clusters (where centroids are close) into one
+    - slow
+        - updating centroid after each new sample assignment may speed up the process
+
+### Cluster Evaluation Metrics
+
+1. in case we have a classified data set (gold standard)
+    - homogeneity score
+        - between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster
+    - completeness score
+        - between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster
+    - Adjusted rand index (ARI)
+        - Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge)
+            - overlap = number of common items
+        - between -1 or 1, where 1 means equality
+2. in case the gold standard is not known
+    - dann kann man es nur noch geometrisch bewerten
+    - SSE: sum of squared error (Fehlerquadratsumme)
+        sum of squared distance of each sample to the centroid of its assigned cluster
+    - silhouette coefficient
+        - wie gut ist der punkt abgegrenzt vom Nachbarcluster
+        - average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster
+    - between -1 and 1 where 1 means dense clusters
+
+
+
+
+