feature(notizen): add notes from l2 afternoon

2026-04-30 14:07:33 +02:00
parent 1c35aa1f79
commit 45b154fee6
1 changed files with 108 additions and 1 deletions
@@ -7,7 +7,7 @@
 ## recap
 > [!NOTE] Definition
-algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
+> algorithm that learns from experience E to solve some tasks T with performance P and P improves with E
 - Model
    - represents the solution to the tasks T
@@ -253,3 +253,110 @@ Bewertung:
    - conference series 
    - research articles
    - data collecting companies and public administrations
 ## Unsupervised Learning
 > [!NOTE] Definition
 > an algorithm learns from experience E to solve some tasks T with performance P if P improves with E
 - K-Means Clustering
    - Sample Vektor repräsentiert als feature vektor
    - Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder
    - Man möchte das in eine sinnvolle Verzeichnisstruktur bringen
        - Clustering könnte helfen
 ## Clustering
 - Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster
    - Iris Datenset kann in Clusters unterteilt werden
    - News Artikel auf Google News
    - Kunden in Kundengruppen
 - Das alles ist **unsupervised Classification**
 - Based on some similarity measure
 - all instances within a cluster should be similar
    - ähnliche feature Werte (numerische Werte)
    - and instances in different clusters should be dissimilar
 - Erster Punkt: Cluster Bilden
 - Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster?
    - Häufig muss dies der Mensch machen, da die Algorithmen das nicht können
 - Cluster discovered
    - Beispielsweise durch die Euklidsche Distanz
        - ähnlichkeit zwischen zwei Vektoren (Feature Vektor)
 ## K-Means
 - Idea
    - creates K clusters
    - interpret samples x as real-valued vectors x-> (vector)
        - data preparation: numeric data only
    - assignement of x to a cluster is based on its distance to the cluster centroids
        - jeder Cluster hat einen Mittelpunkt (centroids)
        - der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist
        - Mittelpunkt (centroid) ist im Vorhinein nicht bekannt
            - und ist auch nicht statisch, der verschiebt sich
            - Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set
            - Am Anfang ist aber noch kein Cluster bekannt
            - Daher löst man das mit einem iterativen Prozess
 ```
 # select K random samples {c1, c2, ... , ck} as approximatrion of centroids
 until termination condition
    for each sample xi:
        assign xi to the cluster Cj such that dist(xj,cj) is minimal
    for each cluster Cj update the approximations of centroids
        cj = u(Cj)
 ```
 ### How many clusters K
 1. Number of clusters K is given
    - partition n samples into predetermined number of clusters
 2. Finding the "right" number of cluster is part of the problem
    - partition n samples into appropriate number of clusters
    - often try and error
 3. Use an algorithm to determine K automatically
    - define a function to assess the "qualits" of all clusters
        - e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is
    - increase K until no further quality improvement
 ### Discussion K-Means
 - Advantages 
    - easy to implement and understand (white box)
 - Disadvantages
    - assumes that clusters are sphere-shaped
    - number of iterations and resulting clusters results depend on seed choice
        - use heuristic rather than random picks
    - algorihm may converge on local minima
        - re-run with different seeds
        - post-process resulting clusters
            - split the n "worst" clusters into 2 or more subclusters
            - merge 2 close clusters (where centroids are close) into one
    - slow
        - updating centroid after each new sample assignment may speed up the process
 ### Cluster Evaluation Metrics
 1. in case we have a classified data set (gold standard)
    - homogeneity score
        - between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster
    - completeness score
        - between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster
    - Adjusted rand index (ARI)
        - Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge)
            - overlap = number of common items
        - between -1 or 1, where 1 means equality
 2. in case the gold standard is not known
    - dann kann man es nur noch geometrisch bewerten
    - SSE: sum of squared error (Fehlerquadratsumme)
        sum of squared distance of each sample to the centroid of its assigned cluster
    - silhouette coefficient
        - wie gut ist der punkt abgegrenzt vom Nachbarcluster
        - average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster
    - between -1 and 1 where 1 means dense clusters