diff --git a/ML/notizen/L2_Notizen.md b/ML/notizen/L2_Notizen.md index a504c7c..4c3e786 100644 --- a/ML/notizen/L2_Notizen.md +++ b/ML/notizen/L2_Notizen.md @@ -7,7 +7,7 @@ ## recap > [!NOTE] Definition -algorithm that learns from experience E to solve some tasks T with performance P and P improves with E +> algorithm that learns from experience E to solve some tasks T with performance P and P improves with E - Model - represents the solution to the tasks T @@ -253,3 +253,110 @@ Bewertung: - conference series - research articles - data collecting companies and public administrations + +## Unsupervised Learning + +> [!NOTE] Definition +> an algorithm learns from experience E to solve some tasks T with performance P if P improves with E + +- K-Means Clustering + - Sample Vektor repräsentiert als feature vektor + - Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder + - Man möchte das in eine sinnvolle Verzeichnisstruktur bringen + - Clustering könnte helfen + +## Clustering + +- Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster + - Iris Datenset kann in Clusters unterteilt werden + - News Artikel auf Google News + - Kunden in Kundengruppen +- Das alles ist **unsupervised Classification** +- Based on some similarity measure +- all instances within a cluster should be similar + - ähnliche feature Werte (numerische Werte) + - and instances in different clusters should be dissimilar +- Erster Punkt: Cluster Bilden +- Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster? + - Häufig muss dies der Mensch machen, da die Algorithmen das nicht können +- Cluster discovered + - Beispielsweise durch die Euklidsche Distanz + - ähnlichkeit zwischen zwei Vektoren (Feature Vektor) + + +## K-Means + +- Idea + - creates K clusters + - interpret samples x as real-valued vectors x-> (vector) + - data preparation: numeric data only + - assignement of x to a cluster is based on its distance to the cluster centroids + - jeder Cluster hat einen Mittelpunkt (centroids) + - der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist + - Mittelpunkt (centroid) ist im Vorhinein nicht bekannt + - und ist auch nicht statisch, der verschiebt sich + - Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set + - Am Anfang ist aber noch kein Cluster bekannt + - Daher löst man das mit einem iterativen Prozess + +``` +# select K random samples {c1, c2, ... , ck} as approximatrion of centroids +until termination condition + for each sample xi: + assign xi to the cluster Cj such that dist(xj,cj) is minimal + for each cluster Cj update the approximations of centroids + cj = u(Cj) +``` + +### How many clusters K + +1. Number of clusters K is given + - partition n samples into predetermined number of clusters +2. Finding the "right" number of cluster is part of the problem + - partition n samples into appropriate number of clusters + - often try and error +3. Use an algorithm to determine K automatically + - define a function to assess the "qualits" of all clusters + - e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is + - increase K until no further quality improvement + +### Discussion K-Means + +- Advantages + - easy to implement and understand (white box) +- Disadvantages + - assumes that clusters are sphere-shaped + - number of iterations and resulting clusters results depend on seed choice + - use heuristic rather than random picks + - algorihm may converge on local minima + - re-run with different seeds + - post-process resulting clusters + - split the n "worst" clusters into 2 or more subclusters + - merge 2 close clusters (where centroids are close) into one + - slow + - updating centroid after each new sample assignment may speed up the process + +### Cluster Evaluation Metrics + +1. in case we have a classified data set (gold standard) + - homogeneity score + - between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster + - completeness score + - between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster + - Adjusted rand index (ARI) + - Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge) + - overlap = number of common items + - between -1 or 1, where 1 means equality +2. in case the gold standard is not known + - dann kann man es nur noch geometrisch bewerten + - SSE: sum of squared error (Fehlerquadratsumme) + sum of squared distance of each sample to the centroid of its assigned cluster + - silhouette coefficient + - wie gut ist der punkt abgegrenzt vom Nachbarcluster + - average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster + - between -1 and 1 where 1 means dense clusters + + + + +