cas-pml/ML/notizen/L2_Notizen.md

# Notizen Lektion 2

>Thema: Einführung Practical Machine Learning 2
>Datum: 22.04.2026
>Dozent: Jürgen Vogel

## recap

> [!NOTE] Definition
> algorithm that learns from experience E to solve some tasks T with performance P and P improves with E

- Model
    - represents the solution to the tasks T
    - is learnt and adapted based on E
    - can be evaluated with respect to P
- Features
    - are the relevant part of the data E for creating the model
    - may have to be designed explicitly depending on the ML algorithm
- Categorization with respect to
    - experience E: supervised vs. unsupervised vs. reinforcement leanring
    - tasks T: clustering vs. classification vs. regresseion
    - human-readable model: white box vs. black box
- Project
    - agile/iterative development (CRISP-DM)
- Key Challenges
    - definition of T that is both solvable and generates value
    - large amounts of high quality data E
    - feature engineering
    - dealing with 95% models

## Evaluation

### How good is the machine learning system?

- returned result is good if it solves the problem at hand
    - may be qualitative or quantitative
    - may be subjective (user need, context, and preferences)
    - may change over time
    - also depends on factors such as credibility, specificity, exhaustivitiy, recency, clarity, interpretability... of the result

- Beispiel Suchmaschine: Eine Reihe von Keywords werden in eine Suchmaschine eingegeben
    - Wann ist die Antwort der Suchmaschine "gut"?
        - Schwirig zu beantworten, da es sich von Nutzer zu Nutzer unterscheided
    - Casual User: Frage aus generellem Context -> generellere Antwort okay
        - "Wo ist in Laufdistanz ein Restaurant, das offen ist"
            - Man will nicht das bestmögliche Setting finden und alle Restaurants finde
        - Schnelles Ergebnis und gut genug
    - Expert User: Recherchiert sehr detailierte Informationen
        - Umfangreiche Analyse machen
        - Was gibts alles für wiss. Literatur zum Thema?
        - Was sind die besten Verfahren?
        - Informationsbedürfnis sehr hoch

- thus, the ML system needs to be assessed in "real-life" situations
    - often with user involvement
    - similar methods as with user requirements research
        - usability tests, interviews, field studies, log analysis
    - but this takes time and is costly

### Metrics SR/ER

- Wichtig:
    - Success Rate
    - Error Rate

- Success
    - Result is correct -> ein einzelnes Sample ist richtig klassifiziert worden
    - success rate -> durschnitt über grössere Menge samples
        - nennt man auch accuracy
- Error
    - Result is incorrect -> ein einzelnes Sample ist falsch
    - error rate -> durschnitt über grössere Menge samples

- Beides ist eine 1/0 Betrachtung -> Entweder falsch oder richtig

- Bsp: Wie viele Personen sind auf Bild
    - Modell sagt 3 Personen
    - Auf Bild sind 5 Personen
    - Wie bewertet man das?
        - falsch? -> 100% error
        - ein bisschen richtig? 3/5 erkannt 2/5 fehler

- Generalisieren wir die Erfolgsrate erhält man
    - our ML system takes some test data D as input and produces some results
        - D -> {r'1, ... r'n}
        - e.g. if r'i are from a list of predefined labels , we call this classification
    - the test data also includes the expected result "gold standard"
        - D -> {r1, ..., rn}
    - for the test setting, we define some comparison functions
        - c(r, r') = 1 if r = r', 0 else # vergleichsfunktion
    - then we can calculate the success rate SR as
        - SR = (1/n)*sum(i=1, n, c(ri, r'i))

### Precision and Recall for Binary Classification

- Bsp. Suchmaschine -> Man will evaluieren ob das Modell gut funktioniert
    - Für eine Suchanfrage wurde ein Test Set zusammengestellt
    - Manuell bewertet (Gold Standard):
        - Man weiss für jeden Eintrag: Website passt oder passt nicht

Bewertung:

|                     | positive gold        | negative gold       |
| ------------------- | -------------------  | ------------------- |
| positive classified | true positive (TP)   | false positive (FP) |
| negative classified | false negatives (FN) | true negative (TN)  |

- True Positives: Classifier bewertet positiv, Goldstandard sagt positiv
- True Negatives: Classifier sagt negativ und das stimmt auch
- False Negatives: Classifier sagt nicht negativ, Goldstandard sagt aber positiv
    - das ist ein Fehler
    - Bsp. Suchmaschine: Die Suchmaschine liefert ein Suchresultat nicht zurück obwohl es relevant wäre
- False Positives: Classifier sagt positive, das stimmt aber nicht
    - das ist ein weiterer Fehler
    - Bsp. Suchmaschine: Die Suchmaschine liefert ein nichtrelevantes Suchresultat zurück

- Daraus abgeleitete Metriken:
    - **Precision**
        - Menge der TP in Bezug auf die insgesamt positiven Samples (gemäss Gold Standard)
        - Wenn mein Algorithmus keinen Fehler macht dann hat man 100% precision
        - P = TP / (Class p Classified)
        - Bsp.: Wieviele der angezeigten Webseiten, sind gemäss Gold Standard wirklich relevant?
    - **Recall**
        - Wie hoch ist der Anteil der False Negatives gemäss Gold Standard
        - R = TP / (Class p Gold)
        - Bsp. Welche Seiten die der Mensch (Gold Standard) als relevant klassifiziert hat, werden tatsächlich angezeigt?
            - Perfekt wenn all relevanten Seiten angezeigt wurden
            - Schlecht wenn keine relevanten Seiten gefunden wurden
- Erweiterte Metrik: Confusion Matrix


- Precision vs Recall
    - There is often a trafe-off between Precision and Recall
    - improving the algorithm towards one weakens the other
        - Will ich das Modell in richtung Precision verbessern, wird der Recall schlechter und umgekehrt
        - Entweder das eine oder andere kann optimiert werden
        - Bspw. Suchmaschine: Einfach alles anzeigen, dann gibts keine False Negatives weil das Gesuchte immer gefunden wird
            - Die Precision wird dabei aber sehr sehr schlecht, weil ganz viele False Positives dabei sind
            - 100% Recall 0% Precision
        - Oft muss ein Kompromiss getroffen werden zwischen Precision und Recall
            - Die Entscheidung was optimiert werden soll, muss vom Entwicklungsteam getroffen werden
        - Precision-oriented users
            - Web Surfers
        - Recall-oriented users
            - Professional searches, legal, etc

- Dafür gibt es aber folgendes Hilfsmittel: **F-measure**
    - Das gewichtete, harmonische Mittel zwischen Precision und Recall
        - Formel: Skript Seite 7
        - F = 1/( alpha* 1/P + [1-alpha] * 1/R) = (beta^2 + 1)PR / (beta^2P+R) = beta^2 = 1 - alpha / alpha
    - Ist parametrisierbar
        - Beta < emphasize precision
        - Beta > emphasize recall

### Other metrics

- the generalization of our binary classifier result matrix (classification result vs. gold standard) is called a confusion matrix
    - many different metrics can be derived from this
        - https.//en.wikipedia.org/wiki/Confusion_matrix
    - other widely used metrics include ROC, K-S, gail/lift, ...
- for specific ML problems and algorithms many additional metrics exists

## Automated Evaluation Workflow

- How can we automatie evaluation
    1. define a controlled test set (benchmarks)
        - collection of data (labeled)
        - one or more tasks to be solved by the ML system
        - expected results
            - created by (typically) human domain experts
    2. execute ML system for test set
    3. compare computed results against expected results

### Evaluation Goals

- Compare a solution with ...
    - different configuration options
        - bspw. feingranulare parametrisierung in einem decision tree (gini parameter bspw)
    - alternative solutions
    - a basic solution ("baseline")
        - abschätzung nach unten
    - the industry and/or academic leader ("state-of-the-art")
        - abschätzung nach oben
    - human performance ("gold standard")
        - auch eine abschätzung nach oben, welche man machen sollte
        - Mensch macht auch 1-2% Fehler
    - itself over time

## Using Data for Training and Testing

- ML Methods usually require fine-tuning for good quality
- Trainingsdaten dürfen nicht gleichzeitig zum Testen verwendet werden, es muss aufgeteilt werden
- Wie splitte ich in Trainingsdaten und Testdaten?
    - Modell wird besser wenn mehr trainingsdaten gegeben werden
    - aber man will auch möglichst viele Daten fürs Testen, damit Metrik verbessert wird
    - Konflikt
- Einfache Methode: 80% Training 20% Testing
    - Verteilung wichtig! Einfach die vorderen 80% fürs Training nehmen und die hinteren 20% als Test ist nicht gut
        - Daten sind oft sortiert
        - Beide Datenmengen müssen repräsentativ sein!
    - Quick fix: Random Number generator verwenden (**rand-split** in scikit-learn)
        - wenn die gesamtdatenmenge gross genug ist, geht das relativ gut auf
        - im Mittel hat man eine gute Verteilung
        - Problematisch wird das wenn das Klassifikationsproblem stark unbalanciert ist
            - Websuche: Datenmenge 100k Webseiten, davon sind 20 relevant -> winzige relevante Menge (p class) und eine grosse (n class)

    - Besser als Random-Split: K-Fold Cross Validation


### K-Fold Cross Validation

- Wie geht man möglichst effizient mit den gelabelten (gold standard) Daten um?
    - Es geht nicht gleichzeitig aber nacheinander
        - erst trainieren, dann testen
- How to split gold standard data into test and trainin set such that
    - we have enough training data
    - our test results are not biased
- k-fold cross validation
    - split data into k folds (Aufteilung in k Teile)
    - use (k-1) for training, 1 for testing
    - repeat k times
- Siehe Grafik Skript Seite 14

- Aufteilung auf 3 Folds
    - Jedes Sample wurde jeweils einmal zum training und zum testen verwendet
    - Man hat ein Problem, wenn sich die Metriken (Fehlerquoten) beim Testen von Fold zu Fold sich stark unterscheiden
    - Weiteres Problem: Modell sehr sensibel auf Trainingsdaten
        - Zu wenig Trainingsdaten vorhanden
        - Unterschiede von Fold zu Fold sehr gross -> Könnte heissen, dass das Modell nicht stabil ist
    - Kommen von Fold zu Fold aber gleich gute Resultate zurück (Varianz und Standardabweichung gleich)
        - Gutes Zeichen für das Modell

- In der Praxis arbeitet man nicht mit 3 sonder mit 10 Folds
    - 10 fold cross validation
        - 90/10
    - Wird statistisch besser
    - Setzt voraus, das man genug grosse Datenmengen hat

## Dataset Challenges

- Potential Problems: is the dataset...
    - correct?
    - large enough?
    - representative?
    - cause overfitting? -> Zu viele eintönige Daten, und das Modell lernt eine Niche
- for many application domains, large datasets are available
    - not all free but still cost saving
    - allows to compare approaches in a larger community
- where to search
    - wikipedia
    - kaggle
    - research groups at universities
    - conference series
    - research articles
    - data collecting companies and public administrations

## Unsupervised Learning

> [!NOTE] Definition
> an algorithm learns from experience E to solve some tasks T with performance P if P improves with E

- K-Means Clustering
    - Sample Vektor repräsentiert als feature vektor
    - Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder
    - Man möchte das in eine sinnvolle Verzeichnisstruktur bringen
        - Clustering könnte helfen

## Clustering

- Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster
    - Iris Datenset kann in Clusters unterteilt werden
    - News Artikel auf Google News
    - Kunden in Kundengruppen
- Das alles ist **unsupervised Classification**
- Based on some similarity measure
- all instances within a cluster should be similar
    - ähnliche feature Werte (numerische Werte)
    - and instances in different clusters should be dissimilar
- Erster Punkt: Cluster Bilden
- Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster?
    - Häufig muss dies der Mensch machen, da die Algorithmen das nicht können
- Cluster discovered
    - Beispielsweise durch die Euklidsche Distanz
        - ähnlichkeit zwischen zwei Vektoren (Feature Vektor)


## K-Means

- Idea
    - creates K clusters
    - interpret samples x as real-valued vectors x-> (vector)
        - data preparation: numeric data only
    - assignement of x to a cluster is based on its distance to the cluster centroids
        - jeder Cluster hat einen Mittelpunkt (centroids)
        - der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist
        - Mittelpunkt (centroid) ist im Vorhinein nicht bekannt
            - und ist auch nicht statisch, der verschiebt sich
            - Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set
            - Am Anfang ist aber noch kein Cluster bekannt
            - Daher löst man das mit einem iterativen Prozess

```
# select K random samples {c1, c2, ... , ck} as approximatrion of centroids
until termination condition
    for each sample xi:
        assign xi to the cluster Cj such that dist(xj,cj) is minimal
    for each cluster Cj update the approximations of centroids
        cj = u(Cj)
```

### How many clusters K

1. Number of clusters K is given
    - partition n samples into predetermined number of clusters
2. Finding the "right" number of cluster is part of the problem
    - partition n samples into appropriate number of clusters
    - often try and error
3. Use an algorithm to determine K automatically
    - define a function to assess the "qualits" of all clusters
        - e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is
    - increase K until no further quality improvement

### Discussion K-Means

- Advantages
    - easy to implement and understand (white box)
- Disadvantages
    - assumes that clusters are sphere-shaped
    - number of iterations and resulting clusters results depend on seed choice
        - use heuristic rather than random picks
    - algorihm may converge on local minima
        - re-run with different seeds
        - post-process resulting clusters
            - split the n "worst" clusters into 2 or more subclusters
            - merge 2 close clusters (where centroids are close) into one
    - slow
        - updating centroid after each new sample assignment may speed up the process

### Cluster Evaluation Metrics

1. in case we have a classified data set (gold standard)
    - homogeneity score
        - between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster
    - completeness score
        - between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster
    - Adjusted rand index (ARI)
        - Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge)
            - overlap = number of common items
        - between -1 or 1, where 1 means equality
2. in case the gold standard is not known
    - dann kann man es nur noch geometrisch bewerten
    - SSE: sum of squared error (Fehlerquadratsumme)
        sum of squared distance of each sample to the centroid of its assigned cluster
    - silhouette coefficient
        - wie gut ist der punkt abgegrenzt vom Nachbarcluster
        - average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster
    - between -1 and 1 where 1 means dense clusters

## Naive Bayes

- Supervised learning

### Bayes' Theorem

- mathematical law about conditional probabilities
- given two events A and B, then the conditional probability P(B|A) relates to the probability that event B occurs after A has occured
   - applies e.g. when we are blindl drawing samples from a bag containing red an black alls without returning the balls
   - assume we start wit 2 red and 2 black balls
        - then P(red) = 2/4 = P(black) for the first draw
   - if we draw a red ball first, then we know upfront for the 2nd draw
        - P(red|red) = 1/3 and P(black|red) = 2/3
- probability that two conditionally-related events P(A) and P(B|A) occur one after another: P(A^B) = P(A) * P(B|A)
    - e.g probability to first draw red and then red again: P(red^red) = P(red)*P(red|red) = 2/4 * 1/3 = 1/6
- P(A|B), P(B|A) may be calculated as
    - P(A|B) = P(A^B) / P(B) and P(B|A) = P(A^B) / P(A)
    - in the drawing example: P(red|red) = 2/4 * 1/3 / 2/4 = 1/3
- from this we can derive Bayes' theorem
    - P(A|B) = P(B|A) * P(A) / P(B)

### Using Bayes' Theorem for Classification

- Our events are A = class c, B = sample x.
- We calculate P(c|x)
    - x -> Ich bekomme ein sample
    - P(c|x) -> wie hoch ist die Wahrscheinlichkeit dass das sample x in der klasse c ist?
- given x, our classifier thus
    1. calculates P(c|x) for all c
    2. selects c with the highes P(c|x)
- P(c|x) = P(x|c) * P(c) / P(x)
    - P(c) -> probability of a class c -> kann aus den Samples ausgelesen werden -> Zählen wieviele Samples hab ich in einer Klasse
        - es gilt: n = n_samples within c / n_samples
    - P(x) -> wie häufig hab ich ein bestimmtes sample in meinen Trainingsdaten? -> Meist nur ein einziges mal, manchmal ein identisches sample mehrmals
        - wenn nur einmal gilt: n = 1 / n_samples
    - P(x|c) -> wie häufig hab ich einen Feature Vektor in einer bestimmten Klasse -> Optimalerweise nur einmal -> Bei einer klasse wahrscheinlichkeit = 1 und bei allen anderen 0
        - man schaut sich die einzelnen Features an und schaut welche kombinationen von Features sind einen Hinweis auf eine Klasse
        - P(f1, f2, ..., fn|c) where fi are the sample's features
        - we assume that fi are independent from each other given a class c
            - thus we can calculate P(f1,f2,...,fn|c) = P(f1|c) * P(f2|c) * ... * P(fn|c)
            - 3 different NB variants for calculating P(fi|c)
                - Gaussian: fi are continuous and P(fi|c) are normally distributed
                - multinomial: P(fi|c) = n_fi in c / n_fi occures overall
                - bernoulli: binary features
        - in general this independence assumption is wrong
            - within c certain features are often correlated
            - thus the name: naive bayes classifier (NB)
            - but the NB classifier performs well in practice despite this "naive" assumption
- Advantages
    - very fast learning and testing
    - performed super auf grossen Datensätzen
    - low storage requirements
    - optimal if the independence assumptions hold
        - typischerweise hat man aber keine unabhängigen features
    - very good in domains with many equally important features
        - robust gegen schlechtes feature engineering
        - bspw.: Um zu erkennen ob das ein Brief ist schau ich ob ein Datum drauf ist -> Das ist Mist
            - für decicion tree fatal
            - für naive bayes hat man halt nur einen neuen Skalierungsfaktor
        - sehr fehlertolerant
- Disadvantages
    - Often not the best classifier