bfh/cas-pml

Fork 0

Files

T

aaron e9ac770dd6

feature(notizen): add notes from l2 afternoon

2026-04-30 16:12:24 +02:00

19 KiB

Raw Blame History

Notizen Lektion 2

Thema: Einführung Practical Machine Learning 2 Datum: 22.04.2026 Dozent: Jürgen Vogel

recap

[!NOTE] Definition algorithm that learns from experience E to solve some tasks T with performance P and P improves with E

Model
- represents the solution to the tasks T
- is learnt and adapted based on E
- can be evaluated with respect to P
Features
- are the relevant part of the data E for creating the model
- may have to be designed explicitly depending on the ML algorithm
Categorization with respect to
- experience E: supervised vs. unsupervised vs. reinforcement leanring
- tasks T: clustering vs. classification vs. regresseion
- human-readable model: white box vs. black box
Project
- agile/iterative development (CRISP-DM)
Key Challenges
- definition of T that is both solvable and generates value
- large amounts of high quality data E
- feature engineering
- dealing with 95% models

Evaluation

How good is the machine learning system?

returned result is good if it solves the problem at hand
- may be qualitative or quantitative
- may be subjective (user need, context, and preferences)
- may change over time
- also depends on factors such as credibility, specificity, exhaustivitiy, recency, clarity, interpretability... of the result
Beispiel Suchmaschine: Eine Reihe von Keywords werden in eine Suchmaschine eingegeben
- Wann ist die Antwort der Suchmaschine "gut"?
  - Schwirig zu beantworten, da es sich von Nutzer zu Nutzer unterscheided
- Casual User: Frage aus generellem Context -> generellere Antwort okay
  - "Wo ist in Laufdistanz ein Restaurant, das offen ist"
    - Man will nicht das bestmögliche Setting finden und alle Restaurants finde
  - Schnelles Ergebnis und gut genug
- Expert User: Recherchiert sehr detailierte Informationen
  - Umfangreiche Analyse machen
  - Was gibts alles für wiss. Literatur zum Thema?
  - Was sind die besten Verfahren?
  - Informationsbedürfnis sehr hoch
thus, the ML system needs to be assessed in "real-life" situations
- often with user involvement
- similar methods as with user requirements research
  - usability tests, interviews, field studies, log analysis
- but this takes time and is costly

Metrics SR/ER

Wichtig:
- Success Rate
- Error Rate
Success
- Result is correct -> ein einzelnes Sample ist richtig klassifiziert worden
- success rate -> durschnitt über grössere Menge samples
  - nennt man auch accuracy
Error
- Result is incorrect -> ein einzelnes Sample ist falsch
- error rate -> durschnitt über grössere Menge samples
Beides ist eine 1/0 Betrachtung -> Entweder falsch oder richtig
Bsp: Wie viele Personen sind auf Bild
- Modell sagt 3 Personen
- Auf Bild sind 5 Personen
- Wie bewertet man das?
  - falsch? -> 100% error
  - ein bisschen richtig? 3/5 erkannt 2/5 fehler
Generalisieren wir die Erfolgsrate erhält man
- our ML system takes some test data D as input and produces some results
  - D -> {r'1, ... r'n}
  - e.g. if r'i are from a list of predefined labels , we call this classification
- the test data also includes the expected result "gold standard"
  - D -> {r1, ..., rn}
- for the test setting, we define some comparison functions
  - c(r, r') = 1 if r = r', 0 else # vergleichsfunktion
- then we can calculate the success rate SR as
  - SR = (1/n)*sum(i=1, n, c(ri, r'i))

Precision and Recall for Binary Classification

Bsp. Suchmaschine -> Man will evaluieren ob das Modell gut funktioniert
- Für eine Suchanfrage wurde ein Test Set zusammengestellt
- Manuell bewertet (Gold Standard):
  - Man weiss für jeden Eintrag: Website passt oder passt nicht

Bewertung:

	positive gold	negative gold
positive classified	true positive (TP)	false positive (FP)
negative classified	false negatives (FN)	true negative (TN)

True Positives: Classifier bewertet positiv, Goldstandard sagt positiv
True Negatives: Classifier sagt negativ und das stimmt auch
False Negatives: Classifier sagt nicht negativ, Goldstandard sagt aber positiv
- das ist ein Fehler
- Bsp. Suchmaschine: Die Suchmaschine liefert ein Suchresultat nicht zurück obwohl es relevant wäre
False Positives: Classifier sagt positive, das stimmt aber nicht
- das ist ein weiterer Fehler
- Bsp. Suchmaschine: Die Suchmaschine liefert ein nichtrelevantes Suchresultat zurück
Daraus abgeleitete Metriken:
- Precision
  - Menge der TP in Bezug auf die insgesamt positiven Samples (gemäss Gold Standard)
  - Wenn mein Algorithmus keinen Fehler macht dann hat man 100% precision
  - P = TP / (Class p Classified)
  - Bsp.: Wieviele der angezeigten Webseiten, sind gemäss Gold Standard wirklich relevant?
- Recall
  - Wie hoch ist der Anteil der False Negatives gemäss Gold Standard
  - R = TP / (Class p Gold)
  - Bsp. Welche Seiten die der Mensch (Gold Standard) als relevant klassifiziert hat, werden tatsächlich angezeigt?
    - Perfekt wenn all relevanten Seiten angezeigt wurden
    - Schlecht wenn keine relevanten Seiten gefunden wurden
Erweiterte Metrik: Confusion Matrix
Precision vs Recall
- There is often a trafe-off between Precision and Recall
- improving the algorithm towards one weakens the other
  - Will ich das Modell in richtung Precision verbessern, wird der Recall schlechter und umgekehrt
  - Entweder das eine oder andere kann optimiert werden
  - Bspw. Suchmaschine: Einfach alles anzeigen, dann gibts keine False Negatives weil das Gesuchte immer gefunden wird
    - Die Precision wird dabei aber sehr sehr schlecht, weil ganz viele False Positives dabei sind
    - 100% Recall 0% Precision
  - Oft muss ein Kompromiss getroffen werden zwischen Precision und Recall
    - Die Entscheidung was optimiert werden soll, muss vom Entwicklungsteam getroffen werden
  - Precision-oriented users
    - Web Surfers
  - Recall-oriented users
    - Professional searches, legal, etc
Dafür gibt es aber folgendes Hilfsmittel: F-measure
- Das gewichtete, harmonische Mittel zwischen Precision und Recall
  - Formel: Skript Seite 7
  - F = 1/( alpha* 1/P + [1-alpha] * 1/R) = (beta^2 + 1)PR / (beta^2P+R) = beta^2 = 1 - alpha / alpha
- Ist parametrisierbar
  - Beta < emphasize precision
  - Beta > emphasize recall

Other metrics

the generalization of our binary classifier result matrix (classification result vs. gold standard) is called a confusion matrix
- many different metrics can be derived from this
  - https.//en.wikipedia.org/wiki/Confusion_matrix
- other widely used metrics include ROC, K-S, gail/lift, ...
for specific ML problems and algorithms many additional metrics exists

Automated Evaluation Workflow

How can we automatie evaluation
1. define a controlled test set (benchmarks)
  - collection of data (labeled)
  - one or more tasks to be solved by the ML system
  - expected results
    - created by (typically) human domain experts
2. execute ML system for test set
3. compare computed results against expected results

Evaluation Goals

Compare a solution with ...
- different configuration options
  - bspw. feingranulare parametrisierung in einem decision tree (gini parameter bspw)
- alternative solutions
- a basic solution ("baseline")
  - abschätzung nach unten
- the industry and/or academic leader ("state-of-the-art")
  - abschätzung nach oben
- human performance ("gold standard")
  - auch eine abschätzung nach oben, welche man machen sollte
  - Mensch macht auch 1-2% Fehler
- itself over time

Using Data for Training and Testing

ML Methods usually require fine-tuning for good quality
Trainingsdaten dürfen nicht gleichzeitig zum Testen verwendet werden, es muss aufgeteilt werden
Wie splitte ich in Trainingsdaten und Testdaten?
- Modell wird besser wenn mehr trainingsdaten gegeben werden
- aber man will auch möglichst viele Daten fürs Testen, damit Metrik verbessert wird
- Konflikt
Einfache Methode: 80% Training 20% Testing
- Verteilung wichtig! Einfach die vorderen 80% fürs Training nehmen und die hinteren 20% als Test ist nicht gut
  - Daten sind oft sortiert
  - Beide Datenmengen müssen repräsentativ sein!
- Quick fix: Random Number generator verwenden (rand-split in scikit-learn)
  - wenn die gesamtdatenmenge gross genug ist, geht das relativ gut auf
  - im Mittel hat man eine gute Verteilung
  - Problematisch wird das wenn das Klassifikationsproblem stark unbalanciert ist
    - Websuche: Datenmenge 100k Webseiten, davon sind 20 relevant -> winzige relevante Menge (p class) und eine grosse (n class)
- Besser als Random-Split: K-Fold Cross Validation

K-Fold Cross Validation

Wie geht man möglichst effizient mit den gelabelten (gold standard) Daten um?
- Es geht nicht gleichzeitig aber nacheinander
  - erst trainieren, dann testen
How to split gold standard data into test and trainin set such that
- we have enough training data
- our test results are not biased
k-fold cross validation
- split data into k folds (Aufteilung in k Teile)
- use (k-1) for training, 1 for testing
- repeat k times
Siehe Grafik Skript Seite 14
Aufteilung auf 3 Folds
- Jedes Sample wurde jeweils einmal zum training und zum testen verwendet
- Man hat ein Problem, wenn sich die Metriken (Fehlerquoten) beim Testen von Fold zu Fold sich stark unterscheiden
- Weiteres Problem: Modell sehr sensibel auf Trainingsdaten
  - Zu wenig Trainingsdaten vorhanden
  - Unterschiede von Fold zu Fold sehr gross -> Könnte heissen, dass das Modell nicht stabil ist
- Kommen von Fold zu Fold aber gleich gute Resultate zurück (Varianz und Standardabweichung gleich)
  - Gutes Zeichen für das Modell
In der Praxis arbeitet man nicht mit 3 sonder mit 10 Folds
- 10 fold cross validation
  - 90/10
- Wird statistisch besser
- Setzt voraus, das man genug grosse Datenmengen hat

Dataset Challenges

Potential Problems: is the dataset...
- correct?
- large enough?
- representative?
- cause overfitting? -> Zu viele eintönige Daten, und das Modell lernt eine Niche
for many application domains, large datasets are available
- not all free but still cost saving
- allows to compare approaches in a larger community
where to search
- wikipedia
- kaggle
- research groups at universities
- conference series
- research articles
- data collecting companies and public administrations

Unsupervised Learning

[!NOTE] Definition an algorithm learns from experience E to solve some tasks T with performance P if P improves with E

K-Means Clustering
- Sample Vektor repräsentiert als feature vektor
- Grosse Dokumentensammlung mit grossem Durcheinander in einem Folder
- Man möchte das in eine sinnvolle Verzeichnisstruktur bringen
  - Clustering könnte helfen

Clustering

Datensatz in Gruppen einteilen, jede Gruppe nennt man Cluster
- Iris Datenset kann in Clusters unterteilt werden
- News Artikel auf Google News
- Kunden in Kundengruppen
Das alles ist unsupervised Classification
Based on some similarity measure
all instances within a cluster should be similar
- ähnliche feature Werte (numerische Werte)
- and instances in different clusters should be dissimilar
Erster Punkt: Cluster Bilden
Zweiter Punkt: Einzelne Cluster identifizieren, was genau ist das Merkmal von jedem Cluster?
- Häufig muss dies der Mensch machen, da die Algorithmen das nicht können
Cluster discovered
- Beispielsweise durch die Euklidsche Distanz
  - ähnlichkeit zwischen zwei Vektoren (Feature Vektor)

K-Means

Idea
- creates K clusters
- interpret samples x as real-valued vectors x-> (vector)
  - data preparation: numeric data only
- assignement of x to a cluster is based on its distance to the cluster centroids
  - jeder Cluster hat einen Mittelpunkt (centroids)
  - der Datenpunkt kommt in den Cluster rein wo die Distanz am kleinsten ist
  - Mittelpunkt (centroid) ist im Vorhinein nicht bekannt
    - und ist auch nicht statisch, der verschiebt sich
    - Problem: Der Mittelpunkt des Clusters wird berechnet aufgrund der Daten im Set
    - Am Anfang ist aber noch kein Cluster bekannt
    - Daher löst man das mit einem iterativen Prozess

# select K random samples {c1, c2, ... , ck} as approximatrion of centroids
until termination condition
    for each sample xi:
        assign xi to the cluster Cj such that dist(xj,cj) is minimal
    for each cluster Cj update the approximations of centroids
        cj = u(Cj)

How many clusters K

Number of clusters K is given
- partition n samples into predetermined number of clusters
Finding the "right" number of cluster is part of the problem
- partition n samples into appropriate number of clusters
- often try and error
Use an algorithm to determine K automatically
- define a function to assess the "qualits" of all clusters
  - e.g. pairwise distance of all samples within a cluster to measure how homogenous the cluster is
- increase K until no further quality improvement

Discussion K-Means

Advantages
- easy to implement and understand (white box)
Disadvantages
- assumes that clusters are sphere-shaped
- number of iterations and resulting clusters results depend on seed choice
  - use heuristic rather than random picks
- algorihm may converge on local minima
  - re-run with different seeds
  - post-process resulting clusters
    - split the n "worst" clusters into 2 or more subclusters
    - merge 2 close clusters (where centroids are close) into one
- slow
  - updating centroid after each new sample assignment may speed up the process

Cluster Evaluation Metrics

in case we have a classified data set (gold standard)
- homogeneity score
  - between 1 and 0, where 1 means that each computed cluster contains only samples of one gold standard cluster
- completeness score
  - between 1 and 0 where 1 means that all samples from a gold standard cluster are assigned to the same computed cluster
- Adjusted rand index (ARI)
  - Überlappung zwischen berechneten Cluster und gold standard cluster wird berechnet (schnittmenge)
    - overlap = number of common items
  - between -1 or 1, where 1 means equality
in case the gold standard is not known
- dann kann man es nur noch geometrisch bewerten
- SSE: sum of squared error (Fehlerquadratsumme) sum of squared distance of each sample to the centroid of its assigned cluster
- silhouette coefficient
  - wie gut ist der punkt abgegrenzt vom Nachbarcluster
  - average distance of sample to all other points in the same cluster - average distance of sample to all other points in the next nearest cluster
- between -1 and 1 where 1 means dense clusters

Naive Bayes

Supervised learning

Bayes' Theorem

mathematical law about conditional probabilities
given two events A and B, then the conditional probability P(B|A) relates to the probability that event B occurs after A has occured
- applies e.g. when we are blindl drawing samples from a bag containing red an black alls without returning the balls
- assume we start wit 2 red and 2 black balls
  - then P(red) = 2/4 = P(black) for the first draw
- if we draw a red ball first, then we know upfront for the 2nd draw
  - P(red|red) = 1/3 and P(black|red) = 2/3
probability that two conditionally-related events P(A) and P(B|A) occur one after another: P(A^B) = P(A) * P(B|A)
- e.g probability to first draw red and then red again: P(red^red) = P(red)*P(red|red) = 2/4 * 1/3 = 1/6
P(A|B), P(B|A) may be calculated as
- P(A|B) = P(A^B) / P(B) and P(B|A) = P(A^B) / P(A)
- in the drawing example: P(red|red) = 2/4 * 1/3 / 2/4 = 1/3
from this we can derive Bayes' theorem
- P(A|B) = P(B|A) * P(A) / P(B)

Using Bayes' Theorem for Classification

Our events are A = class c, B = sample x.
We calculate P(c|x)
- x -> Ich bekomme ein sample
- P(c|x) -> wie hoch ist die Wahrscheinlichkeit dass das sample x in der klasse c ist?
given x, our classifier thus
1. calculates P(c|x) for all c
2. selects c with the highes P(c|x)
P(c|x) = P(x|c) * P(c) / P(x)
- P(c) -> probability of a class c -> kann aus den Samples ausgelesen werden -> Zählen wieviele Samples hab ich in einer Klasse
  - es gilt: n = n_samples within c / n_samples
- P(x) -> wie häufig hab ich ein bestimmtes sample in meinen Trainingsdaten? -> Meist nur ein einziges mal, manchmal ein identisches sample mehrmals
  - wenn nur einmal gilt: n = 1 / n_samples
- P(x|c) -> wie häufig hab ich einen Feature Vektor in einer bestimmten Klasse -> Optimalerweise nur einmal -> Bei einer klasse wahrscheinlichkeit = 1 und bei allen anderen 0
  - man schaut sich die einzelnen Features an und schaut welche kombinationen von Features sind einen Hinweis auf eine Klasse
  - P(f1, f2, ..., fn|c) where fi are the sample's features
  - we assume that fi are independent from each other given a class c
    - thus we can calculate P(f1,f2,...,fn|c) = P(f1|c) * P(f2|c) * ... * P(fn|c)
    - 3 different NB variants for calculating P(fi|c)
      - Gaussian: fi are continuous and P(fi|c) are normally distributed
      - multinomial: P(fi|c) = n_fi in c / n_fi occures overall
      - bernoulli: binary features
  - in general this independence assumption is wrong
    - within c certain features are often correlated
    - thus the name: naive bayes classifier (NB)
    - but the NB classifier performs well in practice despite this "naive" assumption
Advantages
- very fast learning and testing
- performed super auf grossen Datensätzen
- low storage requirements
- optimal if the independence assumptions hold
  - typischerweise hat man aber keine unabhängigen features
- very good in domains with many equally important features
  - robust gegen schlechtes feature engineering
  - bspw.: Um zu erkennen ob das ein Brief ist schau ich ob ein Datum drauf ist -> Das ist Mist
    - für decicion tree fatal
    - für naive bayes hat man halt nur einen neuen Skalierungsfaktor
  - sehr fehlertolerant
Disadvantages
- Often not the best classifier

19 KiB Raw Blame History

Notizen Lektion 2

recap

Evaluation

How good is the machine learning system?

Metrics SR/ER

Precision and Recall for Binary Classification

Other metrics

Automated Evaluation Workflow

Evaluation Goals

Using Data for Training and Testing

K-Fold Cross Validation

Dataset Challenges

Unsupervised Learning

Clustering

K-Means

How many clusters K

Discussion K-Means

Cluster Evaluation Metrics

Naive Bayes

Bayes' Theorem

Using Bayes' Theorem for Classification

19 KiB

Raw Blame History