Notizen Lektion 2

Thema: Einführung Practical Machine Learning 2 Datum: 22.04.2026 Dozent: Jürgen Vogel

recap

[!NOTE] Definition algorithm that learns from experience E to solve some tasks T with performance P and P improves with E

Model
- represents the solution to the tasks T
- is learnt and adapted based on E
- can be evaluated with respect to P
Features
- are the relevant part of the data E for creating the model
- may have to be designed explicitly depending on the ML algorithm
Categorization with respect to
- experience E: supervised vs. unsupervised vs. reinforcement leanring
- tasks T: clustering vs. classification vs. regresseion
- human-readable model: white box vs. black box
Project
- agile/iterative development (CRISP-DM)
Key Challenges
- definition of T that is both solvable and generates value
- large amounts of high quality data E
- feature engineering
- dealing with 95% models

Evaluation

How good is the machine learning system?

returned result is good if it solves the problem at hand
- may be qualitative or quantitative
- may be subjective (user need, context, and preferences)
- may change over time
- also depends on factors such as credibility, specificity, exhaustivitiy, recency, clarity, interpretability... of the result
Beispiel Suchmaschine: Eine Reihe von Keywords werden in eine Suchmaschine eingegeben
- Wann ist die Antwort der Suchmaschine "gut"?
  - Schwirig zu beantworten, da es sich von Nutzer zu Nutzer unterscheided
- Casual User: Frage aus generellem Context -> generellere Antwort okay
  - "Wo ist in Laufdistanz ein Restaurant, das offen ist"
    - Man will nicht das bestmögliche Setting finden und alle Restaurants finde
  - Schnelles Ergebnis und gut genug
- Expert User: Recherchiert sehr detailierte Informationen
  - Umfangreiche Analyse machen
  - Was gibts alles für wiss. Literatur zum Thema?
  - Was sind die besten Verfahren?
  - Informationsbedürfnis sehr hoch
thus, the ML system needs to be assessed in "real-life" situations
- often with user involvement
- similar methods as with user requirements research
  - usability tests, interviews, field studies, log analysis
- but this takes time and is costly

Metrics SR/ER

Wichtig:
- Success Rate
- Error Rate
Success
- Result is correct -> ein einzelnes Sample ist richtig klassifiziert worden
- success rate -> durschnitt über grössere Menge samples
  - nennt man auch accuracy
Error
- Result is incorrect -> ein einzelnes Sample ist falsch
- error rate -> durschnitt über grössere Menge samples
Beides ist eine 1/0 Betrachtung -> Entweder falsch oder richtig
Bsp: Wie viele Personen sind auf Bild
- Modell sagt 3 Personen
- Auf Bild sind 5 Personen
- Wie bewertet man das?
  - falsch? -> 100% error
  - ein bisschen richtig? 3/5 erkannt 2/5 fehler
Generalisieren wir die Erfolgsrate erhält man
- our ML system takes some test data D as input and produces some results
  - D -> {r'1, ... r'n}
  - e.g. if r'i are from a list of predefined labels , we call this classification
- the test data also includes the expected result "gold standard"
  - D -> {r1, ..., rn}
- for the test setting, we define some comparison functions
  - c(r, r') = 1 if r = r', 0 else # vergleichsfunktion
- then we can calculate the success rate SR as
  - SR = (1/n)*sum(i=1, n, c(ri, r'i))

Precision and Recall for Binary Classification

Bsp. Suchmaschine -> Man will evaluieren ob das Modell gut funktioniert
- Für eine Suchanfrage wurde ein Test Set zusammengestellt
- Manuell bewertet (Gold Standard):
  - Man weiss für jeden Eintrag: Website passt oder passt nicht

Bewertung:

	positive gold	negative gold
positive classified	true positive (TP)	false positive (FP)
negative classified	false negatives (FN)	true negative (TN)

True Positives: Classifier bewertet positiv, Goldstandard sagt positiv
True Negatives: Classifier sagt negativ und das stimmt auch
False Negatives: Classifier sagt nicht negativ, Goldstandard sagt aber positiv
- das ist ein Fehler
- Bsp. Suchmaschine: Die Suchmaschine liefert ein Suchresultat nicht zurück obwohl es relevant wäre
False Positives: Classifier sagt positive, das stimmt aber nicht
- das ist ein weiterer Fehler
- Bsp. Suchmaschine: Die Suchmaschine liefert ein nichtrelevantes Suchresultat zurück
Daraus abgeleitete Metriken:
- Precision
  - Menge der TP in Bezug auf die insgesamt positiven Samples (gemäss Gold Standard)
  - Wenn mein Algorithmus keinen Fehler macht dann hat man 100% precision
  - P = TP / (Class p Classified)
  - Bsp.: Wieviele der angezeigten Webseiten, sind gemäss Gold Standard wirklich relevant?
- Recall
  - Wie hoch ist der Anteil der False Negatives gemäss Gold Standard
  - R = TP / (Class p Gold)
  - Bsp. Welche Seiten die der Mensch (Gold Standard) als relevant klassifiziert hat, werden tatsächlich angezeigt?
    - Perfekt wenn all relevanten Seiten angezeigt wurden
    - Schlecht wenn keine relevanten Seiten gefunden wurden
Erweiterte Metrik: Confusion Matrix
Precision vs Recall
- There is often a trafe-off between Precision and Recall
- improving the algorithm towards one weakens the other
  - Will ich das Modell in richtung Precision verbessern, wird der Recall schlechter und umgekehrt
  - Entweder das eine oder andere kann optimiert werden
  - Bspw. Suchmaschine: Einfach alles anzeigen, dann gibts keine False Negatives weil das Gesuchte immer gefunden wird
    - Die Precision wird dabei aber sehr sehr schlecht, weil ganz viele False Positives dabei sind
    - 100% Recall 0% Precision
  - Oft muss ein Kompromiss getroffen werden zwischen Precision und Recall
    - Die Entscheidung was optimiert werden soll, muss vom Entwicklungsteam getroffen werden
  - Precision-oriented users
    - Web Surfers
  - Recall-oriented users
    - Professional searches, legal, etc
Dafür gibt es aber folgendes Hilfsmittel: F-measure
- Das gewichtete, harmonische Mittel zwischen Precision und Recall
  - Formel: Skript Seite 7
  - F = 1/( alpha* 1/P + [1-alpha] * 1/R) = (beta^2 + 1)PR / (beta^2P+R) = beta^2 = 1 - alpha / alpha
- Ist parametrisierbar
  - Beta < emphasize precision
  - Beta > emphasize recall

Other metrics

the generalization of our binary classifier result matrix (classification result vs. gold standard) is called a confusion matrix
- many different metrics can be derived from this
  - https.//en.wikipedia.org/wiki/Confusion_matrix
- other widely used metrics include ROC, K-S, gail/lift, ...
for specific ML problems and algorithms many additional metrics exists

7.0 KiB Raw Blame History

Notizen Lektion 2

recap

Evaluation

How good is the machine learning system?

Metrics SR/ER

Precision and Recall for Binary Classification

Other metrics

7.0 KiB

Raw Blame History