feature(workshop): add workshop13 scaffold, solution will be added later
This commit is contained in:
@@ -0,0 +1,68 @@
|
|||||||
|
# Workshop 13 — Stabilitätsvergleich von Klassifikatoren (Kreuzvalidierung)
|
||||||
|
|
||||||
|
> CAS Practical Machine Learning · Supervised Learning · Lektion 5 (Foliensatz 14, Folie 12)
|
||||||
|
> Zeit: 30'
|
||||||
|
|
||||||
|
## Aufgabenstellung
|
||||||
|
|
||||||
|
Vergleiche **alle bisher bekannten Klassifikatoren** in Bezug auf deren **Stabilität**
|
||||||
|
unter Anwendung von Kreuzvalidierung.
|
||||||
|
|
||||||
|
- für die Klassifikatoren jeweils **Default-Parametrisierung** verwenden
|
||||||
|
- für die Kreuzvalidierung `sklearn.model_selection.cross_val_score` einsetzen
|
||||||
|
|
||||||
|
## Kernidee: was heisst „Stabilität"?
|
||||||
|
|
||||||
|
`cross_val_score` liefert einen Score **pro Fold**. Aus diesen Werten:
|
||||||
|
|
||||||
|
- **mean** → durchschnittliche Performance
|
||||||
|
- **std** → **Stabilität**: wie stark schwankt die Performance über die Folds (= über
|
||||||
|
verschiedene Datenaufteilungen). **Kleine std = stabiler.**
|
||||||
|
|
||||||
|
Ziel des Vergleichs: welcher Klassifikator liefert nicht nur gute, sondern auch
|
||||||
|
**verlässliche** (wenig streuende) Ergebnisse?
|
||||||
|
|
||||||
|
## Datensatz
|
||||||
|
|
||||||
|
Klassifikations-Datensatz aus den bisherigen Workshops (analog WS6, via
|
||||||
|
`bfh_cas_pml.prep_data` auf dem aufbereiteten CSV in `data/`).
|
||||||
|
|
||||||
|
> Hinweis: `cross_val_score` übernimmt das Splitten selbst → **kein** manueller
|
||||||
|
> Train-Test-Split nötig. Es genügt, `X` und `y` zu übergeben (z.B. `X_train`, `y_train`).
|
||||||
|
|
||||||
|
## Ordnerstruktur
|
||||||
|
|
||||||
|
```
|
||||||
|
workshop13
|
||||||
|
├── data
|
||||||
|
│ └── <classification_data>.csv # aus Kursmaterial
|
||||||
|
├── devenv.lock
|
||||||
|
├── devenv.nix
|
||||||
|
├── README.md
|
||||||
|
├── stability_boxplot.png # Output
|
||||||
|
└── src
|
||||||
|
├── bfh_cas_pml.py # aus Kursmaterial
|
||||||
|
└── crossvalidation.py # Lösung
|
||||||
|
```
|
||||||
|
|
||||||
|
## Vorgehen
|
||||||
|
|
||||||
|
1. Daten laden (`X`, `y`).
|
||||||
|
2. Alle bekannten Klassifikatoren mit **Default-Parametern** in einem `dict` sammeln.
|
||||||
|
3. Pro Klassifikator `cross_val_score(clf, X, y, cv=kfold)` rechnen.
|
||||||
|
4. `mean` und `std` je Klassifikator gegenüberstellen (nach `std` sortieren).
|
||||||
|
5. Boxplot aller Klassifikatoren nebeneinander → Streuung sichtbar machen.
|
||||||
|
|
||||||
|
## Erkenntnisse / offene Punkte
|
||||||
|
|
||||||
|
> hier eigene Beobachtungen festhalten
|
||||||
|
|
||||||
|
- Stabilster Klassifikator (kleinste std):
|
||||||
|
- Bester Mittelwert (mean):
|
||||||
|
- Trade-off mean vs. std:
|
||||||
|
- Welche Klassifikatoren brauchen `random_state`, welche nicht?
|
||||||
|
|
||||||
|
## Quellen
|
||||||
|
|
||||||
|
- Foliensatz 14 (Validierung), V. Vogel, TI BFH — Folien 10–12
|
||||||
|
- Notizen: `../../L5_Notizen.md` (Abschnitt „Praxis: Kreuzvalidierung")
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,65 @@
|
|||||||
|
{
|
||||||
|
"nodes": {
|
||||||
|
"devenv": {
|
||||||
|
"locked": {
|
||||||
|
"dir": "src/modules",
|
||||||
|
"lastModified": 1781147004,
|
||||||
|
"narHash": "sha256-/s2Fk3BDmdIIwSWZc04fLrCK86chpxpeMRgHXGjzquk=",
|
||||||
|
"owner": "cachix",
|
||||||
|
"repo": "devenv",
|
||||||
|
"rev": "15f44b869b9c99b0bb104b7d5a04d9faba540a5e",
|
||||||
|
"type": "github"
|
||||||
|
},
|
||||||
|
"original": {
|
||||||
|
"dir": "src/modules",
|
||||||
|
"owner": "cachix",
|
||||||
|
"repo": "devenv",
|
||||||
|
"type": "github"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nixpkgs": {
|
||||||
|
"inputs": {
|
||||||
|
"nixpkgs-src": "nixpkgs-src"
|
||||||
|
},
|
||||||
|
"locked": {
|
||||||
|
"lastModified": 1778507786,
|
||||||
|
"narHash": "sha256-HzSQCKMsMr8r55LwM1JuzIOB+8bzk0FEv6sItKvsfoY=",
|
||||||
|
"owner": "cachix",
|
||||||
|
"repo": "devenv-nixpkgs",
|
||||||
|
"rev": "8f24a228a782e24576b155d1e39f0d914b380691",
|
||||||
|
"type": "github"
|
||||||
|
},
|
||||||
|
"original": {
|
||||||
|
"owner": "cachix",
|
||||||
|
"ref": "rolling",
|
||||||
|
"repo": "devenv-nixpkgs",
|
||||||
|
"type": "github"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nixpkgs-src": {
|
||||||
|
"flake": false,
|
||||||
|
"locked": {
|
||||||
|
"lastModified": 1778274207,
|
||||||
|
"narHash": "sha256-I4puXmX1iovcCHZlRmztO3vW0mAbbRvq4F8wgIMQ1MM=",
|
||||||
|
"owner": "NixOS",
|
||||||
|
"repo": "nixpkgs",
|
||||||
|
"rev": "b3da656039dc7a6240f27b2ef8cc6a3ef3bccae7",
|
||||||
|
"type": "github"
|
||||||
|
},
|
||||||
|
"original": {
|
||||||
|
"owner": "NixOS",
|
||||||
|
"ref": "nixpkgs-unstable",
|
||||||
|
"repo": "nixpkgs",
|
||||||
|
"type": "github"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"root": {
|
||||||
|
"inputs": {
|
||||||
|
"devenv": "devenv",
|
||||||
|
"nixpkgs": "nixpkgs"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"root": "root",
|
||||||
|
"version": 7
|
||||||
|
}
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
{ pkgs, ... }:
|
||||||
|
|
||||||
|
{
|
||||||
|
# Native libs that the pip-wheel-installed numpy/scipy/matplotlib stack
|
||||||
|
# dlopen()s at runtime. zlib war schon in W3/W4 nötig (libz.so.1),
|
||||||
|
# stdenv.cc.cc.lib liefert libstdc++ für die scipy/sklearn-Wheels.
|
||||||
|
packages = [
|
||||||
|
pkgs.zlib
|
||||||
|
pkgs.stdenv.cc.cc.lib
|
||||||
|
];
|
||||||
|
|
||||||
|
languages.python = {
|
||||||
|
enable = true;
|
||||||
|
venv.enable = true;
|
||||||
|
venv.requirements = ''
|
||||||
|
pandas
|
||||||
|
numpy
|
||||||
|
scikit-learn
|
||||||
|
matplotlib
|
||||||
|
seaborn
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
|
||||||
|
# Loader-Pfad für die obigen nativen Libs. Wenn beim Import trotzdem ein
|
||||||
|
# "ImportError: libXYZ.so.N" auftaucht: das bereitstellende pkgs.<paket>
|
||||||
|
# zu packages UND hier ergänzen — gleiches Muster wie der W3-Fix.
|
||||||
|
env.LD_LIBRARY_PATH = pkgs.lib.makeLibraryPath [
|
||||||
|
pkgs.zlib
|
||||||
|
pkgs.stdenv.cc.cc.lib
|
||||||
|
];
|
||||||
|
}
|
||||||
@@ -0,0 +1,193 @@
|
|||||||
|
"""
|
||||||
|
Useful functions for example notebooks and workshop solutions
|
||||||
|
of course Practical Machine Learning - Supervised Learning
|
||||||
|
Bern University of Applied Sciences (BFH)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
# ========== Packages ==========
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import numpy as np
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import seaborn as sns
|
||||||
|
|
||||||
|
|
||||||
|
# ========== Functions ==========
|
||||||
|
|
||||||
|
def prep_data(dataset, target, train_ratio = 2 / 3, seed = None, sep = ','):
|
||||||
|
""" read and prepare real data from the current directory
|
||||||
|
performs
|
||||||
|
read data
|
||||||
|
features - target - split
|
||||||
|
train - test - split
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
dataset: name of dataset in csv format
|
||||||
|
target: name of target column
|
||||||
|
train_ratio (2 / 3): (optional)
|
||||||
|
seed (None): random seet for split (optional)
|
||||||
|
sep (,): separator of csv file (optional)
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
X_train: feature matrix of train set
|
||||||
|
X_test: target vector of train set
|
||||||
|
y_train: feature matrix of test set
|
||||||
|
y_test: target vector of train set
|
||||||
|
"""
|
||||||
|
|
||||||
|
## load data
|
||||||
|
data = pd.read_csv(dataset, sep = sep)
|
||||||
|
|
||||||
|
## features - target - split
|
||||||
|
X = data.drop(target, axis=1)
|
||||||
|
y = data[target]
|
||||||
|
|
||||||
|
## train - test - split
|
||||||
|
from sklearn.model_selection import train_test_split
|
||||||
|
return train_test_split(
|
||||||
|
X,
|
||||||
|
y,
|
||||||
|
train_size=train_ratio,
|
||||||
|
random_state=seed)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def prep_demo_data(dataset, target):
|
||||||
|
""" read demo data from the current directory
|
||||||
|
performs
|
||||||
|
read data
|
||||||
|
features - target - split
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
dataset: name of dataset in csv format, ',' separated
|
||||||
|
target: name of target column
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
X: feature matrix
|
||||||
|
y: target vector
|
||||||
|
"""
|
||||||
|
|
||||||
|
## load data
|
||||||
|
data = pd.read_csv(dataset)
|
||||||
|
|
||||||
|
## features - target - split
|
||||||
|
X = data.drop(target, axis=1)
|
||||||
|
y = data[target]
|
||||||
|
|
||||||
|
return X, y
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def inspect_decision_tree_model(model_def, features, target, figsize=(6, 6)):
|
||||||
|
""" train a DecisionTreeClassifier and visualize the tree
|
||||||
|
|
||||||
|
prints some motel attributes from within the function
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
model_def: DecisionTreeClassifier object with set parameters
|
||||||
|
features: feature matrix
|
||||||
|
target: target vector
|
||||||
|
figsize: size of image, optional, default = (6, 6)
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
visualization of the trained tree
|
||||||
|
prints model attributes
|
||||||
|
"""
|
||||||
|
|
||||||
|
from sklearn.tree import plot_tree
|
||||||
|
|
||||||
|
model = model_def
|
||||||
|
model.fit(features, target)
|
||||||
|
|
||||||
|
print('TREE DIAGNOSTICS:')
|
||||||
|
print('depth :', model.get_depth())
|
||||||
|
print('leaves :', model.get_n_leaves())
|
||||||
|
print('score :', model.score(features, target))
|
||||||
|
|
||||||
|
plt.figure(figsize=figsize)
|
||||||
|
plot_tree(model,
|
||||||
|
feature_names=features.columns,
|
||||||
|
class_names=model.classes_,
|
||||||
|
filled=True);
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def test_regression_model(model, X_train, y_train, X_test, y_test, show_plot=True):
|
||||||
|
|
||||||
|
""" shows behavoiur of univariate ML regression on synthetic dataset
|
||||||
|
|
||||||
|
performs
|
||||||
|
- training on train data
|
||||||
|
- prediction on test data
|
||||||
|
- calculate performance measures
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
model: a parametrized regression model
|
||||||
|
X_train, y_train: train data
|
||||||
|
X_test, y_test: test data
|
||||||
|
show_plot: show scatterplot ov pred vs true, optional, default=True
|
||||||
|
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
shows a scatterplot von X_test vs X_pred with a diagonal line, indicating identity
|
||||||
|
prints r2_score and mean_squared_error
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
from sklearn.metrics import r2_score
|
||||||
|
from sklearn.metrics import mean_squared_error
|
||||||
|
|
||||||
|
model = model
|
||||||
|
model.fit(X_train, y_train)
|
||||||
|
y_pred = model.predict(X_test)
|
||||||
|
print('R2 = %0.4f' %(r2_score(y_test, y_pred)))
|
||||||
|
|
||||||
|
if show_plot == True:
|
||||||
|
plt.figure(figsize=(6,6))
|
||||||
|
ax = sns.scatterplot(x=y_test, y=y_pred)
|
||||||
|
ax.set(xlabel='y_test', ylabel='y_pred')
|
||||||
|
ls = np.linspace(min(y_test), max(y_test), 100)
|
||||||
|
plt.plot(ls, ls, color='black', linestyle='dashed')
|
||||||
|
ax.set_title(model.__class__.__name__)
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
return (model)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def show_pred_on_synth(model, X, y, X_synth, param_str):
|
||||||
|
""" shows behavoiur of univariate ML regression on synthetic dataset
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
model: a parametrized regression model
|
||||||
|
X, y: data for univariate regression
|
||||||
|
X_synth: synthetic Feature
|
||||||
|
param_str: parameter description for title
|
||||||
|
seed (None): random seet for split
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
a scatterplot von X, y, with the prediction values for X_synth
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
model.fit(X.to_numpy(), y)
|
||||||
|
y_pred = model.predict(X_synth)
|
||||||
|
|
||||||
|
ax = sns.scatterplot(x=X['X'], y=y)
|
||||||
|
ax = sns.lineplot(x=X_synth[:,0], y=y_pred, color='orange')
|
||||||
|
ax.set_title(model.__class__.__name__ + ' : ' + param_str)
|
||||||
|
ax.set(xlabel='X', ylabel='y')
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
"""
|
||||||
|
Workshop 13 — Stabilitätsvergleich von Klassifikatoren mittels Kreuzvalidierung.
|
||||||
|
|
||||||
|
Aufgabe (Folie 12): vergleiche alle bisher bekannten Klassifikatoren bzgl. ihrer
|
||||||
|
Stabilität unter Kreuzvalidierung.
|
||||||
|
- Default-Parametrisierung
|
||||||
|
- sklearn.model_selection.cross_val_score
|
||||||
|
"""
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
import seaborn as sns
|
||||||
|
import matplotlib
|
||||||
|
|
||||||
|
matplotlib.use("Agg") # headless: Plot in Datei statt Fenster
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
from sklearn.model_selection import cross_val_score
|
||||||
|
|
||||||
|
# Klassifikatoren — TODO: an "alle bisher bekannten" anpassen (im Kurs behandelte)
|
||||||
|
from sklearn.tree import DecisionTreeClassifier
|
||||||
|
from sklearn.ensemble import RandomForestClassifier
|
||||||
|
# from sklearn.linear_model import LogisticRegression
|
||||||
|
# from sklearn.neighbors import KNeighborsClassifier
|
||||||
|
# from sklearn.naive_bayes import GaussianNB
|
||||||
|
# from sklearn.svm import SVC
|
||||||
|
# ...
|
||||||
|
|
||||||
|
|
||||||
|
# --- Daten laden ---------------------------------------------------------
|
||||||
|
# wie in den bisherigen Klassifikations-Workshops (z.B. bfh_cas_pml.prep_data
|
||||||
|
# auf dem Klassifikations-Datensatz in data/).
|
||||||
|
# Hinweis: cross_val_score splittet selbst -> KEIN manueller Train-Test-Split.
|
||||||
|
# Übergib X, y (z.B. X_train, y_train aus prep_data).
|
||||||
|
# TODO: X, y bereitstellen
|
||||||
|
|
||||||
|
|
||||||
|
# --- Klassifikatoren sammeln ---------------------------------------------
|
||||||
|
# dict {name: estimator} -> sauber iterierbar, alle mit Default-Parametern.
|
||||||
|
# Frage: welche Estimator sind stochastisch (brauchen random_state für
|
||||||
|
# Reproduzierbarkeit), welche sind deterministisch? -> nur erstere setzen.
|
||||||
|
SEED = 1234
|
||||||
|
classifiers = {
|
||||||
|
"DecisionTree": DecisionTreeClassifier(random_state=SEED),
|
||||||
|
"RandomForest": RandomForestClassifier(random_state=SEED),
|
||||||
|
# TODO: restliche bekannte Klassifikatoren ergänzen (Default-Parameter!)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# --- Kreuzvalidierung je Klassifikator -----------------------------------
|
||||||
|
KFOLD = 10 # default wäre 5; grösser = stabilere Schätzung, mehr Rechenzeit
|
||||||
|
results = {} # name -> scores-array (ein Score pro Fold)
|
||||||
|
# TODO: für jeden (name, clf) in classifiers:
|
||||||
|
# scores = cross_val_score(clf, X, y, cv=KFOLD)
|
||||||
|
# results[name] = scores
|
||||||
|
|
||||||
|
|
||||||
|
# --- Auswertung: mean & std ----------------------------------------------
|
||||||
|
# Stabilität = Streuung der Fold-Scores. Kleine std => stabil (vgl. Notizen).
|
||||||
|
# TODO: pro Klassifikator mean und std berechnen,
|
||||||
|
# z.B. als DataFrame, aufsteigend nach std sortiert (stabilste zuerst).
|
||||||
|
|
||||||
|
|
||||||
|
# --- Visualisierung -------------------------------------------------------
|
||||||
|
# Boxplot pro Klassifikator nebeneinander -> Streuung direkt vergleichbar.
|
||||||
|
# Tipp: results in ein "long format" bringen (Spalten: classifier, score),
|
||||||
|
# dann sns.boxplot(data=df, x="classifier", y="score").
|
||||||
|
# TODO: Boxplot erstellen und mit plt.savefig("stability_boxplot.png") speichern.
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
pass # TODO: Ablauf aufrufen / Ergebnisse ausgeben
|
||||||
Reference in New Issue
Block a user