feature(workshop): add workshop5 decision tree tuning
This commit is contained in:
@@ -0,0 +1,103 @@
|
||||
# Workshop 05 — Decision Tree Classifier: `min_impurity_decrease` Tuning
|
||||
|
||||
## Aufgabe
|
||||
|
||||
Untersuchen, wie verschiedene Werte von `min_impurity_decrease` beim
|
||||
`DecisionTreeClassifier` die erreichbare **Test-Accuracy** beeinflussen.
|
||||
|
||||
- Wertebereich **schrittweise eingrenzen** (grob → fein).
|
||||
- Resultate darstellen:
|
||||
- grafisch als **Liniendiagramm** (Parameter vs. Accuracy),
|
||||
- in der **Konsole**: bester Score + zugehöriger Parameterwert.
|
||||
|
||||
Hinweis aus den Folien: `range()` liefert Ganzzahlen, `np.arange()` liefert
|
||||
Gleitkommawerte. `min_impurity_decrease` ist ein Float → `np.arange`.
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
devenv shell # python + venv (pandas, numpy, sklearn, matplotlib, seaborn)
|
||||
```
|
||||
|
||||
Datengrundlage = vorbereiteter Bank-Datensatz aus W4. Geladen über das
|
||||
kursinterne Modul (muss auf dem `PYTHONPATH` / im Projektordner liegen):
|
||||
|
||||
```python
|
||||
from bfh_cas_pml import prep_data
|
||||
X_train, X_test, y_train, y_test = prep_data('bank_data_prep.csv', 'y', seed=1234)
|
||||
```
|
||||
|
||||
Baseline zum Abgleich (voll ausgewachsener Baum):
|
||||
|
||||
```python
|
||||
DecisionTreeClassifier(random_state=1234) # train=1.0, test≈0.8296 → overfit
|
||||
```
|
||||
|
||||
> Falls `bfh_cas_pml` / `bank_data_prep.csv` nicht zur Hand sind: als Fallback
|
||||
> tut es jeder `train_test_split` auf einem sklearn-Datensatz — die Mechanik
|
||||
> des Sweeps bleibt identisch.
|
||||
|
||||
## Theorie — was `min_impurity_decrease` tut
|
||||
|
||||
Ein Split wird **nur** ausgeführt, wenn er die (gewichtete, auf den ganzen
|
||||
Baum normierte) Impurity um mindestens diesen Wert senkt:
|
||||
|
||||
```
|
||||
ΔI_norm = (N_node / N_total) * ( I_parent
|
||||
- (N_left/N_node) * I_left
|
||||
- (N_right/N_node) * I_right )
|
||||
```
|
||||
|
||||
Kernpunkte fürs Verständnis:
|
||||
|
||||
- Es ist eine **Pre-Pruning**-Schwelle: schwache Splits werden gar nicht erst
|
||||
gemacht → der Baum bleibt kleiner → weniger Overfitting.
|
||||
- Der Wert ist mit dem **Anteil der Beobachtungen im Knoten** gewichtet
|
||||
(`N_node / N_total`). Tiefe Knoten betreffen wenige Samples → ihr ΔI_norm ist
|
||||
winzig. Darum liegen sinnvolle Schwellen im Bereich **~1e-4 bis ~1e-2**, nicht
|
||||
bei 0.1+. (Vgl. das Folien-Rechenbeispiel: ein *guter* Split nahe der Wurzel
|
||||
ergab 0.0421.)
|
||||
- `=0` → kein Pruning → Baseline-Baum (overfit).
|
||||
|
||||
## Vorgehen — Eingrenzung (der eigentliche Lerninhalt)
|
||||
|
||||
1. **Grob**: weiter Bereich, grobe Schritte, um die Region des Maximums zu
|
||||
lokalisieren — z. B. `np.arange(0, 0.02, 0.001)`.
|
||||
2. **Fein**: um das gefundene Maximum herum zoomen — z. B.
|
||||
`np.arange(0, 0.004, 0.0002)`.
|
||||
3. Wiederholen, bis Lage/Wert des Peaks stabil sind.
|
||||
|
||||
Jede Iteration ist ein eigener Sweep (gleicher Loop, anderer `np.arange`).
|
||||
Im README/Notes die drei (o. ä.) Ranges + jeweils Peak dokumentieren — das
|
||||
*ist* die geforderte „schrittweise Eingrenzung“.
|
||||
|
||||
## Erwartetes Verhalten (Sanity-Check)
|
||||
|
||||
- Bei `0` startest du auf der Baseline (~0.83).
|
||||
- Mit steigendem Wert zunächst Plateau / leichter Bump (Rausch-Splits werden
|
||||
entfernt), dann **Kante nach unten**, sobald nützliche Splits wegfallen.
|
||||
- Im Extrem degeneriert der Baum zum Stumpf → Accuracy = Mehrheitsklasse.
|
||||
Der Bank-Datensatz wurde in W4 ~balanciert resampled → Floor liegt nahe
|
||||
~0.5. Wenn deine Kurve dort hin abstürzt, ist das korrekt, kein Bug.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- [ ] Loop über `np.arange`-Range, `set_params(min_impurity_decrease=p)`,
|
||||
`fit`, `score(X_test, y_test)`, sammeln.
|
||||
- [ ] `sns.lineplot(x=params, y=scores)` + Scatter-Marker auf `max(scores)`,
|
||||
Achsen beschriftet (`min_impurity_decrease` / `accuracy`).
|
||||
- [ ] Konsole: `best score` + zugehöriger Parameterwert.
|
||||
- [ ] Mind. 2 Eingrenzungs-Stufen (grob + fein) dokumentiert.
|
||||
|
||||
## Caveats / Vertiefung (optional)
|
||||
|
||||
- **Optimistic bias** (wie in W4): hier wird `min_impurity_decrease` direkt
|
||||
gegen `X_test` getunt — derselbe Trap wie die manuelle Grid-Search ohne CV.
|
||||
Der gemeldete „beste“ Score ist dadurch optimistisch verzerrt. Die Folien
|
||||
machen es so; fürs Deliverable also bewusst übernehmen, aber als Deviation
|
||||
notieren. Sauber wäre Tuning auf einem Validation-Split bzw. `GridSearchCV`.
|
||||
- **Verwandter Faden, falls Zeit/Interesse**: `ccp_alpha` (Cost-Complexity /
|
||||
Minimal-Cost-Complexity-Pruning) ist sklearns „eigentlicher“ Pruning-Knopf
|
||||
und liefert via `cost_complexity_pruning_path()` direkt eine sinnvolle
|
||||
Kandidatenliste statt manueller `np.arange`-Rate-Erei — nur als Zeiger, nicht
|
||||
Teil der Aufgabe.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,65 @@
|
||||
{
|
||||
"nodes": {
|
||||
"devenv": {
|
||||
"locked": {
|
||||
"dir": "src/modules",
|
||||
"lastModified": 1780543372,
|
||||
"narHash": "sha256-FCGxk82Lc4koWcFw5xgr+W5vbwLVFLCnSMwm2gQOgr0=",
|
||||
"owner": "cachix",
|
||||
"repo": "devenv",
|
||||
"rev": "f693b472c731e7dda69402daa88c06369d54fd3a",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
"dir": "src/modules",
|
||||
"owner": "cachix",
|
||||
"repo": "devenv",
|
||||
"type": "github"
|
||||
}
|
||||
},
|
||||
"nixpkgs": {
|
||||
"inputs": {
|
||||
"nixpkgs-src": "nixpkgs-src"
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1778507786,
|
||||
"narHash": "sha256-HzSQCKMsMr8r55LwM1JuzIOB+8bzk0FEv6sItKvsfoY=",
|
||||
"owner": "cachix",
|
||||
"repo": "devenv-nixpkgs",
|
||||
"rev": "8f24a228a782e24576b155d1e39f0d914b380691",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
"owner": "cachix",
|
||||
"ref": "rolling",
|
||||
"repo": "devenv-nixpkgs",
|
||||
"type": "github"
|
||||
}
|
||||
},
|
||||
"nixpkgs-src": {
|
||||
"flake": false,
|
||||
"locked": {
|
||||
"lastModified": 1778274207,
|
||||
"narHash": "sha256-I4puXmX1iovcCHZlRmztO3vW0mAbbRvq4F8wgIMQ1MM=",
|
||||
"owner": "NixOS",
|
||||
"repo": "nixpkgs",
|
||||
"rev": "b3da656039dc7a6240f27b2ef8cc6a3ef3bccae7",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
"owner": "NixOS",
|
||||
"ref": "nixpkgs-unstable",
|
||||
"repo": "nixpkgs",
|
||||
"type": "github"
|
||||
}
|
||||
},
|
||||
"root": {
|
||||
"inputs": {
|
||||
"devenv": "devenv",
|
||||
"nixpkgs": "nixpkgs"
|
||||
}
|
||||
}
|
||||
},
|
||||
"root": "root",
|
||||
"version": 7
|
||||
}
|
||||
@@ -0,0 +1,31 @@
|
||||
{ pkgs, ... }:
|
||||
|
||||
{
|
||||
# Native libs that the pip-wheel-installed numpy/scipy/matplotlib stack
|
||||
# dlopen()s at runtime. zlib war schon in W3/W4 nötig (libz.so.1),
|
||||
# stdenv.cc.cc.lib liefert libstdc++ für die scipy/sklearn-Wheels.
|
||||
packages = [
|
||||
pkgs.zlib
|
||||
pkgs.stdenv.cc.cc.lib
|
||||
];
|
||||
|
||||
languages.python = {
|
||||
enable = true;
|
||||
venv.enable = true;
|
||||
venv.requirements = ''
|
||||
pandas
|
||||
numpy
|
||||
scikit-learn
|
||||
matplotlib
|
||||
seaborn
|
||||
'';
|
||||
};
|
||||
|
||||
# Loader-Pfad für die obigen nativen Libs. Wenn beim Import trotzdem ein
|
||||
# "ImportError: libXYZ.so.N" auftaucht: das bereitstellende pkgs.<paket>
|
||||
# zu packages UND hier ergänzen — gleiches Muster wie der W3-Fix.
|
||||
env.LD_LIBRARY_PATH = pkgs.lib.makeLibraryPath [
|
||||
pkgs.zlib
|
||||
pkgs.stdenv.cc.cc.lib
|
||||
];
|
||||
}
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 28 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 28 KiB |
Binary file not shown.
@@ -0,0 +1,193 @@
|
||||
"""
|
||||
Useful functions for example notebooks and workshop solutions
|
||||
of course Practical Machine Learning - Supervised Learning
|
||||
Bern University of Applied Sciences (BFH)
|
||||
"""
|
||||
|
||||
|
||||
# ========== Packages ==========
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
|
||||
# ========== Functions ==========
|
||||
|
||||
def prep_data(dataset, target, train_ratio = 2 / 3, seed = None, sep = ','):
|
||||
""" read and prepare real data from the current directory
|
||||
performs
|
||||
read data
|
||||
features - target - split
|
||||
train - test - split
|
||||
|
||||
Parameters
|
||||
----------
|
||||
dataset: name of dataset in csv format
|
||||
target: name of target column
|
||||
train_ratio (2 / 3): (optional)
|
||||
seed (None): random seet for split (optional)
|
||||
sep (,): separator of csv file (optional)
|
||||
|
||||
Returns
|
||||
-------
|
||||
X_train: feature matrix of train set
|
||||
X_test: target vector of train set
|
||||
y_train: feature matrix of test set
|
||||
y_test: target vector of train set
|
||||
"""
|
||||
|
||||
## load data
|
||||
data = pd.read_csv(dataset, sep = sep)
|
||||
|
||||
## features - target - split
|
||||
X = data.drop(target, axis=1)
|
||||
y = data[target]
|
||||
|
||||
## train - test - split
|
||||
from sklearn.model_selection import train_test_split
|
||||
return train_test_split(
|
||||
X,
|
||||
y,
|
||||
train_size=train_ratio,
|
||||
random_state=seed)
|
||||
|
||||
|
||||
|
||||
def prep_demo_data(dataset, target):
|
||||
""" read demo data from the current directory
|
||||
performs
|
||||
read data
|
||||
features - target - split
|
||||
|
||||
Parameters
|
||||
----------
|
||||
dataset: name of dataset in csv format, ',' separated
|
||||
target: name of target column
|
||||
|
||||
Returns
|
||||
-------
|
||||
X: feature matrix
|
||||
y: target vector
|
||||
"""
|
||||
|
||||
## load data
|
||||
data = pd.read_csv(dataset)
|
||||
|
||||
## features - target - split
|
||||
X = data.drop(target, axis=1)
|
||||
y = data[target]
|
||||
|
||||
return X, y
|
||||
|
||||
|
||||
|
||||
def inspect_decision_tree_model(model_def, features, target, figsize=(6, 6)):
|
||||
""" train a DecisionTreeClassifier and visualize the tree
|
||||
|
||||
prints some motel attributes from within the function
|
||||
|
||||
Parameters
|
||||
----------
|
||||
model_def: DecisionTreeClassifier object with set parameters
|
||||
features: feature matrix
|
||||
target: target vector
|
||||
figsize: size of image, optional, default = (6, 6)
|
||||
|
||||
Returns
|
||||
-------
|
||||
visualization of the trained tree
|
||||
prints model attributes
|
||||
"""
|
||||
|
||||
from sklearn.tree import plot_tree
|
||||
|
||||
model = model_def
|
||||
model.fit(features, target)
|
||||
|
||||
print('TREE DIAGNOSTICS:')
|
||||
print('depth :', model.get_depth())
|
||||
print('leaves :', model.get_n_leaves())
|
||||
print('score :', model.score(features, target))
|
||||
|
||||
plt.figure(figsize=figsize)
|
||||
plot_tree(model,
|
||||
feature_names=features.columns,
|
||||
class_names=model.classes_,
|
||||
filled=True);
|
||||
|
||||
|
||||
|
||||
def test_regression_model(model, X_train, y_train, X_test, y_test, show_plot=True):
|
||||
|
||||
""" shows behavoiur of univariate ML regression on synthetic dataset
|
||||
|
||||
performs
|
||||
- training on train data
|
||||
- prediction on test data
|
||||
- calculate performance measures
|
||||
|
||||
Parameters
|
||||
----------
|
||||
model: a parametrized regression model
|
||||
X_train, y_train: train data
|
||||
X_test, y_test: test data
|
||||
show_plot: show scatterplot ov pred vs true, optional, default=True
|
||||
|
||||
|
||||
Returns
|
||||
-------
|
||||
shows a scatterplot von X_test vs X_pred with a diagonal line, indicating identity
|
||||
prints r2_score and mean_squared_error
|
||||
|
||||
"""
|
||||
|
||||
from sklearn.metrics import r2_score
|
||||
from sklearn.metrics import mean_squared_error
|
||||
|
||||
model = model
|
||||
model.fit(X_train, y_train)
|
||||
y_pred = model.predict(X_test)
|
||||
print('R2 = %0.4f' %(r2_score(y_test, y_pred)))
|
||||
|
||||
if show_plot == True:
|
||||
plt.figure(figsize=(6,6))
|
||||
ax = sns.scatterplot(x=y_test, y=y_pred)
|
||||
ax.set(xlabel='y_test', ylabel='y_pred')
|
||||
ls = np.linspace(min(y_test), max(y_test), 100)
|
||||
plt.plot(ls, ls, color='black', linestyle='dashed')
|
||||
ax.set_title(model.__class__.__name__)
|
||||
plt.show()
|
||||
|
||||
return (model)
|
||||
|
||||
|
||||
|
||||
def show_pred_on_synth(model, X, y, X_synth, param_str):
|
||||
""" shows behavoiur of univariate ML regression on synthetic dataset
|
||||
|
||||
Parameters
|
||||
----------
|
||||
model: a parametrized regression model
|
||||
X, y: data for univariate regression
|
||||
X_synth: synthetic Feature
|
||||
param_str: parameter description for title
|
||||
seed (None): random seet for split
|
||||
|
||||
Returns
|
||||
-------
|
||||
a scatterplot von X, y, with the prediction values for X_synth
|
||||
|
||||
"""
|
||||
|
||||
model.fit(X.to_numpy(), y)
|
||||
y_pred = model.predict(X_synth)
|
||||
|
||||
ax = sns.scatterplot(x=X['X'], y=y)
|
||||
ax = sns.lineplot(x=X_synth[:,0], y=y_pred, color='orange')
|
||||
ax.set_title(model.__class__.__name__ + ' : ' + param_str)
|
||||
ax.set(xlabel='X', ylabel='y')
|
||||
plt.show()
|
||||
|
||||
|
||||
@@ -0,0 +1,58 @@
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
import seaborn as sns
|
||||
|
||||
from bfh_cas_pml import prep_data
|
||||
from pathlib import Path
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
|
||||
|
||||
DATA = Path(__file__).resolve().parent.parent / "data" / "bank_data_prep.csv"
|
||||
|
||||
|
||||
def sweep(X_train, y_train, X_test, y_test, params):
|
||||
"""Ein Sweep über eine min_impurity_decrease-Range → Liste der Test-Scores."""
|
||||
model = DecisionTreeClassifier(random_state=1234)
|
||||
scores = []
|
||||
for p in params:
|
||||
model.set_params(min_impurity_decrease=p)
|
||||
model.fit(X_train, y_train)
|
||||
scores.append(model.score(X_test, y_test))
|
||||
return scores
|
||||
|
||||
|
||||
def report(params, scores, name):
|
||||
"""Besten Score + Parameter in die Konsole, Kurve + Peak-Marker plotten."""
|
||||
# find best score and best param
|
||||
best_p = params[scores.index(max(scores))]
|
||||
best_s = max(scores)
|
||||
print(f"best_score: {name} -> {best_s}")
|
||||
print(f"best_param: {name} -> {best_p}")
|
||||
|
||||
# plot
|
||||
plt.figure() # eigene Figur pro Sweep
|
||||
sns.lineplot(x=params, y=scores)
|
||||
plt.scatter(best_p, best_s, color="black", zorder=5) # Peak markieren
|
||||
plt.xlabel("min_impurity_decrease")
|
||||
plt.ylabel("accuracy")
|
||||
plt.title(name)
|
||||
plt.savefig(f"{name}.png", dpi=120, bbox_inches="tight")
|
||||
|
||||
|
||||
def main():
|
||||
X_train, X_test, y_train, y_test = prep_data(str(DATA), "y", seed=1234)
|
||||
|
||||
sweeps = [
|
||||
# 1) grob
|
||||
("dtc_tuning_coarse", np.arange(0, 0.02, 0.001)),
|
||||
# 2) fein (Range nach Befund anpassen)
|
||||
("dtc_tuning_fine", np.arange(0, 0.004, 0.0002)),
|
||||
]
|
||||
|
||||
for name, params in sweeps:
|
||||
scores = sweep(X_train, y_train, X_test, y_test, params)
|
||||
report(params, scores, name)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user