feautre(workshops): add workshop 6
This commit is contained in:
@@ -0,0 +1,102 @@
|
|||||||
|
# Workshop 06 — RandomForestClassifier: Parameter Tuning
|
||||||
|
|
||||||
|
## Aufgabe
|
||||||
|
|
||||||
|
Drei Tuning-Parameter des `RandomForestClassifier` auf die erreichte Accuracy
|
||||||
|
untersuchen (vorbereiteter Bank-Datensatz, identisch zu WS5), je über einen
|
||||||
|
vorgegebenen Wertebereich:
|
||||||
|
|
||||||
|
| Parameter | Wertebereich | Typ |
|
||||||
|
|------------------------|---------------------------|-------|
|
||||||
|
| `n_estimators` | `range(100, 500, 50)` | int |
|
||||||
|
| `max_features` | `range(1, 11)` | int |
|
||||||
|
| `min_impurity_decrease`| `np.arange(0, 0.1, 0.01)` | float |
|
||||||
|
|
||||||
|
Zusätzlich: Wirkung von `random_state` einordnen, und (Zusatzfrage) bestimmen,
|
||||||
|
welche der übrigen Parameter keine Tuning-Parameter sind.
|
||||||
|
|
||||||
|
## Vorgehen
|
||||||
|
|
||||||
|
Wiederverwendung des Sweep-/Plot-Gerüsts aus WS5. Die einzige Anpassung war die
|
||||||
|
Generalisierung der Sweep-Funktion auf einen frei wählbaren Parameternamen
|
||||||
|
(`set_params` per dict-Unpacking), womit sich alle drei Parameter mit derselben
|
||||||
|
Funktion abdecken lassen. Jeder Sweep hält die übrigen Parameter auf den
|
||||||
|
Defaults und variiert nur den untersuchten.
|
||||||
|
|
||||||
|
Pro Sweep: Liniendiagramm (Parameter vs. Accuracy) mit markiertem Maximum sowie
|
||||||
|
Konsolenausgabe von bestem Score und zugehörigem Parameterwert.
|
||||||
|
|
||||||
|
`n_jobs=-1` gesetzt — bei bis zu 450 Bäumen pro Fit und mehreren Fits pro Sweep
|
||||||
|
ist die Parallelisierung über alle Cores hier praktisch zwingend, nicht
|
||||||
|
optional. `n_jobs` ist dabei reiner Performance-Schalter ohne Einfluss auf das
|
||||||
|
Resultat.
|
||||||
|
|
||||||
|
## Resultate
|
||||||
|
|
||||||
|
| Parameter | bester Score | bei Wert |
|
||||||
|
|------------------------|--------------|----------|
|
||||||
|
| `n_estimators` | 0.8792 | 400 |
|
||||||
|
| `max_features` | 0.8780 | 6 |
|
||||||
|
| `min_impurity_decrease`| 0.8750 | 0.0 |
|
||||||
|
|
||||||
|
Interpretation der einzelnen Verläufe:
|
||||||
|
|
||||||
|
**`n_estimators`** — die Kurve bewegt sich über den gesamten Bereich nur
|
||||||
|
zwischen 0.875 und 0.879. Diese Schwankung liegt im Bereich des Seed-Rauschens
|
||||||
|
und ist kein echtes Optimum. Die belastbare Aussage lautet: Plateau ab ~150
|
||||||
|
Bäumen, danach reine Rechenzeit ohne Mehrwert. Der nominelle „Peak" bei 400 ist
|
||||||
|
nicht als optimaler Wert zu interpretieren.
|
||||||
|
|
||||||
|
**`max_features`** — hier liegt echtes Signal vor: Anstieg von ~0.857 (bei 1)
|
||||||
|
auf ein breites Maximum ab ~4 Features. Der beste Wert (6) liegt nahe am
|
||||||
|
sklearn-Default `sqrt(n_features)` für Klassifikation, was den Default bestätigt.
|
||||||
|
Zu kleine Werte machen die einzelnen Bäume zu zufällig (Underfit); zu grosse
|
||||||
|
Werte erhöhen die Korrelation zwischen den Bäumen und schmälern den
|
||||||
|
Ensemble-Vorteil.
|
||||||
|
|
||||||
|
**`min_impurity_decrease`** — bestes Resultat bei 0 (kein Pruning), danach
|
||||||
|
monotoner Abfall auf ein Plateau bei ~0.757. Das ist das aufschlussreichste
|
||||||
|
Ergebnis des Workshops und der direkte Kontrast zu WS5: beim einzelnen
|
||||||
|
DecisionTree hat Pre-Pruning das Overfitting reduziert und die Test-Accuracy
|
||||||
|
verbessert. Beim Random Forest übernimmt das Bagging diese Varianzkontrolle
|
||||||
|
bereits auf Ensemble-Ebene — Pruning der einzelnen Bäume nimmt ihnen die
|
||||||
|
gewollte Varianz und kann die Accuracy daher praktisch nur verschlechtern.
|
||||||
|
|
||||||
|
## Wirkung von `random_state`
|
||||||
|
|
||||||
|
Der Random Forest hat zwei Zufallsquellen: das Bootstrap-Sampling (welche Zeilen
|
||||||
|
jeder Baum sieht) und die Random Feature Selection (welche Features pro Split zur
|
||||||
|
Wahl stehen). `random_state` seedet beide und macht den Lauf reproduzierbar.
|
||||||
|
|
||||||
|
Der Effekt des konkreten Seeds nimmt mit steigendem `n_estimators` ab: bei wenig
|
||||||
|
Bäumen schwankt die Accuracy je nach Seed merklich, bei vielen Bäumen mittelt
|
||||||
|
sich das aus. Ein Teil der Zacken im `n_estimators`-Verlauf ist genau diese
|
||||||
|
Seed-Variation und nicht echtes Signal — siehe Interpretation oben.
|
||||||
|
|
||||||
|
## Zusatzfrage: Nicht-Tuning-Parameter
|
||||||
|
|
||||||
|
Kriterium: ein Tuning-Parameter verschiebt den Bias-Varianz-Tradeoff bzw. die
|
||||||
|
Kapazität des Modells. Parameter, die nur Infrastruktur, Reproduzierbarkeit,
|
||||||
|
Logging, Workflow oder Reporting steuern, sind keine Tuning-Parameter. Aus
|
||||||
|
`model.get_params()` betrifft das:
|
||||||
|
|
||||||
|
- `random_state` — Seed (Reproduzierbarkeit)
|
||||||
|
- `n_jobs` — Parallelisierung (Performance)
|
||||||
|
- `verbose` — Logging-Ausgabe
|
||||||
|
- `warm_start` — Workflow-Schalter für inkrementelles Fitten
|
||||||
|
- `oob_score` — schaltet nur die Out-of-Bag-Schätzung als Reporting an, ändert
|
||||||
|
das gefittete Modell nicht
|
||||||
|
|
||||||
|
Grenzfälle wie `bootstrap`, `class_weight` oder `max_samples` beeinflussen das
|
||||||
|
Modell hingegen sehr wohl und zählen damit zu den Tuning-Parametern.
|
||||||
|
|
||||||
|
## Caveats / Deviations
|
||||||
|
|
||||||
|
- **One-at-a-time-Tuning**: jeder Parameter wurde einzeln bei Defaults der
|
||||||
|
übrigen variiert. Damit werden Wechselwirkungen zwischen Parametern nicht
|
||||||
|
erfasst; das gemeinsame Optimum kann von der Kombination der drei
|
||||||
|
Einzelbestwerte abweichen. Eine gemeinsame Suche (`GridSearchCV`, vgl. WS4)
|
||||||
|
wäre dafür das richtige Werkzeug.
|
||||||
|
- **Optimistic Bias**: wie in WS4/WS5 wird direkt gegen das Test-Set getunt. Die
|
||||||
|
berichteten Bestwerte sind dadurch optimistisch verzerrt; sauber wäre ein
|
||||||
|
Validierungs-Split bzw. Cross-Validation.
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,65 @@
|
|||||||
|
{
|
||||||
|
"nodes": {
|
||||||
|
"devenv": {
|
||||||
|
"locked": {
|
||||||
|
"dir": "src/modules",
|
||||||
|
"lastModified": 1780543372,
|
||||||
|
"narHash": "sha256-FCGxk82Lc4koWcFw5xgr+W5vbwLVFLCnSMwm2gQOgr0=",
|
||||||
|
"owner": "cachix",
|
||||||
|
"repo": "devenv",
|
||||||
|
"rev": "f693b472c731e7dda69402daa88c06369d54fd3a",
|
||||||
|
"type": "github"
|
||||||
|
},
|
||||||
|
"original": {
|
||||||
|
"dir": "src/modules",
|
||||||
|
"owner": "cachix",
|
||||||
|
"repo": "devenv",
|
||||||
|
"type": "github"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nixpkgs": {
|
||||||
|
"inputs": {
|
||||||
|
"nixpkgs-src": "nixpkgs-src"
|
||||||
|
},
|
||||||
|
"locked": {
|
||||||
|
"lastModified": 1778507786,
|
||||||
|
"narHash": "sha256-HzSQCKMsMr8r55LwM1JuzIOB+8bzk0FEv6sItKvsfoY=",
|
||||||
|
"owner": "cachix",
|
||||||
|
"repo": "devenv-nixpkgs",
|
||||||
|
"rev": "8f24a228a782e24576b155d1e39f0d914b380691",
|
||||||
|
"type": "github"
|
||||||
|
},
|
||||||
|
"original": {
|
||||||
|
"owner": "cachix",
|
||||||
|
"ref": "rolling",
|
||||||
|
"repo": "devenv-nixpkgs",
|
||||||
|
"type": "github"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nixpkgs-src": {
|
||||||
|
"flake": false,
|
||||||
|
"locked": {
|
||||||
|
"lastModified": 1778274207,
|
||||||
|
"narHash": "sha256-I4puXmX1iovcCHZlRmztO3vW0mAbbRvq4F8wgIMQ1MM=",
|
||||||
|
"owner": "NixOS",
|
||||||
|
"repo": "nixpkgs",
|
||||||
|
"rev": "b3da656039dc7a6240f27b2ef8cc6a3ef3bccae7",
|
||||||
|
"type": "github"
|
||||||
|
},
|
||||||
|
"original": {
|
||||||
|
"owner": "NixOS",
|
||||||
|
"ref": "nixpkgs-unstable",
|
||||||
|
"repo": "nixpkgs",
|
||||||
|
"type": "github"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"root": {
|
||||||
|
"inputs": {
|
||||||
|
"devenv": "devenv",
|
||||||
|
"nixpkgs": "nixpkgs"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"root": "root",
|
||||||
|
"version": 7
|
||||||
|
}
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
{ pkgs, ... }:
|
||||||
|
|
||||||
|
{
|
||||||
|
# Native libs that the pip-wheel-installed numpy/scipy/matplotlib stack
|
||||||
|
# dlopen()s at runtime. zlib war schon in W3/W4 nötig (libz.so.1),
|
||||||
|
# stdenv.cc.cc.lib liefert libstdc++ für die scipy/sklearn-Wheels.
|
||||||
|
packages = [
|
||||||
|
pkgs.zlib
|
||||||
|
pkgs.stdenv.cc.cc.lib
|
||||||
|
];
|
||||||
|
|
||||||
|
languages.python = {
|
||||||
|
enable = true;
|
||||||
|
venv.enable = true;
|
||||||
|
venv.requirements = ''
|
||||||
|
pandas
|
||||||
|
numpy
|
||||||
|
scikit-learn
|
||||||
|
matplotlib
|
||||||
|
seaborn
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
|
||||||
|
# Loader-Pfad für die obigen nativen Libs. Wenn beim Import trotzdem ein
|
||||||
|
# "ImportError: libXYZ.so.N" auftaucht: das bereitstellende pkgs.<paket>
|
||||||
|
# zu packages UND hier ergänzen — gleiches Muster wie der W3-Fix.
|
||||||
|
env.LD_LIBRARY_PATH = pkgs.lib.makeLibraryPath [
|
||||||
|
pkgs.zlib
|
||||||
|
pkgs.stdenv.cc.cc.lib
|
||||||
|
];
|
||||||
|
}
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 32 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 31 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 35 KiB |
Binary file not shown.
@@ -0,0 +1,193 @@
|
|||||||
|
"""
|
||||||
|
Useful functions for example notebooks and workshop solutions
|
||||||
|
of course Practical Machine Learning - Supervised Learning
|
||||||
|
Bern University of Applied Sciences (BFH)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
# ========== Packages ==========
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import numpy as np
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import seaborn as sns
|
||||||
|
|
||||||
|
|
||||||
|
# ========== Functions ==========
|
||||||
|
|
||||||
|
def prep_data(dataset, target, train_ratio = 2 / 3, seed = None, sep = ','):
|
||||||
|
""" read and prepare real data from the current directory
|
||||||
|
performs
|
||||||
|
read data
|
||||||
|
features - target - split
|
||||||
|
train - test - split
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
dataset: name of dataset in csv format
|
||||||
|
target: name of target column
|
||||||
|
train_ratio (2 / 3): (optional)
|
||||||
|
seed (None): random seet for split (optional)
|
||||||
|
sep (,): separator of csv file (optional)
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
X_train: feature matrix of train set
|
||||||
|
X_test: target vector of train set
|
||||||
|
y_train: feature matrix of test set
|
||||||
|
y_test: target vector of train set
|
||||||
|
"""
|
||||||
|
|
||||||
|
## load data
|
||||||
|
data = pd.read_csv(dataset, sep = sep)
|
||||||
|
|
||||||
|
## features - target - split
|
||||||
|
X = data.drop(target, axis=1)
|
||||||
|
y = data[target]
|
||||||
|
|
||||||
|
## train - test - split
|
||||||
|
from sklearn.model_selection import train_test_split
|
||||||
|
return train_test_split(
|
||||||
|
X,
|
||||||
|
y,
|
||||||
|
train_size=train_ratio,
|
||||||
|
random_state=seed)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def prep_demo_data(dataset, target):
|
||||||
|
""" read demo data from the current directory
|
||||||
|
performs
|
||||||
|
read data
|
||||||
|
features - target - split
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
dataset: name of dataset in csv format, ',' separated
|
||||||
|
target: name of target column
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
X: feature matrix
|
||||||
|
y: target vector
|
||||||
|
"""
|
||||||
|
|
||||||
|
## load data
|
||||||
|
data = pd.read_csv(dataset)
|
||||||
|
|
||||||
|
## features - target - split
|
||||||
|
X = data.drop(target, axis=1)
|
||||||
|
y = data[target]
|
||||||
|
|
||||||
|
return X, y
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def inspect_decision_tree_model(model_def, features, target, figsize=(6, 6)):
|
||||||
|
""" train a DecisionTreeClassifier and visualize the tree
|
||||||
|
|
||||||
|
prints some motel attributes from within the function
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
model_def: DecisionTreeClassifier object with set parameters
|
||||||
|
features: feature matrix
|
||||||
|
target: target vector
|
||||||
|
figsize: size of image, optional, default = (6, 6)
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
visualization of the trained tree
|
||||||
|
prints model attributes
|
||||||
|
"""
|
||||||
|
|
||||||
|
from sklearn.tree import plot_tree
|
||||||
|
|
||||||
|
model = model_def
|
||||||
|
model.fit(features, target)
|
||||||
|
|
||||||
|
print('TREE DIAGNOSTICS:')
|
||||||
|
print('depth :', model.get_depth())
|
||||||
|
print('leaves :', model.get_n_leaves())
|
||||||
|
print('score :', model.score(features, target))
|
||||||
|
|
||||||
|
plt.figure(figsize=figsize)
|
||||||
|
plot_tree(model,
|
||||||
|
feature_names=features.columns,
|
||||||
|
class_names=model.classes_,
|
||||||
|
filled=True);
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def test_regression_model(model, X_train, y_train, X_test, y_test, show_plot=True):
|
||||||
|
|
||||||
|
""" shows behavoiur of univariate ML regression on synthetic dataset
|
||||||
|
|
||||||
|
performs
|
||||||
|
- training on train data
|
||||||
|
- prediction on test data
|
||||||
|
- calculate performance measures
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
model: a parametrized regression model
|
||||||
|
X_train, y_train: train data
|
||||||
|
X_test, y_test: test data
|
||||||
|
show_plot: show scatterplot ov pred vs true, optional, default=True
|
||||||
|
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
shows a scatterplot von X_test vs X_pred with a diagonal line, indicating identity
|
||||||
|
prints r2_score and mean_squared_error
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
from sklearn.metrics import r2_score
|
||||||
|
from sklearn.metrics import mean_squared_error
|
||||||
|
|
||||||
|
model = model
|
||||||
|
model.fit(X_train, y_train)
|
||||||
|
y_pred = model.predict(X_test)
|
||||||
|
print('R2 = %0.4f' %(r2_score(y_test, y_pred)))
|
||||||
|
|
||||||
|
if show_plot == True:
|
||||||
|
plt.figure(figsize=(6,6))
|
||||||
|
ax = sns.scatterplot(x=y_test, y=y_pred)
|
||||||
|
ax.set(xlabel='y_test', ylabel='y_pred')
|
||||||
|
ls = np.linspace(min(y_test), max(y_test), 100)
|
||||||
|
plt.plot(ls, ls, color='black', linestyle='dashed')
|
||||||
|
ax.set_title(model.__class__.__name__)
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
return (model)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def show_pred_on_synth(model, X, y, X_synth, param_str):
|
||||||
|
""" shows behavoiur of univariate ML regression on synthetic dataset
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
model: a parametrized regression model
|
||||||
|
X, y: data for univariate regression
|
||||||
|
X_synth: synthetic Feature
|
||||||
|
param_str: parameter description for title
|
||||||
|
seed (None): random seet for split
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
a scatterplot von X, y, with the prediction values for X_synth
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
model.fit(X.to_numpy(), y)
|
||||||
|
y_pred = model.predict(X_synth)
|
||||||
|
|
||||||
|
ax = sns.scatterplot(x=X['X'], y=y)
|
||||||
|
ax = sns.lineplot(x=X_synth[:,0], y=y_pred, color='orange')
|
||||||
|
ax.set_title(model.__class__.__name__ + ' : ' + param_str)
|
||||||
|
ax.set(xlabel='X', ylabel='y')
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
|
||||||
@@ -0,0 +1,57 @@
|
|||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import numpy as np
|
||||||
|
import seaborn as sns
|
||||||
|
|
||||||
|
from bfh_cas_pml import prep_data
|
||||||
|
from pathlib import Path
|
||||||
|
from sklearn.ensemble import RandomForestClassifier
|
||||||
|
|
||||||
|
|
||||||
|
DATA = Path(__file__).resolve().parent.parent / "data" / "bank_data_prep.csv"
|
||||||
|
|
||||||
|
|
||||||
|
def sweep(param_name, params, X_train, y_train, X_test, y_test):
|
||||||
|
"""Ein Sweep über eine min_impurity_decrease-Range → Liste der Test-Scores."""
|
||||||
|
model = RandomForestClassifier(random_state=1234, n_jobs=-1)
|
||||||
|
scores = []
|
||||||
|
for p in params:
|
||||||
|
model.set_params(**{param_name: p}) # dict-unpacking statt fixem Keyword
|
||||||
|
model.fit(X_train, y_train)
|
||||||
|
scores.append(model.score(X_test, y_test))
|
||||||
|
return scores
|
||||||
|
|
||||||
|
|
||||||
|
def report(params, scores, name):
|
||||||
|
"""Besten Score + Parameter in die Konsole, Kurve + Peak-Marker plotten."""
|
||||||
|
# find best score and best param
|
||||||
|
best_p = params[scores.index(max(scores))]
|
||||||
|
best_s = max(scores)
|
||||||
|
print(f"best_score: {name} -> {best_s}")
|
||||||
|
print(f"best_param: {name} -> {best_p}")
|
||||||
|
|
||||||
|
# plot
|
||||||
|
plt.figure() # eigene Figur pro Sweep
|
||||||
|
sns.lineplot(x=params, y=scores)
|
||||||
|
plt.scatter(best_p, best_s, color="black", zorder=5) # Peak markieren
|
||||||
|
plt.xlabel(name)
|
||||||
|
plt.ylabel("accuracy")
|
||||||
|
plt.title("random forest parameter tuning")
|
||||||
|
plt.savefig(f"{name}.png", dpi=120, bbox_inches="tight")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
X_train, X_test, y_train, y_test = prep_data(str(DATA), "y", seed=1234)
|
||||||
|
|
||||||
|
sweeps = [
|
||||||
|
("n_estimators", range(100, 500, 50)),
|
||||||
|
("max_features", range(1, 11)),
|
||||||
|
("min_impurity_decrease", np.arange(0, 0.1, 0.01)),
|
||||||
|
]
|
||||||
|
|
||||||
|
for name, params in sweeps:
|
||||||
|
scores = sweep(name, params, X_train, y_train, X_test, y_test)
|
||||||
|
report(params, scores, name)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user