A new beginning

This commit is contained in:
Varyngoth
2025-11-24 23:15:00 -04:00
commit 59139bab3f
10 changed files with 1070 additions and 0 deletions

135
ACENET_HPC_Guide.md Normal file
View File

@@ -0,0 +1,135 @@
# Using Compute Canada / ACENET HPC
This guide explains how to connect to the Digital Research Alliance of Canada (Compute Canada) or ACENET clusters, create a working directory in scratch, transfer files with Globus, and submit jobs using SLURM.
## 1. Connect to the HPC via SSH
1. Determine which cluster to use (examples):
- Graham: `graham.computecanada.ca`
- Cedar: `cedar.computecanada.ca`
- Beluga: `beluga.computecanada.ca`
- Niagara: `niagara.scinet.utoronto.ca`
- ACENET: `login1.acenet.ca`
2. Open a terminal and connect via SSH:
```bash
ssh username@graham.computecanada.ca
```
3. When prompted, confirm the host fingerprint and enter your password.
---
## 2. Create a Folder in Scratch
Your `$SCRATCH` directory is a temporary workspace for large data and computations. It is purged after 60 days of inactivity.
After logging in:
```bash
cd $SCRATCH
mkdir my_project
cd my_project
```
Confirm your path:
```bash
pwd
# Example output: /scratch/username/my_project
```
---
## 3. Install and Use Globus for File Transfers
Globus is a fast, reliable tool for large file transfers. It requires a small local agent called **Globus Connect Personal**.
### Install Globus Connect Personal
- **Linux:**
```bash
wget https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz
tar xzf globusconnectpersonal-latest.tgz
cd globusconnectpersonal*
./globusconnectpersonal -setup
```
- **macOS:**
Download and install from: [https://www.globus.org/globus-connect-personal](https://www.globus.org/globus-connect-personal)
- **Windows:**
Download the installer from the same link and follow the setup wizard.
After installation, your local computer will appear as a **Globus endpoint**.
### Transfer Files
1. Visit [https://app.globus.org](https://app.globus.org) and log in using **Compute Canada credentials**.
2. In the web app, choose two endpoints:
- **Source:** Your local computer or institutional storage.
- **Destination:** Your HPC endpoint (for example, *Compute Canada Graham Scratch*).
3. Navigate to your target scratch folder (`/scratch/username/my_project`).
4. Select files and click **Start Transfer**.
Globus will handle transfers asynchronously and resume interrupted transfers automatically.
---
## 4. Submit Jobs to ACENET with SLURM
Job submissions use the SLURM scheduler. Create a batch file describing your job resources and commands.
### Example job script (`job.slurm`)
```bash
#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --account=def-yourprof
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem=8G
#SBATCH --output=output_%j.log
module load python/3.11
source ~/myenv/bin/activate
python my_script.py
```
### Submit and Monitor Jobs
```bash
sbatch job.slurm # Submit job
squeue -u username # Check status
scancel job_id # Cancel job
```
### View Results
After completion, check output logs:
```bash
less output_<jobid>.log
```
---
## 5. Useful Commands
```bash
module avail # List available software modules
module load python/3.11 # Load a module
df -h $SCRATCH # Check scratch usage
quota -s # Check your disk quota
```
---
## 6. References
- Alliance Docs: [https://docs.alliancecan.ca/wiki/Technical_documentation](https://docs.alliancecan.ca/wiki/Technical_documentation)
- ACENET Training: [https://www.ace-net.ca/training/](https://www.ace-net.ca/training/)
- Globus Setup: [https://www.globus.org/globus-connect-personal](https://www.globus.org/globus-connect-personal)

62
README.md Normal file
View File

@@ -0,0 +1,62 @@
# Explanation-Aware Optimization and AutoML (DEAP + SHAP Stability)
This project implements an **AutoML framework** that uses **DEAPs NSGA-II** for multi-objective optimization, balancing **model accuracy** and **SHAP-based stability**.
It supports both **classification** and **regression** datasets via OpenML and sklearn.
All results are tracked with **MLflow**.
---
## 1. Environment Setup (macOS / Linux)
### Create and activate a virtual environment
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install \
numpy==1.26.4 \
pandas==1.5.3 \
scikit-learn==1.3.2 \
shap==0.45.0 \
deap==1.4.1 \
openml==0.14.2 \
mlflow==2.11.3 \
matplotlib==3.7.5
```
## 2. Running Experiments
Classification: Adult Dataset
```bash
python run_deap.py \
--dataset adult \
--generations 5 \
--pop-size 24 \
--cv-folds 3
```
Regression: California Housing Dataset
```bash
python run_deap.py \
--dataset cal_housing \
--generations 5 \
--pop-size 24 \
--cv-folds 3
```
Results are saved under:
```bash
runs/<dataset>/pareto_front.csv
```
## 3. Viewing Results in MLflow
```bash
mlflow ui --backend-store-uri ./mlruns --host 0.0.0.0 --port 5000
```
Then open: http://localhost:5000
You can visualize:
MSE-like score (lower is better)
SHAP stability (higher is better)

8
requirements.txt Normal file
View File

@@ -0,0 +1,8 @@
scikit-learn
openml
deap
mlflow
shap
numpy
pandas
matplotlib

170
run_deap.py Normal file
View File

@@ -0,0 +1,170 @@
import argparse
import random
import pickle
from pathlib import Path
import numpy as np
import mlflow
from deap import algorithms
from deap.tools.emo import sortNondominated
import pandas as pd
from src.data_openml import load_dataset
from src.search.nsga_deap import build_toolbox, decode
from src.preprocessing import build_preprocessor
from src.models import make_model
from src.stability import compute_shap_matrix
def save_checkpoint(path, gen, pop, seed):
state = {
"gen": gen,
"pop": pop,
"py_random_state": random.getstate(),
"np_random_state": np.random.get_state(),
"seed": seed,
}
with open(path, "wb") as f:
pickle.dump(state, f)
def load_checkpoint(path):
with open(path, "rb") as f:
state = pickle.load(f)
random.setstate(state["py_random_state"])
np.random.set_state(state["np_random_state"])
return state["gen"], state["pop"], state["seed"]
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--dataset", required=True, choices=["adult", "cal_housing"])
ap.add_argument("--generations", type=int, default=10)
ap.add_argument("--pop-size", type=int, default=24)
ap.add_argument("--seed", type=int, default=42)
ap.add_argument("--cv-folds", type=int, default=3)
ap.add_argument("--experiment", default="deap_nsga_shap")
ap.add_argument("--checkpoint-every", type=int, default=5)
ap.add_argument(
"--shap-pf-eval-rows",
type=int,
default=512,
help="Number of rows from the dataset to use when saving SHAP for Pareto models",
)
args = ap.parse_args()
# data and experiment
X, y, task = load_dataset(args.dataset, random_state=args.seed)
mlflow.set_experiment(args.experiment)
outdir = Path("runs") / args.dataset
outdir.mkdir(parents=True, exist_ok=True)
ckpt_path = outdir / "checkpoint.pkl"
# seed RNGs
random.seed(args.seed)
np.random.seed(args.seed)
# toolbox for this run
toolbox = build_toolbox(
X,
y,
task,
seed=args.seed,
cv_folds=args.cv_folds,
mlflow_experiment=args.experiment,
)
# initial population or resume from checkpoint
if ckpt_path.exists():
start_gen, pop, loaded_seed = load_checkpoint(ckpt_path)
if loaded_seed != args.seed:
print(
f"Warning: checkpoint seed {loaded_seed} differs from current seed {args.seed}"
)
print(f"Resuming from checkpoint at generation {start_gen}")
else:
pop = toolbox.population(n=args.pop_size)
fits = list(map(toolbox.evaluate, pop))
for ind, fit in zip(pop, fits):
ind.fitness.values = fit
start_gen = 0
save_checkpoint(ckpt_path, start_gen, pop, args.seed)
print(f"Initial checkpoint saved at generation {start_gen}")
# GA loop
for gen in range(start_gen, args.generations):
offspring = algorithms.varAnd(pop, toolbox, cxpb=0.7, mutpb=0.2)
fits = list(map(toolbox.evaluate, offspring))
for ind, fit in zip(offspring, fits):
ind.fitness.values = fit
pop = toolbox.select(pop + offspring, k=args.pop_size)
if (gen + 1) % args.checkpoint_every == 0:
save_checkpoint(ckpt_path, gen + 1, pop, args.seed)
print(f"Checkpoint saved at generation {gen + 1}")
# final Pareto front
pf = sortNondominated(pop, len(pop), first_front_only=True)[0]
rows = []
for ind in pf:
algo, model_params, pre_cfg = decode(ind)
rows.append(
{
"algo": algo,
"mse_like": ind.fitness.values[0],
"stability": ind.fitness.values[1],
**{f"m_{k}": v for k, v in model_params.items()},
**{f"p_{k}": v for k, v in pre_cfg.items()},
}
)
pareto_path = outdir / "pareto_front.csv"
pd.DataFrame(rows).to_csv(pareto_path, index=False)
print(f"Saved Pareto front to {pareto_path}")
shap_dir = outdir / "shap"
shap_dir.mkdir(exist_ok=True)
eval_rows = min(args.shap_pf_eval_rows, len(X))
rng = np.random.RandomState(args.seed)
eval_idx = rng.choice(len(X), size=eval_rows, replace=False)
X_eval_shap = X.iloc[eval_idx]
y_full = y
for i, ind in enumerate(pf):
algo, model_params, pre_cfg = decode(ind)
fixed_poly_degree = pre_cfg.get("poly_degree", 1)
fixed_k = pre_cfg.get("select_k", None)
preproc = build_preprocessor(
X,
task,
pre_cfg,
fixed_k=fixed_k,
fixed_poly_degree=fixed_poly_degree,
)
model = make_model(task, algo, model_params, random_state=args.seed)
from sklearn.pipeline import Pipeline as SkPipeline
pipe = SkPipeline([("pre", preproc), ("model", model)])
shap_vals, t_fit, t_shap, feat_names = compute_shap_matrix(
pipe,
X_fit=X,
y_fit=y_full,
X_eval=X_eval_shap,
task_type=task,
bg_size=128,
max_eval_rows=eval_rows,
rng_seed=args.seed,
)
np.save(shap_dir / f"pf_{i}_shap_vals.npy", shap_vals)
np.save(shap_dir / f"pf_{i}_feat_names.npy", np.asarray(feat_names))
print(f"Saved SHAP arrays for {len(pf)} Pareto models under {shap_dir}")
if __name__ == "__main__":
main()

16
src/data_openml.py Normal file
View File

@@ -0,0 +1,16 @@
from sklearn.datasets import fetch_california_housing, fetch_openml
def load_dataset(name: str, random_state: int = 42):
name = name.lower()
if name == "cal_housing":
ds = fetch_california_housing(as_frame=True)
X = ds.data
y = ds.target
return X, y, "regression"
elif name == "adult":
ds = fetch_openml(data_id=1590, as_frame=True) # Adult
X = ds.data
y = (ds.target == ">50K").astype(int)
return X, y, "classification"
else:
raise ValueError("dataset must be adult or cal_housing")

55
src/models.py Normal file
View File

@@ -0,0 +1,55 @@
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier
def make_model(task, algo, params, random_state=0):
if task == "regression":
if algo == "rf":
return RandomForestRegressor(
n_estimators=int(params["n_estimators"]),
max_depth=int(params["max_depth"]),
max_features=params["max_features"],
random_state=random_state,
n_jobs=1
)
elif algo == "gbt":
return GradientBoostingRegressor(
n_estimators=int(params["n_estimators"]),
learning_rate=float(params["learning_rate"]),
max_depth=int(params["max_depth"]),
random_state=random_state
)
elif algo == "mlp":
return MLPRegressor(
hidden_layer_sizes=tuple(params["hidden_layers"]),
activation=params["activation"],
alpha=float(params["alpha"]),
learning_rate_init=float(params["lr_init"]),
max_iter=int(params.get("max_iter", 200)),
random_state=random_state
)
else:
if algo == "rf":
return RandomForestClassifier(
n_estimators=int(params["n_estimators"]),
max_depth=int(params["max_depth"]),
max_features=params["max_features"],
random_state=random_state,
n_jobs=1
)
elif algo == "gbt":
return GradientBoostingClassifier(
n_estimators=int(params["n_estimators"]),
learning_rate=float(params["learning_rate"]),
max_depth=int(params["max_depth"]),
random_state=random_state
)
elif algo == "mlp":
return MLPClassifier(
hidden_layer_sizes=tuple(params["hidden_layers"]),
activation=params["activation"],
alpha=float(params["alpha"]),
learning_rate_init=float(params["lr_init"]),
max_iter=int(params.get("max_iter", 200)),
random_state=random_state
)
raise ValueError("Unknown algo")

66
src/objectives.py Normal file
View File

@@ -0,0 +1,66 @@
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, brier_score_loss
from .preprocessing import build_preprocessor
from .models import make_model
from .stability import compute_shap_matrix, shap_stability_from_matrices
def evaluate_config(X, y, task, algo, model_params, preproc_cfg, cv_folds=3, seed=42):
kf = KFold(n_splits=cv_folds, shuffle=True, random_state=seed)
losses = []
# this will store full SHAP matrices and feature names for stability
shap_mats_with_names = []
# pick a fixed evaluation pool for SHAP, same rows and order for all folds
rng = np.random.RandomState(seed)
max_eval_rows = 1024
eval_size = min(max_eval_rows, len(X))
eval_idx = rng.choice(len(X), size=eval_size, replace=False)
X_eval_fixed = X.iloc[eval_idx]
# probe preprocessor to compute a safe cap for k
fixed_poly_degree = preproc_cfg.get("fixed_poly_degree", preproc_cfg.get("poly_degree", 1))
probe_pre = build_preprocessor(X, task, preproc_cfg, fixed_k=None, fixed_poly_degree=fixed_poly_degree)
Xp = probe_pre.fit_transform(X, y)
n_after_prep = Xp.shape[1]
desired_k = preproc_cfg.get("select_k", None)
k_cap = None if desired_k is None else int(min(max(1, desired_k), n_after_prep))
for fold_idx, (tr, te) in enumerate(kf.split(X)):
preproc = build_preprocessor(X, task, preproc_cfg, fixed_k=k_cap, fixed_poly_degree=fixed_poly_degree)
model = make_model(task, algo, model_params, random_state=seed + fold_idx)
pipe = Pipeline([("pre", preproc), ("model", model)])
# 1) SHAP stability: always use the same X_eval_fixed for all folds
shap_vals, t_fit, t_shap, feat_names = compute_shap_matrix(
pipe,
X_fit=X.iloc[tr],
y_fit=y.iloc[tr],
X_eval=X_eval_fixed,
task_type=task,
)
shap_mats_with_names.append((shap_vals, feat_names))
# 2) Loss: still use standard CV split (te) for generalization
if task == "regression":
y_pred = pipe.predict(X.iloc[te])
loss = float(mean_squared_error(y.iloc[te], y_pred))
else:
if hasattr(pipe.named_steps["model"], "predict_proba"):
y_prob = pipe.predict_proba(X.iloc[te])[:, 1]
else:
scores = pipe.decision_function(X.iloc[te])
scores = (scores - scores.min()) / (scores.max() - scores.min() + 1e-8)
y_prob = scores
loss = float(brier_score_loss(y.iloc[te], y_prob))
losses.append(loss)
# instance level SHAP stability across folds
agg_std, stability, per_feat_std, per_inst_std = shap_stability_from_matrices(shap_mats_with_names)
mse_like = float(np.mean(losses))
return mse_like, float(stability), per_feat_std

138
src/preprocessing.py Normal file
View File

@@ -0,0 +1,138 @@
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler, PowerTransformer, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif, f_regression
from sklearn.feature_selection import VarianceThreshold
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class SafeSelectK(BaseEstimator, TransformerMixin):
def __init__(self, task: str, k=None):
self.task = task
self.k = k
self.selector_ = None
self.k_effective_ = None
self.support_mask_ = None
self.feature_names_in_ = None
self.feature_names_out_ = None
def fit(self, X, y=None):
if self.k is None:
self.selector_ = "passthrough"
self.feature_names_out_ = self.feature_names_in_
return self
n_feats = X.shape[1]
k_eff = int(min(max(1, self.k), n_feats))
score_func = f_classif if self.task == "classification" else f_regression
sel = SelectKBest(score_func=score_func, k=k_eff).fit(X, y)
self.selector_ = sel
self.k_effective_ = k_eff
mask = np.zeros(n_feats, dtype=bool)
mask[sel.get_support(indices=True)] = True
self.support_mask_ = mask
if self.feature_names_in_ is not None:
self.feature_names_out_ = self.feature_names_in_[mask]
return self
def set_feature_names_in(self, names):
self.feature_names_in_ = np.asarray(names)
def transform(self, X):
if self.selector_ == "passthrough":
return X
return self.selector_.transform(X)
def get_feature_names_out(self, input_features=None):
if getattr(self, "feature_names_out_", None) is not None:
return self.feature_names_out_
if getattr(self, "support_mask_", None) is not None and input_features is not None:
input_features = np.asarray(input_features)
return input_features[self.support_mask_]
return None
class ConstantFilter(BaseEstimator, TransformerMixin):
def __init__(self, eps=0.0):
self.eps = eps
self.mask_ = None
self.feature_names_in_ = None
self.feature_names_out_ = None
def fit(self, X, y=None):
X = np.asarray(X)
var = X.var(axis=0)
self.mask_ = var > self.eps
if self.feature_names_in_ is not None:
self.feature_names_out_ = np.asarray(self.feature_names_in_)[self.mask_]
return self
def set_feature_names_in(self, names):
self.feature_names_in_ = np.asarray(names)
def get_feature_names_out(self):
if self.feature_names_out_ is not None:
return self.feature_names_out_
# fallback when names were not set
return np.array([f"f{i}" for i, keep in enumerate(self.mask_) if keep])
def transform(self, X):
X = np.asarray(X)
return X[:, self.mask_]
def build_preprocessor(X_full, task, cfg, fixed_k=None, fixed_poly_degree=None):
cat_cols = X_full.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
num_cols = [c for c in X_full.columns if c not in cat_cols]
num_imputer = SimpleImputer(strategy=cfg.get("num_impute_strategy", "median"))
cat_imputer = SimpleImputer(strategy=cfg.get("cat_impute_strategy", "most_frequent"))
scaler_name = cfg.get("scaler", "standard")
if scaler_name == "standard":
num_scaler = StandardScaler(with_mean=True, with_std=True)
elif scaler_name == "robust":
num_scaler = RobustScaler()
elif scaler_name == "minmax":
num_scaler = MinMaxScaler()
elif scaler_name == "power":
num_scaler = PowerTransformer(method="yeo-johnson")
else:
num_scaler = "passthrough"
poly_degree = fixed_poly_degree if fixed_poly_degree is not None else cfg.get("poly_degree", 1)
poly = PolynomialFeatures(degree=poly_degree, include_bias=False) if poly_degree > 1 else "passthrough"
# always fix categories from the full dataset
fixed_categories = None
if len(cat_cols) > 0:
fixed_categories = {c: sorted(X_full[c].dropna().astype(str).unique()) for c in cat_cols}
ohe_kwargs = dict(handle_unknown="ignore")
try:
ohe_kwargs["sparse_output"] = False
except TypeError:
ohe_kwargs["sparse"] = False
if fixed_categories is not None:
ohe_kwargs["categories"] = [fixed_categories[c] for c in cat_cols]
cat_encoder = OneHotEncoder(**ohe_kwargs)
num_steps = [("impute", num_imputer), ("scale", num_scaler), ("poly", poly)]
if int(cfg.get("use_vt", 0)):
num_steps.append(("vt", VarianceThreshold(threshold=float(cfg.get("vt_thr", 0.0)))))
ct = ColumnTransformer([
("num", Pipeline(steps=num_steps), num_cols),
("cat", Pipeline(steps=[("impute", cat_imputer), ("oh", cat_encoder)]), cat_cols),
])
select_k = fixed_k if fixed_k is not None else cfg.get("select_k", None)
selector = SafeSelectK(task=task, k=select_k)
pre = Pipeline([
("prep", ct),
("drop_const", ConstantFilter(eps=0.0)),
("select", selector),
])
return pre

147
src/search/nsga_deap.py Normal file
View File

@@ -0,0 +1,147 @@
import mlflow
from deap import base, creator, tools
from sklearn.utils import check_random_state
from src.objectives import evaluate_config
SCALERS = ["standard", "robust", "minmax", "power", "none"]
NUM_IMPUTE = ["median", "mean"]
CAT_IMPUTE = ["most_frequent"]
ALGOS = ["rf", "gbt", "mlp"]
def decode(ind):
i = 0
algo = ALGOS[int(ind[i]) % len(ALGOS)]
i += 1
scaler = SCALERS[int(ind[i]) % len(SCALERS)]
i += 1
num_imp = NUM_IMPUTE[int(ind[i]) % len(NUM_IMPUTE)]
i += 1
cat_imp = CAT_IMPUTE[int(ind[i]) % len(CAT_IMPUTE)]
i += 1
poly_degree = 1 + int(ind[i]) % 2
i += 1
use_selectk = int(ind[i]) % 2
i += 1
select_k = [None, 16, 32, 64, 128][int(ind[i]) % 5]
i += 1
if not use_selectk:
select_k = None
pre_cfg = {
"num_impute_strategy": num_imp,
"cat_impute_strategy": cat_imp,
"scaler": scaler,
"poly_degree": poly_degree,
"select_k": select_k,
}
if algo == "rf":
n_estimators = [100, 200, 300, 400, 500][int(ind[i]) % 5]
i += 1
max_depth = [2, 4, 6, 8, 10, 12][int(ind[i]) % 6]
i += 1
max_features = ["sqrt", "log2", None][int(ind[i]) % 3]
i += 1
params = {
"n_estimators": n_estimators,
"max_depth": max_depth,
"max_features": max_features,
}
elif algo == "gbt":
n_estimators = [100, 200, 300, 400, 500][int(ind[i]) % 5]
i += 1
max_depth = [2, 3, 4, 5][int(ind[i]) % 4]
i += 1
lr = [0.01, 0.02, 0.05, 0.1, 0.2][int(ind[i]) % 5]
i += 1
params = {
"n_estimators": n_estimators,
"max_depth": max_depth,
"learning_rate": lr,
}
else:
n_layers = [1, 2, 3][int(ind[i]) % 3]
i += 1
h = []
for _ in range(n_layers):
h.append([16, 32, 64, 128, 256][int(ind[i]) % 5])
i += 1
# skip unused gene slots if fewer than 3 layers
i += max(0, 3 - n_layers)
alpha = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2][int(ind[i]) % 5]
i += 1
lr_init = [1e-4, 5e-4, 1e-3, 5e-3, 1e-2][int(ind[i]) % 5]
i += 1
params = {
"hidden_layers": tuple(h),
"activation": "relu",
"alpha": alpha,
"lr_init": lr_init,
"max_iter": 200,
}
return algo, params, pre_cfg
def build_toolbox(X, y, task, seed, cv_folds, mlflow_experiment):
rng = check_random_state(seed)
# guard against duplicate class creation in the same process
if not hasattr(creator, "FitnessMSEStab"):
creator.create("FitnessMSEStab", base.Fitness, weights=(-1.0, 1.0))
if not hasattr(creator, "Individual"):
creator.create("Individual", list, fitness=creator.FitnessMSEStab)
toolbox = base.Toolbox()
toolbox.register("gene", rng.randint, 0, 1000000)
toolbox.register(
"individual",
tools.initRepeat,
creator.Individual,
toolbox.gene,
n=16,
)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
def eval_ind(individual):
algo, model_params, pre_cfg = decode(individual)
# one run per individual; the outer script sets the experiment
with mlflow.start_run(run_name=f"{algo}", nested=True):
for gi, g in enumerate(individual):
mlflow.log_param(f"g{gi}", int(g))
mlflow.log_param("algo", algo)
for k, v in model_params.items():
mlflow.log_param(f"m_{k}", v)
for k, v in pre_cfg.items():
mlflow.log_param(f"p_{k}", v)
mse_like, stability, _ = evaluate_config(
X,
y,
task,
algo,
model_params,
pre_cfg,
cv_folds=cv_folds,
seed=seed,
)
mlflow.log_metric("mse_like", mse_like)
mlflow.log_metric("stability", stability)
return mse_like, stability
toolbox.register("evaluate", eval_ind)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register(
"mutate",
tools.mutUniformInt,
low=0,
up=1000000,
indpb=0.2,
)
toolbox.register("select", tools.selNSGA2)
return toolbox

273
src/stability.py Normal file
View File

@@ -0,0 +1,273 @@
import numpy as np
import pandas as pd
import shap
import time
def compute_shap_matrix(
pipe,
X_fit,
y_fit,
X_eval,
task_type,
bg_size=128,
max_eval_rows=1024,
rng_seed=0,
):
"""
Fit the pipeline on (X_fit, y_fit), then compute SHAP values on X_eval.
Important: for stability, X_eval should be the same rows and order
across all folds or retrains that you want to compare.
"""
t0 = time.time()
pipe.fit(X_fit, y_fit)
t_fit = time.time() - t0
pre = pipe.named_steps["pre"]
model = pipe.named_steps["model"]
prep = pre.named_steps["prep"]
# derive names after prep using TRAIN data, not eval
X_probe = prep.transform(X_fit[:1])
names_after_prep = getattr(
prep,
"get_feature_names_out",
lambda: np.array([f"f{i}" for i in range(X_probe.shape[1])]),
)()
# thread names through constant-dropper
if "drop_const" in pre.named_steps:
dropper = pre.named_steps["drop_const"]
if hasattr(dropper, "set_feature_names_in"):
dropper.set_feature_names_in(names_after_prep)
names_into_select = dropper.get_feature_names_out()
else:
names_into_select = names_after_prep
# thread names into selector
selector = pre.named_steps["select"]
if hasattr(selector, "set_feature_names_in"):
selector.set_feature_names_in(names_into_select)
# preprocess eval and train splits
X_eval_proc = pre.transform(X_eval)
X_train_proc = pre.transform(X_fit)
# cap eval rows for speed, keep deterministic subsample
n_eval = X_eval_proc.shape[0]
if n_eval > max_eval_rows:
rng = np.random.RandomState(rng_seed)
idx = rng.choice(n_eval, size=max_eval_rows, replace=False)
X_eval_proc = X_eval_proc[idx]
n_cols = X_eval_proc.shape[1]
# resolve final feature names after selection
feat_names = None
if hasattr(selector, "get_feature_names_out"):
feat_names = selector.get_feature_names_out(input_features=names_into_select)
if feat_names is None:
supp = getattr(selector, "support_mask_", None)
if (
supp is not None
and len(names_into_select) == supp.shape[0]
and supp.sum() == n_cols
):
feat_names = np.asarray(names_into_select)[supp]
else:
feat_names = np.array([f"f{i}" for i in range(n_cols)])
else:
feat_names = np.asarray(feat_names)
if len(feat_names) != n_cols:
feat_names = np.array([f"f{i}" for i in range(n_cols)])
# build background from TRAIN split to avoid leakage
rng = np.random.RandomState(rng_seed)
n_bg_pool = X_train_proc.shape[0]
bg_n = min(bg_size, n_bg_pool)
bg_idx = rng.choice(n_bg_pool, size=bg_n, replace=False)
background = X_train_proc[bg_idx]
# choose explainer by model type
def _is_tree_model(m):
# sklearn trees and ensembles
if hasattr(m, "tree_") or hasattr(m, "estimators_"):
return True
# xgboost and lightgbm wrappers
try:
import xgboost as _xgb
if isinstance(m, _xgb.XGBModel):
return True
except Exception:
pass
try:
import lightgbm as _lgb
if isinstance(getattr(m, "booster_", None), _lgb.basic.Booster):
return True
except Exception:
pass
return False
# compute SHAP values
t1 = time.time()
vals = None
if _is_tree_model(model):
# tree specific, interventional mode with background data
explainer = shap.TreeExplainer(
model,
data=background,
feature_perturbation="interventional",
)
# disable strict additivity check to avoid ExplainerError
sv = explainer.shap_values(X_eval_proc, check_additivity=False)
# shap API can return list for classification. pick positive class if so.
if isinstance(sv, list):
cls_idx = 1 if len(sv) > 1 else 0
vals = np.asarray(sv[cls_idx])
else:
vals = np.asarray(sv)
elif hasattr(model, "coef_"):
# linear models
explainer = shap.LinearExplainer(model, background)
vals = np.asarray(explainer.shap_values(X_eval_proc))
else:
# fallback generic with training background masker
masker = shap.maskers.Independent(background)
if task_type == "classification" and hasattr(model, "predict_proba"):
f = lambda M: model.predict_proba(M)[:, 1]
else:
f = lambda M: model.predict(M)
explainer = shap.Explainer(f, masker)
out = explainer(X_eval_proc)
vals = np.asarray(getattr(out, "values", out))
t_shap = time.time() - t1
# normalize shapes to (n_rows, n_features)
vals = np.asarray(vals)
vals = np.squeeze(vals)
if vals.ndim == 3 and vals.shape[2] in (2,) and vals.shape[1] == len(feat_names):
vals = vals[..., -1]
if vals.ndim == 3 and vals.shape[-1] == len(feat_names):
vals = vals.reshape(-1, vals.shape[-1])
if vals.ndim == 2 and vals.shape[0] == len(feat_names):
vals = vals.T
return vals, t_fit, t_shap, feat_names
def mean_abs_shap(shap_matrix, feature_names):
"""
Global mean absolute SHAP per feature.
Still useful for descriptive plots, but not used for instance level stability.
"""
return pd.Series(
np.abs(shap_matrix).mean(axis=0),
index=np.asarray(feature_names),
)
def _align_matrices_to_union(mats_with_names):
"""
mats_with_names: list of (shap_matrix, feature_names)
Each shap_matrix has shape (n_instances, n_features_k).
feature_names is array-like of length n_features_k.
Returns:
all_feats: list[str]
T: np.ndarray of shape (n_models, n_instances, n_all_feats)
"""
if not mats_with_names:
raise ValueError("No SHAP matrices provided")
# check that all matrices have the same number of instances
n_instances = mats_with_names[0][0].shape[0]
for M, names in mats_with_names:
if M.shape[0] != n_instances:
raise ValueError(
f"All SHAP matrices must have same number of rows. "
f"Expected {n_instances}, got {M.shape[0]}"
)
# union of all feature names
all_feats = sorted(
set().union(*[set(np.asarray(names)) for _, names in mats_with_names])
)
n_models = len(mats_with_names)
n_feats = len(all_feats)
T = np.zeros((n_models, n_instances, n_feats), dtype=float)
feat_index = {f: j for j, f in enumerate(all_feats)}
for m_idx, (M, names) in enumerate(mats_with_names):
names = np.asarray(names)
col_map = {name: c for c, name in enumerate(names)}
for fname, j_global in feat_index.items():
if fname in col_map:
j_local = col_map[fname]
T[m_idx, :, j_global] = M[:, j_local]
# if fname not present, leave zeros
return all_feats, T
def shap_stability_from_matrices(mats_with_names):
"""
mats_with_names: list of tuples (shap_matrix, feature_names)
shap_matrix: np.ndarray of shape (n_instances, n_features_k)
feature_names: list or array of length n_features_k
This measures instance level stability:
For each instance i and feature j we look at the SHAP values
across models and compute their standard deviation.
Steps:
1) Align all matrices on the union of feature names.
2) Build tensor T of shape (n_models, n_instances, n_features_union).
3) Compute std across models: per_inst_feat_std = T.std(axis=0).
4) Aggregate:
agg_std = mean of per_inst_feat_std over instances and features.
stability_score = 1 / (1 + agg_std).
Returns:
agg_std: float
stability_score: float
per_feat_std: pd.Series with mean std per feature over instances
per_inst_std: np.ndarray with mean std per instance over features
"""
if not mats_with_names:
raise ValueError("No SHAP matrices provided")
if len(mats_with_names) < 2:
raise ValueError(
f"Need at least 2 models to estimate stability, got {len(mats_with_names)}"
)
feat_names_union, T = _align_matrices_to_union(mats_with_names)
# std across models for each instance and feature
per_inst_feat_std = T.std(axis=0) # shape (n_instances, n_features)
# aggregate
agg_std = float(per_inst_feat_std.mean())
stability_score = 1.0 / (1.0 + agg_std)
# per feature: average std over instances
per_feat_std = per_inst_feat_std.mean(axis=0)
per_feat_std_series = pd.Series(per_feat_std, index=feat_names_union)
# per instance: average std over features
per_inst_std = per_inst_feat_std.mean(axis=1)
return agg_std, stability_score, per_feat_std_series, per_inst_std