A new beginning
This commit is contained in:
135
ACENET_HPC_Guide.md
Normal file
135
ACENET_HPC_Guide.md
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
# Using Compute Canada / ACENET HPC
|
||||||
|
|
||||||
|
This guide explains how to connect to the Digital Research Alliance of Canada (Compute Canada) or ACENET clusters, create a working directory in scratch, transfer files with Globus, and submit jobs using SLURM.
|
||||||
|
|
||||||
|
## 1. Connect to the HPC via SSH
|
||||||
|
|
||||||
|
1. Determine which cluster to use (examples):
|
||||||
|
- Graham: `graham.computecanada.ca`
|
||||||
|
- Cedar: `cedar.computecanada.ca`
|
||||||
|
- Beluga: `beluga.computecanada.ca`
|
||||||
|
- Niagara: `niagara.scinet.utoronto.ca`
|
||||||
|
- ACENET: `login1.acenet.ca`
|
||||||
|
|
||||||
|
2. Open a terminal and connect via SSH:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh username@graham.computecanada.ca
|
||||||
|
```
|
||||||
|
|
||||||
|
3. When prompted, confirm the host fingerprint and enter your password.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Create a Folder in Scratch
|
||||||
|
|
||||||
|
Your `$SCRATCH` directory is a temporary workspace for large data and computations. It is purged after 60 days of inactivity.
|
||||||
|
|
||||||
|
After logging in:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd $SCRATCH
|
||||||
|
mkdir my_project
|
||||||
|
cd my_project
|
||||||
|
```
|
||||||
|
|
||||||
|
Confirm your path:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pwd
|
||||||
|
# Example output: /scratch/username/my_project
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Install and Use Globus for File Transfers
|
||||||
|
|
||||||
|
Globus is a fast, reliable tool for large file transfers. It requires a small local agent called **Globus Connect Personal**.
|
||||||
|
|
||||||
|
### Install Globus Connect Personal
|
||||||
|
|
||||||
|
- **Linux:**
|
||||||
|
```bash
|
||||||
|
wget https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz
|
||||||
|
tar xzf globusconnectpersonal-latest.tgz
|
||||||
|
cd globusconnectpersonal*
|
||||||
|
./globusconnectpersonal -setup
|
||||||
|
```
|
||||||
|
|
||||||
|
- **macOS:**
|
||||||
|
Download and install from: [https://www.globus.org/globus-connect-personal](https://www.globus.org/globus-connect-personal)
|
||||||
|
|
||||||
|
- **Windows:**
|
||||||
|
Download the installer from the same link and follow the setup wizard.
|
||||||
|
|
||||||
|
After installation, your local computer will appear as a **Globus endpoint**.
|
||||||
|
|
||||||
|
### Transfer Files
|
||||||
|
|
||||||
|
1. Visit [https://app.globus.org](https://app.globus.org) and log in using **Compute Canada credentials**.
|
||||||
|
2. In the web app, choose two endpoints:
|
||||||
|
- **Source:** Your local computer or institutional storage.
|
||||||
|
- **Destination:** Your HPC endpoint (for example, *Compute Canada Graham Scratch*).
|
||||||
|
3. Navigate to your target scratch folder (`/scratch/username/my_project`).
|
||||||
|
4. Select files and click **Start Transfer**.
|
||||||
|
|
||||||
|
Globus will handle transfers asynchronously and resume interrupted transfers automatically.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Submit Jobs to ACENET with SLURM
|
||||||
|
|
||||||
|
Job submissions use the SLURM scheduler. Create a batch file describing your job resources and commands.
|
||||||
|
|
||||||
|
### Example job script (`job.slurm`)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=my_analysis
|
||||||
|
#SBATCH --account=def-yourprof
|
||||||
|
#SBATCH --time=2:00:00
|
||||||
|
#SBATCH --nodes=1
|
||||||
|
#SBATCH --ntasks=4
|
||||||
|
#SBATCH --mem=8G
|
||||||
|
#SBATCH --output=output_%j.log
|
||||||
|
|
||||||
|
module load python/3.11
|
||||||
|
source ~/myenv/bin/activate
|
||||||
|
|
||||||
|
python my_script.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Submit and Monitor Jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sbatch job.slurm # Submit job
|
||||||
|
squeue -u username # Check status
|
||||||
|
scancel job_id # Cancel job
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Results
|
||||||
|
|
||||||
|
After completion, check output logs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
less output_<jobid>.log
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Useful Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
module avail # List available software modules
|
||||||
|
module load python/3.11 # Load a module
|
||||||
|
df -h $SCRATCH # Check scratch usage
|
||||||
|
quota -s # Check your disk quota
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. References
|
||||||
|
|
||||||
|
- Alliance Docs: [https://docs.alliancecan.ca/wiki/Technical_documentation](https://docs.alliancecan.ca/wiki/Technical_documentation)
|
||||||
|
- ACENET Training: [https://www.ace-net.ca/training/](https://www.ace-net.ca/training/)
|
||||||
|
- Globus Setup: [https://www.globus.org/globus-connect-personal](https://www.globus.org/globus-connect-personal)
|
||||||
62
README.md
Normal file
62
README.md
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
# Explanation-Aware Optimization and AutoML (DEAP + SHAP Stability)
|
||||||
|
|
||||||
|
This project implements an **AutoML framework** that uses **DEAP’s NSGA-II** for multi-objective optimization, balancing **model accuracy** and **SHAP-based stability**.
|
||||||
|
It supports both **classification** and **regression** datasets via OpenML and sklearn.
|
||||||
|
All results are tracked with **MLflow**.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Environment Setup (macOS / Linux)
|
||||||
|
|
||||||
|
### Create and activate a virtual environment
|
||||||
|
```bash
|
||||||
|
python3 -m venv .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
pip install --upgrade pip wheel setuptools
|
||||||
|
|
||||||
|
pip install \
|
||||||
|
numpy==1.26.4 \
|
||||||
|
pandas==1.5.3 \
|
||||||
|
scikit-learn==1.3.2 \
|
||||||
|
shap==0.45.0 \
|
||||||
|
deap==1.4.1 \
|
||||||
|
openml==0.14.2 \
|
||||||
|
mlflow==2.11.3 \
|
||||||
|
matplotlib==3.7.5
|
||||||
|
```
|
||||||
|
## 2. Running Experiments
|
||||||
|
Classification: Adult Dataset
|
||||||
|
```bash
|
||||||
|
python run_deap.py \
|
||||||
|
--dataset adult \
|
||||||
|
--generations 5 \
|
||||||
|
--pop-size 24 \
|
||||||
|
--cv-folds 3
|
||||||
|
```
|
||||||
|
Regression: California Housing Dataset
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_deap.py \
|
||||||
|
--dataset cal_housing \
|
||||||
|
--generations 5 \
|
||||||
|
--pop-size 24 \
|
||||||
|
--cv-folds 3
|
||||||
|
```
|
||||||
|
Results are saved under:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
runs/<dataset>/pareto_front.csv
|
||||||
|
```
|
||||||
|
## 3. Viewing Results in MLflow
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mlflow ui --backend-store-uri ./mlruns --host 0.0.0.0 --port 5000
|
||||||
|
```
|
||||||
|
Then open: http://localhost:5000
|
||||||
|
|
||||||
|
You can visualize:
|
||||||
|
|
||||||
|
MSE-like score (lower is better)
|
||||||
|
|
||||||
|
SHAP stability (higher is better)
|
||||||
8
requirements.txt
Normal file
8
requirements.txt
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
scikit-learn
|
||||||
|
openml
|
||||||
|
deap
|
||||||
|
mlflow
|
||||||
|
shap
|
||||||
|
numpy
|
||||||
|
pandas
|
||||||
|
matplotlib
|
||||||
170
run_deap.py
Normal file
170
run_deap.py
Normal file
@@ -0,0 +1,170 @@
|
|||||||
|
import argparse
|
||||||
|
import random
|
||||||
|
import pickle
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import mlflow
|
||||||
|
from deap import algorithms
|
||||||
|
from deap.tools.emo import sortNondominated
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
from src.data_openml import load_dataset
|
||||||
|
from src.search.nsga_deap import build_toolbox, decode
|
||||||
|
from src.preprocessing import build_preprocessor
|
||||||
|
from src.models import make_model
|
||||||
|
from src.stability import compute_shap_matrix
|
||||||
|
|
||||||
|
|
||||||
|
def save_checkpoint(path, gen, pop, seed):
|
||||||
|
state = {
|
||||||
|
"gen": gen,
|
||||||
|
"pop": pop,
|
||||||
|
"py_random_state": random.getstate(),
|
||||||
|
"np_random_state": np.random.get_state(),
|
||||||
|
"seed": seed,
|
||||||
|
}
|
||||||
|
with open(path, "wb") as f:
|
||||||
|
pickle.dump(state, f)
|
||||||
|
|
||||||
|
|
||||||
|
def load_checkpoint(path):
|
||||||
|
with open(path, "rb") as f:
|
||||||
|
state = pickle.load(f)
|
||||||
|
random.setstate(state["py_random_state"])
|
||||||
|
np.random.set_state(state["np_random_state"])
|
||||||
|
return state["gen"], state["pop"], state["seed"]
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--dataset", required=True, choices=["adult", "cal_housing"])
|
||||||
|
ap.add_argument("--generations", type=int, default=10)
|
||||||
|
ap.add_argument("--pop-size", type=int, default=24)
|
||||||
|
ap.add_argument("--seed", type=int, default=42)
|
||||||
|
ap.add_argument("--cv-folds", type=int, default=3)
|
||||||
|
ap.add_argument("--experiment", default="deap_nsga_shap")
|
||||||
|
ap.add_argument("--checkpoint-every", type=int, default=5)
|
||||||
|
ap.add_argument(
|
||||||
|
"--shap-pf-eval-rows",
|
||||||
|
type=int,
|
||||||
|
default=512,
|
||||||
|
help="Number of rows from the dataset to use when saving SHAP for Pareto models",
|
||||||
|
)
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
# data and experiment
|
||||||
|
X, y, task = load_dataset(args.dataset, random_state=args.seed)
|
||||||
|
mlflow.set_experiment(args.experiment)
|
||||||
|
|
||||||
|
outdir = Path("runs") / args.dataset
|
||||||
|
outdir.mkdir(parents=True, exist_ok=True)
|
||||||
|
ckpt_path = outdir / "checkpoint.pkl"
|
||||||
|
|
||||||
|
# seed RNGs
|
||||||
|
random.seed(args.seed)
|
||||||
|
np.random.seed(args.seed)
|
||||||
|
|
||||||
|
# toolbox for this run
|
||||||
|
toolbox = build_toolbox(
|
||||||
|
X,
|
||||||
|
y,
|
||||||
|
task,
|
||||||
|
seed=args.seed,
|
||||||
|
cv_folds=args.cv_folds,
|
||||||
|
mlflow_experiment=args.experiment,
|
||||||
|
)
|
||||||
|
|
||||||
|
# initial population or resume from checkpoint
|
||||||
|
if ckpt_path.exists():
|
||||||
|
start_gen, pop, loaded_seed = load_checkpoint(ckpt_path)
|
||||||
|
if loaded_seed != args.seed:
|
||||||
|
print(
|
||||||
|
f"Warning: checkpoint seed {loaded_seed} differs from current seed {args.seed}"
|
||||||
|
)
|
||||||
|
print(f"Resuming from checkpoint at generation {start_gen}")
|
||||||
|
else:
|
||||||
|
pop = toolbox.population(n=args.pop_size)
|
||||||
|
fits = list(map(toolbox.evaluate, pop))
|
||||||
|
for ind, fit in zip(pop, fits):
|
||||||
|
ind.fitness.values = fit
|
||||||
|
start_gen = 0
|
||||||
|
save_checkpoint(ckpt_path, start_gen, pop, args.seed)
|
||||||
|
print(f"Initial checkpoint saved at generation {start_gen}")
|
||||||
|
|
||||||
|
# GA loop
|
||||||
|
for gen in range(start_gen, args.generations):
|
||||||
|
offspring = algorithms.varAnd(pop, toolbox, cxpb=0.7, mutpb=0.2)
|
||||||
|
fits = list(map(toolbox.evaluate, offspring))
|
||||||
|
for ind, fit in zip(offspring, fits):
|
||||||
|
ind.fitness.values = fit
|
||||||
|
pop = toolbox.select(pop + offspring, k=args.pop_size)
|
||||||
|
|
||||||
|
if (gen + 1) % args.checkpoint_every == 0:
|
||||||
|
save_checkpoint(ckpt_path, gen + 1, pop, args.seed)
|
||||||
|
print(f"Checkpoint saved at generation {gen + 1}")
|
||||||
|
|
||||||
|
# final Pareto front
|
||||||
|
pf = sortNondominated(pop, len(pop), first_front_only=True)[0]
|
||||||
|
rows = []
|
||||||
|
for ind in pf:
|
||||||
|
algo, model_params, pre_cfg = decode(ind)
|
||||||
|
rows.append(
|
||||||
|
{
|
||||||
|
"algo": algo,
|
||||||
|
"mse_like": ind.fitness.values[0],
|
||||||
|
"stability": ind.fitness.values[1],
|
||||||
|
**{f"m_{k}": v for k, v in model_params.items()},
|
||||||
|
**{f"p_{k}": v for k, v in pre_cfg.items()},
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
pareto_path = outdir / "pareto_front.csv"
|
||||||
|
pd.DataFrame(rows).to_csv(pareto_path, index=False)
|
||||||
|
print(f"Saved Pareto front to {pareto_path}")
|
||||||
|
|
||||||
|
shap_dir = outdir / "shap"
|
||||||
|
shap_dir.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
eval_rows = min(args.shap_pf_eval_rows, len(X))
|
||||||
|
rng = np.random.RandomState(args.seed)
|
||||||
|
eval_idx = rng.choice(len(X), size=eval_rows, replace=False)
|
||||||
|
X_eval_shap = X.iloc[eval_idx]
|
||||||
|
y_full = y
|
||||||
|
|
||||||
|
for i, ind in enumerate(pf):
|
||||||
|
algo, model_params, pre_cfg = decode(ind)
|
||||||
|
|
||||||
|
fixed_poly_degree = pre_cfg.get("poly_degree", 1)
|
||||||
|
fixed_k = pre_cfg.get("select_k", None)
|
||||||
|
|
||||||
|
preproc = build_preprocessor(
|
||||||
|
X,
|
||||||
|
task,
|
||||||
|
pre_cfg,
|
||||||
|
fixed_k=fixed_k,
|
||||||
|
fixed_poly_degree=fixed_poly_degree,
|
||||||
|
)
|
||||||
|
model = make_model(task, algo, model_params, random_state=args.seed)
|
||||||
|
from sklearn.pipeline import Pipeline as SkPipeline
|
||||||
|
pipe = SkPipeline([("pre", preproc), ("model", model)])
|
||||||
|
|
||||||
|
shap_vals, t_fit, t_shap, feat_names = compute_shap_matrix(
|
||||||
|
pipe,
|
||||||
|
X_fit=X,
|
||||||
|
y_fit=y_full,
|
||||||
|
X_eval=X_eval_shap,
|
||||||
|
task_type=task,
|
||||||
|
bg_size=128,
|
||||||
|
max_eval_rows=eval_rows,
|
||||||
|
rng_seed=args.seed,
|
||||||
|
)
|
||||||
|
|
||||||
|
np.save(shap_dir / f"pf_{i}_shap_vals.npy", shap_vals)
|
||||||
|
np.save(shap_dir / f"pf_{i}_feat_names.npy", np.asarray(feat_names))
|
||||||
|
|
||||||
|
print(f"Saved SHAP arrays for {len(pf)} Pareto models under {shap_dir}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
16
src/data_openml.py
Normal file
16
src/data_openml.py
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
from sklearn.datasets import fetch_california_housing, fetch_openml
|
||||||
|
|
||||||
|
def load_dataset(name: str, random_state: int = 42):
|
||||||
|
name = name.lower()
|
||||||
|
if name == "cal_housing":
|
||||||
|
ds = fetch_california_housing(as_frame=True)
|
||||||
|
X = ds.data
|
||||||
|
y = ds.target
|
||||||
|
return X, y, "regression"
|
||||||
|
elif name == "adult":
|
||||||
|
ds = fetch_openml(data_id=1590, as_frame=True) # Adult
|
||||||
|
X = ds.data
|
||||||
|
y = (ds.target == ">50K").astype(int)
|
||||||
|
return X, y, "classification"
|
||||||
|
else:
|
||||||
|
raise ValueError("dataset must be adult or cal_housing")
|
||||||
55
src/models.py
Normal file
55
src/models.py
Normal file
@@ -0,0 +1,55 @@
|
|||||||
|
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor, GradientBoostingClassifier
|
||||||
|
from sklearn.neural_network import MLPRegressor, MLPClassifier
|
||||||
|
|
||||||
|
def make_model(task, algo, params, random_state=0):
|
||||||
|
if task == "regression":
|
||||||
|
if algo == "rf":
|
||||||
|
return RandomForestRegressor(
|
||||||
|
n_estimators=int(params["n_estimators"]),
|
||||||
|
max_depth=int(params["max_depth"]),
|
||||||
|
max_features=params["max_features"],
|
||||||
|
random_state=random_state,
|
||||||
|
n_jobs=1
|
||||||
|
)
|
||||||
|
elif algo == "gbt":
|
||||||
|
return GradientBoostingRegressor(
|
||||||
|
n_estimators=int(params["n_estimators"]),
|
||||||
|
learning_rate=float(params["learning_rate"]),
|
||||||
|
max_depth=int(params["max_depth"]),
|
||||||
|
random_state=random_state
|
||||||
|
)
|
||||||
|
elif algo == "mlp":
|
||||||
|
return MLPRegressor(
|
||||||
|
hidden_layer_sizes=tuple(params["hidden_layers"]),
|
||||||
|
activation=params["activation"],
|
||||||
|
alpha=float(params["alpha"]),
|
||||||
|
learning_rate_init=float(params["lr_init"]),
|
||||||
|
max_iter=int(params.get("max_iter", 200)),
|
||||||
|
random_state=random_state
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
if algo == "rf":
|
||||||
|
return RandomForestClassifier(
|
||||||
|
n_estimators=int(params["n_estimators"]),
|
||||||
|
max_depth=int(params["max_depth"]),
|
||||||
|
max_features=params["max_features"],
|
||||||
|
random_state=random_state,
|
||||||
|
n_jobs=1
|
||||||
|
)
|
||||||
|
elif algo == "gbt":
|
||||||
|
return GradientBoostingClassifier(
|
||||||
|
n_estimators=int(params["n_estimators"]),
|
||||||
|
learning_rate=float(params["learning_rate"]),
|
||||||
|
max_depth=int(params["max_depth"]),
|
||||||
|
random_state=random_state
|
||||||
|
)
|
||||||
|
elif algo == "mlp":
|
||||||
|
return MLPClassifier(
|
||||||
|
hidden_layer_sizes=tuple(params["hidden_layers"]),
|
||||||
|
activation=params["activation"],
|
||||||
|
alpha=float(params["alpha"]),
|
||||||
|
learning_rate_init=float(params["lr_init"]),
|
||||||
|
max_iter=int(params.get("max_iter", 200)),
|
||||||
|
random_state=random_state
|
||||||
|
)
|
||||||
|
raise ValueError("Unknown algo")
|
||||||
66
src/objectives.py
Normal file
66
src/objectives.py
Normal file
@@ -0,0 +1,66 @@
|
|||||||
|
import numpy as np
|
||||||
|
from sklearn.pipeline import Pipeline
|
||||||
|
from sklearn.model_selection import KFold
|
||||||
|
from sklearn.metrics import mean_squared_error, brier_score_loss
|
||||||
|
|
||||||
|
from .preprocessing import build_preprocessor
|
||||||
|
from .models import make_model
|
||||||
|
from .stability import compute_shap_matrix, shap_stability_from_matrices
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate_config(X, y, task, algo, model_params, preproc_cfg, cv_folds=3, seed=42):
|
||||||
|
kf = KFold(n_splits=cv_folds, shuffle=True, random_state=seed)
|
||||||
|
losses = []
|
||||||
|
|
||||||
|
# this will store full SHAP matrices and feature names for stability
|
||||||
|
shap_mats_with_names = []
|
||||||
|
|
||||||
|
# pick a fixed evaluation pool for SHAP, same rows and order for all folds
|
||||||
|
rng = np.random.RandomState(seed)
|
||||||
|
max_eval_rows = 1024
|
||||||
|
eval_size = min(max_eval_rows, len(X))
|
||||||
|
eval_idx = rng.choice(len(X), size=eval_size, replace=False)
|
||||||
|
X_eval_fixed = X.iloc[eval_idx]
|
||||||
|
|
||||||
|
# probe preprocessor to compute a safe cap for k
|
||||||
|
fixed_poly_degree = preproc_cfg.get("fixed_poly_degree", preproc_cfg.get("poly_degree", 1))
|
||||||
|
probe_pre = build_preprocessor(X, task, preproc_cfg, fixed_k=None, fixed_poly_degree=fixed_poly_degree)
|
||||||
|
Xp = probe_pre.fit_transform(X, y)
|
||||||
|
n_after_prep = Xp.shape[1]
|
||||||
|
desired_k = preproc_cfg.get("select_k", None)
|
||||||
|
k_cap = None if desired_k is None else int(min(max(1, desired_k), n_after_prep))
|
||||||
|
|
||||||
|
for fold_idx, (tr, te) in enumerate(kf.split(X)):
|
||||||
|
preproc = build_preprocessor(X, task, preproc_cfg, fixed_k=k_cap, fixed_poly_degree=fixed_poly_degree)
|
||||||
|
model = make_model(task, algo, model_params, random_state=seed + fold_idx)
|
||||||
|
pipe = Pipeline([("pre", preproc), ("model", model)])
|
||||||
|
|
||||||
|
# 1) SHAP stability: always use the same X_eval_fixed for all folds
|
||||||
|
shap_vals, t_fit, t_shap, feat_names = compute_shap_matrix(
|
||||||
|
pipe,
|
||||||
|
X_fit=X.iloc[tr],
|
||||||
|
y_fit=y.iloc[tr],
|
||||||
|
X_eval=X_eval_fixed,
|
||||||
|
task_type=task,
|
||||||
|
)
|
||||||
|
shap_mats_with_names.append((shap_vals, feat_names))
|
||||||
|
|
||||||
|
# 2) Loss: still use standard CV split (te) for generalization
|
||||||
|
if task == "regression":
|
||||||
|
y_pred = pipe.predict(X.iloc[te])
|
||||||
|
loss = float(mean_squared_error(y.iloc[te], y_pred))
|
||||||
|
else:
|
||||||
|
if hasattr(pipe.named_steps["model"], "predict_proba"):
|
||||||
|
y_prob = pipe.predict_proba(X.iloc[te])[:, 1]
|
||||||
|
else:
|
||||||
|
scores = pipe.decision_function(X.iloc[te])
|
||||||
|
scores = (scores - scores.min()) / (scores.max() - scores.min() + 1e-8)
|
||||||
|
y_prob = scores
|
||||||
|
loss = float(brier_score_loss(y.iloc[te], y_prob))
|
||||||
|
losses.append(loss)
|
||||||
|
|
||||||
|
# instance level SHAP stability across folds
|
||||||
|
agg_std, stability, per_feat_std, per_inst_std = shap_stability_from_matrices(shap_mats_with_names)
|
||||||
|
|
||||||
|
mse_like = float(np.mean(losses))
|
||||||
|
return mse_like, float(stability), per_feat_std
|
||||||
138
src/preprocessing.py
Normal file
138
src/preprocessing.py
Normal file
@@ -0,0 +1,138 @@
|
|||||||
|
from sklearn.compose import ColumnTransformer
|
||||||
|
from sklearn.pipeline import Pipeline
|
||||||
|
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler, PowerTransformer, PolynomialFeatures
|
||||||
|
from sklearn.impute import SimpleImputer
|
||||||
|
from sklearn.feature_selection import SelectKBest, f_classif, f_regression
|
||||||
|
from sklearn.feature_selection import VarianceThreshold
|
||||||
|
from sklearn.base import BaseEstimator, TransformerMixin
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
class SafeSelectK(BaseEstimator, TransformerMixin):
|
||||||
|
def __init__(self, task: str, k=None):
|
||||||
|
self.task = task
|
||||||
|
self.k = k
|
||||||
|
self.selector_ = None
|
||||||
|
self.k_effective_ = None
|
||||||
|
self.support_mask_ = None
|
||||||
|
self.feature_names_in_ = None
|
||||||
|
self.feature_names_out_ = None
|
||||||
|
|
||||||
|
def fit(self, X, y=None):
|
||||||
|
if self.k is None:
|
||||||
|
self.selector_ = "passthrough"
|
||||||
|
self.feature_names_out_ = self.feature_names_in_
|
||||||
|
return self
|
||||||
|
n_feats = X.shape[1]
|
||||||
|
k_eff = int(min(max(1, self.k), n_feats))
|
||||||
|
score_func = f_classif if self.task == "classification" else f_regression
|
||||||
|
sel = SelectKBest(score_func=score_func, k=k_eff).fit(X, y)
|
||||||
|
self.selector_ = sel
|
||||||
|
self.k_effective_ = k_eff
|
||||||
|
mask = np.zeros(n_feats, dtype=bool)
|
||||||
|
mask[sel.get_support(indices=True)] = True
|
||||||
|
self.support_mask_ = mask
|
||||||
|
if self.feature_names_in_ is not None:
|
||||||
|
self.feature_names_out_ = self.feature_names_in_[mask]
|
||||||
|
return self
|
||||||
|
|
||||||
|
def set_feature_names_in(self, names):
|
||||||
|
self.feature_names_in_ = np.asarray(names)
|
||||||
|
|
||||||
|
def transform(self, X):
|
||||||
|
if self.selector_ == "passthrough":
|
||||||
|
return X
|
||||||
|
return self.selector_.transform(X)
|
||||||
|
|
||||||
|
def get_feature_names_out(self, input_features=None):
|
||||||
|
if getattr(self, "feature_names_out_", None) is not None:
|
||||||
|
return self.feature_names_out_
|
||||||
|
if getattr(self, "support_mask_", None) is not None and input_features is not None:
|
||||||
|
input_features = np.asarray(input_features)
|
||||||
|
return input_features[self.support_mask_]
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
class ConstantFilter(BaseEstimator, TransformerMixin):
|
||||||
|
def __init__(self, eps=0.0):
|
||||||
|
self.eps = eps
|
||||||
|
self.mask_ = None
|
||||||
|
self.feature_names_in_ = None
|
||||||
|
self.feature_names_out_ = None
|
||||||
|
|
||||||
|
def fit(self, X, y=None):
|
||||||
|
X = np.asarray(X)
|
||||||
|
var = X.var(axis=0)
|
||||||
|
self.mask_ = var > self.eps
|
||||||
|
if self.feature_names_in_ is not None:
|
||||||
|
self.feature_names_out_ = np.asarray(self.feature_names_in_)[self.mask_]
|
||||||
|
return self
|
||||||
|
|
||||||
|
def set_feature_names_in(self, names):
|
||||||
|
self.feature_names_in_ = np.asarray(names)
|
||||||
|
|
||||||
|
def get_feature_names_out(self):
|
||||||
|
if self.feature_names_out_ is not None:
|
||||||
|
return self.feature_names_out_
|
||||||
|
# fallback when names were not set
|
||||||
|
return np.array([f"f{i}" for i, keep in enumerate(self.mask_) if keep])
|
||||||
|
|
||||||
|
def transform(self, X):
|
||||||
|
X = np.asarray(X)
|
||||||
|
return X[:, self.mask_]
|
||||||
|
|
||||||
|
|
||||||
|
def build_preprocessor(X_full, task, cfg, fixed_k=None, fixed_poly_degree=None):
|
||||||
|
cat_cols = X_full.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
|
||||||
|
num_cols = [c for c in X_full.columns if c not in cat_cols]
|
||||||
|
|
||||||
|
num_imputer = SimpleImputer(strategy=cfg.get("num_impute_strategy", "median"))
|
||||||
|
cat_imputer = SimpleImputer(strategy=cfg.get("cat_impute_strategy", "most_frequent"))
|
||||||
|
|
||||||
|
scaler_name = cfg.get("scaler", "standard")
|
||||||
|
if scaler_name == "standard":
|
||||||
|
num_scaler = StandardScaler(with_mean=True, with_std=True)
|
||||||
|
elif scaler_name == "robust":
|
||||||
|
num_scaler = RobustScaler()
|
||||||
|
elif scaler_name == "minmax":
|
||||||
|
num_scaler = MinMaxScaler()
|
||||||
|
elif scaler_name == "power":
|
||||||
|
num_scaler = PowerTransformer(method="yeo-johnson")
|
||||||
|
else:
|
||||||
|
num_scaler = "passthrough"
|
||||||
|
|
||||||
|
poly_degree = fixed_poly_degree if fixed_poly_degree is not None else cfg.get("poly_degree", 1)
|
||||||
|
poly = PolynomialFeatures(degree=poly_degree, include_bias=False) if poly_degree > 1 else "passthrough"
|
||||||
|
|
||||||
|
# always fix categories from the full dataset
|
||||||
|
fixed_categories = None
|
||||||
|
if len(cat_cols) > 0:
|
||||||
|
fixed_categories = {c: sorted(X_full[c].dropna().astype(str).unique()) for c in cat_cols}
|
||||||
|
|
||||||
|
ohe_kwargs = dict(handle_unknown="ignore")
|
||||||
|
try:
|
||||||
|
ohe_kwargs["sparse_output"] = False
|
||||||
|
except TypeError:
|
||||||
|
ohe_kwargs["sparse"] = False
|
||||||
|
if fixed_categories is not None:
|
||||||
|
ohe_kwargs["categories"] = [fixed_categories[c] for c in cat_cols]
|
||||||
|
cat_encoder = OneHotEncoder(**ohe_kwargs)
|
||||||
|
|
||||||
|
num_steps = [("impute", num_imputer), ("scale", num_scaler), ("poly", poly)]
|
||||||
|
if int(cfg.get("use_vt", 0)):
|
||||||
|
num_steps.append(("vt", VarianceThreshold(threshold=float(cfg.get("vt_thr", 0.0)))))
|
||||||
|
|
||||||
|
ct = ColumnTransformer([
|
||||||
|
("num", Pipeline(steps=num_steps), num_cols),
|
||||||
|
("cat", Pipeline(steps=[("impute", cat_imputer), ("oh", cat_encoder)]), cat_cols),
|
||||||
|
])
|
||||||
|
|
||||||
|
select_k = fixed_k if fixed_k is not None else cfg.get("select_k", None)
|
||||||
|
selector = SafeSelectK(task=task, k=select_k)
|
||||||
|
|
||||||
|
|
||||||
|
pre = Pipeline([
|
||||||
|
("prep", ct),
|
||||||
|
("drop_const", ConstantFilter(eps=0.0)),
|
||||||
|
("select", selector),
|
||||||
|
])
|
||||||
|
return pre
|
||||||
147
src/search/nsga_deap.py
Normal file
147
src/search/nsga_deap.py
Normal file
@@ -0,0 +1,147 @@
|
|||||||
|
import mlflow
|
||||||
|
from deap import base, creator, tools
|
||||||
|
from sklearn.utils import check_random_state
|
||||||
|
|
||||||
|
from src.objectives import evaluate_config
|
||||||
|
|
||||||
|
SCALERS = ["standard", "robust", "minmax", "power", "none"]
|
||||||
|
NUM_IMPUTE = ["median", "mean"]
|
||||||
|
CAT_IMPUTE = ["most_frequent"]
|
||||||
|
ALGOS = ["rf", "gbt", "mlp"]
|
||||||
|
|
||||||
|
|
||||||
|
def decode(ind):
|
||||||
|
i = 0
|
||||||
|
algo = ALGOS[int(ind[i]) % len(ALGOS)]
|
||||||
|
i += 1
|
||||||
|
scaler = SCALERS[int(ind[i]) % len(SCALERS)]
|
||||||
|
i += 1
|
||||||
|
num_imp = NUM_IMPUTE[int(ind[i]) % len(NUM_IMPUTE)]
|
||||||
|
i += 1
|
||||||
|
cat_imp = CAT_IMPUTE[int(ind[i]) % len(CAT_IMPUTE)]
|
||||||
|
i += 1
|
||||||
|
poly_degree = 1 + int(ind[i]) % 2
|
||||||
|
i += 1
|
||||||
|
use_selectk = int(ind[i]) % 2
|
||||||
|
i += 1
|
||||||
|
select_k = [None, 16, 32, 64, 128][int(ind[i]) % 5]
|
||||||
|
i += 1
|
||||||
|
if not use_selectk:
|
||||||
|
select_k = None
|
||||||
|
|
||||||
|
pre_cfg = {
|
||||||
|
"num_impute_strategy": num_imp,
|
||||||
|
"cat_impute_strategy": cat_imp,
|
||||||
|
"scaler": scaler,
|
||||||
|
"poly_degree": poly_degree,
|
||||||
|
"select_k": select_k,
|
||||||
|
}
|
||||||
|
|
||||||
|
if algo == "rf":
|
||||||
|
n_estimators = [100, 200, 300, 400, 500][int(ind[i]) % 5]
|
||||||
|
i += 1
|
||||||
|
max_depth = [2, 4, 6, 8, 10, 12][int(ind[i]) % 6]
|
||||||
|
i += 1
|
||||||
|
max_features = ["sqrt", "log2", None][int(ind[i]) % 3]
|
||||||
|
i += 1
|
||||||
|
params = {
|
||||||
|
"n_estimators": n_estimators,
|
||||||
|
"max_depth": max_depth,
|
||||||
|
"max_features": max_features,
|
||||||
|
}
|
||||||
|
elif algo == "gbt":
|
||||||
|
n_estimators = [100, 200, 300, 400, 500][int(ind[i]) % 5]
|
||||||
|
i += 1
|
||||||
|
max_depth = [2, 3, 4, 5][int(ind[i]) % 4]
|
||||||
|
i += 1
|
||||||
|
lr = [0.01, 0.02, 0.05, 0.1, 0.2][int(ind[i]) % 5]
|
||||||
|
i += 1
|
||||||
|
params = {
|
||||||
|
"n_estimators": n_estimators,
|
||||||
|
"max_depth": max_depth,
|
||||||
|
"learning_rate": lr,
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
n_layers = [1, 2, 3][int(ind[i]) % 3]
|
||||||
|
i += 1
|
||||||
|
h = []
|
||||||
|
for _ in range(n_layers):
|
||||||
|
h.append([16, 32, 64, 128, 256][int(ind[i]) % 5])
|
||||||
|
i += 1
|
||||||
|
# skip unused gene slots if fewer than 3 layers
|
||||||
|
i += max(0, 3 - n_layers)
|
||||||
|
alpha = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2][int(ind[i]) % 5]
|
||||||
|
i += 1
|
||||||
|
lr_init = [1e-4, 5e-4, 1e-3, 5e-3, 1e-2][int(ind[i]) % 5]
|
||||||
|
i += 1
|
||||||
|
params = {
|
||||||
|
"hidden_layers": tuple(h),
|
||||||
|
"activation": "relu",
|
||||||
|
"alpha": alpha,
|
||||||
|
"lr_init": lr_init,
|
||||||
|
"max_iter": 200,
|
||||||
|
}
|
||||||
|
|
||||||
|
return algo, params, pre_cfg
|
||||||
|
|
||||||
|
|
||||||
|
def build_toolbox(X, y, task, seed, cv_folds, mlflow_experiment):
|
||||||
|
rng = check_random_state(seed)
|
||||||
|
|
||||||
|
# guard against duplicate class creation in the same process
|
||||||
|
if not hasattr(creator, "FitnessMSEStab"):
|
||||||
|
creator.create("FitnessMSEStab", base.Fitness, weights=(-1.0, 1.0))
|
||||||
|
if not hasattr(creator, "Individual"):
|
||||||
|
creator.create("Individual", list, fitness=creator.FitnessMSEStab)
|
||||||
|
|
||||||
|
toolbox = base.Toolbox()
|
||||||
|
toolbox.register("gene", rng.randint, 0, 1000000)
|
||||||
|
toolbox.register(
|
||||||
|
"individual",
|
||||||
|
tools.initRepeat,
|
||||||
|
creator.Individual,
|
||||||
|
toolbox.gene,
|
||||||
|
n=16,
|
||||||
|
)
|
||||||
|
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
|
||||||
|
|
||||||
|
def eval_ind(individual):
|
||||||
|
algo, model_params, pre_cfg = decode(individual)
|
||||||
|
|
||||||
|
# one run per individual; the outer script sets the experiment
|
||||||
|
with mlflow.start_run(run_name=f"{algo}", nested=True):
|
||||||
|
for gi, g in enumerate(individual):
|
||||||
|
mlflow.log_param(f"g{gi}", int(g))
|
||||||
|
mlflow.log_param("algo", algo)
|
||||||
|
for k, v in model_params.items():
|
||||||
|
mlflow.log_param(f"m_{k}", v)
|
||||||
|
for k, v in pre_cfg.items():
|
||||||
|
mlflow.log_param(f"p_{k}", v)
|
||||||
|
|
||||||
|
mse_like, stability, _ = evaluate_config(
|
||||||
|
X,
|
||||||
|
y,
|
||||||
|
task,
|
||||||
|
algo,
|
||||||
|
model_params,
|
||||||
|
pre_cfg,
|
||||||
|
cv_folds=cv_folds,
|
||||||
|
seed=seed,
|
||||||
|
)
|
||||||
|
mlflow.log_metric("mse_like", mse_like)
|
||||||
|
mlflow.log_metric("stability", stability)
|
||||||
|
|
||||||
|
return mse_like, stability
|
||||||
|
|
||||||
|
toolbox.register("evaluate", eval_ind)
|
||||||
|
toolbox.register("mate", tools.cxTwoPoint)
|
||||||
|
toolbox.register(
|
||||||
|
"mutate",
|
||||||
|
tools.mutUniformInt,
|
||||||
|
low=0,
|
||||||
|
up=1000000,
|
||||||
|
indpb=0.2,
|
||||||
|
)
|
||||||
|
toolbox.register("select", tools.selNSGA2)
|
||||||
|
|
||||||
|
return toolbox
|
||||||
273
src/stability.py
Normal file
273
src/stability.py
Normal file
@@ -0,0 +1,273 @@
|
|||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
import shap
|
||||||
|
import time
|
||||||
|
|
||||||
|
|
||||||
|
def compute_shap_matrix(
|
||||||
|
pipe,
|
||||||
|
X_fit,
|
||||||
|
y_fit,
|
||||||
|
X_eval,
|
||||||
|
task_type,
|
||||||
|
bg_size=128,
|
||||||
|
max_eval_rows=1024,
|
||||||
|
rng_seed=0,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Fit the pipeline on (X_fit, y_fit), then compute SHAP values on X_eval.
|
||||||
|
|
||||||
|
Important: for stability, X_eval should be the same rows and order
|
||||||
|
across all folds or retrains that you want to compare.
|
||||||
|
"""
|
||||||
|
t0 = time.time()
|
||||||
|
pipe.fit(X_fit, y_fit)
|
||||||
|
t_fit = time.time() - t0
|
||||||
|
|
||||||
|
pre = pipe.named_steps["pre"]
|
||||||
|
model = pipe.named_steps["model"]
|
||||||
|
prep = pre.named_steps["prep"]
|
||||||
|
|
||||||
|
# derive names after prep using TRAIN data, not eval
|
||||||
|
X_probe = prep.transform(X_fit[:1])
|
||||||
|
names_after_prep = getattr(
|
||||||
|
prep,
|
||||||
|
"get_feature_names_out",
|
||||||
|
lambda: np.array([f"f{i}" for i in range(X_probe.shape[1])]),
|
||||||
|
)()
|
||||||
|
|
||||||
|
# thread names through constant-dropper
|
||||||
|
if "drop_const" in pre.named_steps:
|
||||||
|
dropper = pre.named_steps["drop_const"]
|
||||||
|
if hasattr(dropper, "set_feature_names_in"):
|
||||||
|
dropper.set_feature_names_in(names_after_prep)
|
||||||
|
names_into_select = dropper.get_feature_names_out()
|
||||||
|
else:
|
||||||
|
names_into_select = names_after_prep
|
||||||
|
|
||||||
|
# thread names into selector
|
||||||
|
selector = pre.named_steps["select"]
|
||||||
|
if hasattr(selector, "set_feature_names_in"):
|
||||||
|
selector.set_feature_names_in(names_into_select)
|
||||||
|
|
||||||
|
# preprocess eval and train splits
|
||||||
|
X_eval_proc = pre.transform(X_eval)
|
||||||
|
X_train_proc = pre.transform(X_fit)
|
||||||
|
|
||||||
|
# cap eval rows for speed, keep deterministic subsample
|
||||||
|
n_eval = X_eval_proc.shape[0]
|
||||||
|
if n_eval > max_eval_rows:
|
||||||
|
rng = np.random.RandomState(rng_seed)
|
||||||
|
idx = rng.choice(n_eval, size=max_eval_rows, replace=False)
|
||||||
|
X_eval_proc = X_eval_proc[idx]
|
||||||
|
|
||||||
|
n_cols = X_eval_proc.shape[1]
|
||||||
|
|
||||||
|
# resolve final feature names after selection
|
||||||
|
feat_names = None
|
||||||
|
if hasattr(selector, "get_feature_names_out"):
|
||||||
|
feat_names = selector.get_feature_names_out(input_features=names_into_select)
|
||||||
|
|
||||||
|
if feat_names is None:
|
||||||
|
supp = getattr(selector, "support_mask_", None)
|
||||||
|
if (
|
||||||
|
supp is not None
|
||||||
|
and len(names_into_select) == supp.shape[0]
|
||||||
|
and supp.sum() == n_cols
|
||||||
|
):
|
||||||
|
feat_names = np.asarray(names_into_select)[supp]
|
||||||
|
else:
|
||||||
|
feat_names = np.array([f"f{i}" for i in range(n_cols)])
|
||||||
|
else:
|
||||||
|
feat_names = np.asarray(feat_names)
|
||||||
|
if len(feat_names) != n_cols:
|
||||||
|
feat_names = np.array([f"f{i}" for i in range(n_cols)])
|
||||||
|
|
||||||
|
# build background from TRAIN split to avoid leakage
|
||||||
|
rng = np.random.RandomState(rng_seed)
|
||||||
|
n_bg_pool = X_train_proc.shape[0]
|
||||||
|
bg_n = min(bg_size, n_bg_pool)
|
||||||
|
bg_idx = rng.choice(n_bg_pool, size=bg_n, replace=False)
|
||||||
|
background = X_train_proc[bg_idx]
|
||||||
|
|
||||||
|
# choose explainer by model type
|
||||||
|
def _is_tree_model(m):
|
||||||
|
# sklearn trees and ensembles
|
||||||
|
if hasattr(m, "tree_") or hasattr(m, "estimators_"):
|
||||||
|
return True
|
||||||
|
# xgboost and lightgbm wrappers
|
||||||
|
try:
|
||||||
|
import xgboost as _xgb
|
||||||
|
|
||||||
|
if isinstance(m, _xgb.XGBModel):
|
||||||
|
return True
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
import lightgbm as _lgb
|
||||||
|
|
||||||
|
if isinstance(getattr(m, "booster_", None), _lgb.basic.Booster):
|
||||||
|
return True
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return False
|
||||||
|
|
||||||
|
# compute SHAP values
|
||||||
|
t1 = time.time()
|
||||||
|
vals = None
|
||||||
|
|
||||||
|
if _is_tree_model(model):
|
||||||
|
# tree specific, interventional mode with background data
|
||||||
|
explainer = shap.TreeExplainer(
|
||||||
|
model,
|
||||||
|
data=background,
|
||||||
|
feature_perturbation="interventional",
|
||||||
|
)
|
||||||
|
# disable strict additivity check to avoid ExplainerError
|
||||||
|
sv = explainer.shap_values(X_eval_proc, check_additivity=False)
|
||||||
|
# shap API can return list for classification. pick positive class if so.
|
||||||
|
if isinstance(sv, list):
|
||||||
|
cls_idx = 1 if len(sv) > 1 else 0
|
||||||
|
vals = np.asarray(sv[cls_idx])
|
||||||
|
else:
|
||||||
|
vals = np.asarray(sv)
|
||||||
|
|
||||||
|
elif hasattr(model, "coef_"):
|
||||||
|
# linear models
|
||||||
|
explainer = shap.LinearExplainer(model, background)
|
||||||
|
vals = np.asarray(explainer.shap_values(X_eval_proc))
|
||||||
|
|
||||||
|
else:
|
||||||
|
# fallback generic with training background masker
|
||||||
|
masker = shap.maskers.Independent(background)
|
||||||
|
if task_type == "classification" and hasattr(model, "predict_proba"):
|
||||||
|
f = lambda M: model.predict_proba(M)[:, 1]
|
||||||
|
else:
|
||||||
|
f = lambda M: model.predict(M)
|
||||||
|
explainer = shap.Explainer(f, masker)
|
||||||
|
out = explainer(X_eval_proc)
|
||||||
|
vals = np.asarray(getattr(out, "values", out))
|
||||||
|
|
||||||
|
t_shap = time.time() - t1
|
||||||
|
|
||||||
|
# normalize shapes to (n_rows, n_features)
|
||||||
|
vals = np.asarray(vals)
|
||||||
|
vals = np.squeeze(vals)
|
||||||
|
|
||||||
|
if vals.ndim == 3 and vals.shape[2] in (2,) and vals.shape[1] == len(feat_names):
|
||||||
|
vals = vals[..., -1]
|
||||||
|
if vals.ndim == 3 and vals.shape[-1] == len(feat_names):
|
||||||
|
vals = vals.reshape(-1, vals.shape[-1])
|
||||||
|
if vals.ndim == 2 and vals.shape[0] == len(feat_names):
|
||||||
|
vals = vals.T
|
||||||
|
|
||||||
|
return vals, t_fit, t_shap, feat_names
|
||||||
|
|
||||||
|
|
||||||
|
def mean_abs_shap(shap_matrix, feature_names):
|
||||||
|
"""
|
||||||
|
Global mean absolute SHAP per feature.
|
||||||
|
Still useful for descriptive plots, but not used for instance level stability.
|
||||||
|
"""
|
||||||
|
return pd.Series(
|
||||||
|
np.abs(shap_matrix).mean(axis=0),
|
||||||
|
index=np.asarray(feature_names),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _align_matrices_to_union(mats_with_names):
|
||||||
|
"""
|
||||||
|
mats_with_names: list of (shap_matrix, feature_names)
|
||||||
|
|
||||||
|
Each shap_matrix has shape (n_instances, n_features_k).
|
||||||
|
feature_names is array-like of length n_features_k.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
all_feats: list[str]
|
||||||
|
T: np.ndarray of shape (n_models, n_instances, n_all_feats)
|
||||||
|
"""
|
||||||
|
if not mats_with_names:
|
||||||
|
raise ValueError("No SHAP matrices provided")
|
||||||
|
|
||||||
|
# check that all matrices have the same number of instances
|
||||||
|
n_instances = mats_with_names[0][0].shape[0]
|
||||||
|
for M, names in mats_with_names:
|
||||||
|
if M.shape[0] != n_instances:
|
||||||
|
raise ValueError(
|
||||||
|
f"All SHAP matrices must have same number of rows. "
|
||||||
|
f"Expected {n_instances}, got {M.shape[0]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# union of all feature names
|
||||||
|
all_feats = sorted(
|
||||||
|
set().union(*[set(np.asarray(names)) for _, names in mats_with_names])
|
||||||
|
)
|
||||||
|
n_models = len(mats_with_names)
|
||||||
|
n_feats = len(all_feats)
|
||||||
|
|
||||||
|
T = np.zeros((n_models, n_instances, n_feats), dtype=float)
|
||||||
|
feat_index = {f: j for j, f in enumerate(all_feats)}
|
||||||
|
|
||||||
|
for m_idx, (M, names) in enumerate(mats_with_names):
|
||||||
|
names = np.asarray(names)
|
||||||
|
col_map = {name: c for c, name in enumerate(names)}
|
||||||
|
for fname, j_global in feat_index.items():
|
||||||
|
if fname in col_map:
|
||||||
|
j_local = col_map[fname]
|
||||||
|
T[m_idx, :, j_global] = M[:, j_local]
|
||||||
|
# if fname not present, leave zeros
|
||||||
|
|
||||||
|
return all_feats, T
|
||||||
|
|
||||||
|
|
||||||
|
def shap_stability_from_matrices(mats_with_names):
|
||||||
|
"""
|
||||||
|
mats_with_names: list of tuples (shap_matrix, feature_names)
|
||||||
|
shap_matrix: np.ndarray of shape (n_instances, n_features_k)
|
||||||
|
feature_names: list or array of length n_features_k
|
||||||
|
|
||||||
|
This measures instance level stability:
|
||||||
|
For each instance i and feature j we look at the SHAP values
|
||||||
|
across models and compute their standard deviation.
|
||||||
|
|
||||||
|
Steps:
|
||||||
|
1) Align all matrices on the union of feature names.
|
||||||
|
2) Build tensor T of shape (n_models, n_instances, n_features_union).
|
||||||
|
3) Compute std across models: per_inst_feat_std = T.std(axis=0).
|
||||||
|
4) Aggregate:
|
||||||
|
agg_std = mean of per_inst_feat_std over instances and features.
|
||||||
|
stability_score = 1 / (1 + agg_std).
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
agg_std: float
|
||||||
|
stability_score: float
|
||||||
|
per_feat_std: pd.Series with mean std per feature over instances
|
||||||
|
per_inst_std: np.ndarray with mean std per instance over features
|
||||||
|
"""
|
||||||
|
if not mats_with_names:
|
||||||
|
raise ValueError("No SHAP matrices provided")
|
||||||
|
|
||||||
|
if len(mats_with_names) < 2:
|
||||||
|
raise ValueError(
|
||||||
|
f"Need at least 2 models to estimate stability, got {len(mats_with_names)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
feat_names_union, T = _align_matrices_to_union(mats_with_names)
|
||||||
|
|
||||||
|
# std across models for each instance and feature
|
||||||
|
per_inst_feat_std = T.std(axis=0) # shape (n_instances, n_features)
|
||||||
|
|
||||||
|
# aggregate
|
||||||
|
agg_std = float(per_inst_feat_std.mean())
|
||||||
|
stability_score = 1.0 / (1.0 + agg_std)
|
||||||
|
|
||||||
|
# per feature: average std over instances
|
||||||
|
per_feat_std = per_inst_feat_std.mean(axis=0)
|
||||||
|
per_feat_std_series = pd.Series(per_feat_std, index=feat_names_union)
|
||||||
|
|
||||||
|
# per instance: average std over features
|
||||||
|
per_inst_std = per_inst_feat_std.mean(axis=1)
|
||||||
|
|
||||||
|
return agg_std, stability_score, per_feat_std_series, per_inst_std
|
||||||
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user