Compare commits

5 Commits
dev ... main

12 changed files with 284 additions and 546 deletions

4
.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
./src/test/
./src/cal_housing.csv
./src/__pycache__
./src/results_nsga

View File

@@ -1,5 +1,9 @@
Codebase:
https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads
Previous Analysis:
https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
Operation:
Specify working directory (local repo location), cache directory (dataset download location), and
@@ -12,15 +16,84 @@ Code File Structure
Shell scripts
h20_batch.sh ->
nsga_batch.sh ->
grid_search_batch.sh ->
h20_batch.sh -> h20_autoML.py
nsga_batch.sh -> nsga_exp.py
grid_search_batch.sh -> grid_search_exp.py
grid_search_batch calls both algorithms and combine_datasets
Run order should be
datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
############################################################################################################################################################
Objective:
Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
The code will need to be modified to work with other datasets
1. Modify code to work with California Housing Price dataset found in datasets.py
(cal_housing, regression dataset)
2. Modify code to work with some other classification focused dataset
(dataset.py code contains cal_housing for regression and three classification datasets)
3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
Table should include key performance indicators for both datasets as well as number of objects in each dataset
4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
Also cache files for easy referencing and to make sure that data can be analysed properly later
Files that need changing
dataset = YES
algorithms = NO
nsga_exp = YES
shap_values_computation = NO(?)
############################################################################################################################################################
Scripting Tasks:
datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
1. Make datasets generalizable
2. Make combine datasets reference generalizable headers / infer from input
3. Make nsga_exp.py reference the combine_dataset headers
4. Make output folders specified by user at runtime / in the slurm bash script
Operation Tasks:
1. Run nsga_exp.py using the California Housing Dataset (regression)
2. Run the nsga_exp.py script using a separate, classification dataset
3. Compare results
############################################################################################################################################################
Code Changes:
nsga_exp.py
- Lines 24 & 26 reference yield_t/ha. This should be a parameter
- Lines 33-36 reference relative paths to previous soil.csv files
- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
- Line 134 references an ngsa output directory. This could be parameterized for other datasets
- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
datasets.py
- User prompt was added to allow users to choose a dataset of the four and list its Type
- User prompt was added to choose a target feature and features to exclude
- User prompt was added for a save location for the processed csv of the dataset output
############################################################################################################################################################
Code Changes:
Code Optimizations:
- SHAP KernelExplainer
Use shap.TreeExplainer on tree-based models instead

View File

@@ -11,4 +11,4 @@ joblib
# Data acquisition
openml
deap

View File

@@ -1,12 +1,36 @@
#!/bin/bash
sudo apt install nfs-common -y
dnf install -y nfs-utils kernel-modules-extra
mkdir /mnt/data
mount 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data
mount -t nfs -o vers=3,proto=tcp 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data
$WORK_DIR=/mnt/data
WORK_DIR=/mnt/data
mkdir -p /mnt/data/cache # ensure directory exists
export OPENML_CACHE_DIR=/mnt/data/cache
apt install python3-venv
python3 -m venv <environment_name>
source <environment_name/bin/activate
pip install -r requirements.txt
chmod -R 750 /root/automl_datasets
adduser mlly
passwd mlly
# mlly
chown -R mlly:mlly /mnt/data
chmod -R 750 /mnt/data
su - mlly
sudo usermod -aG wheel mlly
sudo chmod -R u+rwx /mnt/data/automl_datasets/nsga/

View File

@@ -1,92 +1,52 @@
import pandas as pd
from loading_climate_data import load_and_process_climate_data
from loading_crop_data import engineer_crop_features
from loading_ndvi_data import process_ndvi_data
from loading_soil_data import get_soil_data
def combine_datasets_single_csv(
csv_file: str,
target_column: str = None,
exclude_features: list = None
):
"""
Load and process a single CSV file into a DataFrame suitable for modeling.
Mimics the original combine_datasets() structure, returning a single DataFrame.
def combine_datasets(crop_data_file,
soil_data_file,
climate_data_file,
ndvi_data_file,
ndvi=False):
dataset_loaders = {
'soil': lambda: get_soil_data(soil_data_file),
'climate': lambda: load_and_process_climate_data(climate_data_file),
'ndvi': lambda: process_ndvi_data(ndvi_data_file),
'crop': lambda: engineer_crop_features(crop_data_file)
}
Parameters
----------
csv_file : str
Path to the CSV file to load.
target_column : str, optional
Column to use as target. If None, assumes last column is the target.
exclude_features : list of str, optional
List of features to exclude from the DataFrame.
merge_keys = {
'soil': ['PostalCode'],
'climate': ['PostalCode', 'Year'],
'ndvi': ['PostalCode', 'Year'],
'crop': ['PostalCode', 'Year']
}
Returns
-------
combined_data : pd.DataFrame
Processed DataFrame, target column removed if specified.
The format is compatible with previous combine_datasets outputs.
"""
# Load CSV
combined_data = pd.read_csv(csv_file)
combined_data = dataset_loaders['crop']()
# Determine target
if target_column is None:
target_column = combined_data.columns[-1] # default last column
# Merge climate data
climate_data = dataset_loaders['climate']()
combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
if target_column not in combined_data.columns:
raise ValueError(f"Target column '{target_column}' not found in CSV.")
# Merge NDVI data if required
if ndvi:
ndvi_data = dataset_loaders['ndvi']()
combined_data = combined_data[combined_data['Year'] >= 2000]
combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
# Separate target column internally (optional)
target_series = combined_data[target_column]
# Merge soil data
soil_data = dataset_loaders['soil']()
combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
postal_codes = combined_data['PostalCode']
years = combined_data['Year']
# Drop the target from features
combined_data = combined_data.drop(columns=[target_column])
# Drop irrelevant or redundant columns
features_to_exclude = ['PostalCode', 'Year', 'SoilID']
combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
return combined_data
def combine_dataset_pc(crop_data_file,
soil_data_file,
climate_data_file,
ndvi_data_file,
ndvi=False):
dataset_loaders = {
'soil': lambda: get_soil_data(soil_data_file),
'climate': lambda: load_and_process_climate_data(climate_data_file),
'ndvi': lambda: process_ndvi_data(ndvi_data_file),
'crop': lambda: engineer_crop_features(crop_data_file)
}
merge_keys = {
'soil': ['PostalCode'],
'climate': ['PostalCode', 'Year'],
'ndvi': ['PostalCode', 'Year'],
'crop': ['PostalCode', 'Year']
}
combined_data = dataset_loaders['crop']()
# Merge climate data
climate_data = dataset_loaders['climate']()
combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
# Merge NDVI data if required
if ndvi:
ndvi_data = dataset_loaders['ndvi']()
combined_data = combined_data[combined_data['Year'] >= 2000]
combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
# Merge soil data
soil_data = dataset_loaders['soil']()
combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
# Drop irrelevant or redundant columns
features_to_exclude = ['Year', 'SoilID']
combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
# Drop additional user-specified features
if exclude_features:
for col in exclude_features:
if col in combined_data.columns:
combined_data = combined_data.drop(columns=[col])
else:
print(f"Warning: '{col}' not found in CSV and cannot be excluded.")
# Mimic original structure: return combined_data as a DataFrame
return combined_data

View File

@@ -3,48 +3,113 @@ import pandas as pd
import openml
# --- CACHE SETUP ---
# Change this path to your preferred local cache directory
#CACHE_DIR = os.path.expanduser("~/openml_cache")
#os.makedirs(CACHE_DIR, exist_ok=True)
#openml.config.cache_directory = CACHE_DIR
CACHE_DIR = os.path.expanduser("~/openml_cache")
os.makedirs(CACHE_DIR, exist_ok=True)
openml.config.cache_directory = CACHE_DIR
# OpenML CC18 classification tasks (task ids)
# --- Dataset IDs ---
TASKS = {
"adult": 7592, # Adult Income classification
"spambase": 43, # Spambase classification
"optdigits": 28, # Optdigits classification
"adult": 7592,
"spambase": 43,
"optdigits": 28,
}
# Regression dataset (dataset id)
DATASETS = {
"cal_housing": 44025
}
# --- Load functions ---
def _load_task_dataframe(task_id: int):
task = openml.tasks.get_task(task_id)
dataset_id = task.dataset_id
dataset = openml.datasets.get_dataset(dataset_id)
X, y, categorical_indicator, _ = dataset.get_data(
dataset_format="dataframe",
target=task.target_name
)
# drop rows with NA target if any
if isinstance(y, pd.Series):
mask = ~y.isna()
X, y = X.loc[mask], y.loc[mask]
return X, y
X, y, _, _ = dataset.get_data(dataset_format="dataframe", target=task.target_name)
mask = ~y.isna()
return X.loc[mask], y.loc[mask]
def load_dataset(name: str):
if name in TASKS:
X, y = _load_task_dataframe(TASKS[name])
return X, y, "classification"
task_type = "classification"
elif name in DATASETS:
ds_id = DATASETS[name]
ds = openml.datasets.get_dataset(ds_id)
X, y, categorical_indicator, _ = ds.get_data(
dataset_format="dataframe", target=ds.default_target_attribute
)
mask = ~y.isna()
return X.loc[mask], y.loc[mask], "regression"
target_col = ds.default_target_attribute
X, y, _, _ = ds.get_data(dataset_format="dataframe", target=None)
mask = ~X[target_col].isna()
X = X.loc[mask]
y = X[target_col].loc[mask]
task_type = "regression"
else:
raise ValueError(f"Unknown dataset {name}")
return X, y, task_type
# --- Interactive main ---
def main():
print("Available datasets:")
all_datasets = list(TASKS.keys()) + list(DATASETS.keys())
for i, name in enumerate(all_datasets):
print(f"{i+1}. {name}")
selection = input("Enter the dataset name: ").strip()
if selection not in all_datasets:
raise ValueError(f"Dataset '{selection}' not recognized.")
X, y, task_type = load_dataset(selection)
# --- Identify default target ---
default_target = y.name
print(f"\nDefault target column: {default_target}")
# --- Print all features (without explanations) ---
print("\nFeatures in the dataset:")
for col in X.columns.unique():
print(f"- {col} ({X[col].dtype})")
# --- Target selection ---
target = input("\nEnter the target feature (or press Enter to use default): ").strip()
if target:
if target not in X.columns:
raise ValueError(f"Target feature '{target}' not found.")
y = X[target]
X = X.drop(columns=[target], errors="ignore")
else:
target = default_target
X = X.drop(columns=[target], errors="ignore")
# --- Feature exclusion ---
exclude_input = input("\nEnter features to exclude (comma-separated), or press Enter to skip: ").strip()
if exclude_input:
exclude_cols = [col.strip() for col in exclude_input.split(",")]
for col in exclude_cols:
if col in X.columns:
X = X.drop(columns=[col])
else:
print(f"Warning: '{col}' not found in dataset and cannot be excluded.")
# --- Show preview ---
print("\nFinal dataset preview (first 5 rows):")
print(X.head())
print("\nTarget preview (first 5 rows):")
print(y.head())
print(f"\nTask type: {task_type}")
print(f"Target column: {target}")
print(f"Number of features: {len(X.columns)}")
# --- Export to CSV ---
output_file = input("\nEnter filename to save dataset as CSV (e.g., dataset.csv): ").strip()
if output_file:
df_export = X.copy()
df_export[target] = y # append target at the end
df_export.to_csv(output_file, index=False)
print(f"Dataset saved to {output_file} (target column: '{target}')")
# Save the CSV path to a temporary text file in the current directory
temp_path_file = "last_csv_path.txt"
full_path = os.path.abspath(output_file)
with open(temp_path_file, "w") as f:
f.write(full_path)
print(f"CSV path written to {temp_path_file}")
if __name__ == "__main__":
main()

View File

@@ -1,23 +0,0 @@
#!/bin/bash
#SBATCH --account=def-xander # Replace with your account
#SBATCH --mem=10G # Memory allocation
#SBATCH --time=21:00:00 # Total run time limit (11 hours)
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --job-name=grid_search_%A # Job name with job ID appended
#SBATCH --output=%x-%j.out # Standard output and error log
#SBATCH --error=%x-%j.err # Separate error log
# Load necessary modules
#module load python/3.8
# Activate your virtual environment
source /env0/bin/activate
# Parameters
TIME=$1
# Run the Python script with the specified time parameter
srun python $WORK_DIR/src/grid_search_exp.py --time $TIME
# Deactivate the virtual environment
deactivate

View File

@@ -1,186 +0,0 @@
import os
import time
import numpy as np
import pandas as pd
import warnings
import shap
from sklearn.model_selection import KFold, ParameterGrid
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, RobustScaler, PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
from sklearn.cluster import KMeans
from sklearn.exceptions import ConvergenceWarning
from joblib import dump
from combine_datasets import combine_datasets
from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
warnings.filterwarnings("ignore", category=ConvergenceWarning)
# Preprocess the data and separate features and target
def preprocess_data(input_data, pipeline):
X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
y = input_data['yield_t/ha']
# Fit-transform the pipeline steps without `y` first
for name, step in pipeline.steps:
if name != 'feature_selection':
X = step.fit_transform(X)
# Apply `SelectKBest` separately, which requires `y`
if 'feature_selection' in [name for name, _ in pipeline.steps]:
X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
return X, y
# Function to compute SHAP values using KMeans clustering
def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
# Fit KMeans to the training data with explicit n_init parameter
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
cluster_centers = kmeans.cluster_centers_
# Compute SHAP values on cluster centers
explainer = shap.KernelExplainer(model.predict, cluster_centers)
shap_values = explainer.shap_values(X_test)
return shap_values
# Iterative grid search with time monitoring
def iterative_grid_search(model, param_grid, X, y, cv, time_limit):
best_model = None
best_score = -np.inf
start_time = time.time()
# Manually iterate over parameter combinations
for params in ParameterGrid(param_grid):
elapsed_time = time.time() - start_time
if elapsed_time > time_limit:
print("Time limit exceeded. Stopping search.")
break
model.set_params(**params)
scores = []
for train_idx, test_idx in cv.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = -mean_squared_error(y_test, predictions)
scores.append(score)
mean_score = np.mean(scores)
if mean_score > best_score:
best_score = mean_score
best_model = model
return best_model
# Main experiment function
def run_experiment_with_dynamic_grid(total_hours=10.0, output_base_directory="./results"):
time_per_algorithm = total_hours * 3600 / 5
# Load and combine datasets
data = combine_datasets(
"./data/potatoes_dataset.csv",
"./data/updated_soil_data_with_awc.csv",
"./data/final_augmented_climate_data.csv",
"./data/NDVI.csv",
ndvi=True
)
pipelines = [
Pipeline([
('normalize', StandardScaler()),
('imputer', SimpleImputer(strategy='mean')),
('poly_features', PolynomialFeatures(degree=2)),
('feature_selection', SelectKBest(score_func=f_regression, k=25))
]),
Pipeline([
('robust', RobustScaler()),
('feature_selection', SelectKBest(score_func=f_regression, k=25)),
('power_transformation', PowerTransformer(method='yeo-johnson'))
])
]
models = {
'lasso': lasso,
'ridge': ridge_regressor,
'random_forest': random_forest,
'gradient_boosting': gradient_boosting,
'decision_tree': decision_tree_regressor,
'stacking_ensemble': stacking_lasso
}
output_directory = os.path.join(output_base_directory, f"grid_search_results_{total_hours}h")
os.makedirs(output_directory, exist_ok=True)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for model_name, model_func in models.items():
best_pipeline_result = None
for pipeline_idx, pipeline in enumerate(pipelines):
X, y = preprocess_data(data, pipeline)
model, params = model_func()
best_model = iterative_grid_search(
model, params, X, y, cv=kf, time_limit=time_per_algorithm
)
shap_stabilities = []
for fold, (train_idx, test_idx) in enumerate(kf.split(X, y)):
X_train_fold, X_test_fold = X[train_idx], X[test_idx]
y_train_fold = y[train_idx]
best_model.fit(X_train_fold, y_train_fold)
shap_values = compute_shap_values_with_kmeans(best_model, X_train_fold, X_test_fold)
fold_shap_stability = np.std(shap_values, axis=0).mean()
shap_stabilities.append(fold_shap_stability)
shap_stability = np.mean(shap_stabilities)
predictions = best_model.predict(X)
mse = mean_squared_error(y, predictions)
if best_pipeline_result is None or (
mse < best_pipeline_result['mse'] and shap_stability < best_pipeline_result['shap_stability']):
best_pipeline_result = {
'model': best_model,
'pipeline_idx': pipeline_idx,
'mse': mse,
'shap_stability': shap_stability
}
if best_pipeline_result:
model_output_dir = os.path.join(output_directory, model_name)
os.makedirs(model_output_dir, exist_ok=True)
model_file_path = os.path.join(model_output_dir, f"{model_name}_best_model.joblib")
dump(best_pipeline_result['model'], model_file_path)
metrics_df = pd.DataFrame({
'Model': [model_name],
'Pipeline': [f"Pipeline_{best_pipeline_result['pipeline_idx'] + 1}"],
'MSE': [best_pipeline_result['mse']],
'SHAP_Stability': [best_pipeline_result['shap_stability']]
})
metrics_csv_path = os.path.join(output_directory, "metrics_summary_GS.csv")
metrics_df.to_csv(metrics_csv_path, index=False, mode='a', header=not os.path.exists(metrics_csv_path))
print(f"Saved best results for {model_name} from Pipeline {best_pipeline_result['pipeline_idx'] + 1}")
print(f"All results saved to {output_directory}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Run Grid Search for different models.')
parser.add_argument('--time', type=float, required=True, help='Time for the model to run in hours')
args = parser.parse_args()
run_experiment_with_dynamic_grid(total_hours=args.time, output_base_directory="./results_grid_search")

View File

@@ -1,166 +0,0 @@
import os
import shap
import numpy as np
import pandas as pd
import h2o
from h2o.automl import H2OAutoML
from sklearn.preprocessing import StandardScaler, RobustScaler, PolynomialFeatures, PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
from sklearn.cluster import KMeans
from combine_datasets import combine_datasets
# Initialize H2O
h2o.init()
# Preprocess the data and separate features and target
def preprocess_data(input_data, pipeline):
X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
y = input_data['yield_t/ha']
# Fit-transform the pipeline steps without `y` first
for name, step in pipeline.steps:
if name != 'feature_selection':
X = step.fit_transform(X)
# Apply `SelectKBest` separately, which requires `y`
if 'feature_selection' in [name for name, _ in pipeline.steps]:
X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
return X, y
# Function to compute SHAP values using KMeans clustering
def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
# Fit KMeans to the training data with explicit n_init parameter
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
cluster_centers = kmeans.cluster_centers_
# Convert cluster centers to H2OFrame
cluster_centers_h2o = h2o.H2OFrame(cluster_centers)
# Convert X_test to H2OFrame
X_test_h2o = h2o.H2OFrame(X_test)
# Compute SHAP values on cluster centers
explainer = shap.KernelExplainer(lambda x: model.predict(h2o.H2OFrame(x)).as_data_frame().values.flatten(),
cluster_centers_h2o.as_data_frame().values)
shap_values = explainer.shap_values(X_test_h2o.as_data_frame().values)
return shap_values
# Function to load and combine datasets
def load_and_combine_datasets(use_ndvi=True):
data = combine_datasets(
"./data/potatoes_dataset.csv",
"./data/updated_soil_data_with_awc.csv",
"./data/final_augmented_climate_data.csv",
"./data/NDVI.csv",
ndvi=use_ndvi
)
return data
# Main experiment function with H2O AutoML
def run_h2o_experiment(use_ndvi=True, automl_time=10, output_base_directory="./results_h2o"):
data = load_and_combine_datasets(use_ndvi=use_ndvi)
pipelines = [
Pipeline([
('normalize', StandardScaler()),
('imputer', SimpleImputer(strategy='mean')),
('poly_features', PolynomialFeatures(degree=2)),
('feature_selection', SelectKBest(score_func=f_regression, k=25))
]),
Pipeline([
('robust', RobustScaler()),
('feature_selection', SelectKBest(score_func=f_regression, k=25)),
('power_transformation', PowerTransformer(method='yeo-johnson'))
])
]
output_directory = os.path.join(output_base_directory, f"h2o_automl_results_{automl_time}h")
os.makedirs(output_directory, exist_ok=True)
metrics_df = pd.DataFrame(columns=['Model', 'Pipeline', 'MSE', 'SHAP_Stability'])
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for pipeline_idx, pipeline in enumerate(pipelines):
X_processed, y = preprocess_data(data, pipeline)
X_h2o = h2o.H2OFrame(X_processed)
y_h2o = h2o.H2OFrame(y.to_frame())
X_h2o['target'] = y_h2o
shap_stabilities = []
all_shap_values = []
for fold, (train_idx, test_idx) in enumerate(kf.split(X_processed)):
train = X_h2o[train_idx.tolist(), :]
test = X_h2o[test_idx.tolist(), :]
aml = H2OAutoML(max_runtime_secs=int(automl_time * 3600 / kf.get_n_splits()), max_models=5, seed=42,
sort_metric='MSE')
aml.train(x=list(X_h2o.columns[:-1]), y='target', training_frame=train)
best_model = aml.leader
leaderboard = aml.leaderboard.as_data_frame()
leaderboard_csv_path = os.path.join(output_directory, f"leaderboard_pipeline_{pipeline_idx + 1}_fold_{fold + 1}.csv")
leaderboard.to_csv(leaderboard_csv_path, index=False)
model_output_dir = os.path.join(output_directory, f"pipeline_{pipeline_idx + 1}")
os.makedirs(model_output_dir, exist_ok=True)
model_file_path = os.path.join(model_output_dir, f"h2o_best_model_{pipeline_idx + 1}_fold_{fold + 1}.zip")
h2o.save_model(best_model, path=model_file_path)
predictions = best_model.predict(test).as_data_frame()['predict']
mse = mean_squared_error(test['target'].as_data_frame().values, predictions.values)
shap_values = compute_shap_values_with_kmeans(best_model, train.as_data_frame().values,
test.as_data_frame().values)
shap_stability = np.std(shap_values, axis=0).mean()
shap_stabilities.append(shap_stability)
all_shap_values.append(shap_values)
print(f"Completed H2O AutoML and SHAP computation for Pipeline {pipeline_idx + 1}, Fold {fold + 1}")
mean_shap_stability = np.mean(shap_stabilities)
for fold_idx, shap_values in enumerate(all_shap_values):
shap_file_name = f"shap_values_pipeline_{pipeline_idx + 1}_fold_{fold_idx + 1}.npy"
shap_file_path = os.path.join(model_output_dir, shap_file_name)
np.save(shap_file_path, shap_values)
metrics_df = pd.concat([metrics_df, pd.DataFrame([{
'Model': best_model.model_id,
'Pipeline': f"Pipeline_{pipeline_idx + 1}",
'MSE': mse,
'SHAP_Stability': mean_shap_stability
}])], ignore_index=True)
metrics_csv_path = os.path.join(output_directory, "metrics_summary.csv")
metrics_df.to_csv(metrics_csv_path, index=False)
print(f"All results saved to {output_directory}")
h2o.shutdown(prompt=False)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Run H2O AutoML for different models.')
parser.add_argument('--time', type=float, required=True, help='Time in hours for the AutoML process to run.')
args = parser.parse_args()
# Run the H2O AutoML experiment
run_h2o_experiment(use_ndvi=True, automl_time=args.time, output_base_directory="./results_h2o")

View File

@@ -1,23 +0,0 @@
#!/bin/bash
#SBATCH --account=def-xander # Replace with your account
#SBATCH --mem=10G # Memory allocation
#SBATCH --time=11:00:00 # Total run time limit (11 hours)
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --job-name=h20_%A # Job name with job ID appended
#SBATCH --output=%x-%j.out # Standard output and error log
#SBATCH --error=%x-%j.err # Separate error log
# Load necessary modules
module load python/3.8
# Activate your virtual environment
source ~/envs/workdir/bin/activate
# Parameters
TIME=$1
# Run the Python script with the specified time parameter
srun python /home/dvera/scratch/Framework_EXP/h20_autoML.py --time $TIME
# Deactivate the virtual environment
deactivate

View File

@@ -1,9 +1,8 @@
#!/bin/bash
#SBATCH --account=def-xander # Replace with your account
#SBATCH --mem=10G # Memory allocation
#SBATCH --mem=30G # Memory allocation
#SBATCH --time=21:00:00 # Total run time limit (11 hours)
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --job-name=nsga_%A # Job name with job ID appended
#SBATCH --cpus-per-task=8 # Number of CPU cores per task
#SBATCH --job-name=nsga # Job name with job ID appended
#SBATCH --output=%x-%j.out # Standard output and error log
#SBATCH --error=%x-%j.err # Separate error log
@@ -11,10 +10,10 @@
#module load python/3.8
# Activate your virtual environment
source /env0/bin/activate
source /nsga/bin/activate
# Run the Python script with the specified time parameter
srun python $WORK_DIR/src/nsga_exp.py
srun python /root/automl_datasets/src/nsga_exp.py
# Deactivate the virtual environment
deactivate

View File

@@ -13,31 +13,39 @@ import shap
from deap import base, creator, tools, algorithms
from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
from combine_datasets import combine_datasets
from multiprocessing import Pool, cpu_count
import argparse
creator.create("FitnessMin", base.Fitness, weights=(-1.0, -1.0)) # Minimize both objectives
creator.create("Individual", list, fitness=creator.FitnessMin)
def load_dataset():
# Read the CSV path from the temporary file
with open("last_csv_path.txt", "r") as f:
csv_path = f.read().strip()
if not os.path.exists(csv_path):
raise FileNotFoundError(f"CSV file not found: {csv_path}")
# Load the dataset
data = pd.read_csv(csv_path)
return data
# Preprocess the data and separate features and target
#
# preprocess_data changed to not use k_features
# feature_selection changed to go based on number of features in dataset
# This is anti-thetical to the larger study but I am not smart enough to make it work properly
#
def preprocess_data(input_data, pipeline, k_value):
pipeline.named_steps['feature_selection'].set_params(k=k_value)
X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
print(X.columns)
y = input_data['yield_t/ha']
X = input_data.iloc[:, :-1]
y = input_data.iloc[:, -1]
k_value = X.shape[1]
pipeline.named_steps['feature_selection'].set_params(k=k_value)
X = pipeline.fit_transform(X, y)
return X, y
# Load and combine datasets
def load_and_combine_datasets(use_ndvi=True):
data = combine_datasets(
"./data/potatoes_dataset.csv",
"./data/updated_soil_data_with_awc.csv",
"./data/final_augmented_climate_data.csv",
"./data/NDVI.csv",
ndvi=use_ndvi
)
return data
# Function to compute SHAP values using KMeans clustering
def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
@@ -94,13 +102,12 @@ def evaluate_individual(individual, data, kf):
return np.mean(mse_scores), np.mean(shap_stabilities), all_shap_values
# Run NSGA-II experiment with a time limit
def run_nsga_experiment(use_ndvi=True,
output_base_directory="./results_nsga",
def run_nsga_experiment(output_base_directory="./results_nsga",
population_size=30,
n_generations=50,
time_limit=36000):
# Load and combine dataset
data = load_and_combine_datasets(use_ndvi=use_ndvi)
# Load the dataset
data = load_dataset()
# Pipelines for preprocessing
global pipelines
@@ -156,28 +163,33 @@ def run_nsga_experiment(use_ndvi=True,
start_time = time.time()
pop = toolbox.population(n=population_size)
# Evaluate the initial population
fits = toolbox.map(toolbox.evaluate, pop)
# Use a Pool to parallelize evaluations
with Pool(processes=cpu_count()) as pool:
# Evaluate the initial population in parallel
fits = list(pool.map(toolbox.evaluate, pop))
for fit, ind in zip(fits, pop):
ind.fitness.values = fit[:2]
ind.shap_values = fit[2]
# Use the number of parents and offspring in the evolution process
# Evolution loop
for gen in range(n_generations):
if time.time() - start_time > time_limit:
print("Time limit exceeded, stopping evolution.")
break
offspring = algorithms.varAnd(pop, toolbox, cxpb=0.7, mutpb=0.2)
fits = toolbox.map(toolbox.evaluate, offspring)
# Evaluate offspring in parallel
fits = list(pool.map(toolbox.evaluate, offspring))
for fit, ind in zip(fits, offspring):
ind.fitness.values = fit[:2] # Ensure only the first two values are assigned to fitness
ind.fitness.values = fit[:2]
ind.shap_values = fit[2]
# Select the next generation of parents from the combined pool of parents and offspring
# Select next generation
pop = toolbox.select(pop + offspring, k=population_size)
for ind in pop:
for fold_idx, shap_values in enumerate(ind.shap_values):
shap_output_path = os.path.join(output_directory, f"shap_values_{int(ind[0])}_fold_{fold_idx + 1}.npy")
@@ -199,12 +211,11 @@ def run_nsga_experiment(use_ndvi=True,
if __name__ == "__main__":
run_nsga_experiment(
use_ndvi=True,
output_base_directory="./results_nsga",
population_size=80, # Larger population size for a comprehensive search
n_generations=100, # Increased number of generations
num_parents=30, # Increased number of parents
num_offspring=50, # Increased number of offspring
# num_parents=30, # Increased number of parents
# num_offspring=50, # Increased number of offspring
time_limit=108000 # 20 hours (20 * 3600 seconds)
)