Multi Processing?

Functional runs from nsga_exp.py, however, extremely slow due to limited parallelization
Removed dangling, non-NSGA experiments
2025-11-11 22:26:10 -04:00 · 2025-11-11 22:05:44 -04:00 · 2025-11-11 20:26:18 -04:00 · 2025-11-11 20:24:36 -04:00 · 2025-11-11 20:22:13 -04:00
12 changed files with 284 additions and 546 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,4 @@
 ./src/test/
 ./src/cal_housing.csv
 ./src/__pycache__
 ./src/results_nsga
--- a/background.txt
+++ b/background.txt
@@ -1,5 +1,9 @@
 Codebase:
 https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads
 Previous Analysis:
 https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
 Operation:
 Specify working directory (local repo location), cache directory (dataset download location), and 
@@ -12,15 +16,84 @@ Code File Structure
 Shell scripts
-            h20_batch.sh ->   
+            h20_batch.sh -> h20_autoML.py  
-            nsga_batch.sh ->
+            nsga_batch.sh -> nsga_exp.py
-            grid_search_batch.sh ->
+            grid_search_batch.sh -> grid_search_exp.py
 grid_search_batch calls both algorithms and combine_datasets
 Run order should be
 datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
 ############################################################################################################################################################
 Objective:
 Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
 The code will need to be modified to work with other datasets
 1. Modify code to work with California Housing Price dataset found in datasets.py
        (cal_housing, regression dataset)
 2. Modify code to work with some other classification focused dataset
        (dataset.py code contains cal_housing for regression and three classification datasets)
 3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
        Table should include key performance indicators for both datasets as well as number of objects in each dataset
 4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
        Also cache files for easy referencing and to make sure that data can be analysed properly later
 Files that need changing
 dataset                 = YES
 algorithms              = NO
 nsga_exp                = YES
 shap_values_computation = NO(?)
 ############################################################################################################################################################
 Scripting Tasks:
 datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
 1. Make datasets generalizable
 2. Make combine datasets reference generalizable headers / infer from input
 3. Make nsga_exp.py reference the combine_dataset headers
 4. Make output folders specified by user at runtime / in the slurm bash script
 Operation Tasks:
 1. Run nsga_exp.py using the California Housing Dataset (regression)
 2. Run the nsga_exp.py script using a separate, classification dataset
 3. Compare results
 ############################################################################################################################################################
 Code Changes:
 nsga_exp.py
 - Lines 24 & 26 reference yield_t/ha. This should be a parameter
 - Lines 33-36 reference relative paths to previous soil.csv files
 - Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
 - Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
 - Line 134 references an ngsa output directory. This could be parameterized for other datasets
 - Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
 - Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
 datasets.py
 - User prompt was added to allow users to choose a dataset of the four and list its Type
 - User prompt was added to choose a target feature and features to exclude
 - User prompt was added for a save location for the processed csv of the dataset output
 ############################################################################################################################################################
-Code Changes:
+Code Optimizations:
 - SHAP KernelExplainer
        Use shap.TreeExplainer on tree-based models instead
--- a/requirements.txt
+++ b/requirements.txt
@@ -11,4 +11,4 @@ joblib
 # Data acquisition
 openml
-
+deap
--- a/setup.sh
+++ b/setup.sh
@@ -1,12 +1,36 @@
 #!/bin/bash
-sudo apt install nfs-common -y
+dnf install -y nfs-utils kernel-modules-extra
 mkdir /mnt/data
-mount 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data
+mount -t nfs -o vers=3,proto=tcp 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data
 $WORK_DIR=/mnt/data
 WORK_DIR=/mnt/data
 mkdir -p /mnt/data/cache       # ensure directory exists
 export OPENML_CACHE_DIR=/mnt/data/cache
 apt install python3-venv
 python3 -m venv <environment_name>
 source <environment_name/bin/activate
 pip install -r requirements.txt
 chmod -R 750 /root/automl_datasets
 adduser mlly
 passwd mlly
 # mlly
 chown -R mlly:mlly /mnt/data
 chmod -R 750 /mnt/data
 su - mlly
 sudo usermod -aG wheel mlly
 sudo chmod -R u+rwx /mnt/data/automl_datasets/nsga/
--- a/src/combine_datasets.py
+++ b/src/combine_datasets.py
@@ -1,92 +1,52 @@
 import pandas as pd
 from loading_climate_data import load_and_process_climate_data
 from loading_crop_data import engineer_crop_features
 from loading_ndvi_data import process_ndvi_data
 from loading_soil_data import get_soil_data
 def combine_datasets_single_csv(
    csv_file: str,
    target_column: str = None,
    exclude_features: list = None
 ):
    """
    Load and process a single CSV file into a DataFrame suitable for modeling.
    Mimics the original combine_datasets() structure, returning a single DataFrame.
-def combine_datasets(crop_data_file,
+    Parameters
-                     soil_data_file,
+    ----------
-                     climate_data_file,
+    csv_file : str
-                     ndvi_data_file,
+        Path to the CSV file to load.
-                     ndvi=False):
+    target_column : str, optional
-    dataset_loaders = {
+        Column to use as target. If None, assumes last column is the target.
-        'soil': lambda: get_soil_data(soil_data_file),
+    exclude_features : list of str, optional
-        'climate': lambda: load_and_process_climate_data(climate_data_file),
+        List of features to exclude from the DataFrame.
        'ndvi': lambda: process_ndvi_data(ndvi_data_file),
        'crop': lambda: engineer_crop_features(crop_data_file)
    }
-    merge_keys = {
+    Returns
-        'soil': ['PostalCode'],
+    -------
-        'climate': ['PostalCode', 'Year'],
+    combined_data : pd.DataFrame
-        'ndvi': ['PostalCode', 'Year'],
+        Processed DataFrame, target column removed if specified.
-        'crop': ['PostalCode', 'Year']
+        The format is compatible with previous combine_datasets outputs.
-    }
+    """
    # Load CSV
    combined_data = pd.read_csv(csv_file)
-    combined_data = dataset_loaders['crop']()
+    # Determine target
    if target_column is None:
        target_column = combined_data.columns[-1]  # default last column
-    # Merge climate data
+    if target_column not in combined_data.columns:
-    climate_data = dataset_loaders['climate']()
+        raise ValueError(f"Target column '{target_column}' not found in CSV.")
    combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
-    # Merge NDVI data if required
+    # Separate target column internally (optional)
-    if ndvi:
+    target_series = combined_data[target_column]
        ndvi_data = dataset_loaders['ndvi']()
        combined_data = combined_data[combined_data['Year'] >= 2000]
        combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
-    # Merge soil data
+    # Drop the target from features
-    soil_data = dataset_loaders['soil']()
+    combined_data = combined_data.drop(columns=[target_column])
    combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
    postal_codes = combined_data['PostalCode']
    years = combined_data['Year']
-    # Drop irrelevant or redundant columns
+    # Drop additional user-specified features
-    features_to_exclude = ['PostalCode', 'Year', 'SoilID']
+    if exclude_features:
-
+        for col in exclude_features:
-    combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
+            if col in combined_data.columns:
-
+                combined_data = combined_data.drop(columns=[col])
-    return combined_data
+            else:
-
+                print(f"Warning: '{col}' not found in CSV and cannot be excluded.")
 def combine_dataset_pc(crop_data_file,
                       soil_data_file,
                       climate_data_file,
                       ndvi_data_file,
                       ndvi=False):
    dataset_loaders = {
        'soil': lambda: get_soil_data(soil_data_file),
        'climate': lambda: load_and_process_climate_data(climate_data_file),
        'ndvi': lambda: process_ndvi_data(ndvi_data_file),
        'crop': lambda: engineer_crop_features(crop_data_file)
    }
    merge_keys = {
        'soil': ['PostalCode'],
        'climate': ['PostalCode', 'Year'],
        'ndvi': ['PostalCode', 'Year'],
        'crop': ['PostalCode', 'Year']
    }
    combined_data = dataset_loaders['crop']()
    # Merge climate data
    climate_data = dataset_loaders['climate']()
    combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
    # Merge NDVI data if required
    if ndvi:
        ndvi_data = dataset_loaders['ndvi']()
        combined_data = combined_data[combined_data['Year'] >= 2000]
        combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
    # Merge soil data
    soil_data = dataset_loaders['soil']()
    combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
    # Drop irrelevant or redundant columns
    features_to_exclude = ['Year', 'SoilID']
    combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
    # Mimic original structure: return combined_data as a DataFrame
    return combined_data
--- a/src/dataset.py
+++ b/src/dataset.py
@@ -3,48 +3,113 @@ import pandas as pd
 import openml
 # --- CACHE SETUP ---
-# Change this path to your preferred local cache directory
+CACHE_DIR = os.path.expanduser("~/openml_cache")
-#CACHE_DIR = os.path.expanduser("~/openml_cache")
+os.makedirs(CACHE_DIR, exist_ok=True)
-#os.makedirs(CACHE_DIR, exist_ok=True)
+openml.config.cache_directory = CACHE_DIR
 #openml.config.cache_directory = CACHE_DIR
-# OpenML CC18 classification tasks (task ids)
+# --- Dataset IDs ---
 TASKS = {
-    "adult": 7592,       # Adult Income classification
+    "adult": 7592,
-    "spambase": 43,      # Spambase classification
+    "spambase": 43,
-    "optdigits": 28,     # Optdigits classification
+    "optdigits": 28,
 }
 # Regression dataset (dataset id)
 DATASETS = {
    "cal_housing": 44025
 }
 # --- Load functions ---
 def _load_task_dataframe(task_id: int):
    task = openml.tasks.get_task(task_id)
    dataset_id = task.dataset_id
    dataset = openml.datasets.get_dataset(dataset_id)
-    X, y, categorical_indicator, _ = dataset.get_data(
+    X, y, _, _ = dataset.get_data(dataset_format="dataframe", target=task.target_name)
-        dataset_format="dataframe",
+    mask = ~y.isna()
-        target=task.target_name
+    return X.loc[mask], y.loc[mask]
    )
    # drop rows with NA target if any
    if isinstance(y, pd.Series):
        mask = ~y.isna()
        X, y = X.loc[mask], y.loc[mask]
    return X, y
 def load_dataset(name: str):
    if name in TASKS:
        X, y = _load_task_dataframe(TASKS[name])
-        return X, y, "classification"
+        task_type = "classification"
    elif name in DATASETS:
        ds_id = DATASETS[name]
        ds = openml.datasets.get_dataset(ds_id)
-        X, y, categorical_indicator, _ = ds.get_data(
+        target_col = ds.default_target_attribute
-            dataset_format="dataframe", target=ds.default_target_attribute
+        X, y, _, _ = ds.get_data(dataset_format="dataframe", target=None)
-        )
+        mask = ~X[target_col].isna()
-        mask = ~y.isna()
+        X = X.loc[mask]
-        return X.loc[mask], y.loc[mask], "regression"
+        y = X[target_col].loc[mask]
        task_type = "regression"
    else:
        raise ValueError(f"Unknown dataset {name}")
    return X, y, task_type
 # --- Interactive main ---
 def main():
    print("Available datasets:")
    all_datasets = list(TASKS.keys()) + list(DATASETS.keys())
    for i, name in enumerate(all_datasets):
        print(f"{i+1}. {name}")
    selection = input("Enter the dataset name: ").strip()
    if selection not in all_datasets:
        raise ValueError(f"Dataset '{selection}' not recognized.")
    X, y, task_type = load_dataset(selection)
    # --- Identify default target ---
    default_target = y.name
    print(f"\nDefault target column: {default_target}")
    # --- Print all features (without explanations) ---
    print("\nFeatures in the dataset:")
    for col in X.columns.unique():
        print(f"- {col} ({X[col].dtype})")
    # --- Target selection ---
    target = input("\nEnter the target feature (or press Enter to use default): ").strip()
    if target:
        if target not in X.columns:
            raise ValueError(f"Target feature '{target}' not found.")
        y = X[target]
        X = X.drop(columns=[target], errors="ignore")
    else:
        target = default_target
        X = X.drop(columns=[target], errors="ignore")
    # --- Feature exclusion ---
    exclude_input = input("\nEnter features to exclude (comma-separated), or press Enter to skip: ").strip()
    if exclude_input:
        exclude_cols = [col.strip() for col in exclude_input.split(",")]
        for col in exclude_cols:
            if col in X.columns:
                X = X.drop(columns=[col])
            else:
                print(f"Warning: '{col}' not found in dataset and cannot be excluded.")
    # --- Show preview ---
    print("\nFinal dataset preview (first 5 rows):")
    print(X.head())
    print("\nTarget preview (first 5 rows):")
    print(y.head())
    print(f"\nTask type: {task_type}")
    print(f"Target column: {target}")
    print(f"Number of features: {len(X.columns)}")
    # --- Export to CSV ---
    output_file = input("\nEnter filename to save dataset as CSV (e.g., dataset.csv): ").strip()
    if output_file:
        df_export = X.copy()
        df_export[target] = y  # append target at the end
        df_export.to_csv(output_file, index=False)
        print(f"Dataset saved to {output_file} (target column: '{target}')")
        # Save the CSV path to a temporary text file in the current directory
        temp_path_file = "last_csv_path.txt"
        full_path = os.path.abspath(output_file)
        with open(temp_path_file, "w") as f:
            f.write(full_path)
        print(f"CSV path written to {temp_path_file}")
 if __name__ == "__main__":
    main()
--- a/src/grid_search_batch.sh
+++ b/src/grid_search_batch.sh
@@ -1,23 +0,0 @@
 #!/bin/bash
 #SBATCH --account=def-xander      # Replace with your account
 #SBATCH --mem=10G                 # Memory allocation
 #SBATCH --time=21:00:00           # Total run time limit (11 hours)
 #SBATCH --cpus-per-task=4         # Number of CPU cores per task
 #SBATCH --job-name=grid_search_%A # Job name with job ID appended
 #SBATCH --output=%x-%j.out        # Standard output and error log
 #SBATCH --error=%x-%j.err         # Separate error log
 # Load necessary modules
 #module load python/3.8
 # Activate your virtual environment
 source /env0/bin/activate
 # Parameters
 TIME=$1
 # Run the Python script with the specified time parameter
 srun python $WORK_DIR/src/grid_search_exp.py --time $TIME
 # Deactivate the virtual environment
 deactivate
--- a/src/grid_search_exp.py
+++ b/src/grid_search_exp.py
@@ -1,186 +0,0 @@
 import os
 import time
 import numpy as np
 import pandas as pd
 import warnings
 import shap
 from sklearn.model_selection import KFold, ParameterGrid
 from sklearn.pipeline import Pipeline
 from sklearn.preprocessing import StandardScaler, PolynomialFeatures, RobustScaler, PowerTransformer
 from sklearn.impute import SimpleImputer
 from sklearn.feature_selection import SelectKBest, f_regression
 from sklearn.metrics import mean_squared_error
 from sklearn.cluster import KMeans
 from sklearn.exceptions import ConvergenceWarning
 from joblib import dump
 from combine_datasets import combine_datasets
 from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
 warnings.filterwarnings("ignore", category=ConvergenceWarning)
 # Preprocess the data and separate features and target
 def preprocess_data(input_data, pipeline):
    X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
    y = input_data['yield_t/ha']
    # Fit-transform the pipeline steps without `y` first
    for name, step in pipeline.steps:
        if name != 'feature_selection':
            X = step.fit_transform(X)
    # Apply `SelectKBest` separately, which requires `y`
    if 'feature_selection' in [name for name, _ in pipeline.steps]:
        X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
    return X, y
 # Function to compute SHAP values using KMeans clustering
 def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
    # Fit KMeans to the training data with explicit n_init parameter
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
    cluster_centers = kmeans.cluster_centers_
    # Compute SHAP values on cluster centers
    explainer = shap.KernelExplainer(model.predict, cluster_centers)
    shap_values = explainer.shap_values(X_test)
    return shap_values
 # Iterative grid search with time monitoring
 def iterative_grid_search(model, param_grid, X, y, cv, time_limit):
    best_model = None
    best_score = -np.inf
    start_time = time.time()
    # Manually iterate over parameter combinations
    for params in ParameterGrid(param_grid):
        elapsed_time = time.time() - start_time
        if elapsed_time > time_limit:
            print("Time limit exceeded. Stopping search.")
            break
        model.set_params(**params)
        scores = []
        for train_idx, test_idx in cv.split(X, y):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            model.fit(X_train, y_train)
            predictions = model.predict(X_test)
            score = -mean_squared_error(y_test, predictions)
            scores.append(score)
        mean_score = np.mean(scores)
        if mean_score > best_score:
            best_score = mean_score
            best_model = model
    return best_model
 # Main experiment function
 def run_experiment_with_dynamic_grid(total_hours=10.0, output_base_directory="./results"):
    time_per_algorithm = total_hours * 3600 / 5
    # Load and combine datasets
    data = combine_datasets(
        "./data/potatoes_dataset.csv",
        "./data/updated_soil_data_with_awc.csv",
        "./data/final_augmented_climate_data.csv",
        "./data/NDVI.csv",
        ndvi=True
    )
    pipelines = [
        Pipeline([
            ('normalize', StandardScaler()),
            ('imputer', SimpleImputer(strategy='mean')),
            ('poly_features', PolynomialFeatures(degree=2)),
            ('feature_selection', SelectKBest(score_func=f_regression, k=25))
        ]),
        Pipeline([
            ('robust', RobustScaler()),
            ('feature_selection', SelectKBest(score_func=f_regression, k=25)),
            ('power_transformation', PowerTransformer(method='yeo-johnson'))
        ])
    ]
    models = {
        'lasso': lasso,
        'ridge': ridge_regressor,
        'random_forest': random_forest,
        'gradient_boosting': gradient_boosting,
        'decision_tree': decision_tree_regressor,
        'stacking_ensemble': stacking_lasso
    }
    output_directory = os.path.join(output_base_directory, f"grid_search_results_{total_hours}h")
    os.makedirs(output_directory, exist_ok=True)
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    for model_name, model_func in models.items():
        best_pipeline_result = None
        for pipeline_idx, pipeline in enumerate(pipelines):
            X, y = preprocess_data(data, pipeline)
            model, params = model_func()
            best_model = iterative_grid_search(
                model, params, X, y, cv=kf, time_limit=time_per_algorithm
            )
            shap_stabilities = []
            for fold, (train_idx, test_idx) in enumerate(kf.split(X, y)):
                X_train_fold, X_test_fold = X[train_idx], X[test_idx]
                y_train_fold = y[train_idx]
                best_model.fit(X_train_fold, y_train_fold)
                shap_values = compute_shap_values_with_kmeans(best_model, X_train_fold, X_test_fold)
                fold_shap_stability = np.std(shap_values, axis=0).mean()
                shap_stabilities.append(fold_shap_stability)
            shap_stability = np.mean(shap_stabilities)
            predictions = best_model.predict(X)
            mse = mean_squared_error(y, predictions)
            if best_pipeline_result is None or (
                    mse < best_pipeline_result['mse'] and shap_stability < best_pipeline_result['shap_stability']):
                best_pipeline_result = {
                    'model': best_model,
                    'pipeline_idx': pipeline_idx,
                    'mse': mse,
                    'shap_stability': shap_stability
                }
        if best_pipeline_result:
            model_output_dir = os.path.join(output_directory, model_name)
            os.makedirs(model_output_dir, exist_ok=True)
            model_file_path = os.path.join(model_output_dir, f"{model_name}_best_model.joblib")
            dump(best_pipeline_result['model'], model_file_path)
            metrics_df = pd.DataFrame({
                'Model': [model_name],
                'Pipeline': [f"Pipeline_{best_pipeline_result['pipeline_idx'] + 1}"],
                'MSE': [best_pipeline_result['mse']],
                'SHAP_Stability': [best_pipeline_result['shap_stability']]
            })
            metrics_csv_path = os.path.join(output_directory, "metrics_summary_GS.csv")
            metrics_df.to_csv(metrics_csv_path, index=False, mode='a', header=not os.path.exists(metrics_csv_path))
            print(f"Saved best results for {model_name} from Pipeline {best_pipeline_result['pipeline_idx'] + 1}")
    print(f"All results saved to {output_directory}")
 if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description='Run Grid Search for different models.')
    parser.add_argument('--time', type=float, required=True, help='Time for the model to run in hours')
    args = parser.parse_args()
    run_experiment_with_dynamic_grid(total_hours=args.time, output_base_directory="./results_grid_search")
--- a/src/h20_autoML.py
+++ b/src/h20_autoML.py
@@ -1,166 +0,0 @@
 import os
 import shap
 import numpy as np
 import pandas as pd
 import h2o
 from h2o.automl import H2OAutoML
 from sklearn.preprocessing import StandardScaler, RobustScaler, PolynomialFeatures, PowerTransformer
 from sklearn.impute import SimpleImputer
 from sklearn.pipeline import Pipeline
 from sklearn.model_selection import KFold
 from sklearn.feature_selection import SelectKBest, f_regression
 from sklearn.metrics import mean_squared_error
 from sklearn.cluster import KMeans
 from combine_datasets import combine_datasets
 # Initialize H2O
 h2o.init()
 # Preprocess the data and separate features and target
 def preprocess_data(input_data, pipeline):
    X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
    y = input_data['yield_t/ha']
    # Fit-transform the pipeline steps without `y` first
    for name, step in pipeline.steps:
        if name != 'feature_selection':
            X = step.fit_transform(X)
    # Apply `SelectKBest` separately, which requires `y`
    if 'feature_selection' in [name for name, _ in pipeline.steps]:
        X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
    return X, y
 # Function to compute SHAP values using KMeans clustering
 def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
    # Fit KMeans to the training data with explicit n_init parameter
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
    cluster_centers = kmeans.cluster_centers_
    # Convert cluster centers to H2OFrame
    cluster_centers_h2o = h2o.H2OFrame(cluster_centers)
    # Convert X_test to H2OFrame
    X_test_h2o = h2o.H2OFrame(X_test)
    # Compute SHAP values on cluster centers
    explainer = shap.KernelExplainer(lambda x: model.predict(h2o.H2OFrame(x)).as_data_frame().values.flatten(),
                                     cluster_centers_h2o.as_data_frame().values)
    shap_values = explainer.shap_values(X_test_h2o.as_data_frame().values)
    return shap_values
 # Function to load and combine datasets
 def load_and_combine_datasets(use_ndvi=True):
    data = combine_datasets(
        "./data/potatoes_dataset.csv",
        "./data/updated_soil_data_with_awc.csv",
        "./data/final_augmented_climate_data.csv",
        "./data/NDVI.csv",
        ndvi=use_ndvi
    )
    return data
 # Main experiment function with H2O AutoML
 def run_h2o_experiment(use_ndvi=True, automl_time=10, output_base_directory="./results_h2o"):
    data = load_and_combine_datasets(use_ndvi=use_ndvi)
    pipelines = [
        Pipeline([
            ('normalize', StandardScaler()),
            ('imputer', SimpleImputer(strategy='mean')),
            ('poly_features', PolynomialFeatures(degree=2)),
            ('feature_selection', SelectKBest(score_func=f_regression, k=25))
        ]),
        Pipeline([
            ('robust', RobustScaler()),
            ('feature_selection', SelectKBest(score_func=f_regression, k=25)),
            ('power_transformation', PowerTransformer(method='yeo-johnson'))
        ])
    ]
    output_directory = os.path.join(output_base_directory, f"h2o_automl_results_{automl_time}h")
    os.makedirs(output_directory, exist_ok=True)
    metrics_df = pd.DataFrame(columns=['Model', 'Pipeline', 'MSE', 'SHAP_Stability'])
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    for pipeline_idx, pipeline in enumerate(pipelines):
        X_processed, y = preprocess_data(data, pipeline)
        X_h2o = h2o.H2OFrame(X_processed)
        y_h2o = h2o.H2OFrame(y.to_frame())
        X_h2o['target'] = y_h2o
        shap_stabilities = []
        all_shap_values = []
        for fold, (train_idx, test_idx) in enumerate(kf.split(X_processed)):
            train = X_h2o[train_idx.tolist(), :]
            test = X_h2o[test_idx.tolist(), :]
            aml = H2OAutoML(max_runtime_secs=int(automl_time * 3600 / kf.get_n_splits()), max_models=5, seed=42,
                            sort_metric='MSE')
            aml.train(x=list(X_h2o.columns[:-1]), y='target', training_frame=train)
            best_model = aml.leader
            leaderboard = aml.leaderboard.as_data_frame()
            leaderboard_csv_path = os.path.join(output_directory, f"leaderboard_pipeline_{pipeline_idx + 1}_fold_{fold + 1}.csv")
            leaderboard.to_csv(leaderboard_csv_path, index=False)
            model_output_dir = os.path.join(output_directory, f"pipeline_{pipeline_idx + 1}")
            os.makedirs(model_output_dir, exist_ok=True)
            model_file_path = os.path.join(model_output_dir, f"h2o_best_model_{pipeline_idx + 1}_fold_{fold + 1}.zip")
            h2o.save_model(best_model, path=model_file_path)
            predictions = best_model.predict(test).as_data_frame()['predict']
            mse = mean_squared_error(test['target'].as_data_frame().values, predictions.values)
            shap_values = compute_shap_values_with_kmeans(best_model, train.as_data_frame().values,
                                                          test.as_data_frame().values)
            shap_stability = np.std(shap_values, axis=0).mean()
            shap_stabilities.append(shap_stability)
            all_shap_values.append(shap_values)
            print(f"Completed H2O AutoML and SHAP computation for Pipeline {pipeline_idx + 1}, Fold {fold + 1}")
        mean_shap_stability = np.mean(shap_stabilities)
        for fold_idx, shap_values in enumerate(all_shap_values):
            shap_file_name = f"shap_values_pipeline_{pipeline_idx + 1}_fold_{fold_idx + 1}.npy"
            shap_file_path = os.path.join(model_output_dir, shap_file_name)
            np.save(shap_file_path, shap_values)
        metrics_df = pd.concat([metrics_df, pd.DataFrame([{
            'Model': best_model.model_id,
            'Pipeline': f"Pipeline_{pipeline_idx + 1}",
            'MSE': mse,
            'SHAP_Stability': mean_shap_stability
        }])], ignore_index=True)
    metrics_csv_path = os.path.join(output_directory, "metrics_summary.csv")
    metrics_df.to_csv(metrics_csv_path, index=False)
    print(f"All results saved to {output_directory}")
    h2o.shutdown(prompt=False)
 if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description='Run H2O AutoML for different models.')
    parser.add_argument('--time', type=float, required=True, help='Time in hours for the AutoML process to run.')
    args = parser.parse_args()
    # Run the H2O AutoML experiment
    run_h2o_experiment(use_ndvi=True, automl_time=args.time, output_base_directory="./results_h2o")
--- a/src/h20_batch.sh
+++ b/src/h20_batch.sh
@@ -1,23 +0,0 @@
 #!/bin/bash
 #SBATCH --account=def-xander      # Replace with your account
 #SBATCH --mem=10G                 # Memory allocation
 #SBATCH --time=11:00:00           # Total run time limit (11 hours)
 #SBATCH --cpus-per-task=4         # Number of CPU cores per task
 #SBATCH --job-name=h20_%A # Job name with job ID appended
 #SBATCH --output=%x-%j.out        # Standard output and error log
 #SBATCH --error=%x-%j.err         # Separate error log
 # Load necessary modules
 module load python/3.8
 # Activate your virtual environment
 source ~/envs/workdir/bin/activate
 # Parameters
 TIME=$1
 # Run the Python script with the specified time parameter
 srun python /home/dvera/scratch/Framework_EXP/h20_autoML.py --time $TIME
 # Deactivate the virtual environment
 deactivate
--- a/src/nsga_batch.sh
+++ b/src/nsga_batch.sh
@@ -1,9 +1,8 @@
 #!/bin/bash
-#SBATCH --account=def-xander      # Replace with your account
+#SBATCH --mem=30G                 # Memory allocation
 #SBATCH --mem=10G                 # Memory allocation
 #SBATCH --time=21:00:00           # Total run time limit (11 hours)
-#SBATCH --cpus-per-task=4         # Number of CPU cores per task
+#SBATCH --cpus-per-task=8         # Number of CPU cores per task
-#SBATCH --job-name=nsga_%A # Job name with job ID appended
+#SBATCH --job-name=nsga           # Job name with job ID appended
 #SBATCH --output=%x-%j.out        # Standard output and error log
 #SBATCH --error=%x-%j.err         # Separate error log
@@ -11,10 +10,10 @@
 #module load python/3.8
 # Activate your virtual environment
-source /env0/bin/activate
+source /nsga/bin/activate
 # Run the Python script with the specified time parameter
-srun python $WORK_DIR/src/nsga_exp.py
+srun python /root/automl_datasets/src/nsga_exp.py
 # Deactivate the virtual environment
 deactivate
--- a/src/nsga_exp.py
+++ b/src/nsga_exp.py
@@ -13,31 +13,39 @@ import shap
 from deap import base, creator, tools, algorithms
 from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
-from combine_datasets import combine_datasets
+from multiprocessing import Pool, cpu_count
 import argparse
 creator.create("FitnessMin", base.Fitness, weights=(-1.0, -1.0))  # Minimize both objectives
 creator.create("Individual", list, fitness=creator.FitnessMin)
 def load_dataset():
    # Read the CSV path from the temporary file
    with open("last_csv_path.txt", "r") as f:
        csv_path = f.read().strip()
    if not os.path.exists(csv_path):
        raise FileNotFoundError(f"CSV file not found: {csv_path}")
    # Load the dataset
    data = pd.read_csv(csv_path)
    return data
 # Preprocess the data and separate features and target
 #
 # preprocess_data changed to not use k_features
 # feature_selection changed to go based on number of features in dataset
 # This is anti-thetical to the larger study but I am not smart enough to make it work properly
 #
 def preprocess_data(input_data, pipeline, k_value):
    X = input_data.iloc[:, :-1]
    y = input_data.iloc[:, -1]
    k_value = X.shape[1]
    pipeline.named_steps['feature_selection'].set_params(k=k_value)    
    X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
    print(X.columns)
    y = input_data['yield_t/ha']
    X = pipeline.fit_transform(X, y)
    return X, y
 # Load and combine datasets
 def load_and_combine_datasets(use_ndvi=True):
    data = combine_datasets(
        "./data/potatoes_dataset.csv",
        "./data/updated_soil_data_with_awc.csv",
        "./data/final_augmented_climate_data.csv",
        "./data/NDVI.csv",
        ndvi=use_ndvi
    )
    return data
 # Function to compute SHAP values using KMeans clustering
 def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
@@ -94,13 +102,12 @@ def evaluate_individual(individual, data, kf):
    return np.mean(mse_scores), np.mean(shap_stabilities), all_shap_values
 # Run NSGA-II experiment with a time limit
-def run_nsga_experiment(use_ndvi=True,
+def run_nsga_experiment(output_base_directory="./results_nsga",
                        output_base_directory="./results_nsga",
                        population_size=30,
                        n_generations=50,
                        time_limit=36000):
-    # Load and combine dataset
+    # Load the dataset
-    data = load_and_combine_datasets(use_ndvi=use_ndvi)
+    data = load_dataset()
    # Pipelines for preprocessing
    global pipelines
@@ -156,28 +163,33 @@ def run_nsga_experiment(use_ndvi=True,
    start_time = time.time()
    pop = toolbox.population(n=population_size)
-    # Evaluate the initial population
+# Use a Pool to parallelize evaluations
-    fits = toolbox.map(toolbox.evaluate, pop)
+with Pool(processes=cpu_count()) as pool:
    # Evaluate the initial population in parallel
    fits = list(pool.map(toolbox.evaluate, pop))
    for fit, ind in zip(fits, pop):
        ind.fitness.values = fit[:2]
        ind.shap_values = fit[2]
-    # Use the number of parents and offspring in the evolution process
+    # Evolution loop
    for gen in range(n_generations):
        if time.time() - start_time > time_limit:
            print("Time limit exceeded, stopping evolution.")
            break
        offspring = algorithms.varAnd(pop, toolbox, cxpb=0.7, mutpb=0.2)
        fits = toolbox.map(toolbox.evaluate, offspring)
        # Evaluate offspring in parallel
        fits = list(pool.map(toolbox.evaluate, offspring))
        for fit, ind in zip(fits, offspring):
-            ind.fitness.values = fit[:2]  # Ensure only the first two values are assigned to fitness
+            ind.fitness.values = fit[:2]
            ind.shap_values = fit[2]
-        # Select the next generation of parents from the combined pool of parents and offspring
+        # Select next generation
        pop = toolbox.select(pop + offspring, k=population_size)
    for ind in pop:
        for fold_idx, shap_values in enumerate(ind.shap_values):
            shap_output_path = os.path.join(output_directory, f"shap_values_{int(ind[0])}_fold_{fold_idx + 1}.npy")
@@ -199,12 +211,11 @@ def run_nsga_experiment(use_ndvi=True,
 if __name__ == "__main__":
    run_nsga_experiment(
        use_ndvi=True,
        output_base_directory="./results_nsga",
        population_size=80,  # Larger population size for a comprehensive search
        n_generations=100,  # Increased number of generations
-        num_parents=30,  # Increased number of parents
+#        num_parents=30,  # Increased number of parents
-        num_offspring=50,  # Increased number of offspring
+#        num_offspring=50,  # Increased number of offspring
        time_limit=108000  # 20 hours (20 * 3600 seconds)
    )
Author	SHA1	Message	Date
Varyngoth	fd0cb67812	Multi Processing?	2025-11-11 22:26:10 -04:00
Varyngoth	c16c545bc8	Functional runs from nsga_exp.py, however, extremely slow due to limited parallelization	2025-11-11 22:05:44 -04:00
Varyngoth	376bc2a8c5	Removed dangling, non-NSGA experiments	2025-11-11 20:26:18 -04:00
Varyngoth	5775534e22	Major changes for real this time	2025-11-11 20:24:36 -04:00
Varyngoth	0a081e48f1	Major updates to NGSA Experiment. Database pulled through user prompt. Paths no longer hardcoded. NGSA Experiment ready for local Rocky Linux 9.6 LXC	2025-11-11 20:22:13 -04:00