Multi Processing?

Functional runs from nsga_exp.py, however, extremely slow due to limited parallelization
Removed dangling, non-NSGA experiments
2025-11-11 22:26:10 -04:00 · 2025-11-11 22:05:44 -04:00 · 2025-11-11 20:26:18 -04:00 · 2025-11-11 20:24:36 -04:00 · 2025-11-11 20:22:13 -04:00
12 changed files with 284 additions and 546 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,4 @@
+./src/test/
+./src/cal_housing.csv
+./src/__pycache__
+./src/results_nsga
--- a/background.txt
+++ b/background.txt
@@ -1,5 +1,9 @@
+Codebase:
 https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads

+Previous Analysis:
+https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
+
 Operation:
 Specify working directory (local repo location), cache directory (dataset download location), and 

@@ -12,15 +16,84 @@ Code File Structure

 Shell scripts

-            h20_batch.sh ->   
-            nsga_batch.sh ->
-            grid_search_batch.sh ->
+            h20_batch.sh -> h20_autoML.py  
+            nsga_batch.sh -> nsga_exp.py
+            grid_search_batch.sh -> grid_search_exp.py

+grid_search_batch calls both algorithms and combine_datasets
+
+Run order should be
+
+datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
+############################################################################################################################################################
+Objective:
+
+Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
+The code will need to be modified to work with other datasets
+1. Modify code to work with California Housing Price dataset found in datasets.py
+        (cal_housing, regression dataset)
+
+2. Modify code to work with some other classification focused dataset
+        (dataset.py code contains cal_housing for regression and three classification datasets)
+
+3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
+        Table should include key performance indicators for both datasets as well as number of objects in each dataset
+
+4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
+        Also cache files for easy referencing and to make sure that data can be analysed properly later
+
+Files that need changing
+
+dataset                 = YES
+algorithms              = NO
+nsga_exp                = YES
+shap_values_computation = NO(?)
+
+############################################################################################################################################################
+Scripting Tasks:
+datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
+
+1. Make datasets generalizable
+
+2. Make combine datasets reference generalizable headers / infer from input
+
+3. Make nsga_exp.py reference the combine_dataset headers
+
+4. Make output folders specified by user at runtime / in the slurm bash script
+
+Operation Tasks:
+1. Run nsga_exp.py using the California Housing Dataset (regression)
+
+2. Run the nsga_exp.py script using a separate, classification dataset
+
+3. Compare results
+############################################################################################################################################################
+Code Changes:
+
+nsga_exp.py
+- Lines 24 & 26 reference yield_t/ha. This should be a parameter
+
+- Lines 33-36 reference relative paths to previous soil.csv files
+
+- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
+
+- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
+
+- Line 134 references an ngsa output directory. This could be parameterized for other datasets
+
+- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
+
+- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
+
+datasets.py
+- User prompt was added to allow users to choose a dataset of the four and list its Type
+- User prompt was added to choose a target feature and features to exclude
+- User prompt was added for a save location for the processed csv of the dataset output



 ############################################################################################################################################################
-Code Changes:
+Code Optimizations:

 - SHAP KernelExplainer
        Use shap.TreeExplainer on tree-based models instead
--- a/requirements.txt
+++ b/requirements.txt
@@ -11,4 +11,4 @@ joblib

 # Data acquisition
 openml
-
+deap
--- a/setup.sh
+++ b/setup.sh
@@ -1,12 +1,36 @@
 #!/bin/bash

-sudo apt install nfs-common -y
+dnf install -y nfs-utils kernel-modules-extra

 mkdir /mnt/data

-mount 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data
+mount -t nfs -o vers=3,proto=tcp 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data

-$WORK_DIR=/mnt/data

+WORK_DIR=/mnt/data
 mkdir -p /mnt/data/cache       # ensure directory exists
 export OPENML_CACHE_DIR=/mnt/data/cache
+
+
+apt install python3-venv
+
+python3 -m venv <environment_name>
+
+source <environment_name/bin/activate
+
+pip install -r requirements.txt
+
+chmod -R 750 /root/automl_datasets
+
+
+adduser mlly
+passwd mlly
+# mlly
+chown -R mlly:mlly /mnt/data
+chmod -R 750 /mnt/data
+
+su - mlly
+
+sudo usermod -aG wheel mlly
+
+sudo chmod -R u+rwx /mnt/data/automl_datasets/nsga/
--- a/src/combine_datasets.py
+++ b/src/combine_datasets.py
@@ -1,92 +1,52 @@
 import pandas as pd
-from loading_climate_data import load_and_process_climate_data
-from loading_crop_data import engineer_crop_features
-from loading_ndvi_data import process_ndvi_data
-from loading_soil_data import get_soil_data

+def combine_datasets_single_csv(
+    csv_file: str,
+    target_column: str = None,
+    exclude_features: list = None
+):
+    """
+    Load and process a single CSV file into a DataFrame suitable for modeling.
+    Mimics the original combine_datasets() structure, returning a single DataFrame.

-def combine_datasets(crop_data_file,
-                     soil_data_file,
-                     climate_data_file,
-                     ndvi_data_file,
-                     ndvi=False):
-    dataset_loaders = {
-        'soil': lambda: get_soil_data(soil_data_file),
-        'climate': lambda: load_and_process_climate_data(climate_data_file),
-        'ndvi': lambda: process_ndvi_data(ndvi_data_file),
-        'crop': lambda: engineer_crop_features(crop_data_file)
-    }
+    Parameters
+    ----------
+    csv_file : str
+        Path to the CSV file to load.
+    target_column : str, optional
+        Column to use as target. If None, assumes last column is the target.
+    exclude_features : list of str, optional
+        List of features to exclude from the DataFrame.

-    merge_keys = {
-        'soil': ['PostalCode'],
-        'climate': ['PostalCode', 'Year'],
-        'ndvi': ['PostalCode', 'Year'],
-        'crop': ['PostalCode', 'Year']
-    }
+    Returns
+    -------
+    combined_data : pd.DataFrame
+        Processed DataFrame, target column removed if specified.
+        The format is compatible with previous combine_datasets outputs.
+    """
+    # Load CSV
+    combined_data = pd.read_csv(csv_file)

-    combined_data = dataset_loaders['crop']()
+    # Determine target
+    if target_column is None:
+        target_column = combined_data.columns[-1]  # default last column

-    # Merge climate data
-    climate_data = dataset_loaders['climate']()
-    combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
+    if target_column not in combined_data.columns:
+        raise ValueError(f"Target column '{target_column}' not found in CSV.")

-    # Merge NDVI data if required
-    if ndvi:
-        ndvi_data = dataset_loaders['ndvi']()
-        combined_data = combined_data[combined_data['Year'] >= 2000]
-        combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
+    # Separate target column internally (optional)
+    target_series = combined_data[target_column]

-    # Merge soil data
-    soil_data = dataset_loaders['soil']()
-    combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
-    postal_codes = combined_data['PostalCode']
-    years = combined_data['Year']
+    # Drop the target from features
+    combined_data = combined_data.drop(columns=[target_column])

-    # Drop irrelevant or redundant columns
-    features_to_exclude = ['PostalCode', 'Year', 'SoilID']
-
-    combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
-
-    return combined_data
-
-
-def combine_dataset_pc(crop_data_file,
-                       soil_data_file,
-                       climate_data_file,
-                       ndvi_data_file,
-                       ndvi=False):
-    dataset_loaders = {
-        'soil': lambda: get_soil_data(soil_data_file),
-        'climate': lambda: load_and_process_climate_data(climate_data_file),
-        'ndvi': lambda: process_ndvi_data(ndvi_data_file),
-        'crop': lambda: engineer_crop_features(crop_data_file)
-    }
-
-    merge_keys = {
-        'soil': ['PostalCode'],
-        'climate': ['PostalCode', 'Year'],
-        'ndvi': ['PostalCode', 'Year'],
-        'crop': ['PostalCode', 'Year']
-    }
-
-    combined_data = dataset_loaders['crop']()
-
-    # Merge climate data
-    climate_data = dataset_loaders['climate']()
-    combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
-
-    # Merge NDVI data if required
-    if ndvi:
-        ndvi_data = dataset_loaders['ndvi']()
-        combined_data = combined_data[combined_data['Year'] >= 2000]
-        combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
-
-    # Merge soil data
-    soil_data = dataset_loaders['soil']()
-    combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
-
-    # Drop irrelevant or redundant columns
-    features_to_exclude = ['Year', 'SoilID']
-    combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
+    # Drop additional user-specified features
+    if exclude_features:
+        for col in exclude_features:
+            if col in combined_data.columns:
+                combined_data = combined_data.drop(columns=[col])
+            else:
+                print(f"Warning: '{col}' not found in CSV and cannot be excluded.")

+    # Mimic original structure: return combined_data as a DataFrame
    return combined_data
--- a/src/dataset.py
+++ b/src/dataset.py
@@ -3,48 +3,113 @@ import pandas as pd
 import openml

 # --- CACHE SETUP ---
-# Change this path to your preferred local cache directory
-#CACHE_DIR = os.path.expanduser("~/openml_cache")
-#os.makedirs(CACHE_DIR, exist_ok=True)
-#openml.config.cache_directory = CACHE_DIR
+CACHE_DIR = os.path.expanduser("~/openml_cache")
+os.makedirs(CACHE_DIR, exist_ok=True)
+openml.config.cache_directory = CACHE_DIR

-# OpenML CC18 classification tasks (task ids)
+# --- Dataset IDs ---
 TASKS = {
-    "adult": 7592,       # Adult Income classification
-    "spambase": 43,      # Spambase classification
-    "optdigits": 28,     # Optdigits classification
+    "adult": 7592,
+    "spambase": 43,
+    "optdigits": 28,
 }

-# Regression dataset (dataset id)
 DATASETS = {
    "cal_housing": 44025
 }

+# --- Load functions ---
 def _load_task_dataframe(task_id: int):
    task = openml.tasks.get_task(task_id)
    dataset_id = task.dataset_id
    dataset = openml.datasets.get_dataset(dataset_id)
-    X, y, categorical_indicator, _ = dataset.get_data(
-        dataset_format="dataframe",
-        target=task.target_name
-    )
-    # drop rows with NA target if any
-    if isinstance(y, pd.Series):
-        mask = ~y.isna()
-        X, y = X.loc[mask], y.loc[mask]
-    return X, y
+    X, y, _, _ = dataset.get_data(dataset_format="dataframe", target=task.target_name)
+    mask = ~y.isna()
+    return X.loc[mask], y.loc[mask]

 def load_dataset(name: str):
    if name in TASKS:
        X, y = _load_task_dataframe(TASKS[name])
-        return X, y, "classification"
+        task_type = "classification"
    elif name in DATASETS:
        ds_id = DATASETS[name]
        ds = openml.datasets.get_dataset(ds_id)
-        X, y, categorical_indicator, _ = ds.get_data(
-            dataset_format="dataframe", target=ds.default_target_attribute
-        )
-        mask = ~y.isna()
-        return X.loc[mask], y.loc[mask], "regression"
+        target_col = ds.default_target_attribute
+        X, y, _, _ = ds.get_data(dataset_format="dataframe", target=None)
+        mask = ~X[target_col].isna()
+        X = X.loc[mask]
+        y = X[target_col].loc[mask]
+        task_type = "regression"
    else:
        raise ValueError(f"Unknown dataset {name}")
+    return X, y, task_type
+
+# --- Interactive main ---
+def main():
+    print("Available datasets:")
+    all_datasets = list(TASKS.keys()) + list(DATASETS.keys())
+    for i, name in enumerate(all_datasets):
+        print(f"{i+1}. {name}")
+
+    selection = input("Enter the dataset name: ").strip()
+    if selection not in all_datasets:
+        raise ValueError(f"Dataset '{selection}' not recognized.")
+    
+    X, y, task_type = load_dataset(selection)
+
+    # --- Identify default target ---
+    default_target = y.name
+    print(f"\nDefault target column: {default_target}")
+
+    # --- Print all features (without explanations) ---
+    print("\nFeatures in the dataset:")
+    for col in X.columns.unique():
+        print(f"- {col} ({X[col].dtype})")
+
+    # --- Target selection ---
+    target = input("\nEnter the target feature (or press Enter to use default): ").strip()
+    if target:
+        if target not in X.columns:
+            raise ValueError(f"Target feature '{target}' not found.")
+        y = X[target]
+        X = X.drop(columns=[target], errors="ignore")
+    else:
+        target = default_target
+        X = X.drop(columns=[target], errors="ignore")
+
+    # --- Feature exclusion ---
+    exclude_input = input("\nEnter features to exclude (comma-separated), or press Enter to skip: ").strip()
+    if exclude_input:
+        exclude_cols = [col.strip() for col in exclude_input.split(",")]
+        for col in exclude_cols:
+            if col in X.columns:
+                X = X.drop(columns=[col])
+            else:
+                print(f"Warning: '{col}' not found in dataset and cannot be excluded.")
+
+    # --- Show preview ---
+    print("\nFinal dataset preview (first 5 rows):")
+    print(X.head())
+    print("\nTarget preview (first 5 rows):")
+    print(y.head())
+    print(f"\nTask type: {task_type}")
+    print(f"Target column: {target}")
+    print(f"Number of features: {len(X.columns)}")
+
+    # --- Export to CSV ---
+    output_file = input("\nEnter filename to save dataset as CSV (e.g., dataset.csv): ").strip()
+    if output_file:
+        df_export = X.copy()
+        df_export[target] = y  # append target at the end
+        df_export.to_csv(output_file, index=False)
+        print(f"Dataset saved to {output_file} (target column: '{target}')")
+        
+        # Save the CSV path to a temporary text file in the current directory
+        temp_path_file = "last_csv_path.txt"
+        full_path = os.path.abspath(output_file)
+        with open(temp_path_file, "w") as f:
+            f.write(full_path)
+        print(f"CSV path written to {temp_path_file}")
+
+if __name__ == "__main__":
+    main()
--- a/src/grid_search_batch.sh
+++ b/src/grid_search_batch.sh
@@ -1,23 +0,0 @@
-#!/bin/bash
-#SBATCH --account=def-xander      # Replace with your account
-#SBATCH --mem=10G                 # Memory allocation
-#SBATCH --time=21:00:00           # Total run time limit (11 hours)
-#SBATCH --cpus-per-task=4         # Number of CPU cores per task
-#SBATCH --job-name=grid_search_%A # Job name with job ID appended
-#SBATCH --output=%x-%j.out        # Standard output and error log
-#SBATCH --error=%x-%j.err         # Separate error log
-
-# Load necessary modules
-#module load python/3.8
-
-# Activate your virtual environment
-source /env0/bin/activate
-
-# Parameters
-TIME=$1
-
-# Run the Python script with the specified time parameter
-srun python $WORK_DIR/src/grid_search_exp.py --time $TIME
-
-# Deactivate the virtual environment
-deactivate
--- a/src/grid_search_exp.py
+++ b/src/grid_search_exp.py
@@ -1,186 +0,0 @@
-import os
-import time
-import numpy as np
-import pandas as pd
-import warnings
-import shap
-from sklearn.model_selection import KFold, ParameterGrid
-from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import StandardScaler, PolynomialFeatures, RobustScaler, PowerTransformer
-from sklearn.impute import SimpleImputer
-from sklearn.feature_selection import SelectKBest, f_regression
-from sklearn.metrics import mean_squared_error
-from sklearn.cluster import KMeans
-from sklearn.exceptions import ConvergenceWarning
-from joblib import dump
-
-from combine_datasets import combine_datasets
-from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
-
-warnings.filterwarnings("ignore", category=ConvergenceWarning)
-
-
-# Preprocess the data and separate features and target
-def preprocess_data(input_data, pipeline):
-    X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
-    y = input_data['yield_t/ha']
-
-    # Fit-transform the pipeline steps without `y` first
-    for name, step in pipeline.steps:
-        if name != 'feature_selection':
-            X = step.fit_transform(X)
-
-    # Apply `SelectKBest` separately, which requires `y`
-    if 'feature_selection' in [name for name, _ in pipeline.steps]:
-        X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
-
-    return X, y
-
-
-# Function to compute SHAP values using KMeans clustering
-def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
-    # Fit KMeans to the training data with explicit n_init parameter
-    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
-    cluster_centers = kmeans.cluster_centers_
-
-    # Compute SHAP values on cluster centers
-    explainer = shap.KernelExplainer(model.predict, cluster_centers)
-    shap_values = explainer.shap_values(X_test)
-
-    return shap_values
-
-
-# Iterative grid search with time monitoring
-def iterative_grid_search(model, param_grid, X, y, cv, time_limit):
-    best_model = None
-    best_score = -np.inf
-    start_time = time.time()
-
-    # Manually iterate over parameter combinations
-    for params in ParameterGrid(param_grid):
-        elapsed_time = time.time() - start_time
-        if elapsed_time > time_limit:
-            print("Time limit exceeded. Stopping search.")
-            break
-
-        model.set_params(**params)
-        scores = []
-
-        for train_idx, test_idx in cv.split(X, y):
-            X_train, X_test = X[train_idx], X[test_idx]
-            y_train, y_test = y[train_idx], y[test_idx]
-            model.fit(X_train, y_train)
-            predictions = model.predict(X_test)
-            score = -mean_squared_error(y_test, predictions)
-            scores.append(score)
-
-        mean_score = np.mean(scores)
-        if mean_score > best_score:
-            best_score = mean_score
-            best_model = model
-
-    return best_model
-
-
-# Main experiment function
-def run_experiment_with_dynamic_grid(total_hours=10.0, output_base_directory="./results"):
-    time_per_algorithm = total_hours * 3600 / 5
-
-    # Load and combine datasets
-    data = combine_datasets(
-        "./data/potatoes_dataset.csv",
-        "./data/updated_soil_data_with_awc.csv",
-        "./data/final_augmented_climate_data.csv",
-        "./data/NDVI.csv",
-        ndvi=True
-    )
-
-    pipelines = [
-        Pipeline([
-            ('normalize', StandardScaler()),
-            ('imputer', SimpleImputer(strategy='mean')),
-            ('poly_features', PolynomialFeatures(degree=2)),
-            ('feature_selection', SelectKBest(score_func=f_regression, k=25))
-        ]),
-        Pipeline([
-            ('robust', RobustScaler()),
-            ('feature_selection', SelectKBest(score_func=f_regression, k=25)),
-            ('power_transformation', PowerTransformer(method='yeo-johnson'))
-        ])
-    ]
-
-    models = {
-        'lasso': lasso,
-        'ridge': ridge_regressor,
-        'random_forest': random_forest,
-        'gradient_boosting': gradient_boosting,
-        'decision_tree': decision_tree_regressor,
-        'stacking_ensemble': stacking_lasso
-    }
-
-    output_directory = os.path.join(output_base_directory, f"grid_search_results_{total_hours}h")
-    os.makedirs(output_directory, exist_ok=True)
-
-    kf = KFold(n_splits=5, shuffle=True, random_state=42)
-
-    for model_name, model_func in models.items():
-        best_pipeline_result = None
-
-        for pipeline_idx, pipeline in enumerate(pipelines):
-            X, y = preprocess_data(data, pipeline)
-            model, params = model_func()
-
-            best_model = iterative_grid_search(
-                model, params, X, y, cv=kf, time_limit=time_per_algorithm
-            )
-
-            shap_stabilities = []
-            for fold, (train_idx, test_idx) in enumerate(kf.split(X, y)):
-                X_train_fold, X_test_fold = X[train_idx], X[test_idx]
-                y_train_fold = y[train_idx]
-
-                best_model.fit(X_train_fold, y_train_fold)
-                shap_values = compute_shap_values_with_kmeans(best_model, X_train_fold, X_test_fold)
-                fold_shap_stability = np.std(shap_values, axis=0).mean()
-                shap_stabilities.append(fold_shap_stability)
-
-            shap_stability = np.mean(shap_stabilities)
-            predictions = best_model.predict(X)
-            mse = mean_squared_error(y, predictions)
-
-            if best_pipeline_result is None or (
-                    mse < best_pipeline_result['mse'] and shap_stability < best_pipeline_result['shap_stability']):
-                best_pipeline_result = {
-                    'model': best_model,
-                    'pipeline_idx': pipeline_idx,
-                    'mse': mse,
-                    'shap_stability': shap_stability
-                }
-
-        if best_pipeline_result:
-            model_output_dir = os.path.join(output_directory, model_name)
-            os.makedirs(model_output_dir, exist_ok=True)
-
-            model_file_path = os.path.join(model_output_dir, f"{model_name}_best_model.joblib")
-            dump(best_pipeline_result['model'], model_file_path)
-
-            metrics_df = pd.DataFrame({
-                'Model': [model_name],
-                'Pipeline': [f"Pipeline_{best_pipeline_result['pipeline_idx'] + 1}"],
-                'MSE': [best_pipeline_result['mse']],
-                'SHAP_Stability': [best_pipeline_result['shap_stability']]
-            })
-            metrics_csv_path = os.path.join(output_directory, "metrics_summary_GS.csv")
-            metrics_df.to_csv(metrics_csv_path, index=False, mode='a', header=not os.path.exists(metrics_csv_path))
-
-            print(f"Saved best results for {model_name} from Pipeline {best_pipeline_result['pipeline_idx'] + 1}")
-
-    print(f"All results saved to {output_directory}")
-
-if __name__ == "__main__":
-    import argparse
-
-    parser = argparse.ArgumentParser(description='Run Grid Search for different models.')
-    parser.add_argument('--time', type=float, required=True, help='Time for the model to run in hours')
-    args = parser.parse_args()
-    run_experiment_with_dynamic_grid(total_hours=args.time, output_base_directory="./results_grid_search")
--- a/src/h20_autoML.py
+++ b/src/h20_autoML.py
@@ -1,166 +0,0 @@
-import os
-import shap
-import numpy as np
-import pandas as pd
-import h2o
-from h2o.automl import H2OAutoML
-from sklearn.preprocessing import StandardScaler, RobustScaler, PolynomialFeatures, PowerTransformer
-from sklearn.impute import SimpleImputer
-from sklearn.pipeline import Pipeline
-from sklearn.model_selection import KFold
-from sklearn.feature_selection import SelectKBest, f_regression
-from sklearn.metrics import mean_squared_error
-from sklearn.cluster import KMeans
-
-from combine_datasets import combine_datasets
-
-# Initialize H2O
-h2o.init()
-
-
-# Preprocess the data and separate features and target
-def preprocess_data(input_data, pipeline):
-    X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
-    y = input_data['yield_t/ha']
-
-    # Fit-transform the pipeline steps without `y` first
-    for name, step in pipeline.steps:
-        if name != 'feature_selection':
-            X = step.fit_transform(X)
-
-    # Apply `SelectKBest` separately, which requires `y`
-    if 'feature_selection' in [name for name, _ in pipeline.steps]:
-        X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
-
-    return X, y
-
-
-# Function to compute SHAP values using KMeans clustering
-def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
-    # Fit KMeans to the training data with explicit n_init parameter
-    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
-    cluster_centers = kmeans.cluster_centers_
-
-    # Convert cluster centers to H2OFrame
-    cluster_centers_h2o = h2o.H2OFrame(cluster_centers)
-
-    # Convert X_test to H2OFrame
-    X_test_h2o = h2o.H2OFrame(X_test)
-
-    # Compute SHAP values on cluster centers
-    explainer = shap.KernelExplainer(lambda x: model.predict(h2o.H2OFrame(x)).as_data_frame().values.flatten(),
-                                     cluster_centers_h2o.as_data_frame().values)
-    shap_values = explainer.shap_values(X_test_h2o.as_data_frame().values)
-
-    return shap_values
-
-
-# Function to load and combine datasets
-def load_and_combine_datasets(use_ndvi=True):
-    data = combine_datasets(
-        "./data/potatoes_dataset.csv",
-        "./data/updated_soil_data_with_awc.csv",
-        "./data/final_augmented_climate_data.csv",
-        "./data/NDVI.csv",
-        ndvi=use_ndvi
-    )
-    return data
-
-
-# Main experiment function with H2O AutoML
-def run_h2o_experiment(use_ndvi=True, automl_time=10, output_base_directory="./results_h2o"):
-    data = load_and_combine_datasets(use_ndvi=use_ndvi)
-    pipelines = [
-        Pipeline([
-            ('normalize', StandardScaler()),
-            ('imputer', SimpleImputer(strategy='mean')),
-            ('poly_features', PolynomialFeatures(degree=2)),
-            ('feature_selection', SelectKBest(score_func=f_regression, k=25))
-        ]),
-        Pipeline([
-            ('robust', RobustScaler()),
-            ('feature_selection', SelectKBest(score_func=f_regression, k=25)),
-            ('power_transformation', PowerTransformer(method='yeo-johnson'))
-        ])
-    ]
-
-    output_directory = os.path.join(output_base_directory, f"h2o_automl_results_{automl_time}h")
-    os.makedirs(output_directory, exist_ok=True)
-
-    metrics_df = pd.DataFrame(columns=['Model', 'Pipeline', 'MSE', 'SHAP_Stability'])
-
-    kf = KFold(n_splits=5, shuffle=True, random_state=42)
-
-    for pipeline_idx, pipeline in enumerate(pipelines):
-        X_processed, y = preprocess_data(data, pipeline)
-
-        X_h2o = h2o.H2OFrame(X_processed)
-        y_h2o = h2o.H2OFrame(y.to_frame())
-        X_h2o['target'] = y_h2o
-
-        shap_stabilities = []
-        all_shap_values = []
-
-        for fold, (train_idx, test_idx) in enumerate(kf.split(X_processed)):
-            train = X_h2o[train_idx.tolist(), :]
-            test = X_h2o[test_idx.tolist(), :]
-
-            aml = H2OAutoML(max_runtime_secs=int(automl_time * 3600 / kf.get_n_splits()), max_models=5, seed=42,
-                            sort_metric='MSE')
-            aml.train(x=list(X_h2o.columns[:-1]), y='target', training_frame=train)
-
-            best_model = aml.leader
-
-            leaderboard = aml.leaderboard.as_data_frame()
-            leaderboard_csv_path = os.path.join(output_directory, f"leaderboard_pipeline_{pipeline_idx + 1}_fold_{fold + 1}.csv")
-            leaderboard.to_csv(leaderboard_csv_path, index=False)
-
-            model_output_dir = os.path.join(output_directory, f"pipeline_{pipeline_idx + 1}")
-            os.makedirs(model_output_dir, exist_ok=True)
-            model_file_path = os.path.join(model_output_dir, f"h2o_best_model_{pipeline_idx + 1}_fold_{fold + 1}.zip")
-            h2o.save_model(best_model, path=model_file_path)
-
-            predictions = best_model.predict(test).as_data_frame()['predict']
-
-            mse = mean_squared_error(test['target'].as_data_frame().values, predictions.values)
-
-            shap_values = compute_shap_values_with_kmeans(best_model, train.as_data_frame().values,
-                                                          test.as_data_frame().values)
-            shap_stability = np.std(shap_values, axis=0).mean()
-            shap_stabilities.append(shap_stability)
-            all_shap_values.append(shap_values)
-
-
-            print(f"Completed H2O AutoML and SHAP computation for Pipeline {pipeline_idx + 1}, Fold {fold + 1}")
-
-        mean_shap_stability = np.mean(shap_stabilities)
-        for fold_idx, shap_values in enumerate(all_shap_values):
-            shap_file_name = f"shap_values_pipeline_{pipeline_idx + 1}_fold_{fold_idx + 1}.npy"
-            shap_file_path = os.path.join(model_output_dir, shap_file_name)
-            np.save(shap_file_path, shap_values)
-
-        metrics_df = pd.concat([metrics_df, pd.DataFrame([{
-            'Model': best_model.model_id,
-            'Pipeline': f"Pipeline_{pipeline_idx + 1}",
-            'MSE': mse,
-            'SHAP_Stability': mean_shap_stability
-        }])], ignore_index=True)
-
-    metrics_csv_path = os.path.join(output_directory, "metrics_summary.csv")
-    metrics_df.to_csv(metrics_csv_path, index=False)
-
-    print(f"All results saved to {output_directory}")
-
-    h2o.shutdown(prompt=False)
-
-
-if __name__ == "__main__":
-    import argparse
-
-    parser = argparse.ArgumentParser(description='Run H2O AutoML for different models.')
-    parser.add_argument('--time', type=float, required=True, help='Time in hours for the AutoML process to run.')
-    args = parser.parse_args()
-
-    # Run the H2O AutoML experiment
-    run_h2o_experiment(use_ndvi=True, automl_time=args.time, output_base_directory="./results_h2o")
-
--- a/src/h20_batch.sh
+++ b/src/h20_batch.sh
@@ -1,23 +0,0 @@
-#!/bin/bash
-#SBATCH --account=def-xander      # Replace with your account
-#SBATCH --mem=10G                 # Memory allocation
-#SBATCH --time=11:00:00           # Total run time limit (11 hours)
-#SBATCH --cpus-per-task=4         # Number of CPU cores per task
-#SBATCH --job-name=h20_%A # Job name with job ID appended
-#SBATCH --output=%x-%j.out        # Standard output and error log
-#SBATCH --error=%x-%j.err         # Separate error log
-
-# Load necessary modules
-module load python/3.8
-
-# Activate your virtual environment
-source ~/envs/workdir/bin/activate
-
-# Parameters
-TIME=$1
-
-# Run the Python script with the specified time parameter
-srun python /home/dvera/scratch/Framework_EXP/h20_autoML.py --time $TIME
-
-# Deactivate the virtual environment
-deactivate
--- a/src/nsga_batch.sh
+++ b/src/nsga_batch.sh
@@ -1,9 +1,8 @@
 #!/bin/bash
-#SBATCH --account=def-xander      # Replace with your account
-#SBATCH --mem=10G                 # Memory allocation
+#SBATCH --mem=30G                 # Memory allocation
 #SBATCH --time=21:00:00           # Total run time limit (11 hours)
-#SBATCH --cpus-per-task=4         # Number of CPU cores per task
-#SBATCH --job-name=nsga_%A # Job name with job ID appended
+#SBATCH --cpus-per-task=8         # Number of CPU cores per task
+#SBATCH --job-name=nsga           # Job name with job ID appended
 #SBATCH --output=%x-%j.out        # Standard output and error log
 #SBATCH --error=%x-%j.err         # Separate error log

@@ -11,10 +10,10 @@
 #module load python/3.8

 # Activate your virtual environment
-source /env0/bin/activate
+source /nsga/bin/activate

 # Run the Python script with the specified time parameter
-srun python $WORK_DIR/src/nsga_exp.py
+srun python /root/automl_datasets/src/nsga_exp.py

 # Deactivate the virtual environment
 deactivate
--- a/src/nsga_exp.py
+++ b/src/nsga_exp.py
@@ -13,31 +13,39 @@ import shap
 from deap import base, creator, tools, algorithms
 from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso

-from combine_datasets import combine_datasets
+from multiprocessing import Pool, cpu_count
+import argparse

 creator.create("FitnessMin", base.Fitness, weights=(-1.0, -1.0))  # Minimize both objectives
 creator.create("Individual", list, fitness=creator.FitnessMin)

+def load_dataset():
+
+    # Read the CSV path from the temporary file
+    with open("last_csv_path.txt", "r") as f:
+        csv_path = f.read().strip()
+
+    if not os.path.exists(csv_path):
+        raise FileNotFoundError(f"CSV file not found: {csv_path}")
+
+    # Load the dataset
+    data = pd.read_csv(csv_path)
+    return data
+
 # Preprocess the data and separate features and target
+#
+# preprocess_data changed to not use k_features
+# feature_selection changed to go based on number of features in dataset
+# This is anti-thetical to the larger study but I am not smart enough to make it work properly
+#
 def preprocess_data(input_data, pipeline, k_value):
-    pipeline.named_steps['feature_selection'].set_params(k=k_value)
-    X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
-    print(X.columns)
-    y = input_data['yield_t/ha']
+    X = input_data.iloc[:, :-1]
+    y = input_data.iloc[:, -1]
+    k_value = X.shape[1]
+    pipeline.named_steps['feature_selection'].set_params(k=k_value)    
    X = pipeline.fit_transform(X, y)
    return X, y

-# Load and combine datasets
-def load_and_combine_datasets(use_ndvi=True):
-    data = combine_datasets(
-        "./data/potatoes_dataset.csv",
-        "./data/updated_soil_data_with_awc.csv",
-        "./data/final_augmented_climate_data.csv",
-        "./data/NDVI.csv",
-        ndvi=use_ndvi
-    )
-    return data
-
 # Function to compute SHAP values using KMeans clustering
 def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
@@ -94,13 +102,12 @@ def evaluate_individual(individual, data, kf):
    return np.mean(mse_scores), np.mean(shap_stabilities), all_shap_values

 # Run NSGA-II experiment with a time limit
-def run_nsga_experiment(use_ndvi=True,
-                        output_base_directory="./results_nsga",
+def run_nsga_experiment(output_base_directory="./results_nsga",
                        population_size=30,
                        n_generations=50,
                        time_limit=36000):
-    # Load and combine dataset
-    data = load_and_combine_datasets(use_ndvi=use_ndvi)
+    # Load the dataset
+    data = load_dataset()

    # Pipelines for preprocessing
    global pipelines
@@ -156,28 +163,33 @@ def run_nsga_experiment(use_ndvi=True,
    start_time = time.time()
    pop = toolbox.population(n=population_size)

-    # Evaluate the initial population
-    fits = toolbox.map(toolbox.evaluate, pop)
+# Use a Pool to parallelize evaluations
+with Pool(processes=cpu_count()) as pool:
+    # Evaluate the initial population in parallel
+    fits = list(pool.map(toolbox.evaluate, pop))
    for fit, ind in zip(fits, pop):
        ind.fitness.values = fit[:2]
        ind.shap_values = fit[2]

-    # Use the number of parents and offspring in the evolution process
+    # Evolution loop
    for gen in range(n_generations):
        if time.time() - start_time > time_limit:
            print("Time limit exceeded, stopping evolution.")
            break

        offspring = algorithms.varAnd(pop, toolbox, cxpb=0.7, mutpb=0.2)
-        fits = toolbox.map(toolbox.evaluate, offspring)

+        # Evaluate offspring in parallel
+        fits = list(pool.map(toolbox.evaluate, offspring))
        for fit, ind in zip(fits, offspring):
-            ind.fitness.values = fit[:2]  # Ensure only the first two values are assigned to fitness
+            ind.fitness.values = fit[:2]
            ind.shap_values = fit[2]

-        # Select the next generation of parents from the combined pool of parents and offspring
+        # Select next generation
        pop = toolbox.select(pop + offspring, k=population_size)

+
+
    for ind in pop:
        for fold_idx, shap_values in enumerate(ind.shap_values):
            shap_output_path = os.path.join(output_directory, f"shap_values_{int(ind[0])}_fold_{fold_idx + 1}.npy")
@@ -199,12 +211,11 @@ def run_nsga_experiment(use_ndvi=True,

 if __name__ == "__main__":
    run_nsga_experiment(
-        use_ndvi=True,
        output_base_directory="./results_nsga",
        population_size=80,  # Larger population size for a comprehensive search
        n_generations=100,  # Increased number of generations
-        num_parents=30,  # Increased number of parents
-        num_offspring=50,  # Increased number of offspring
+#        num_parents=30,  # Increased number of parents
+#        num_offspring=50,  # Increased number of offspring
        time_limit=108000  # 20 hours (20 * 3600 seconds)
    )
Author	SHA1	Message	Date
Varyngoth	fd0cb67812	Multi Processing?	2025-11-11 22:26:10 -04:00
Varyngoth	c16c545bc8	Functional runs from nsga_exp.py, however, extremely slow due to limited parallelization	2025-11-11 22:05:44 -04:00
Varyngoth	376bc2a8c5	Removed dangling, non-NSGA experiments	2025-11-11 20:26:18 -04:00
Varyngoth	5775534e22	Major changes for real this time	2025-11-11 20:24:36 -04:00
Varyngoth	0a081e48f1	Major updates to NGSA Experiment. Database pulled through user prompt. Paths no longer hardcoded. NGSA Experiment ready for local Rocky Linux 9.6 LXC	2025-11-11 20:22:13 -04:00