Compare commits
5 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
fd0cb67812 | ||
|
|
c16c545bc8 | ||
|
|
376bc2a8c5 | ||
|
|
5775534e22 | ||
|
|
0a081e48f1 |
4
.gitignore
vendored
Normal file
4
.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
./src/test/
|
||||||
|
./src/cal_housing.csv
|
||||||
|
./src/__pycache__
|
||||||
|
./src/results_nsga
|
||||||
@@ -1,5 +1,9 @@
|
|||||||
|
Codebase:
|
||||||
https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads
|
https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads
|
||||||
|
|
||||||
|
Previous Analysis:
|
||||||
|
https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
|
||||||
|
|
||||||
Operation:
|
Operation:
|
||||||
Specify working directory (local repo location), cache directory (dataset download location), and
|
Specify working directory (local repo location), cache directory (dataset download location), and
|
||||||
|
|
||||||
@@ -12,15 +16,84 @@ Code File Structure
|
|||||||
|
|
||||||
Shell scripts
|
Shell scripts
|
||||||
|
|
||||||
h20_batch.sh ->
|
h20_batch.sh -> h20_autoML.py
|
||||||
nsga_batch.sh ->
|
nsga_batch.sh -> nsga_exp.py
|
||||||
grid_search_batch.sh ->
|
grid_search_batch.sh -> grid_search_exp.py
|
||||||
|
|
||||||
|
grid_search_batch calls both algorithms and combine_datasets
|
||||||
|
|
||||||
|
Run order should be
|
||||||
|
|
||||||
|
datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
|
||||||
|
############################################################################################################################################################
|
||||||
|
Objective:
|
||||||
|
|
||||||
|
Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
|
||||||
|
The code will need to be modified to work with other datasets
|
||||||
|
1. Modify code to work with California Housing Price dataset found in datasets.py
|
||||||
|
(cal_housing, regression dataset)
|
||||||
|
|
||||||
|
2. Modify code to work with some other classification focused dataset
|
||||||
|
(dataset.py code contains cal_housing for regression and three classification datasets)
|
||||||
|
|
||||||
|
3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
|
||||||
|
Table should include key performance indicators for both datasets as well as number of objects in each dataset
|
||||||
|
|
||||||
|
4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
|
||||||
|
Also cache files for easy referencing and to make sure that data can be analysed properly later
|
||||||
|
|
||||||
|
Files that need changing
|
||||||
|
|
||||||
|
dataset = YES
|
||||||
|
algorithms = NO
|
||||||
|
nsga_exp = YES
|
||||||
|
shap_values_computation = NO(?)
|
||||||
|
|
||||||
|
############################################################################################################################################################
|
||||||
|
Scripting Tasks:
|
||||||
|
datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
|
||||||
|
|
||||||
|
1. Make datasets generalizable
|
||||||
|
|
||||||
|
2. Make combine datasets reference generalizable headers / infer from input
|
||||||
|
|
||||||
|
3. Make nsga_exp.py reference the combine_dataset headers
|
||||||
|
|
||||||
|
4. Make output folders specified by user at runtime / in the slurm bash script
|
||||||
|
|
||||||
|
Operation Tasks:
|
||||||
|
1. Run nsga_exp.py using the California Housing Dataset (regression)
|
||||||
|
|
||||||
|
2. Run the nsga_exp.py script using a separate, classification dataset
|
||||||
|
|
||||||
|
3. Compare results
|
||||||
|
############################################################################################################################################################
|
||||||
|
Code Changes:
|
||||||
|
|
||||||
|
nsga_exp.py
|
||||||
|
- Lines 24 & 26 reference yield_t/ha. This should be a parameter
|
||||||
|
|
||||||
|
- Lines 33-36 reference relative paths to previous soil.csv files
|
||||||
|
|
||||||
|
- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
|
||||||
|
|
||||||
|
- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
|
||||||
|
|
||||||
|
- Line 134 references an ngsa output directory. This could be parameterized for other datasets
|
||||||
|
|
||||||
|
- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
|
||||||
|
|
||||||
|
- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
|
||||||
|
|
||||||
|
datasets.py
|
||||||
|
- User prompt was added to allow users to choose a dataset of the four and list its Type
|
||||||
|
- User prompt was added to choose a target feature and features to exclude
|
||||||
|
- User prompt was added for a save location for the processed csv of the dataset output
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
############################################################################################################################################################
|
############################################################################################################################################################
|
||||||
Code Changes:
|
Code Optimizations:
|
||||||
|
|
||||||
- SHAP KernelExplainer
|
- SHAP KernelExplainer
|
||||||
Use shap.TreeExplainer on tree-based models instead
|
Use shap.TreeExplainer on tree-based models instead
|
||||||
|
|||||||
@@ -11,4 +11,4 @@ joblib
|
|||||||
|
|
||||||
# Data acquisition
|
# Data acquisition
|
||||||
openml
|
openml
|
||||||
|
deap
|
||||||
|
|||||||
30
setup.sh
30
setup.sh
@@ -1,12 +1,36 @@
|
|||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
|
||||||
sudo apt install nfs-common -y
|
dnf install -y nfs-utils kernel-modules-extra
|
||||||
|
|
||||||
mkdir /mnt/data
|
mkdir /mnt/data
|
||||||
|
|
||||||
mount 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data
|
mount -t nfs -o vers=3,proto=tcp 192.168.2.69:/mnt/user/ml_datasets0 /mnt/data
|
||||||
|
|
||||||
$WORK_DIR=/mnt/data
|
|
||||||
|
|
||||||
|
WORK_DIR=/mnt/data
|
||||||
mkdir -p /mnt/data/cache # ensure directory exists
|
mkdir -p /mnt/data/cache # ensure directory exists
|
||||||
export OPENML_CACHE_DIR=/mnt/data/cache
|
export OPENML_CACHE_DIR=/mnt/data/cache
|
||||||
|
|
||||||
|
|
||||||
|
apt install python3-venv
|
||||||
|
|
||||||
|
python3 -m venv <environment_name>
|
||||||
|
|
||||||
|
source <environment_name/bin/activate
|
||||||
|
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
chmod -R 750 /root/automl_datasets
|
||||||
|
|
||||||
|
|
||||||
|
adduser mlly
|
||||||
|
passwd mlly
|
||||||
|
# mlly
|
||||||
|
chown -R mlly:mlly /mnt/data
|
||||||
|
chmod -R 750 /mnt/data
|
||||||
|
|
||||||
|
su - mlly
|
||||||
|
|
||||||
|
sudo usermod -aG wheel mlly
|
||||||
|
|
||||||
|
sudo chmod -R u+rwx /mnt/data/automl_datasets/nsga/
|
||||||
|
|||||||
@@ -1,92 +1,52 @@
|
|||||||
import pandas as pd
|
import pandas as pd
|
||||||
from loading_climate_data import load_and_process_climate_data
|
|
||||||
from loading_crop_data import engineer_crop_features
|
|
||||||
from loading_ndvi_data import process_ndvi_data
|
|
||||||
from loading_soil_data import get_soil_data
|
|
||||||
|
|
||||||
|
def combine_datasets_single_csv(
|
||||||
|
csv_file: str,
|
||||||
|
target_column: str = None,
|
||||||
|
exclude_features: list = None
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Load and process a single CSV file into a DataFrame suitable for modeling.
|
||||||
|
Mimics the original combine_datasets() structure, returning a single DataFrame.
|
||||||
|
|
||||||
def combine_datasets(crop_data_file,
|
Parameters
|
||||||
soil_data_file,
|
----------
|
||||||
climate_data_file,
|
csv_file : str
|
||||||
ndvi_data_file,
|
Path to the CSV file to load.
|
||||||
ndvi=False):
|
target_column : str, optional
|
||||||
dataset_loaders = {
|
Column to use as target. If None, assumes last column is the target.
|
||||||
'soil': lambda: get_soil_data(soil_data_file),
|
exclude_features : list of str, optional
|
||||||
'climate': lambda: load_and_process_climate_data(climate_data_file),
|
List of features to exclude from the DataFrame.
|
||||||
'ndvi': lambda: process_ndvi_data(ndvi_data_file),
|
|
||||||
'crop': lambda: engineer_crop_features(crop_data_file)
|
|
||||||
}
|
|
||||||
|
|
||||||
merge_keys = {
|
Returns
|
||||||
'soil': ['PostalCode'],
|
-------
|
||||||
'climate': ['PostalCode', 'Year'],
|
combined_data : pd.DataFrame
|
||||||
'ndvi': ['PostalCode', 'Year'],
|
Processed DataFrame, target column removed if specified.
|
||||||
'crop': ['PostalCode', 'Year']
|
The format is compatible with previous combine_datasets outputs.
|
||||||
}
|
"""
|
||||||
|
# Load CSV
|
||||||
|
combined_data = pd.read_csv(csv_file)
|
||||||
|
|
||||||
combined_data = dataset_loaders['crop']()
|
# Determine target
|
||||||
|
if target_column is None:
|
||||||
|
target_column = combined_data.columns[-1] # default last column
|
||||||
|
|
||||||
# Merge climate data
|
if target_column not in combined_data.columns:
|
||||||
climate_data = dataset_loaders['climate']()
|
raise ValueError(f"Target column '{target_column}' not found in CSV.")
|
||||||
combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
|
|
||||||
|
|
||||||
# Merge NDVI data if required
|
# Separate target column internally (optional)
|
||||||
if ndvi:
|
target_series = combined_data[target_column]
|
||||||
ndvi_data = dataset_loaders['ndvi']()
|
|
||||||
combined_data = combined_data[combined_data['Year'] >= 2000]
|
|
||||||
combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
|
|
||||||
|
|
||||||
# Merge soil data
|
# Drop the target from features
|
||||||
soil_data = dataset_loaders['soil']()
|
combined_data = combined_data.drop(columns=[target_column])
|
||||||
combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
|
|
||||||
postal_codes = combined_data['PostalCode']
|
|
||||||
years = combined_data['Year']
|
|
||||||
|
|
||||||
# Drop irrelevant or redundant columns
|
# Drop additional user-specified features
|
||||||
features_to_exclude = ['PostalCode', 'Year', 'SoilID']
|
if exclude_features:
|
||||||
|
for col in exclude_features:
|
||||||
combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
|
if col in combined_data.columns:
|
||||||
|
combined_data = combined_data.drop(columns=[col])
|
||||||
return combined_data
|
else:
|
||||||
|
print(f"Warning: '{col}' not found in CSV and cannot be excluded.")
|
||||||
|
|
||||||
def combine_dataset_pc(crop_data_file,
|
|
||||||
soil_data_file,
|
|
||||||
climate_data_file,
|
|
||||||
ndvi_data_file,
|
|
||||||
ndvi=False):
|
|
||||||
dataset_loaders = {
|
|
||||||
'soil': lambda: get_soil_data(soil_data_file),
|
|
||||||
'climate': lambda: load_and_process_climate_data(climate_data_file),
|
|
||||||
'ndvi': lambda: process_ndvi_data(ndvi_data_file),
|
|
||||||
'crop': lambda: engineer_crop_features(crop_data_file)
|
|
||||||
}
|
|
||||||
|
|
||||||
merge_keys = {
|
|
||||||
'soil': ['PostalCode'],
|
|
||||||
'climate': ['PostalCode', 'Year'],
|
|
||||||
'ndvi': ['PostalCode', 'Year'],
|
|
||||||
'crop': ['PostalCode', 'Year']
|
|
||||||
}
|
|
||||||
|
|
||||||
combined_data = dataset_loaders['crop']()
|
|
||||||
|
|
||||||
# Merge climate data
|
|
||||||
climate_data = dataset_loaders['climate']()
|
|
||||||
combined_data = pd.merge(combined_data, climate_data, on=merge_keys['climate'], how='left')
|
|
||||||
|
|
||||||
# Merge NDVI data if required
|
|
||||||
if ndvi:
|
|
||||||
ndvi_data = dataset_loaders['ndvi']()
|
|
||||||
combined_data = combined_data[combined_data['Year'] >= 2000]
|
|
||||||
combined_data = pd.merge(combined_data, ndvi_data, on=merge_keys['ndvi'], how='left')
|
|
||||||
|
|
||||||
# Merge soil data
|
|
||||||
soil_data = dataset_loaders['soil']()
|
|
||||||
combined_data = pd.merge(combined_data, soil_data, on=merge_keys['soil'], how='left')
|
|
||||||
|
|
||||||
# Drop irrelevant or redundant columns
|
|
||||||
features_to_exclude = ['Year', 'SoilID']
|
|
||||||
combined_data = combined_data.drop(columns=features_to_exclude, errors='ignore')
|
|
||||||
|
|
||||||
|
# Mimic original structure: return combined_data as a DataFrame
|
||||||
return combined_data
|
return combined_data
|
||||||
|
|||||||
113
src/dataset.py
113
src/dataset.py
@@ -3,48 +3,113 @@ import pandas as pd
|
|||||||
import openml
|
import openml
|
||||||
|
|
||||||
# --- CACHE SETUP ---
|
# --- CACHE SETUP ---
|
||||||
# Change this path to your preferred local cache directory
|
CACHE_DIR = os.path.expanduser("~/openml_cache")
|
||||||
#CACHE_DIR = os.path.expanduser("~/openml_cache")
|
os.makedirs(CACHE_DIR, exist_ok=True)
|
||||||
#os.makedirs(CACHE_DIR, exist_ok=True)
|
openml.config.cache_directory = CACHE_DIR
|
||||||
#openml.config.cache_directory = CACHE_DIR
|
|
||||||
|
|
||||||
# OpenML CC18 classification tasks (task ids)
|
# --- Dataset IDs ---
|
||||||
TASKS = {
|
TASKS = {
|
||||||
"adult": 7592, # Adult Income classification
|
"adult": 7592,
|
||||||
"spambase": 43, # Spambase classification
|
"spambase": 43,
|
||||||
"optdigits": 28, # Optdigits classification
|
"optdigits": 28,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Regression dataset (dataset id)
|
|
||||||
DATASETS = {
|
DATASETS = {
|
||||||
"cal_housing": 44025
|
"cal_housing": 44025
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# --- Load functions ---
|
||||||
def _load_task_dataframe(task_id: int):
|
def _load_task_dataframe(task_id: int):
|
||||||
task = openml.tasks.get_task(task_id)
|
task = openml.tasks.get_task(task_id)
|
||||||
dataset_id = task.dataset_id
|
dataset_id = task.dataset_id
|
||||||
dataset = openml.datasets.get_dataset(dataset_id)
|
dataset = openml.datasets.get_dataset(dataset_id)
|
||||||
X, y, categorical_indicator, _ = dataset.get_data(
|
X, y, _, _ = dataset.get_data(dataset_format="dataframe", target=task.target_name)
|
||||||
dataset_format="dataframe",
|
mask = ~y.isna()
|
||||||
target=task.target_name
|
return X.loc[mask], y.loc[mask]
|
||||||
)
|
|
||||||
# drop rows with NA target if any
|
|
||||||
if isinstance(y, pd.Series):
|
|
||||||
mask = ~y.isna()
|
|
||||||
X, y = X.loc[mask], y.loc[mask]
|
|
||||||
return X, y
|
|
||||||
|
|
||||||
def load_dataset(name: str):
|
def load_dataset(name: str):
|
||||||
if name in TASKS:
|
if name in TASKS:
|
||||||
X, y = _load_task_dataframe(TASKS[name])
|
X, y = _load_task_dataframe(TASKS[name])
|
||||||
return X, y, "classification"
|
task_type = "classification"
|
||||||
elif name in DATASETS:
|
elif name in DATASETS:
|
||||||
ds_id = DATASETS[name]
|
ds_id = DATASETS[name]
|
||||||
ds = openml.datasets.get_dataset(ds_id)
|
ds = openml.datasets.get_dataset(ds_id)
|
||||||
X, y, categorical_indicator, _ = ds.get_data(
|
target_col = ds.default_target_attribute
|
||||||
dataset_format="dataframe", target=ds.default_target_attribute
|
X, y, _, _ = ds.get_data(dataset_format="dataframe", target=None)
|
||||||
)
|
mask = ~X[target_col].isna()
|
||||||
mask = ~y.isna()
|
X = X.loc[mask]
|
||||||
return X.loc[mask], y.loc[mask], "regression"
|
y = X[target_col].loc[mask]
|
||||||
|
task_type = "regression"
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Unknown dataset {name}")
|
raise ValueError(f"Unknown dataset {name}")
|
||||||
|
return X, y, task_type
|
||||||
|
|
||||||
|
# --- Interactive main ---
|
||||||
|
def main():
|
||||||
|
print("Available datasets:")
|
||||||
|
all_datasets = list(TASKS.keys()) + list(DATASETS.keys())
|
||||||
|
for i, name in enumerate(all_datasets):
|
||||||
|
print(f"{i+1}. {name}")
|
||||||
|
|
||||||
|
selection = input("Enter the dataset name: ").strip()
|
||||||
|
if selection not in all_datasets:
|
||||||
|
raise ValueError(f"Dataset '{selection}' not recognized.")
|
||||||
|
|
||||||
|
X, y, task_type = load_dataset(selection)
|
||||||
|
|
||||||
|
# --- Identify default target ---
|
||||||
|
default_target = y.name
|
||||||
|
print(f"\nDefault target column: {default_target}")
|
||||||
|
|
||||||
|
# --- Print all features (without explanations) ---
|
||||||
|
print("\nFeatures in the dataset:")
|
||||||
|
for col in X.columns.unique():
|
||||||
|
print(f"- {col} ({X[col].dtype})")
|
||||||
|
|
||||||
|
# --- Target selection ---
|
||||||
|
target = input("\nEnter the target feature (or press Enter to use default): ").strip()
|
||||||
|
if target:
|
||||||
|
if target not in X.columns:
|
||||||
|
raise ValueError(f"Target feature '{target}' not found.")
|
||||||
|
y = X[target]
|
||||||
|
X = X.drop(columns=[target], errors="ignore")
|
||||||
|
else:
|
||||||
|
target = default_target
|
||||||
|
X = X.drop(columns=[target], errors="ignore")
|
||||||
|
|
||||||
|
# --- Feature exclusion ---
|
||||||
|
exclude_input = input("\nEnter features to exclude (comma-separated), or press Enter to skip: ").strip()
|
||||||
|
if exclude_input:
|
||||||
|
exclude_cols = [col.strip() for col in exclude_input.split(",")]
|
||||||
|
for col in exclude_cols:
|
||||||
|
if col in X.columns:
|
||||||
|
X = X.drop(columns=[col])
|
||||||
|
else:
|
||||||
|
print(f"Warning: '{col}' not found in dataset and cannot be excluded.")
|
||||||
|
|
||||||
|
# --- Show preview ---
|
||||||
|
print("\nFinal dataset preview (first 5 rows):")
|
||||||
|
print(X.head())
|
||||||
|
print("\nTarget preview (first 5 rows):")
|
||||||
|
print(y.head())
|
||||||
|
print(f"\nTask type: {task_type}")
|
||||||
|
print(f"Target column: {target}")
|
||||||
|
print(f"Number of features: {len(X.columns)}")
|
||||||
|
|
||||||
|
# --- Export to CSV ---
|
||||||
|
output_file = input("\nEnter filename to save dataset as CSV (e.g., dataset.csv): ").strip()
|
||||||
|
if output_file:
|
||||||
|
df_export = X.copy()
|
||||||
|
df_export[target] = y # append target at the end
|
||||||
|
df_export.to_csv(output_file, index=False)
|
||||||
|
print(f"Dataset saved to {output_file} (target column: '{target}')")
|
||||||
|
|
||||||
|
# Save the CSV path to a temporary text file in the current directory
|
||||||
|
temp_path_file = "last_csv_path.txt"
|
||||||
|
full_path = os.path.abspath(output_file)
|
||||||
|
with open(temp_path_file, "w") as f:
|
||||||
|
f.write(full_path)
|
||||||
|
print(f"CSV path written to {temp_path_file}")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|||||||
@@ -1,23 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
#SBATCH --account=def-xander # Replace with your account
|
|
||||||
#SBATCH --mem=10G # Memory allocation
|
|
||||||
#SBATCH --time=21:00:00 # Total run time limit (11 hours)
|
|
||||||
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
|
|
||||||
#SBATCH --job-name=grid_search_%A # Job name with job ID appended
|
|
||||||
#SBATCH --output=%x-%j.out # Standard output and error log
|
|
||||||
#SBATCH --error=%x-%j.err # Separate error log
|
|
||||||
|
|
||||||
# Load necessary modules
|
|
||||||
#module load python/3.8
|
|
||||||
|
|
||||||
# Activate your virtual environment
|
|
||||||
source /env0/bin/activate
|
|
||||||
|
|
||||||
# Parameters
|
|
||||||
TIME=$1
|
|
||||||
|
|
||||||
# Run the Python script with the specified time parameter
|
|
||||||
srun python $WORK_DIR/src/grid_search_exp.py --time $TIME
|
|
||||||
|
|
||||||
# Deactivate the virtual environment
|
|
||||||
deactivate
|
|
||||||
@@ -1,186 +0,0 @@
|
|||||||
import os
|
|
||||||
import time
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
import warnings
|
|
||||||
import shap
|
|
||||||
from sklearn.model_selection import KFold, ParameterGrid
|
|
||||||
from sklearn.pipeline import Pipeline
|
|
||||||
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, RobustScaler, PowerTransformer
|
|
||||||
from sklearn.impute import SimpleImputer
|
|
||||||
from sklearn.feature_selection import SelectKBest, f_regression
|
|
||||||
from sklearn.metrics import mean_squared_error
|
|
||||||
from sklearn.cluster import KMeans
|
|
||||||
from sklearn.exceptions import ConvergenceWarning
|
|
||||||
from joblib import dump
|
|
||||||
|
|
||||||
from combine_datasets import combine_datasets
|
|
||||||
from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
|
|
||||||
|
|
||||||
warnings.filterwarnings("ignore", category=ConvergenceWarning)
|
|
||||||
|
|
||||||
|
|
||||||
# Preprocess the data and separate features and target
|
|
||||||
def preprocess_data(input_data, pipeline):
|
|
||||||
X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
|
|
||||||
y = input_data['yield_t/ha']
|
|
||||||
|
|
||||||
# Fit-transform the pipeline steps without `y` first
|
|
||||||
for name, step in pipeline.steps:
|
|
||||||
if name != 'feature_selection':
|
|
||||||
X = step.fit_transform(X)
|
|
||||||
|
|
||||||
# Apply `SelectKBest` separately, which requires `y`
|
|
||||||
if 'feature_selection' in [name for name, _ in pipeline.steps]:
|
|
||||||
X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
|
|
||||||
|
|
||||||
return X, y
|
|
||||||
|
|
||||||
|
|
||||||
# Function to compute SHAP values using KMeans clustering
|
|
||||||
def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
|
|
||||||
# Fit KMeans to the training data with explicit n_init parameter
|
|
||||||
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
|
|
||||||
cluster_centers = kmeans.cluster_centers_
|
|
||||||
|
|
||||||
# Compute SHAP values on cluster centers
|
|
||||||
explainer = shap.KernelExplainer(model.predict, cluster_centers)
|
|
||||||
shap_values = explainer.shap_values(X_test)
|
|
||||||
|
|
||||||
return shap_values
|
|
||||||
|
|
||||||
|
|
||||||
# Iterative grid search with time monitoring
|
|
||||||
def iterative_grid_search(model, param_grid, X, y, cv, time_limit):
|
|
||||||
best_model = None
|
|
||||||
best_score = -np.inf
|
|
||||||
start_time = time.time()
|
|
||||||
|
|
||||||
# Manually iterate over parameter combinations
|
|
||||||
for params in ParameterGrid(param_grid):
|
|
||||||
elapsed_time = time.time() - start_time
|
|
||||||
if elapsed_time > time_limit:
|
|
||||||
print("Time limit exceeded. Stopping search.")
|
|
||||||
break
|
|
||||||
|
|
||||||
model.set_params(**params)
|
|
||||||
scores = []
|
|
||||||
|
|
||||||
for train_idx, test_idx in cv.split(X, y):
|
|
||||||
X_train, X_test = X[train_idx], X[test_idx]
|
|
||||||
y_train, y_test = y[train_idx], y[test_idx]
|
|
||||||
model.fit(X_train, y_train)
|
|
||||||
predictions = model.predict(X_test)
|
|
||||||
score = -mean_squared_error(y_test, predictions)
|
|
||||||
scores.append(score)
|
|
||||||
|
|
||||||
mean_score = np.mean(scores)
|
|
||||||
if mean_score > best_score:
|
|
||||||
best_score = mean_score
|
|
||||||
best_model = model
|
|
||||||
|
|
||||||
return best_model
|
|
||||||
|
|
||||||
|
|
||||||
# Main experiment function
|
|
||||||
def run_experiment_with_dynamic_grid(total_hours=10.0, output_base_directory="./results"):
|
|
||||||
time_per_algorithm = total_hours * 3600 / 5
|
|
||||||
|
|
||||||
# Load and combine datasets
|
|
||||||
data = combine_datasets(
|
|
||||||
"./data/potatoes_dataset.csv",
|
|
||||||
"./data/updated_soil_data_with_awc.csv",
|
|
||||||
"./data/final_augmented_climate_data.csv",
|
|
||||||
"./data/NDVI.csv",
|
|
||||||
ndvi=True
|
|
||||||
)
|
|
||||||
|
|
||||||
pipelines = [
|
|
||||||
Pipeline([
|
|
||||||
('normalize', StandardScaler()),
|
|
||||||
('imputer', SimpleImputer(strategy='mean')),
|
|
||||||
('poly_features', PolynomialFeatures(degree=2)),
|
|
||||||
('feature_selection', SelectKBest(score_func=f_regression, k=25))
|
|
||||||
]),
|
|
||||||
Pipeline([
|
|
||||||
('robust', RobustScaler()),
|
|
||||||
('feature_selection', SelectKBest(score_func=f_regression, k=25)),
|
|
||||||
('power_transformation', PowerTransformer(method='yeo-johnson'))
|
|
||||||
])
|
|
||||||
]
|
|
||||||
|
|
||||||
models = {
|
|
||||||
'lasso': lasso,
|
|
||||||
'ridge': ridge_regressor,
|
|
||||||
'random_forest': random_forest,
|
|
||||||
'gradient_boosting': gradient_boosting,
|
|
||||||
'decision_tree': decision_tree_regressor,
|
|
||||||
'stacking_ensemble': stacking_lasso
|
|
||||||
}
|
|
||||||
|
|
||||||
output_directory = os.path.join(output_base_directory, f"grid_search_results_{total_hours}h")
|
|
||||||
os.makedirs(output_directory, exist_ok=True)
|
|
||||||
|
|
||||||
kf = KFold(n_splits=5, shuffle=True, random_state=42)
|
|
||||||
|
|
||||||
for model_name, model_func in models.items():
|
|
||||||
best_pipeline_result = None
|
|
||||||
|
|
||||||
for pipeline_idx, pipeline in enumerate(pipelines):
|
|
||||||
X, y = preprocess_data(data, pipeline)
|
|
||||||
model, params = model_func()
|
|
||||||
|
|
||||||
best_model = iterative_grid_search(
|
|
||||||
model, params, X, y, cv=kf, time_limit=time_per_algorithm
|
|
||||||
)
|
|
||||||
|
|
||||||
shap_stabilities = []
|
|
||||||
for fold, (train_idx, test_idx) in enumerate(kf.split(X, y)):
|
|
||||||
X_train_fold, X_test_fold = X[train_idx], X[test_idx]
|
|
||||||
y_train_fold = y[train_idx]
|
|
||||||
|
|
||||||
best_model.fit(X_train_fold, y_train_fold)
|
|
||||||
shap_values = compute_shap_values_with_kmeans(best_model, X_train_fold, X_test_fold)
|
|
||||||
fold_shap_stability = np.std(shap_values, axis=0).mean()
|
|
||||||
shap_stabilities.append(fold_shap_stability)
|
|
||||||
|
|
||||||
shap_stability = np.mean(shap_stabilities)
|
|
||||||
predictions = best_model.predict(X)
|
|
||||||
mse = mean_squared_error(y, predictions)
|
|
||||||
|
|
||||||
if best_pipeline_result is None or (
|
|
||||||
mse < best_pipeline_result['mse'] and shap_stability < best_pipeline_result['shap_stability']):
|
|
||||||
best_pipeline_result = {
|
|
||||||
'model': best_model,
|
|
||||||
'pipeline_idx': pipeline_idx,
|
|
||||||
'mse': mse,
|
|
||||||
'shap_stability': shap_stability
|
|
||||||
}
|
|
||||||
|
|
||||||
if best_pipeline_result:
|
|
||||||
model_output_dir = os.path.join(output_directory, model_name)
|
|
||||||
os.makedirs(model_output_dir, exist_ok=True)
|
|
||||||
|
|
||||||
model_file_path = os.path.join(model_output_dir, f"{model_name}_best_model.joblib")
|
|
||||||
dump(best_pipeline_result['model'], model_file_path)
|
|
||||||
|
|
||||||
metrics_df = pd.DataFrame({
|
|
||||||
'Model': [model_name],
|
|
||||||
'Pipeline': [f"Pipeline_{best_pipeline_result['pipeline_idx'] + 1}"],
|
|
||||||
'MSE': [best_pipeline_result['mse']],
|
|
||||||
'SHAP_Stability': [best_pipeline_result['shap_stability']]
|
|
||||||
})
|
|
||||||
metrics_csv_path = os.path.join(output_directory, "metrics_summary_GS.csv")
|
|
||||||
metrics_df.to_csv(metrics_csv_path, index=False, mode='a', header=not os.path.exists(metrics_csv_path))
|
|
||||||
|
|
||||||
print(f"Saved best results for {model_name} from Pipeline {best_pipeline_result['pipeline_idx'] + 1}")
|
|
||||||
|
|
||||||
print(f"All results saved to {output_directory}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(description='Run Grid Search for different models.')
|
|
||||||
parser.add_argument('--time', type=float, required=True, help='Time for the model to run in hours')
|
|
||||||
args = parser.parse_args()
|
|
||||||
run_experiment_with_dynamic_grid(total_hours=args.time, output_base_directory="./results_grid_search")
|
|
||||||
@@ -1,166 +0,0 @@
|
|||||||
import os
|
|
||||||
import shap
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
import h2o
|
|
||||||
from h2o.automl import H2OAutoML
|
|
||||||
from sklearn.preprocessing import StandardScaler, RobustScaler, PolynomialFeatures, PowerTransformer
|
|
||||||
from sklearn.impute import SimpleImputer
|
|
||||||
from sklearn.pipeline import Pipeline
|
|
||||||
from sklearn.model_selection import KFold
|
|
||||||
from sklearn.feature_selection import SelectKBest, f_regression
|
|
||||||
from sklearn.metrics import mean_squared_error
|
|
||||||
from sklearn.cluster import KMeans
|
|
||||||
|
|
||||||
from combine_datasets import combine_datasets
|
|
||||||
|
|
||||||
# Initialize H2O
|
|
||||||
h2o.init()
|
|
||||||
|
|
||||||
|
|
||||||
# Preprocess the data and separate features and target
|
|
||||||
def preprocess_data(input_data, pipeline):
|
|
||||||
X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
|
|
||||||
y = input_data['yield_t/ha']
|
|
||||||
|
|
||||||
# Fit-transform the pipeline steps without `y` first
|
|
||||||
for name, step in pipeline.steps:
|
|
||||||
if name != 'feature_selection':
|
|
||||||
X = step.fit_transform(X)
|
|
||||||
|
|
||||||
# Apply `SelectKBest` separately, which requires `y`
|
|
||||||
if 'feature_selection' in [name for name, _ in pipeline.steps]:
|
|
||||||
X = pipeline.named_steps['feature_selection'].fit_transform(X, y)
|
|
||||||
|
|
||||||
return X, y
|
|
||||||
|
|
||||||
|
|
||||||
# Function to compute SHAP values using KMeans clustering
|
|
||||||
def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
|
|
||||||
# Fit KMeans to the training data with explicit n_init parameter
|
|
||||||
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
|
|
||||||
cluster_centers = kmeans.cluster_centers_
|
|
||||||
|
|
||||||
# Convert cluster centers to H2OFrame
|
|
||||||
cluster_centers_h2o = h2o.H2OFrame(cluster_centers)
|
|
||||||
|
|
||||||
# Convert X_test to H2OFrame
|
|
||||||
X_test_h2o = h2o.H2OFrame(X_test)
|
|
||||||
|
|
||||||
# Compute SHAP values on cluster centers
|
|
||||||
explainer = shap.KernelExplainer(lambda x: model.predict(h2o.H2OFrame(x)).as_data_frame().values.flatten(),
|
|
||||||
cluster_centers_h2o.as_data_frame().values)
|
|
||||||
shap_values = explainer.shap_values(X_test_h2o.as_data_frame().values)
|
|
||||||
|
|
||||||
return shap_values
|
|
||||||
|
|
||||||
|
|
||||||
# Function to load and combine datasets
|
|
||||||
def load_and_combine_datasets(use_ndvi=True):
|
|
||||||
data = combine_datasets(
|
|
||||||
"./data/potatoes_dataset.csv",
|
|
||||||
"./data/updated_soil_data_with_awc.csv",
|
|
||||||
"./data/final_augmented_climate_data.csv",
|
|
||||||
"./data/NDVI.csv",
|
|
||||||
ndvi=use_ndvi
|
|
||||||
)
|
|
||||||
return data
|
|
||||||
|
|
||||||
|
|
||||||
# Main experiment function with H2O AutoML
|
|
||||||
def run_h2o_experiment(use_ndvi=True, automl_time=10, output_base_directory="./results_h2o"):
|
|
||||||
data = load_and_combine_datasets(use_ndvi=use_ndvi)
|
|
||||||
pipelines = [
|
|
||||||
Pipeline([
|
|
||||||
('normalize', StandardScaler()),
|
|
||||||
('imputer', SimpleImputer(strategy='mean')),
|
|
||||||
('poly_features', PolynomialFeatures(degree=2)),
|
|
||||||
('feature_selection', SelectKBest(score_func=f_regression, k=25))
|
|
||||||
]),
|
|
||||||
Pipeline([
|
|
||||||
('robust', RobustScaler()),
|
|
||||||
('feature_selection', SelectKBest(score_func=f_regression, k=25)),
|
|
||||||
('power_transformation', PowerTransformer(method='yeo-johnson'))
|
|
||||||
])
|
|
||||||
]
|
|
||||||
|
|
||||||
output_directory = os.path.join(output_base_directory, f"h2o_automl_results_{automl_time}h")
|
|
||||||
os.makedirs(output_directory, exist_ok=True)
|
|
||||||
|
|
||||||
metrics_df = pd.DataFrame(columns=['Model', 'Pipeline', 'MSE', 'SHAP_Stability'])
|
|
||||||
|
|
||||||
kf = KFold(n_splits=5, shuffle=True, random_state=42)
|
|
||||||
|
|
||||||
for pipeline_idx, pipeline in enumerate(pipelines):
|
|
||||||
X_processed, y = preprocess_data(data, pipeline)
|
|
||||||
|
|
||||||
X_h2o = h2o.H2OFrame(X_processed)
|
|
||||||
y_h2o = h2o.H2OFrame(y.to_frame())
|
|
||||||
X_h2o['target'] = y_h2o
|
|
||||||
|
|
||||||
shap_stabilities = []
|
|
||||||
all_shap_values = []
|
|
||||||
|
|
||||||
for fold, (train_idx, test_idx) in enumerate(kf.split(X_processed)):
|
|
||||||
train = X_h2o[train_idx.tolist(), :]
|
|
||||||
test = X_h2o[test_idx.tolist(), :]
|
|
||||||
|
|
||||||
aml = H2OAutoML(max_runtime_secs=int(automl_time * 3600 / kf.get_n_splits()), max_models=5, seed=42,
|
|
||||||
sort_metric='MSE')
|
|
||||||
aml.train(x=list(X_h2o.columns[:-1]), y='target', training_frame=train)
|
|
||||||
|
|
||||||
best_model = aml.leader
|
|
||||||
|
|
||||||
leaderboard = aml.leaderboard.as_data_frame()
|
|
||||||
leaderboard_csv_path = os.path.join(output_directory, f"leaderboard_pipeline_{pipeline_idx + 1}_fold_{fold + 1}.csv")
|
|
||||||
leaderboard.to_csv(leaderboard_csv_path, index=False)
|
|
||||||
|
|
||||||
model_output_dir = os.path.join(output_directory, f"pipeline_{pipeline_idx + 1}")
|
|
||||||
os.makedirs(model_output_dir, exist_ok=True)
|
|
||||||
model_file_path = os.path.join(model_output_dir, f"h2o_best_model_{pipeline_idx + 1}_fold_{fold + 1}.zip")
|
|
||||||
h2o.save_model(best_model, path=model_file_path)
|
|
||||||
|
|
||||||
predictions = best_model.predict(test).as_data_frame()['predict']
|
|
||||||
|
|
||||||
mse = mean_squared_error(test['target'].as_data_frame().values, predictions.values)
|
|
||||||
|
|
||||||
shap_values = compute_shap_values_with_kmeans(best_model, train.as_data_frame().values,
|
|
||||||
test.as_data_frame().values)
|
|
||||||
shap_stability = np.std(shap_values, axis=0).mean()
|
|
||||||
shap_stabilities.append(shap_stability)
|
|
||||||
all_shap_values.append(shap_values)
|
|
||||||
|
|
||||||
|
|
||||||
print(f"Completed H2O AutoML and SHAP computation for Pipeline {pipeline_idx + 1}, Fold {fold + 1}")
|
|
||||||
|
|
||||||
mean_shap_stability = np.mean(shap_stabilities)
|
|
||||||
for fold_idx, shap_values in enumerate(all_shap_values):
|
|
||||||
shap_file_name = f"shap_values_pipeline_{pipeline_idx + 1}_fold_{fold_idx + 1}.npy"
|
|
||||||
shap_file_path = os.path.join(model_output_dir, shap_file_name)
|
|
||||||
np.save(shap_file_path, shap_values)
|
|
||||||
|
|
||||||
metrics_df = pd.concat([metrics_df, pd.DataFrame([{
|
|
||||||
'Model': best_model.model_id,
|
|
||||||
'Pipeline': f"Pipeline_{pipeline_idx + 1}",
|
|
||||||
'MSE': mse,
|
|
||||||
'SHAP_Stability': mean_shap_stability
|
|
||||||
}])], ignore_index=True)
|
|
||||||
|
|
||||||
metrics_csv_path = os.path.join(output_directory, "metrics_summary.csv")
|
|
||||||
metrics_df.to_csv(metrics_csv_path, index=False)
|
|
||||||
|
|
||||||
print(f"All results saved to {output_directory}")
|
|
||||||
|
|
||||||
h2o.shutdown(prompt=False)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(description='Run H2O AutoML for different models.')
|
|
||||||
parser.add_argument('--time', type=float, required=True, help='Time in hours for the AutoML process to run.')
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
# Run the H2O AutoML experiment
|
|
||||||
run_h2o_experiment(use_ndvi=True, automl_time=args.time, output_base_directory="./results_h2o")
|
|
||||||
|
|
||||||
@@ -1,23 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
#SBATCH --account=def-xander # Replace with your account
|
|
||||||
#SBATCH --mem=10G # Memory allocation
|
|
||||||
#SBATCH --time=11:00:00 # Total run time limit (11 hours)
|
|
||||||
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
|
|
||||||
#SBATCH --job-name=h20_%A # Job name with job ID appended
|
|
||||||
#SBATCH --output=%x-%j.out # Standard output and error log
|
|
||||||
#SBATCH --error=%x-%j.err # Separate error log
|
|
||||||
|
|
||||||
# Load necessary modules
|
|
||||||
module load python/3.8
|
|
||||||
|
|
||||||
# Activate your virtual environment
|
|
||||||
source ~/envs/workdir/bin/activate
|
|
||||||
|
|
||||||
# Parameters
|
|
||||||
TIME=$1
|
|
||||||
|
|
||||||
# Run the Python script with the specified time parameter
|
|
||||||
srun python /home/dvera/scratch/Framework_EXP/h20_autoML.py --time $TIME
|
|
||||||
|
|
||||||
# Deactivate the virtual environment
|
|
||||||
deactivate
|
|
||||||
@@ -1,9 +1,8 @@
|
|||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
#SBATCH --account=def-xander # Replace with your account
|
#SBATCH --mem=30G # Memory allocation
|
||||||
#SBATCH --mem=10G # Memory allocation
|
|
||||||
#SBATCH --time=21:00:00 # Total run time limit (11 hours)
|
#SBATCH --time=21:00:00 # Total run time limit (11 hours)
|
||||||
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
|
#SBATCH --cpus-per-task=8 # Number of CPU cores per task
|
||||||
#SBATCH --job-name=nsga_%A # Job name with job ID appended
|
#SBATCH --job-name=nsga # Job name with job ID appended
|
||||||
#SBATCH --output=%x-%j.out # Standard output and error log
|
#SBATCH --output=%x-%j.out # Standard output and error log
|
||||||
#SBATCH --error=%x-%j.err # Separate error log
|
#SBATCH --error=%x-%j.err # Separate error log
|
||||||
|
|
||||||
@@ -11,10 +10,10 @@
|
|||||||
#module load python/3.8
|
#module load python/3.8
|
||||||
|
|
||||||
# Activate your virtual environment
|
# Activate your virtual environment
|
||||||
source /env0/bin/activate
|
source /nsga/bin/activate
|
||||||
|
|
||||||
# Run the Python script with the specified time parameter
|
# Run the Python script with the specified time parameter
|
||||||
srun python $WORK_DIR/src/nsga_exp.py
|
srun python /root/automl_datasets/src/nsga_exp.py
|
||||||
|
|
||||||
# Deactivate the virtual environment
|
# Deactivate the virtual environment
|
||||||
deactivate
|
deactivate
|
||||||
|
|||||||
@@ -13,31 +13,39 @@ import shap
|
|||||||
from deap import base, creator, tools, algorithms
|
from deap import base, creator, tools, algorithms
|
||||||
from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
|
from algorithms import lasso, random_forest, gradient_boosting, decision_tree_regressor, ridge_regressor, stacking_lasso
|
||||||
|
|
||||||
from combine_datasets import combine_datasets
|
from multiprocessing import Pool, cpu_count
|
||||||
|
import argparse
|
||||||
|
|
||||||
creator.create("FitnessMin", base.Fitness, weights=(-1.0, -1.0)) # Minimize both objectives
|
creator.create("FitnessMin", base.Fitness, weights=(-1.0, -1.0)) # Minimize both objectives
|
||||||
creator.create("Individual", list, fitness=creator.FitnessMin)
|
creator.create("Individual", list, fitness=creator.FitnessMin)
|
||||||
|
|
||||||
|
def load_dataset():
|
||||||
|
|
||||||
|
# Read the CSV path from the temporary file
|
||||||
|
with open("last_csv_path.txt", "r") as f:
|
||||||
|
csv_path = f.read().strip()
|
||||||
|
|
||||||
|
if not os.path.exists(csv_path):
|
||||||
|
raise FileNotFoundError(f"CSV file not found: {csv_path}")
|
||||||
|
|
||||||
|
# Load the dataset
|
||||||
|
data = pd.read_csv(csv_path)
|
||||||
|
return data
|
||||||
|
|
||||||
# Preprocess the data and separate features and target
|
# Preprocess the data and separate features and target
|
||||||
|
#
|
||||||
|
# preprocess_data changed to not use k_features
|
||||||
|
# feature_selection changed to go based on number of features in dataset
|
||||||
|
# This is anti-thetical to the larger study but I am not smart enough to make it work properly
|
||||||
|
#
|
||||||
def preprocess_data(input_data, pipeline, k_value):
|
def preprocess_data(input_data, pipeline, k_value):
|
||||||
|
X = input_data.iloc[:, :-1]
|
||||||
|
y = input_data.iloc[:, -1]
|
||||||
|
k_value = X.shape[1]
|
||||||
pipeline.named_steps['feature_selection'].set_params(k=k_value)
|
pipeline.named_steps['feature_selection'].set_params(k=k_value)
|
||||||
X = input_data.drop(columns=['yield_t/ha'], errors='ignore')
|
|
||||||
print(X.columns)
|
|
||||||
y = input_data['yield_t/ha']
|
|
||||||
X = pipeline.fit_transform(X, y)
|
X = pipeline.fit_transform(X, y)
|
||||||
return X, y
|
return X, y
|
||||||
|
|
||||||
# Load and combine datasets
|
|
||||||
def load_and_combine_datasets(use_ndvi=True):
|
|
||||||
data = combine_datasets(
|
|
||||||
"./data/potatoes_dataset.csv",
|
|
||||||
"./data/updated_soil_data_with_awc.csv",
|
|
||||||
"./data/final_augmented_climate_data.csv",
|
|
||||||
"./data/NDVI.csv",
|
|
||||||
ndvi=use_ndvi
|
|
||||||
)
|
|
||||||
return data
|
|
||||||
|
|
||||||
# Function to compute SHAP values using KMeans clustering
|
# Function to compute SHAP values using KMeans clustering
|
||||||
def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
|
def compute_shap_values_with_kmeans(model, X_train, X_test, n_clusters=10):
|
||||||
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
|
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42).fit(X_train)
|
||||||
@@ -94,13 +102,12 @@ def evaluate_individual(individual, data, kf):
|
|||||||
return np.mean(mse_scores), np.mean(shap_stabilities), all_shap_values
|
return np.mean(mse_scores), np.mean(shap_stabilities), all_shap_values
|
||||||
|
|
||||||
# Run NSGA-II experiment with a time limit
|
# Run NSGA-II experiment with a time limit
|
||||||
def run_nsga_experiment(use_ndvi=True,
|
def run_nsga_experiment(output_base_directory="./results_nsga",
|
||||||
output_base_directory="./results_nsga",
|
|
||||||
population_size=30,
|
population_size=30,
|
||||||
n_generations=50,
|
n_generations=50,
|
||||||
time_limit=36000):
|
time_limit=36000):
|
||||||
# Load and combine dataset
|
# Load the dataset
|
||||||
data = load_and_combine_datasets(use_ndvi=use_ndvi)
|
data = load_dataset()
|
||||||
|
|
||||||
# Pipelines for preprocessing
|
# Pipelines for preprocessing
|
||||||
global pipelines
|
global pipelines
|
||||||
@@ -156,28 +163,33 @@ def run_nsga_experiment(use_ndvi=True,
|
|||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
pop = toolbox.population(n=population_size)
|
pop = toolbox.population(n=population_size)
|
||||||
|
|
||||||
# Evaluate the initial population
|
# Use a Pool to parallelize evaluations
|
||||||
fits = toolbox.map(toolbox.evaluate, pop)
|
with Pool(processes=cpu_count()) as pool:
|
||||||
|
# Evaluate the initial population in parallel
|
||||||
|
fits = list(pool.map(toolbox.evaluate, pop))
|
||||||
for fit, ind in zip(fits, pop):
|
for fit, ind in zip(fits, pop):
|
||||||
ind.fitness.values = fit[:2]
|
ind.fitness.values = fit[:2]
|
||||||
ind.shap_values = fit[2]
|
ind.shap_values = fit[2]
|
||||||
|
|
||||||
# Use the number of parents and offspring in the evolution process
|
# Evolution loop
|
||||||
for gen in range(n_generations):
|
for gen in range(n_generations):
|
||||||
if time.time() - start_time > time_limit:
|
if time.time() - start_time > time_limit:
|
||||||
print("Time limit exceeded, stopping evolution.")
|
print("Time limit exceeded, stopping evolution.")
|
||||||
break
|
break
|
||||||
|
|
||||||
offspring = algorithms.varAnd(pop, toolbox, cxpb=0.7, mutpb=0.2)
|
offspring = algorithms.varAnd(pop, toolbox, cxpb=0.7, mutpb=0.2)
|
||||||
fits = toolbox.map(toolbox.evaluate, offspring)
|
|
||||||
|
|
||||||
|
# Evaluate offspring in parallel
|
||||||
|
fits = list(pool.map(toolbox.evaluate, offspring))
|
||||||
for fit, ind in zip(fits, offspring):
|
for fit, ind in zip(fits, offspring):
|
||||||
ind.fitness.values = fit[:2] # Ensure only the first two values are assigned to fitness
|
ind.fitness.values = fit[:2]
|
||||||
ind.shap_values = fit[2]
|
ind.shap_values = fit[2]
|
||||||
|
|
||||||
# Select the next generation of parents from the combined pool of parents and offspring
|
# Select next generation
|
||||||
pop = toolbox.select(pop + offspring, k=population_size)
|
pop = toolbox.select(pop + offspring, k=population_size)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
for ind in pop:
|
for ind in pop:
|
||||||
for fold_idx, shap_values in enumerate(ind.shap_values):
|
for fold_idx, shap_values in enumerate(ind.shap_values):
|
||||||
shap_output_path = os.path.join(output_directory, f"shap_values_{int(ind[0])}_fold_{fold_idx + 1}.npy")
|
shap_output_path = os.path.join(output_directory, f"shap_values_{int(ind[0])}_fold_{fold_idx + 1}.npy")
|
||||||
@@ -199,12 +211,11 @@ def run_nsga_experiment(use_ndvi=True,
|
|||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
run_nsga_experiment(
|
run_nsga_experiment(
|
||||||
use_ndvi=True,
|
|
||||||
output_base_directory="./results_nsga",
|
output_base_directory="./results_nsga",
|
||||||
population_size=80, # Larger population size for a comprehensive search
|
population_size=80, # Larger population size for a comprehensive search
|
||||||
n_generations=100, # Increased number of generations
|
n_generations=100, # Increased number of generations
|
||||||
num_parents=30, # Increased number of parents
|
# num_parents=30, # Increased number of parents
|
||||||
num_offspring=50, # Increased number of offspring
|
# num_offspring=50, # Increased number of offspring
|
||||||
time_limit=108000 # 20 hours (20 * 3600 seconds)
|
time_limit=108000 # 20 hours (20 * 3600 seconds)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user