automl_datasets/background.txt

Codebase:
https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads

Previous Analysis:
https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml

Operation:
Specify working directory (local repo location), cache directory (dataset download location), and


$WORK_DIR=


############################################################################################################################################################
Code File Structure

Shell scripts

            h20_batch.sh -> h20_autoML.py
            nsga_batch.sh -> nsga_exp.py
            grid_search_batch.sh -> grid_search_exp.py

grid_search_batch calls both algorithms and combine_datasets

Run order should be

datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
############################################################################################################################################################
Objective:

Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
The code will need to be modified to work with other datasets
1. Modify code to work with California Housing Price dataset found in datasets.py
        (cal_housing, regression dataset)

2. Modify code to work with some other classification focused dataset
        (dataset.py code contains cal_housing for regression and three classification datasets)

3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
        Table should include key performance indicators for both datasets as well as number of objects in each dataset

4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
        Also cache files for easy referencing and to make sure that data can be analysed properly later

Files that need changing

dataset                 = YES
algorithms              = NO
nsga_exp                = YES
shap_values_computation = NO(?)

############################################################################################################################################################
Scripting Tasks:
datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation

1. Make datasets generalizable

2. Make combine datasets reference generalizable headers / infer from input

3. Make nsga_exp.py reference the combine_dataset headers

4. Make output folders specified by user at runtime / in the slurm bash script

Operation Tasks:
1. Run nsga_exp.py using the California Housing Dataset (regression)

2. Run the nsga_exp.py script using a separate, classification dataset

3. Compare results
############################################################################################################################################################
Code Changes:

nsga_exp.py
- Lines 24 & 26 reference yield_t/ha. This should be a parameter

- Lines 33-36 reference relative paths to previous soil.csv files

- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset

- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features

- Line 134 references an ngsa output directory. This could be parameterized for other datasets

- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run

- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study

datasets.py
- User prompt was added to allow users to choose a dataset of the four and list its Type
- User prompt was added to choose a target feature and features to exclude
- User prompt was added for a save location for the processed csv of the dataset output


############################################################################################################################################################
Code Optimizations:

- SHAP KernelExplainer
        Use shap.TreeExplainer on tree-based models instead

- AutoML search size
        Reduce max_models or max_runtime_secs per fold or pre-select algorithms

- Data transformations
        Cache intermediate NumPy arrays to skip repeated fit_transform calls in each fold

- Parallel folds
        if CPU has many cores, parallelize the K-fold loop with joblib.parallel to fully use a higher core count CPU

############################################################################################################################################################
Notes
- The Slurm headers indicate that the programs should be run on a system with 4 cores per task and 10GB of RAM.
  This is quite conservative and would not need to be directed towards a cloud-computing environment to run

- The three jobs run with a run time limit of 11 hours. Considering average Compute Canada / AceNet servers (approx 2.5GHz CPUs),
  allocate a time limit of at least 5 hours to run on a 13600KF system (assuming no hyperthreading and E-core processing)

- H20 AutoML supports GPU compute using CUDA libraries. A CUDA accelerate GPU may see performance gains for this computation

-