120 lines
5.3 KiB
Plaintext
120 lines
5.3 KiB
Plaintext
Codebase:
|
|
https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads
|
|
|
|
Previous Analysis:
|
|
https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
|
|
|
|
Operation:
|
|
Specify working directory (local repo location), cache directory (dataset download location), and
|
|
|
|
|
|
$WORK_DIR=
|
|
|
|
|
|
############################################################################################################################################################
|
|
Code File Structure
|
|
|
|
Shell scripts
|
|
|
|
h20_batch.sh -> h20_autoML.py
|
|
nsga_batch.sh -> nsga_exp.py
|
|
grid_search_batch.sh -> grid_search_exp.py
|
|
|
|
grid_search_batch calls both algorithms and combine_datasets
|
|
|
|
Run order should be
|
|
|
|
datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
|
|
############################################################################################################################################################
|
|
Objective:
|
|
|
|
Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
|
|
The code will need to be modified to work with other datasets
|
|
1. Modify code to work with California Housing Price dataset found in datasets.py
|
|
(cal_housing, regression dataset)
|
|
|
|
2. Modify code to work with some other classification focused dataset
|
|
(dataset.py code contains cal_housing for regression and three classification datasets)
|
|
|
|
3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
|
|
Table should include key performance indicators for both datasets as well as number of objects in each dataset
|
|
|
|
4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
|
|
Also cache files for easy referencing and to make sure that data can be analysed properly later
|
|
|
|
Files that need changing
|
|
|
|
dataset = YES
|
|
algorithms = NO
|
|
nsga_exp = YES
|
|
shap_values_computation = NO(?)
|
|
|
|
############################################################################################################################################################
|
|
Scripting Tasks:
|
|
datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
|
|
|
|
1. Make datasets generalizable
|
|
|
|
2. Make combine datasets reference generalizable headers / infer from input
|
|
|
|
3. Make nsga_exp.py reference the combine_dataset headers
|
|
|
|
4. Make output folders specified by user at runtime / in the slurm bash script
|
|
|
|
Operation Tasks:
|
|
1. Run nsga_exp.py using the California Housing Dataset (regression)
|
|
|
|
2. Run the nsga_exp.py script using a separate, classification dataset
|
|
|
|
3. Compare results
|
|
############################################################################################################################################################
|
|
Code Changes:
|
|
|
|
nsga_exp.py
|
|
- Lines 24 & 26 reference yield_t/ha. This should be a parameter
|
|
|
|
- Lines 33-36 reference relative paths to previous soil.csv files
|
|
|
|
- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
|
|
|
|
- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
|
|
|
|
- Line 134 references an ngsa output directory. This could be parameterized for other datasets
|
|
|
|
- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
|
|
|
|
- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
|
|
|
|
datasets.py
|
|
- User prompt was added to allow users to choose a dataset of the four and list its Type
|
|
- User prompt was added to choose a target feature and features to exclude
|
|
- User prompt was added for a save location for the processed csv of the dataset output
|
|
|
|
|
|
|
|
############################################################################################################################################################
|
|
Code Optimizations:
|
|
|
|
- SHAP KernelExplainer
|
|
Use shap.TreeExplainer on tree-based models instead
|
|
|
|
- AutoML search size
|
|
Reduce max_models or max_runtime_secs per fold or pre-select algorithms
|
|
|
|
- Data transformations
|
|
Cache intermediate NumPy arrays to skip repeated fit_transform calls in each fold
|
|
|
|
- Parallel folds
|
|
if CPU has many cores, parallelize the K-fold loop with joblib.parallel to fully use a higher core count CPU
|
|
|
|
############################################################################################################################################################
|
|
Notes
|
|
- The Slurm headers indicate that the programs should be run on a system with 4 cores per task and 10GB of RAM.
|
|
This is quite conservative and would not need to be directed towards a cloud-computing environment to run
|
|
|
|
- The three jobs run with a run time limit of 11 hours. Considering average Compute Canada / AceNet servers (approx 2.5GHz CPUs),
|
|
allocate a time limit of at least 5 hours to run on a 13600KF system (assuming no hyperthreading and E-core processing)
|
|
|
|
- H20 AutoML supports GPU compute using CUDA libraries. A CUDA accelerate GPU may see performance gains for this computation
|
|
|
|
- |