Major changes for real this time

This commit is contained in:
Varyngoth
2025-11-11 20:24:36 -04:00
parent 0a081e48f1
commit 5775534e22
6 changed files with 196 additions and 59 deletions

View File

@@ -1,5 +1,9 @@
Codebase:
https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads
Previous Analysis:
https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
Operation:
Specify working directory (local repo location), cache directory (dataset download location), and
@@ -12,15 +16,84 @@ Code File Structure
Shell scripts
h20_batch.sh ->
nsga_batch.sh ->
grid_search_batch.sh ->
h20_batch.sh -> h20_autoML.py
nsga_batch.sh -> nsga_exp.py
grid_search_batch.sh -> grid_search_exp.py
grid_search_batch calls both algorithms and combine_datasets
Run order should be
datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
############################################################################################################################################################
Objective:
Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
The code will need to be modified to work with other datasets
1. Modify code to work with California Housing Price dataset found in datasets.py
(cal_housing, regression dataset)
2. Modify code to work with some other classification focused dataset
(dataset.py code contains cal_housing for regression and three classification datasets)
3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
Table should include key performance indicators for both datasets as well as number of objects in each dataset
4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
Also cache files for easy referencing and to make sure that data can be analysed properly later
Files that need changing
dataset = YES
algorithms = NO
nsga_exp = YES
shap_values_computation = NO(?)
############################################################################################################################################################
Scripting Tasks:
datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
1. Make datasets generalizable
2. Make combine datasets reference generalizable headers / infer from input
3. Make nsga_exp.py reference the combine_dataset headers
4. Make output folders specified by user at runtime / in the slurm bash script
Operation Tasks:
1. Run nsga_exp.py using the California Housing Dataset (regression)
2. Run the nsga_exp.py script using a separate, classification dataset
3. Compare results
############################################################################################################################################################
Code Changes:
nsga_exp.py
- Lines 24 & 26 reference yield_t/ha. This should be a parameter
- Lines 33-36 reference relative paths to previous soil.csv files
- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
- Line 134 references an ngsa output directory. This could be parameterized for other datasets
- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
datasets.py
- User prompt was added to allow users to choose a dataset of the four and list its Type
- User prompt was added to choose a target feature and features to exclude
- User prompt was added for a save location for the processed csv of the dataset output
############################################################################################################################################################
Code Changes:
Code Optimizations:
- SHAP KernelExplainer
Use shap.TreeExplainer on tree-based models instead