Major changes for real this time

2025-11-11 20:24:36 -04:00
parent 0a081e48f1
commit 5775534e22
6 changed files with 196 additions and 59 deletions
--- a/background.txt
+++ b/background.txt
@@ -1,5 +1,9 @@
+Codebase:
 https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads

+Previous Analysis:
+https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
+
 Operation:
 Specify working directory (local repo location), cache directory (dataset download location), and 

@@ -12,15 +16,84 @@ Code File Structure

 Shell scripts

-            h20_batch.sh ->   
-            nsga_batch.sh ->
-            grid_search_batch.sh ->
+            h20_batch.sh -> h20_autoML.py  
+            nsga_batch.sh -> nsga_exp.py
+            grid_search_batch.sh -> grid_search_exp.py

+grid_search_batch calls both algorithms and combine_datasets
+
+Run order should be
+
+datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
+############################################################################################################################################################
+Objective:
+
+Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
+The code will need to be modified to work with other datasets
+1. Modify code to work with California Housing Price dataset found in datasets.py
+        (cal_housing, regression dataset)
+
+2. Modify code to work with some other classification focused dataset
+        (dataset.py code contains cal_housing for regression and three classification datasets)
+
+3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
+        Table should include key performance indicators for both datasets as well as number of objects in each dataset
+
+4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
+        Also cache files for easy referencing and to make sure that data can be analysed properly later
+
+Files that need changing
+
+dataset                 = YES
+algorithms              = NO
+nsga_exp                = YES
+shap_values_computation = NO(?)
+
+############################################################################################################################################################
+Scripting Tasks:
+datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
+
+1. Make datasets generalizable
+
+2. Make combine datasets reference generalizable headers / infer from input
+
+3. Make nsga_exp.py reference the combine_dataset headers
+
+4. Make output folders specified by user at runtime / in the slurm bash script
+
+Operation Tasks:
+1. Run nsga_exp.py using the California Housing Dataset (regression)
+
+2. Run the nsga_exp.py script using a separate, classification dataset
+
+3. Compare results
+############################################################################################################################################################
+Code Changes:
+
+nsga_exp.py
+- Lines 24 & 26 reference yield_t/ha. This should be a parameter
+
+- Lines 33-36 reference relative paths to previous soil.csv files
+
+- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
+
+- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
+
+- Line 134 references an ngsa output directory. This could be parameterized for other datasets
+
+- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
+
+- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
+
+datasets.py
+- User prompt was added to allow users to choose a dataset of the four and list its Type
+- User prompt was added to choose a target feature and features to exclude
+- User prompt was added for a save location for the processed csv of the dataset output



 ############################################################################################################################################################
-Code Changes:
+Code Optimizations:

 - SHAP KernelExplainer
        Use shap.TreeExplainer on tree-based models instead