Major changes for real this time
This commit is contained in:
@@ -1,5 +1,9 @@
|
||||
Codebase:
|
||||
https://gitlab.com/university-of-prince-edward-isalnd/explanation-aware-optimization-and-automl/-/tree/main/src?ref_type=heads
|
||||
|
||||
Previous Analysis:
|
||||
https://gitlab.com/agri-food-canada/potato-yield-predictions-by-postal-code-ml
|
||||
|
||||
Operation:
|
||||
Specify working directory (local repo location), cache directory (dataset download location), and
|
||||
|
||||
@@ -12,15 +16,84 @@ Code File Structure
|
||||
|
||||
Shell scripts
|
||||
|
||||
h20_batch.sh ->
|
||||
nsga_batch.sh ->
|
||||
grid_search_batch.sh ->
|
||||
h20_batch.sh -> h20_autoML.py
|
||||
nsga_batch.sh -> nsga_exp.py
|
||||
grid_search_batch.sh -> grid_search_exp.py
|
||||
|
||||
grid_search_batch calls both algorithms and combine_datasets
|
||||
|
||||
Run order should be
|
||||
|
||||
datasets -> algorithms -> combine_datasets -> 3 .sh files -> shap_values_computation.py
|
||||
############################################################################################################################################################
|
||||
Objective:
|
||||
|
||||
Current code is built to perform ML analysis on a potato yield dataset as shown in Potato Yield Predictions by Postal Code ML
|
||||
The code will need to be modified to work with other datasets
|
||||
1. Modify code to work with California Housing Price dataset found in datasets.py
|
||||
(cal_housing, regression dataset)
|
||||
|
||||
2. Modify code to work with some other classification focused dataset
|
||||
(dataset.py code contains cal_housing for regression and three classification datasets)
|
||||
|
||||
3. Compare the performance of the model in both situations to compare baseline of regression vs. classification.
|
||||
Table should include key performance indicators for both datasets as well as number of objects in each dataset
|
||||
|
||||
4. (Ideally) Make models as easy as possible to migrate between datasets through user prompt.
|
||||
Also cache files for easy referencing and to make sure that data can be analysed properly later
|
||||
|
||||
Files that need changing
|
||||
|
||||
dataset = YES
|
||||
algorithms = NO
|
||||
nsga_exp = YES
|
||||
shap_values_computation = NO(?)
|
||||
|
||||
############################################################################################################################################################
|
||||
Scripting Tasks:
|
||||
datasets -> algorithms -> combine datasets -> nsga_exp.py -> shap_values_computation
|
||||
|
||||
1. Make datasets generalizable
|
||||
|
||||
2. Make combine datasets reference generalizable headers / infer from input
|
||||
|
||||
3. Make nsga_exp.py reference the combine_dataset headers
|
||||
|
||||
4. Make output folders specified by user at runtime / in the slurm bash script
|
||||
|
||||
Operation Tasks:
|
||||
1. Run nsga_exp.py using the California Housing Dataset (regression)
|
||||
|
||||
2. Run the nsga_exp.py script using a separate, classification dataset
|
||||
|
||||
3. Compare results
|
||||
############################################################################################################################################################
|
||||
Code Changes:
|
||||
|
||||
nsga_exp.py
|
||||
- Lines 24 & 26 reference yield_t/ha. This should be a parameter
|
||||
|
||||
- Lines 33-36 reference relative paths to previous soil.csv files
|
||||
|
||||
- Lines 112 and 116 reference a set value of k (k=25). It might be better to set this dynamically based on the size of the dataset
|
||||
|
||||
- Lines 141 - 143 reference models_space, pipelines, and k_value range. Should be generalized for other datasets and features
|
||||
|
||||
- Line 134 references an ngsa output directory. This could be parameterized for other datasets
|
||||
|
||||
- Lines 183, 190, and 195 reference specific output path csv files. This will cause overwriting on subsequent runs. Change to store based on run
|
||||
|
||||
- Lines 124 - 129 reference models and functions from algorithms.py. This could be generalized to allow any model dictionary but not likely beneficial for this study
|
||||
|
||||
datasets.py
|
||||
- User prompt was added to allow users to choose a dataset of the four and list its Type
|
||||
- User prompt was added to choose a target feature and features to exclude
|
||||
- User prompt was added for a save location for the processed csv of the dataset output
|
||||
|
||||
|
||||
|
||||
############################################################################################################################################################
|
||||
Code Changes:
|
||||
Code Optimizations:
|
||||
|
||||
- SHAP KernelExplainer
|
||||
Use shap.TreeExplainer on tree-based models instead
|
||||
|
||||
Reference in New Issue
Block a user