Training CatBoost (Mid-Latitudes)¶

This notebooks demonstrates training a CatBoost model with hyperparameter optimization, followed by feature importance visualization using SHAP. CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. This notebook utilizes the deepfuel-ML/src/models/catboost_module.py script for model training.

import os
import pandas as pd
import numpy as np
from joblib import dump, load
import shap

Data directory¶

# The training, validation and test set required for model training are placed in data/midlats/
! tree ../data/midlats

[01;34m../data/midlats[00m
├── midlats_test.csv
├── midlats_train.csv
└── midlats_val.csv

0 directories, 3 files

Input Features¶

Latitude
Longitude
Leaf Area Index
Fire Weather Index: fwinx
Drought Code: drtcode
Fire Danegr Severity Rating: fdsrte
Fraction of Burnable Area: fraction_of_burnable_area
d2m
Evaporation Rate: erate
fg10
si10
Volumetric Soil Water Level 1: swvl1
2m Temperature: t2m
tprate
Climatic Region: climatic_region
Slope: slor
Month: month
Fuel Load: actual_load (target variable)

# Check header of training set matches input features
! head -n 1 ../data/midlats/midlats_train.csv

latitude,longitude,LAI,fwinx,drtcode,fdsrte,fraction_of_burnable_area,d2m,erate,fg10,si10,swvl1,t2m,tprate,climatic_region,slor,actual_load,month

Model Training¶

!python '../src/train.py'  --model_name 'CatBoost' --data_path '../data/midlats/' --exp_name 'CatBoost_exp'

Link for the created Neptune experiment--------
Info (NVML): NVML Shared Library Not Found. GPU usage metrics may not be reported. For more information, see https://docs.neptune.ai/logging-and-managing-experiment-results/logging-experiment-data.html#hardware-consumption
https://ui.neptune.ai/shared/step-by-step-monitoring-experiments-live/e/STEP-163
---------------------------------------
0:  learn: 0.9193915        test: 0.9374665 best: 0.9374665 (0)     total: 78.9ms   remaining: 1m 18s
1:  learn: 0.8572337        test: 0.8807920 best: 0.8807920 (1)     total: 90.4ms   remaining: 45.1s
2:  learn: 0.8133704        test: 0.8472102 best: 0.8472102 (2)     total: 105ms    remaining: 34.9s
3:  learn: 0.7751241        test: 0.8134123 best: 0.8134123 (3)     total: 119ms    remaining: 29.6s
4:  learn: 0.7455154        test: 0.7849642 best: 0.7849642 (4)     total: 134ms    remaining: 26.7s
5:  learn: 0.7227938        test: 0.7619387 best: 0.7619387 (5)     total: 147ms    remaining: 24.3s
. . .
315:        learn: 0.4626048        test: 0.6321089 best: 0.6305450 (299)   total: 4.34s    remaining: 9.4s
316:        learn: 0.4623857        test: 0.6320927 best: 0.6305450 (299)   total: 4.36s    remaining: 9.39s
317:        learn: 0.4622255        test: 0.6321369 best: 0.6305450 (299)   total: 4.37s    remaining: 9.37s
318:        learn: 0.4618840        test: 0.6320915 best: 0.6305450 (299)   total: 4.38s    remaining: 9.36s
319:        learn: 0.4616440        test: 0.6317562 best: 0.6305450 (299)   total: 4.4s     remaining: 9.35s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.6305450201
bestIteration = 299

Shrink model to first 300 iterations.
RMSE  : 0.6305450197747737
-----------------------------------------------------------------
Inference results

Training error:  2039187852.5081983
Validation error:  2854273450.7074313
Test error:  2231005975.951971
Model file save at ['/Users/rbiswas/VSCodeProjects/deepfuel-ML/src/results/pre-trained_models/CatBoost.joblib']

The training logs can be viewed live online at the following link: https://ui.neptune.ai/shared/step-by-step-monitoring-experiments-live/e/STEP-158

Loading the trained model¶

model = load('../src/results/pre-trained_models/CatBoost.joblib')

Feature importance using SHAP¶

SHAP (SHapley Additive exPlanations) is used to explain the output of the trained machine learning model.

midlat_train = pd.read_csv('../data/midlats/midlats_train.csv')

shap_values = shap.TreeExplainer(model).shap_values(midlat_train.drop([ 'actual_load'], axis=1))
shap.summary_plot(shap_values, midlat_train.drop(['actual_load'], axis=1))

The y-axis indicates the variable name, in order of importance from top to bottom. On the x-axis (Impact on model output), the horizontal location shows whether the effect of that value is associated with a higher or lower prediction. Gradient colour indicates feature value.