Azure ML Diabetes Classification

January 1, 2024

Skills Used: XGBoost, Docker, AzureML, Azure Devops Pipelines, Hyperopt, Upsampling Data, GitHub, DVC

Summary

This is a project to train and deploy an XGBoost model that predicts Diabetes based on non-invasive metrics (BMI, Sex, etc.).

Check out the final web app.

When the web app is run, it sends a request to my API that runs the latest deployed model.

classification report

Classification report for the latest deployed model

The model is the result of a job that kicks off every time a change is made to the model training repository. After the job completes, I can inspect the metrics and decide if I want to use the new model in my API.

There are a few repositories associated with this project, in case you want to check out the code. Please be gentle, as I wrote this very quickly and I am currently cleaning up the code.

Repos:

Check out below for more details!

Details

Description: The Behavioral Risk Factor Surveillance System of the CDC conducts telephone surveys, and those responses as it relates to diabetes can be found on here on kaggle.

Data: I start out by doing some EDA. A notable characteristic is the imbalance in this classification problem. You can check out some of the details in the EDA notebook. There is a visualization for each column which displays the representation of the various values that the column takes on in the data, and what the effect is on diabetes.

I version the data and build pipelines to build the training and test datasets using data version control (DVC), which would allow for consistency among data scientists if there were any others working on this project. You can find the data pipeline definitions in dvc.yaml in the root of the project, and the python files referenced in the pipeline definitions are contained in the folder diabetes_data_code. I store these data in azure using blob storage for ease of accessability between machines, which automatically get updated via cicd pipelines on changes to the code/data.

One of the challenges in these data is the imbalance, as only 14% of the data are in the diabetes=1 class. To account for this, a data pipeline is built that will upsample the data using SMOTE. Essentially, SMOTE creates new synthetic data similar to data in the minority class via kmeans clustering. You can see the effect of upsampling toward the end of the data exploration notebook.

The end result of the data repo is upsampled cross validation data and training data for model building which is uploaded to blob storage. When there is an update to the repo, the data pipelines kick off again and update the storage with the latest data.

Model Building: Here in the model training repo you can find the principal training script.

In this project I choose to use XGBoost. Given time, it would be nice to experiment with other models. Model building is logged by MLFlow, making it easy to log metrics, parameters, and experiments. I tune the hyperparameters using hyperopt on the cross validation sets that were made in the data pipelines, then log the trained models with the best average metrics. To start, I’ll define some range of hyperparameters, then run a notebook that uses hyperopt to train many models and get their metrics for many hyperparameter collectionsI then parallel plot those hyperparameters against the f1 metric. For example, here is one iteration of such parallel plots:

parallel plot 0

Here I have highlighted the models built with the best metrics, and I can see, for example, the best performing models have a lower max_depth than a lot of the max_depths I’m searching over (from 2 to 20, in this case). Thus, I can reduce the search space for that parameter to perhaps improve the model. There is a similar theme for both eta, subsample, n_estimators, and lambda. After searching over a more limited space, the parallel plot looks as follows:

parallel plot 0

We can see here that the best model eeked out a few more points in this run. After getting a reasonable ranges for the hyperparameters, I can run a longer training job in the cloud to do a more extensive search.

The model building process is made into a cicd pipeline so that when there is a change to the code base, new models are built so that I can examine if I would like to deploy a new version of the model.

After a job is promoted to a model, I can do a bit of model analysis to interpret what the model is telling us. One way to do this is to use SHAP values. I do this in a notebook, where I download a version of the model from Azure ML. I can then generate individual SHAP waterfall plots, like the following

parallel plot 0

Among other things, this plot is saying that the BMI being equal to 21 decreased the log odds of Diabetes by .31, where log odds relates to probability as log_odds(p) = ln((p)/(1-p)).

We can see a visualization for the entire test dataset via a beeswarm plot

parallel plot 0

To interpret this, let’s take Age as an example. The more to the left, the more negative the shap value. Now, since in the Age row the dots to the left are blue, this is saying low values of age are leading to negative shap values. So, small age is decreasing the probability of diabetes. Also, towards the right in the age column we see that the feature values are high due to them being red, so large age increases probability of diabetes.

Everything in the beeswarm plot seems to make sense to me, save for the smoker row. According to the model, low values (color blue) of smoker (which means smoker = 0, since it is binary, and thus the person is not a smoker) has positive shap values, increasing the probability of Diabetes. You can see this as well if you play with the streamlit app. If you switch from smoker to nonsmoker, the probability of Diabetes tends to increase!

Finally, I’ll mention that I am also working on another plan of attack to address the imbalance of the problem, and that is by not upsampling the data and using the scale_pos_weight parameter (see xgboost parameter documentation). Initial tests are that this works marginally better than upsampling the data.

Model Deployment: Here you can find the code for the deployed model’s API. If I decide that I would like to deploy a new model, I first promote the result of an mlflow training job to a model in my azure ml workspace. Then, I update the version of my model in my api config.

Streamlit Web App: The streamlit app allows you to easily tinker with the inputs for calling the deployed model’s API. I copied and pasted many of the BRFSS’s questions that correspond to features in the model. You can check out the code if you like. Essentially, I create json data for all the questions for easy loading into the application.

Note: This is somewhat of a redesign of a previous version of this project.

Questions? Comments? Let me know what you think! Reach out to me on LinkedIn.