Description
DrivenData is a machine learning competition website similar to Kaggle, where data scientists from around the world can come together and compete in building predictive models. DrivenData is different in that the problems don’t stem from companies looking to solve problems to improve their operations or provide insight, but the objectives are centred around solving problems that can help others “save the world”. They find real world questions where data science can have positive social impact, then run online modelling competitions for data scientists to develop the best models to solve them. I think that it’s a really admiral purpose that has really opened my eyes to the power we, as data scientists, have to change the world and the very many applications the field has. Here I start small with DrivenData’s starter dataset, predicting blood donations.
This is the smallest, least complex dataset on DrivenData. That makes it a great place to dive into the world of data science competitions. The dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Centre drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.
This is the smallest, least complex dataset on DrivenData. That makes it a great place to dive into the world of data science competitions. The dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Centre drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.
Imports & Admin
Imports
First off, lets run all the imports as usual. The imports are pretty standard as per previous posts and used for similar if not the same purposes. Let’s dive into the data!
Read Data
Here I print the dimensions of the data, columns and the first few rows so I can get a feel for the data as usual. I also check if there are any null values that we will have to deal with. As I prefer to deal with those first to then getter a cleaner view of the data. Here we can see that the training dataset is 576 rows with 6 columns and the test set is 200 rows with 5 columns. First thing to note is that it’s generally best to have a training set bigger than the test set. 3 times bigger is not too bad. The data is all numeric which is great! Easy to deal with! Each row is an individual, and we can see that they have a column to identify them. We will have to remove this from the modelling process during the build. Not much cleaning to do at all! Great!
Describe
Here we have a look at how we can describe the data. Descriptive statistics are a very good way to better understand the dataset and see some quick metrics. Focussing on the training dataset, we can see that “Months since last donation” ranges from 0 to 74, with a mean of 9.439. With such a low mean in comparison to the max range, this maximum may be an anomaly and will become clearer, as will other findings, further into our analysis. We have a similar case with the “Number of Donations” column with a mean of 5.427 and a range between 1 and 50. The “Total Volume Donated (c.c.)” and “Months since First Donation” columns also has a similar distribution with a range from 250 to 125000 with a mean of 1356.771, and 2 to 98 with a mean of 34.050 respectively.
Pivot Table
Descriptive stats can be great for getting a feel for the data but here I dive, just a little bit deeper and create a pivot table to see the means relating to those who donated blood and those who didn’t. This way I can begin to hypothesise what it is that differentiates the two sets of people within the dataset. Here we found that people that donated were more recent first time donors with the mean “Months since First Donation” being 1 month lower than those who did not donate. I will not speculate as to why this is the case as it could be driven by a number of reasons and we don’t have enough data to say. The pivot shows that on average, people that donated tend to have donated had fewer months since their last donation than those who did not donate. The pivot also shows that on average, those who donated in march 2007 gave more donations in total than those who did not. The total volume donated follows the same trend as the number of donations. Let’s explore that concept a bit more; how much each of these variable move together.
Correlation
This correlation table and diagram show the relationship between the movement of two variables. Straight away, it’s noticeable that there is a perfect correlation between the “Total Volume Donated (c.c.)” column and the “Number of Donations” column. With this in mind, it would be pretty safe to say that each donation is of the same amount. This means that donors are not able to give any more or less blood. Interesting! I guess you can’t give too much blood to compromise your health and you can’t give less otherwise you could be wasting the rest of the space in the bag that another potential donor could have filled. I guess this way, donations are consistent and they are able to better allocate and plan their resources and their source. Moving on, the next highest correlation is between “Months since First Donation” and the “Number of Donations” which is 0.622116. This means that as the number of months since the first donation increases, so does the number of donations. This makes sense as there is more time allowed to make more donations than those with less time.
Let’s look at how the variables correlate to the dependent variable, “Made Donation in March 2007”. The correlation table shows us that the people with higher number of donation tend to donate 22% (0.2206) of the time, this is the case with total volume donated as the two independent variables are perfectly correlated. However, it also shows that the those with more months since their last donation tend not to donate 26% (-0.261234) of the time. The “Months since First Donation” has a very little correlation to the dependent variable.
Let’s look at how the variables correlate to the dependent variable, “Made Donation in March 2007”. The correlation table shows us that the people with higher number of donation tend to donate 22% (0.2206) of the time, this is the case with total volume donated as the two independent variables are perfectly correlated. However, it also shows that the those with more months since their last donation tend not to donate 26% (-0.261234) of the time. The “Months since First Donation” has a very little correlation to the dependent variable.
Here is a pairplot to visualise the relationships between the variables. There appears to be a clear linear relationship with the majority of the plots except the plots with the “Months since Last Donation” variable.
Test Data
Now lets have a look at our training set in relation to our test set. If we are to build a model to predict the outcomes of the test set and we are basing this model on the training set, we’re going to want to hope that they are similar so we can assess just how generalised our model needs to be in order to accommodate the test data. Let’s have a look at the distributions of both the training set’s and test set’s variables. With that being said, looking at the descriptive statistics and correlations, I would say they are very similar which is great!
Train Vs Test
Here we have a look at our training set in relation to our test set. If we are to build a model to predict the outcomes of the test set and we are basing this model on the training set, we’re going to want to hope that they are similar so we can assess just how generalised our model needs to be in order to accommodate the test data. Let’s have a look at the distributions of both the training set’s and test set’s variables.
It would appear that the test set is very similar to the training dataset. The distributions also show outliers aggregating far from the main cluster of results. Lets deal with these outliers by taking the log of variables and reducing the ranges. Perhaps the distribution of the transformed variables would be better for our next steps in model building. Lets have a look!
Variable Transformation
The log of the variables look much better distributed than the initial distributions. Lets have a look at the descriptive stats and correlation of the log values. In the instances where the minimum value in the original dataset is 0, lambda x: np.log(x+1) was calculated as you cannot take the log of 0. This works nicely as log 1 is equal 0.
log Correlation
KFold Cross Validation
Before we start modelling, let’s create a KFold object that we can use to cut our data and cross validate our models. Cross validation is a popular technique used to validate models. This allows us to train the data on a subset of the data and use the remaining data to test the accuracy of the model. This is great for testing how the model would cope with out of sample data given that it is within the same domain as the training dataset. Let’s try out a bunch of models and see how they fair against each other! The submission requires that we return the probability of each test row donating blood. So we will be working with models that allows us to return probabilities.
Models
The approach I took with these models was to firstly run the model and assess the cross validated scores by assessing the range between the models’ accuracy scores. I then took the mean of these in order to compare models against each other. After having run the models, I ran a grid search in order to find the optimal parameters for the model. I then re-built the model and took the average score as a metric of the model's accuracy. I then plot the ROC curve of the model alongside the rest of the models in order to make a comparison. After that, I re-iterate the process which is explained further below. Firstly, lets have a look at the results of the models.
Logistic Regression
Support Vector Machines
ADA Boost
XGBoost
Random Forest
Looking at the results above, it looks like the models tend to perform in a similar way, however, some models perform marginally better than others. Let's re-iterate and review our results again.
Despite having a cross validated score, here I write a function to assess how well the models perform in order to extract the most likely mean. If the model is run several times, different results can occur, therefore, I assess the range of possibilities by observing the mean of 500 runs of each model. This will give me an idea as to which models perform better and which are less likely to deviate from their mean. Below, I visualise these differences.
Here we can quickly identify a leptokurtic-type curve for both the Logistic Regression and Support Vector Machines. This means that the two models tend to be quite consistent in their accuracy. However, the ADABoost, XGBoost and Random Forest have better accuracies though, they do tend to vary much more than the other models. Looking at the top graph, we can see that the XGBoost and ADABoost could potentially perform worse than the Logistic Regression and Support Vector Machine. Let's see the chances of that happening with some basic probabilities. These averages have been run 500 times each, so that's adequate enough to get to a true mean.
Here we can see that theres is less than a 2% chance of the Logistic Regression or Support Vector Machine performing better than either of the models on average. It is important to remember that this is the average scores of multiple models being run with a KFold cross validation of 10. So, what this means is a model is run, KFold cross validated with a K=10, and the average of those 10 scores are taken.
This shows that the models do have a similar distribution on a model by model basis; however the same trend applies here, the Random Forest, XGBoost and ADABost perform better than the Logistic Regression and the Support Vector Machine more often in general but the Random Forest performs the best out of all the models, as suggested earlier. Lets have a look at the models' ROC Curve's to further assess their performance.
Model ROC Curves
The ROC curve is a plot between the True Positive Rate and False Positive rate at different thresholds. The important aspect of the plot is the Area Under the Curve which is referred to as the AUC (Area Under the Curve) Score. This area shows us the probability that any given model will predict a random positive outcome above a random negative outcome. In a perfect world, we would like to build a model that can cover the whole area, however, we don't live in such a world and so the thresholds and models that the model builder choses will have a trade off between the TPR and FPR. This is set based on your business requirements and is a matter of personal judgement and expert knowledge. Here we can see that the Random Forest model is consistently better than the other models.
Model Choice
After rigorously assessing our models, it's clear that the Random Forest would be the best option for our current problem. So let's press forward with the Random Forest and go on to predict on our test set. There are a number of ways we can approach this and we can run through them all in order to get to the best submission, however in the interest of not overloading this post, we'll stick to one.
So now that it's clear that the Random Forest is the best model choice from the models we've looked at, lets take a modular approach and engage in some functional programming and create a function that can give us what we're looking for. We need to return the probabilities of each of the final set of 200 records which we will test against for the competition.
What this function does is calculate the cross validated score of the model, on the training set with a KFold of K=10 and returns the average score to a list. with each model built in the K-Fold, it predicts, the probability we're looking for to submit to the competition, the probability that the individual donated blood, resulting in us having 10 predictions at this point. The average of these 10 predictions is then taken and stored into a data-frame. This process is then repeated for as many iterations as the user inputs. To maintain consistency, the number of iterations match our analysis above, 500. The function then returns the average and individual scores and probabilities of the 500 iterations.
Lets have a look at the score and predictions!
What this function does is calculate the cross validated score of the model, on the training set with a KFold of K=10 and returns the average score to a list. with each model built in the K-Fold, it predicts, the probability we're looking for to submit to the competition, the probability that the individual donated blood, resulting in us having 10 predictions at this point. The average of these 10 predictions is then taken and stored into a data-frame. This process is then repeated for as many iterations as the user inputs. To maintain consistency, the number of iterations match our analysis above, 500. The function then returns the average and individual scores and probabilities of the 500 iterations.
Lets have a look at the score and predictions!
Finally, we'll take the "final_avg" and submit that column as the answer. We can also set thresholds to better our score. when you submit your probability, the scoring will look at this in comparison to whether or not the row has donated blood. Therefore, we could submit a series of results that improve our score based on how certain we are of a result. For example, if we predict that there's a 90% chance of the person donating blood, we can round that up to 100% as we our fairly sure that they did donate blood and we get the additional points that we otherwise would have missed out by holding back 10%. The trade off is that if we are wrong then we lose all the point entirely! So we could set this depending on how confident we are with our model. For our first submission, we will just submit the "final_avg" score that we have.
Conclusion and Results
This competition has been running since December 2014. On the first run without tweaking the model as I mentioned how we could before, I have managed to come 342 out of 1704 which is not bad at all for a first submission. The evaluation metric is shown below as well. Within competitions, people tend to tweak their model incrementally and monitor their score in order to get the best score possible! When there's $5,000 to $15,000 up for grabs that's quite the incentive to run through plenty of iterations!
Improvements
There are a number of ideas in which the research can be improved. One of the ways is to run more iterations in order to get a more robust answer. Another Idea would be to run a variation variables through the model, ie. every combination of logged and not-logged variables to see which is best during the train and cross validation stage to then predict the submission variable.
Optimising the model by evaluating the changes made at each submission could result in a better competition score, but please note, not necessarily the best generalised model as you would be attempting to fit the model to the test data.
I could also look into other modelling algorithms that could potentially perform better than the Random Forest.
Optimising the model by evaluating the changes made at each submission could result in a better competition score, but please note, not necessarily the best generalised model as you would be attempting to fit the model to the test data.
I could also look into other modelling algorithms that could potentially perform better than the Random Forest.