Can an Algorithm Detect a Brain Tumor?

Lucy Schafer
5 min readMay 8, 2021

--

source: https://www.eurekalert.org/pub_releases/2020-06/ku-aie060820.php

Abstract:

Through the application and cross validation of multiple machine learning models, I was able to develop an algorithm that can predict with 98.67% accuracy whether or not a brain tumor is present from 13 key features from an MRI scan. The dataset that I used for training and testing was found in the Kaggle dataset library. It includes roughly 3800 entries of 15 columns. Exploring different models, such as: Decision Tree Classifier, Random Forest Classifier, and AdaBoost Classifier, I was able to determine the highest correlating features to accurately and reliably predict the target.

Guiding Questions:

Once I found the dataset to explore, I had some questions that helped guide my programming process:

  1. Which features of the MRI data will prove most important in predicting the presence or absence of a brain tumor?
  2. Which Machine Learning model will achieve the highest accuracy when comparing the test data with the predicted data?

Model Exploration and Comparison:

Step 1: Train, Test, Baseline

I began by using the scikit learn model selection library to divide my data using train_test_split into 2 sets: training (70%) and testing (30%). Next, I wanted to determine the baseline accuracy a model could build using DummyClassifier from scikit learn dummy. The strategy I used was “most frequent”. My baseline accuracy came out to be roughly 55%, so really anything higher than this would be a good model.

Step 2: Explore Other Models

Decision Tree Classifier

I like to start out with Decision Tree Classifier from scikit learn tree, because I find it gives a reliable high accuracy on most models and it allows you to visualize the feature matrix through scikit learn tree.plot_tree(). The accuracy of my decision tree model came out to be 97.8%. I was delighted to see this outcome, as it is much higher than my baseline. I wanted to visualize the accuracy in a confusion matrix which demonstrated the high accuracy because only 24 out of 3800 predictions did not match the true value. The decision tree below displays the deciding features of the tree with the max_depth set to 5 to avoid overfitting the tree.

The one issue with the decision tree model is the feature importance. The first order features of the dataset were listed as Kurtosis, Variance, Skewness, Mean, and Standard Deviation; however, of these 5 only Kurtosis was within the second branch. This model was quite accurate that I could have stopped here, but I wanted to see how I could achieve high accuracy and have the feature importance match the first order attributes. Thus, I continued testing more models.

Bagging Classifier

Next, I moved to Bagging Classifier under scikit learn ensemble. Boom! 98.4% accuracy. See the confusion matrix below: only 18 predictions that did not match the true value.

I expected a higher accuracy than the decision tree before I started this model because the Bagging Classifier is an averaging method so it builds several models independently and averages their predictions. I like this model for this reason, but another averaging method model that came to mind after the bagging success was the Random Forest Classifier.

Random Forest Classifier

Random Forest (RF) is essentially a modified bagging method that averages a number of de-correlated trees. Random Forest models typically have lower variance because of the de-correlation at every split of the tree. My RF model created 100 base models that the predictions were then averaged from. Below is a snippet of the last 7 model recall scores and the statistics from those base models that created the RF model.

As seen above, my RF model had a recall score of 97.2% and an accuracy score of 98.67%. This is excellent. The Random Forest model officially took the lead for highest accuracy machine learning algorithm to predict the presence of a brain tumor. With such a success, I wanted to see if my model could improve with the use of boosting.

AdaBoost Classifier

Boosting is different from bagging because instead of attempting to reduce variance, boosting models strive for low bias. It is a very modern machine learning model that improves with each sequential model’s predictions. With a max_depth of 4, my AdaBoost model achieved an accuracy of 98.14%: not an improvement from RF, but certainly not much of a deterioration either. Let’s try one more model…

XGBoost

Notorious for being one of the best machine learning algorithms, XGBoost was certainly saved for last :). My XGB model had a root mean square error of only 0.11. Even more exciting than the accuracy is the feature weights used by the XGB model. Within the first 6 features of importance in this model, 4 of them were first order attributes in the training set. Not only has the XGBoost model accurately predicted the absence or presence of a brain tumor with RMSE of only 0.1, it has done it by reliably weighing the features.

The XGBoost model is far superior to the other models because of its accurate reporting of feature importance.

Discussion and Conclusion

After analyzing the data and creating various models to achieve one that could accurately predict brain tumors, I have found that my most accurate models come down to the Random Forest and the XGBoost models. Each achieved around 99% accuracy in identifying whether or not there was a brain tumor. However, I must admit that the XGBoost model takes the 1st place prize in predicting brain tumors because of its feature importance sorting. The RF model has Energy, Entropy, ASM, and Homogeneity as the top features in creating the model; however, each of those is a second order attribute in the dataset. The XGBoost model was far more accurate in the listing of feature importance and thus, is a more reliable model long term.

Click here for the Dataset Source:

Github Repository: click here

As always, thank you for reading! I appreciate it :)

--

--