Apache Spark MLlib Tutorial
In this part of the series, we will put together everything we have learned to train a classification model. The objective is to learn how to build a complete classification workflow from the beginning to the end.
The problem we are going to solve is the infamous Titanic Survival Problem. We are asked to build a machine learning model that takes passenger information and predict whether he/she survived or not.
Preparing the Development Environment
You should be familiar with this step now. We will open a new Jyputer notebook, import and initialize findspark, create a spark session and finally load the data.
Here is an example on how someone may select/update his features by analyzing the above tables:
- It does not make sense to include some features such as: PassengerID, Name and Ticket → we will drop them
- Cabin has a lot of null values → we will drop it as well
- Maybe the Embarked column has nothing to do with the survival → let us remove it
- We are missing 177 values from the Age column → Age is important, we need to find a way to deal with the missing values
- Gender has nominal values → need to encode them.
We will deal with the transformations one by one. In a future article, I will discuss how to improve the process using pipelines. But let us do it the boring way first.
Calculating Age Missing Values
Age is an important feature; it is not wise to drop it because of some missing values. What we could do is to fill missing values with the help of existing ones. This process is called Data Imputation. There are many available strategies, but we will follow a simple one that fills missing values with the mean value calculated from the sample.
MLlib makes the job easy using the Imputer class. First, we define the estimator, fit it to the model, then we apply the transformer on the data.
No more missing values! Let us continue to the next step…
Encoding Gender Values
We learned that machine learning algorithms cannot deal with categorical features. So, we need to index the Gender values:
Creating the Features Vector
We learned previously that MLlib expects data to be represented in two columns: a features vector and a label column. We have the label column ready (Survived), so let us prepare the features vector.
Training the Model
We will use a Random Forest Classifier for this problem. You are free to choose any other classifier you see fit.
- Create an estimator
- Specify the name of the features column and the label column
- Fit the model
y one. We need to calculate some metrics to get the overall performance of the model. Evaluation time…
We will use a BinaryClassificationEvaluator to evaluate our model. It needs to know the name of the label column and the metric name.
Given that we did nothing to configure the hypreparatmers, the initial results are promising. I know that I did not evaluate it on a testing data, but I trust you can do it.
Model Evaluation with SciKit-Learn
If you want to generate other evaluations such as a confusion matrix or a classification report, you could always use the scikit-learn library.
You only need to extract y_true and y_pred from your DataFrame.
Congrats! You have successfully completed another tutorial. You should be more confident with your MLlib skills now. In future tutorials, we are going to improve the preprocessing phase by using pipelines, and I will show you more exciting MLlib features. Stay tuned…