This is the third project in the Udacity Predictive Analytics for Business course.
In this section, we learn about solving Classification Problems using classification models. There are two types of classification problem:
- Binary – a yes/no answer to the problem eg. (Should the student be accepted into this University?)
- Non Binary – multiple answers to the problem eg. (What color shirt looks best on me?)
We learn to use 4 types of models on Alteryx to predict an outcome:
- Logistic Regresion
- Decision Tree
- Forest Model
- Boosted Model
You’ll see the difference for each of these model, for our project below.
As always in a Udacity project, we go through the project in 3 steps:
Step 1: Business and Data Understanding
What decisions needs to be made?
Of the 500 loan applications, who should be approved for a loan?
What data is needed to inform those decisions?
Data from past applications can help us inform these decisions. See the data on all past applications
Looking at the above linked sheet, we could use the customer’s current length of employment, income, credit score etc in help us inform these decisions.
What kind of model (Continuous, Binary, Non-Binary, Time-Series) do we need to use to help make these decisions?
This is a binary classification model as we are trying to determine if the loan applications get either approved or not.
Step 2: Building the Training Set
During my data cleanup process, I removed the Duration In Current Address field due to lot’s of missing data in this field, I also removed the Concurrent-Credits and Occupation data field as all the data were the same (completely uniform). Other fields removed were foreign workers, no-of-dependents, guarantors and purpose (due to low variability). There were a few data missing in Age, but I imputed this with the median of 33, this is because Age feels like it would play a part in which an application gets approved or not. You can view the clean data here
Step 3: Train your Classification Models
For the purpose of this project, we will use 70% to create the Estimation set and 30% to create the validation set. Random seed is set to 1 in the create sample tool. Here is an example of how our workflow looks like on Alteryx:
We connect the create sample tool estimation output to the 4 predictive tools used. (Logistic, Decision Tree, Forest & Boosted), our target variable is the credit application result, and our predictor variables is everything else in the cleaned data sheet. In order to determine which model to use out of the 4, we can join the output of each of the models to a model comparison tool (this can be downloaded on the Alteryx gallery here).
Let’s take a look at the variable importance plot for each of these models:
In the decision tree model, the account balance, value savings stocks & duration of credit month is significant:
In the forest tree model, the credit amount, age, duration of credit month is more important:
In the boosted model, we can see that credit amount, account balance are more important:
And in the logistic regression model, we can see that the lowest P value in this table below here:
Here’s all 4 models side by side comparison:
Confusion Matrix (you can learn more about this here) comparison below:
Step 4: Writeup
After doing a model comparison across the 4 models (Logistic Regression, Decision Tree, Forest Tree & Boosted), I have chosen the Forest Tree model as it had the highest accuracy against my validation set at 0.79. It had the highest Accuracy_Creditworthy at 0.95 and a 0.42 Non-Credit worthy segments. You can view the model comparison report generated by Alteryx here.
We can see the ROC curve here comparing all the models together where Forest Tree produced the best result:
Scoring the applicants
Since we are using the forest model, we will connect and output a database to the output in the forest tree tool (ModObjForest.yxdb). To get a score for each applicant, we will use the score tool on Alteryx, see below for workflow:
Note: ensure you convert the fields in customers-to-score to the correct data type. Once you run the workflow, the output file will have a score for each applicant. Any applicants with a score of >= 0.5, is approved! In this case, we approved 408 of the 500 applicants. You can see the sheet here with the score for each applicant.