This is the first project in the Udacity Predictive Analytics for Business course, not including the mini refresher project.
First of all, here is a summary of what we learnt in this section:
The Analytical Problem
The two key analytical concepts we need to know to understand any business situation:
- Cross Industry Standard Process for Data Mining (CRISP-DM)
- Predictive Methodology Map
The CRISP-DM (which we’ll call the problem solving framework) consist of 6 steps:
And the Methodology Map, which is a guide to determine the appropriate analytical technique to solve a particular business question or problem. We’ll be going through this throughout the course. There are two main scenarios for a business problem:
- Data analysis – Where you have the data and you want to answer questions base on this data, eg. “On average, how many people order coffee and a donut per transaction in my store in any given week?”.
- Predictive analysis – Where you want to predict future behaviour base on existing data, eg. “Given the average coffee order, how much coffee can I expect to sell next week if I were to add a new brand of coffee?”.
Back to the project, where we’ll be creating linear regression models to predict sales data. We’ll need to use the Alteryx tool. The two data files are: p1-customers & p1-mailinglist. (note: this is fake user data, obviously).
We need to predict the expected profit from the 250 new customers in the mailing list file. Unless we can expect a profit of >$10,000, we will go ahead with printing the catalogs to send to these new customers. So because we are data rich, and trying to solve a numerical continuous problem, we will be using the linear regression model.
- The costs of printing and distributing is $6.50 per catalog.
- The average gross margin (price – cost) on all products sold through the catalog is 50%.
We need to make sure to multiply our revenue by the gross margin first before subtracting out the $6.50 cost when calculating profit.
Step 1: Business and Data Understanding
What decisions needs to be made?
Should we send these new catalogs to the 250 new customers in the mailing list? If yes, how much profit can we expect to earn?
What data is needed to inform those decisions?
We need to calculate the average number of sale amount per customer. We can then get expected revenue by multiplying the average number of sale with the score value. Once we have this, we can find profit by multiplying average gross margin (0.5) and then subtracting 6.50 (cost of printing and distributing per catalog).
Step 2: Analysis, Modeling, and Validation
How and why did you select the predictor variables in your model? You must explain how your continuous predictor variables you’ve chosen have a linear relationship with the target variable.
I have selected the average number of products and customer segment as the predictor variables. Here is a scatterplot between Average sales amount vs average number of products:
As we can see, there is some correlation between Average sales amount and average number of products purchased here. The correlation calculated to be 0.86 (to calculate correlation in Excel, we can just use the =CORREL(Col1,Col2) formula). The other predictor variable is the customer segment, which is a categorical variable and is separated by 4 categories, we can predict better depending if a customer is a loyalty club or in the store mailing list.
Explain why you believe your linear model is a good model. You must justify your reasoning using the statistical results that your regression model created. For each variable you selected, please justify how each variable is a good fit for your model by using the p-values and R-squared values that your model produced.
This model is a good model as with these predictor variables, we get a multiple R-squared of 0.8369 and an adjusted R-Squared of 0.8366 which is a high number (>0.7). The P value generated for both these variables is < 0.05 (it is < 2.2e^16). This report is generated on Alteryx, see here:
You will want to use the linear regression tool, choose your target variable (in this case, avg_sale_amount as we are trying to find the sale amount) and your predictor variable, in this case, customer segment and avg_num_products_purchased.
What is the best linear regression equation based on the available data? Each coefficient should have no more than 2 digits after the decimal (ex: 1.28)
Note: To use categorical variables in regression is to use what are called dummy variables. A dummy variable can only take on two values, generally zero or one. You would add one dummy variable for one less than the number of unique values in the categorical variable. So if the variable is binary, you’d add one dummy. If there are four categories, you’d add three dummy variables.
Y = 303.46 + (66.98*Average Number of Products) + (-149.36 * Loyalty Club) + (281.84 * Loyalty Club and Credit Card) + (-245.42 * Store Mailing List) + 0 (if Credit Card)For example if we look at the Mailing list data for user Amanda Donahoe (who is a credit card only customer), using this model, her expected sale amount is:
303.46 + (66.98*7)+(-149.36*0)+(281.84*0)+(-245.42*0)+0 = $772.32
Step 3: Presentation/Visualization
What is your recommendation? Should the company send the catalog to these 250 customers?
After conducting an analysis on the data, I would recommend that the company send out the catalog to these 250 customers as there is a greater than $10,000 profit. You can see how profit is calculated in this sheet.
How did you come up with your recommendation?
I have created a linear regression model, using the average sales amount as the target variable and using 2 predictors variable. The average number of products bought per customer and customer segment (this is a categorical variable). After running this in Alteryx, I can use the coefficients provided to predict expected revenue (multiply revenue by the score rate) from each customer in the mailing list, and then I can use this to predict profit (subtracting 6.50 from expected revenue). You can see how expected revenue is calculated in this sheet.
What is the expected profit from the new catalog (assuming the catalog is sent to these 250 customers)?
I have calculated that the expected profit will be $21987.96. Again, you can see how profit is calculated in this sheet.