Combining Predictive Techniques



This is the final project in the Udacity Predictive Analytics Nanodegree. This project is separated into 3 sections. The first section is on clustering, the second section is on classification model, and the third section is on time series forecasting. There is also a whole section on time series forecasting that comes along with a practise project in this Nanodegree.


Problem: The company currently has 85 stores and is planning to open 10 new stores at the beginning of the year. Currently, all stores are treated similarly, shipping same product to each store. Our task is to provide analytical support to make decisions about store formats and inventory planning.

We need to determine the optimal number of store formats based on sales data. In order to do this, we will need to:

  1. Sum sales data by StoreID and Year
  2. Use percentage sales per category per store for clustering (category sales as a percentage of total store sales).
  3. Use only 2015 sales data.
  4. Use a K-means clustering model.

First, we want to prep our data. We will use these 2 files. The store information file, found here & the store sales data file, found here. Using Alteryx, we can get a clean file with the percentage sales per category per store data for the year 2015.

You can see the pct_sales_category_per_store data here.

Next, we can use the cluster tools on Alteryx to get our cluster analysis (sort stores into clusters)

For both the K-Centroid Diagnostics and Analysis tool, we’ll tick all the percentage fields and use z-score. Our clustering method will be K-Means. We will set it at 3 clusters and number of starting seeds at 10. You can find the clusteredanalysis file here.

Task 1: Determine Store Formats for Existing Stores


What is the optimal number of store formats? How did you arrive at that number?

The optimal number of store formats is 3. This is because it has high median values within both the AR and CH index and smaller spread, showing compactness. We can get this through using the K-Centroids diagnosis tool on Alteryx.

How many stores fall into each store format?


Cluster 1 has 23 store, Cluster 2 has 29 stores and Cluster 3 has 33 stores. (See sheet here, column U)

Based on the results of the clustering model, what is one way that the clusters differ from one another?


Stores that fall under cluster 2 for example would want an increase in Dairy inventory as compared to stores that fall under cluster 1 & 3.


Please provide a Tableau visualization (saved as a Tableau Public file) that shows the location of the stores, uses color to show cluster, and size to show total sales.


Tableau Visualization Link here:!/vizhome/GroceryStoreClusters_15552236106380/Sheet1?publish=yes

To set this up use the clustered analysis sheet generated, see screenshot below for settings on Tableau:


Task 2: Formats for New Stores


What methodology did you use to predict the best store format for the new stores? Why did you choose that methodology? (Remember to Use a 20% validation sample with Random Seed = 3 to test differences in models.)

This task is a good refresher for classification modelling. Remember back to project default risk, the steps here are similar. For this, we will use the clustered analysis file here, and the store demographic file here. The Alteryx workflow is similar to project default risk:

To figure out which model to use, we will union the 3 models and use a model comparison tool to generate a report for us.

We can see here that the Boosted Model has the highest F1 score and highest average accuracy rate across Accuracy_1, 2 and 3, so that is what we will use to score our 10 new scores and assign it to either cluster 1,2 or 3. Create an output on the boosted model tool (ModObjBoosted.yxdb) and let’s move onto the next step.


Note: We can also use the forest model.


What format do each of the 10 new stores fall into? Please fill in the table below.

The Alteryx workflow to score the store is similar to what we did with the predicting default risk project. We will use the ModObjBoosted.yxdb file created from previous steps, and the stores 86 – 95 demographic data, you can find this here, highlighted in yellow at the bottom (just copy 86-95 and paste it into a separate file).

Once you run this workflow, you will get 3 additional columns on a file similar to storestoscore.csv, you can see here, for more details. The columns at the end (in yellow, is what Alteryx output after running the scoring tool, and the ones highlighted in red is what segment/cluster we will add the new stores into as it has the highest score.


Store NumberSegment

Task 3: Predicting Produce Sales


Because fresh produce has a short life span, and due to increasing costs, we want to have an accurate monthly sales forecast. We need to prepare a monthly forecast for produce sales for the full year of 2016 for both existing and new stores.


We will use a 6 month holdout sample for the TS Compare tool (due to lack of data, we won’t use 12 months). There are 3 steps to go about this task.


Step 1: To forecast produce sales for existing stores we will aggregate produce sales across all stores by month and create a forecast.


Step 2: To forecast produce sales for new stores:


  • Forecast produce sales (not total sales) for the average store (rather than the aggregate) for each segment.
  • Multiply the average store produce sales forecast by the number of new stores in that segment.
  • For example, if the forecasted average store produce sales for segment 1 for March is 10,000, and there are 4 new stores in segment 1, the forecast for the new stores in segment 1 would be 40,000.
  • Sum the new stores produce sales forecasts for each of the segments to get the forecast for all new stores.

Step 3: Sum the forecasts of the existing and new stores together for the total produce sales forecast.


What type of ETS or ARIMA model did you use for each forecast? Use ETS(a,m,n) or ARIMA(ar, i, ma) notation. How did you come to that decision?

To get this answer, we want to compare the ETS and ARIMA model on Alteryx. We will need to use the store sales data for this workflow.

Remember, in the summary tool, we GroupBy year and month and sum produce. We create a record tool, so we can filter out record < = 40 (from the 46 record that we will get after running the summarise tool). Our Target field is Sum of Produce and frequency monthly. Once running the workflow, we will take the interactive report generated by the time series plot tool.

I have chosen to use ETS (M,N,M) model for each of my forecast. We can see in the decomposition plot above there is no trend, seasonal is multiplicative and error is multiplicative. After comparing the results against the holdout sample, the ETS performs better against the ARIMA model. Still confused by this? Check out this free course by Udacity here.

Please provide a table of your forecasts for existing and new stores. Also, provide visualization of your forecasts that includes historical data, existing stores forecasts, and new stores forecasts.


MonthNew StoresExisting Stores
Jan 162,587,45121,539,936
Feb 162,477,35320,413,771
Mar 162,913,18524,325,953
Apr 162,775,74622,993,466
May 163,150,86726,691,951
Jun 163,188,92226,989,964
July 163,214,74626,948,631
Aug 162,866,34924,091,579
Sep 162,538,72720,523,492
Oct 162,488,14820,011,749
Nov 162,595,27021,177,435
Dec 162,573,39720,855,799


To find the existing stores numbers on Alteryx, we can use the store sales data file and utilise the ETS and TS Forecast tool. This is pretty straightforward. You can see the workflow below:

We have a summarise tool where we GroupBy Year & Month and have the Sum of Proudce. Our target field in the ETS model is Sum Produce with a monthly frequency. For our TS Forecast tool, we have a 95% large and 80% small confidence interval (default settings) and 12 periods into the future (as requested by the project). The basic table tool is to neatly summarise our results. Once you run this workflow, you will able to match the existing stores data in the table above.


Ok, what if we want to forecast the new stores sales data? This is where it gets a bit confusing, and I had to seek help from mentors on Udacity but here is the Alteryx workflow:

We need the clustered analysis file generated from task 1. This is because we are predicting/forecasting sales data. The first summarise tool, we need to GroupBy Store, Cluster, Year, Month and have Sum Produce. Then the next summarise tool we will GroupBy Cluster, Year, Month and Average Sum Produce. Now we can filter out by cluster 1,2 &3. Since we are using the ETS model, we will link the ETS model to each of the filter and select target field as Average Sum Produce, with a frequency of month. Our TS forecast will use the default setting, with 12 number of periods. Once we add our forecast formula, we then union the 3 sets and summarise it again, Grouping by Period and Sub Period and Sum of forecast. The Basic Table is to neatly summarise our results. Once you run this workflow, you will get the above new store data in the table above.


Visualisation here:!/vizhome/producesale_15554118631260/Sheet1?publish=yes


To set this up, you’ll need to create a sheet with 3 columns, date, sales value and type of sales (either historical, existing or new). See sheet here for example. You can then use Tableau and set it up like this:

For a quick crash course on Tableau and Data Visualization, check out this free course by Udacity here.

If you would like my Alteryx workflow or have questions, feel free to contact me. This was a fun and challenging project. Now onto some Python!

Feel free to message me if you need tech help, have work opportunities, or if you want restaurant recommendation in Sydney! Contact Icon