For this project in the Udacity Predictive Analytics Nanodegree, we need to create a clean analytical dataset that will help us with building our predictive model to select the best location to open a new pet store (Pawdacity) in Wyoming. It is fairly straight forward, you can get through this without using the tool Alteryx.
Datasets provided are:
- Monthly Sales for all Pawdacity stores for the year 2010.
- NAICs data on competitors stores (12 months worth)
- Population numbers
- Demographic data
As always, we want to understand what business decision we are trying to solve here:
Step 1: Business and Data Understanding
What decisions needs to be made?
Base on predicted yearly sales, which city in Wyoming should Pawdacity expand and open it’s new store in?
What data is needed to inform those decisions?
Monthly sales data of each Pawdacity store and demographic data for each city and county in the state of Wyoming.
Step 2: Building the Training Set
We want to build a dataset which we will help us build our regression model in order to predict sales for our new pet store location. Looking at the 4 provided dataset, we will build a clean training set with the following columns:
- 2010 Census Population
- Total Pawdacity Sales
- Households with under 18
- Land Area
- Population Density
- Total Families
There’s two ways to building this, we can manually get the numbers from the provided dataset to build our training set (this is definitely not ideal for larger datasets), or we can use tools like Alteryx. If you want to learn more on how to do this with Alteryx, check out this course here, for free.
You can see my clean dataset that I will use for my regression model, here.
Step 3: Dealing with outliers
Are there any cities that are outliers in the training set? Which outlier have you chosen to remove or impute? Because this dataset is a small data set (11 cities), you should only remove or impute one outlier. Please explain your reasoning.
You can find my dataset here
There are two cities that have outliers in this training set. The total pawdacity sales of the city of Cheyenne and Gilette are both over the upper fence and thus an outlier, but due to the high population in both these cities, the total sales value seem to make sense, thus I will not be removing both these data. The population density of Cheyenne however is 20.34 which is above the upper fence and well above the average, this seem to be quite high, so I will be imputing the population density of Cheyenne, which was originally at 20.34 with the average population density over the 11 city, which is 5.71.
Note: To impute means to substitute the data.
To learn how to find IQR visit here: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/interquartile-range/
To learn about Upper and Lower fence, visit here: https://www.statisticshowto.datasciencecentral.com/upper-and-lower-fences/