Rossmann Store Sales Analysis & Forecasting

Posted by Christopher Mertin on July 24, 2017 in Project • 16 min read

Rossmann operates over 3,000 drug stores in 7 European countries. They released some of their data online for free in hopes of people building a statistical/machine learning model to help with sales forecasting. Forecasting sales is important since it allows store managers to determine how much staff to hire on a given day. Rossmann stores provided four files.

  1. train.csv: Historical data including sales
  2. test.csv: Historical data excluding sales
  3. sample_submission.csv: A sample submission file in the correct format
  4. store.csv: Supplemental information about the stores

As the competition was closed, I opted to only use train.csv where I created my own test and training set (to validate my models), and store.csv to create more features to make better predictive models.

Feature creation

The first, most important, part was feature creation. store.csv had information about each of the stores which is helpful to having more features, as more features usually lead to better models — as long as you account for over fitting.

First of all, reading in the data

1
2
3
4
5
6
7
data = pd.read_csv("train.csv", low_memory=False)
store = pd.read_csv("store.csv")

# Consider only open stores for training
df = data[data["Open"] != 0]
# Use only sales > 0
df = df[df["Sales"] > 0]

In building the models, we’re only going to look at stores that were "Open" and those that had more than 0 sales for a given day. This will help in reducing the MSE, and also there is no point in attempting to predict the number of sales when a store is closed.

For creating the features, I used the following function

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def BuildFeatures(data):
    # remove NaNs
    data.fillna(0, inplace=True)
    data.loc[data.Open.isnull(), "Open"] = 1

    data["Date"] = pd.to_datetime(data["Date"], errors="coerce")

    # Label encode some features
    mappings = {'0':0, 'a':1, 'b':2, 'c':3, 'd':4}
    data.StoreType.replace(mappings, inplace=True)
    data.Assortment.replace(mappings, inplace=True)
    data.StateHoliday.replace(mappings, inplace=True)

    data["Year"] = data.Date.dt.year
    data["Month"] = data.Date.dt.month
    data["Day"] = data.Date.dt.day
    data["DayOfWeek"] = data.Date.dt.dayofweek
    data["WeekOfYear"] = data.Date.dt.weekofyear

    # CompetionOpen and PromoOpen from https://www.kaggle.com/ananya77041/rossmann-store-sales/randomforestpython/code
    # Calculate time competition open time in months
    data["CompetitionOpen"] = 12 * (data.Year - data.CompetitionOpenSinceYear) + (data.Month - data.CompetitionOpenSinceMonth)
    # Promo open time in months
    data["PromoOpen"] = 12 * (data.Year - data.Promo2SinceYear) +             (data.WeekOfYear - data.Promo2SinceWeek) / 4.0
    data["PromoOpen"] = data.PromoOpen.apply(lambda x: x if x > 0 else 0)
    data.loc[data.Promo2SinceYear == 0, "PromoOpen"] = 0

    # Indicate that sales on that day are in promo interval
    data.loc[data.PromoInterval == 0, "PromoInterval"] = ''
    data["IsPromoMonth"] = 0
    for interval in data.PromoInterval.unique():
        if interval != '':
            for month in interval.split(','):
                data.loc[(data.monthStr == month) & (data.PromoInterval == interval), "IsPromoMonth"] = 1

    return data

The above function creates features from the data to make them more accessible with the learning algorithms. For example, splits the YYYY-MM-DD date format into three separate features. It also calculates the time (in months) that the nearest competition has been open, makes categorical features numerical, and creates a boolean for if the store is currently having a promotion.

In order to use this data, first we have to merge the store data and then we can build the relevant features.

1
2
df = pd.merge(df, store, on="Store")
df = BuildFeatures(data)

The above code merges the two dataframes df and store on the "Store" variable, and then builds the new features with the function BuildFeatures.

Exploratory Analysis

Our data frame data has the “vanilla” dataset without the removal of store openings, while df has the built features which we will attempt to predict sales with. data is still kept around for the exploratory analysis.

First, in looking at the relation of stores that are open and closed for a given day of the week, we get

1
2
3
fig, (ax1) = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x="Open", hue="DayOfWeek", edgecolor="black", data=data, ax=ax1)
plt.savefig("Sales_By_Day.png", bbox_inches="tight")

Open Closed

Which shows that the majority of stores are open for the first 6 days, while the 7th day most stores seem to be closed. This most likely means that there are no sales on Sunday.

We can also look at how the sales changed over the course of the data, showing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Create Year, Month, and Day columns
data["Year"]  = data["Date"].apply(lambda x: int(str(x)[:4]))
data["Month"] = data["Date"].apply(lambda x: int(str(x)[5:7]))
#data["Day"] = data["Date"].apply(lambda x: int(str(x)[8:]))

# Assign Date column to Date (Year-Month) instead of (Year-Month-Day)
data["Date"] = data["Date"].apply(lambda x: (str(x)[:7]))

# Group by date and get average sales, and percent change
average_sales    = data.groupby("Date")["Sales"].mean()
pct_change_sales = data.groupby("Date")["Sales"].sum().pct_change()

fig, (axis1, axis2) = plt.subplots(2,1,sharex=True, figsize=(15,8))

# Plot average sales over time(year-month)
ax1 = average_sales.plot(legend=True,ax=axis1,marker='o', title="Average Sales")
ax1.set_xticks(range(len(average_sales)))
ax1.set_xticklabels(average_sales.index.tolist(), rotation=60)

# Plot precent change for sales over time (year-month)
ax2 = pct_change_sales.plot(legend=True, ax=axis2, marker='o', rot=60, colormap="summer", title="Sales Percent Change")
vals = ax2.get_yticks()
ax2.set_yticklabels(["{:3.1f}%".format(x*100) for x in vals])

plt.savefig("Sales_Chart.png", bbox_inches="tight")

Sales History

The above chart shows that the overall trend of sales seems to be increasing over the years. The percentage change is the percentage relative to the prior year, and as it shows the most common is a “positive” gain, rather than negative. In order to make this more visually understanding, we can explore the number of sales and customers per year, giving

1
2
3
4
5
6
7
8
# Plot average sales & customers for every year
fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(15,4))

sns.barplot(x="Year", y="Sales", data=data, edgecolor="black", ax=axis1)
axis1.set_ylabel("Mean Sales")
sns.barplot(x="Year", y="Customers", data=data, edgecolor="black", ax=axis2)
axis2.set_ylabel("Mean Customers")
plt.savefig("Mean_Sales_Cutomers.png", bbox_inches="tight")

Mean Number of Customers

Which shows the mean number of sales and the mean number of customers each year. As the charts show, the number of sales did increase over all three years (though did stagnate slightly), which is likely due to the total number of customers decreasing from 2014 to 2015.

We can get some more insight in the number of customers during the trend by looking at the historical plot as we did for sales.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
fig, (axis1, axis2, axis3) = plt.subplots(3, 1, figsize=(15,12), sharex=False)

# Plot max, min values, & 2nd, 3rd quartile
sns.boxplot([data["Customers"]], whis=np.inf, ax=axis1)


# Group by date and get average customers, and precent change
average_customers = data.groupby("Date")["Customers"].mean()
pct_change_customers = data.groupby("Date")["Customers"].sum().pct_change()

# Plot average customers over the time
ax = average_customers.plot(legend=True,marker='o', ax=axis2, sharex=axis3)
ax.set_xticks(range(len(average_customers)))
ax.set_title("Mean Number of Customers Per Month")

ax = pct_change_customers.plot(color='g', legend=True, marker='o', ax=axis3)
ax.set_xticks(range(len(average_customers)))
vals = ax.get_yticks()
ax.set_yticklabels(["{:3.1f}%".format(x*100) for x in vals])
ax.set_xticklabels(pct_change_customers.index.tolist(), rotation=60)
ax.set_title("Percent Change of Customers Per Month")

# Force create the axis for the box plot
axis1.xaxis.set_visible(True)
minx = min(data["Customers"])
maxx = max(data["Customers"])
range_step = int((maxx - minx)/8)
ticklab = [0] + list(range(minx, maxx, range_step))
axis1.set_xticklabels(ticklab, visible=True)

plt.savefig("Customer_Change.png", bbox_inches="tight")

Change in customers over time

The above shows the distribution of the number of customers per month, as well as their growth over the years. While it is correlated with the number of sales, it also shows that there is slightly less gain. However, there seems to be common peaks for the months of March, July, and December. December is believable for Christmas, but I could not come to a proper conclusion for March and July. As Rossmann is a German-based store, there does not seem to be any German holidays that fall in those months.

We can also look at the trend of the sales per the day of week to see how that is an influence

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# In both cases where the store is closed and opened

fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(15,4))

sns.barplot(x="DayOfWeek", y="Sales", data=data, edgecolor="black", order=[1,2,3,4,5,6,7], ax=axis1)
axis1.set_ylabel("Mean Sales")
sns.barplot(x="DayOfWeek", y="Customers", data=data, edgecolor="black", order=[1,2,3,4,5,6,7], ax=axis2)
axis2.set_ylabel("Mean Customers")

plt.savefig("Sales_Customers_Day.png", bbox_inches="tight")

Sales and Customers per day

Which the above shows that the mean number of sales and the mean number of customers for each day of the week is correlated with one another. For example, Monday (1) has the highest mean sales and the highest mean number of customers. And you would expect that if a lot of people go on Monday, the number of sales/customers go on Monday, the following days would be less, until they need to revisit. Monday is likely the largest as they’re closed on Sunday.

In looking at how a store having a promotion deals with sales, we can see

1
2
3
4
5
6
7
8
9
# Plot average sales & customers with/without promo
fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(15,4))

sns.barplot(x="Promo", y="Sales", data=data, edgecolor="black", ax=axis1)
axis1.set_ylabel("Mean Sales")
sns.barplot(x="Promo", y="Customers", data=data, edgecolor="black", ax=axis2)
axis2.set_ylabel("Mean Customers")

plt.savefig("Sales_For_Promo.png", bbox_inches="tight")

promo sales

As expected, if a store is having a promotion, the number of sales and customers increases.

The files also list if it was a StateHoliday, where 0 denotes no holiday, a is for a “public holiday,” b is for Easter, and c is Christmas. We can see that this has a large impact on the number of sales

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
sns.countplot(x="StateHoliday", edgecolor="black", data=data)

# Before
fig, (axis1,axis2) = plt.subplots(1, 2, figsize=(15,4))

sns.barplot(x="StateHoliday", y="Sales", data=data, edgecolor="black", ax=axis1)
axis1.set_ylabel("Mean Sales")

# Create a mask for *only* stores during a state holiday
mask = (data["StateHoliday"] != "0") & (data["Sales"] > 0)
sns.barplot(x="StateHoliday", y="Sales", data=data[mask], edgecolor="black", ax=axis2)
axis2.set_ylabel("Mean Sales")
plt.savefig("State_Holiday_Sales.png", bbox_inches="tight")

state holiday sales

The above chart is to be expected, as the number of non-holiday days far outweigh actual holidays. The top bar chart shows the total number of instances, while the bottom two focus on the mean number of sales. It shows that there are more sales on Easter and Christmas in comparison to public holidays. Therefore, which holiday it is seems to be a driving factor as well.

Schools also have holidays, so we can look at how that effects sales as well.

1
2
3
4
5
6
7
8
9
sns.countplot(x="SchoolHoliday", data=data)

fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(15,4))

sns.barplot(x="SchoolHoliday", y="Sales", data=data, edgecolor="black", ax=axis1)
axis1.set_ylabel("Mean Sales")
sns.barplot(x="SchoolHoliday", y="Customers", data=data, edgecolor="black", ax=axis2)
axis2.set_ylabel("Mean Customers")
plt.savefig("School_Holiday_Sales.png", bbox_inches="tight")

school holiday sales

This above figure also shows the same as the last, perhaps a little more expected. There seems to be an increase in the mean number of sales on school holidays, which makes sense as kids on holiday most likely are going to travel around places with their friends.

Lastly, we can look at the distribution of the daily sales (for all stores combined). For this plot, I used the df dataframe as to remove the instances where sales were 0, as those are uninteresting and mostly occur on Sundays when the stores are closed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fig, (axis1, axis2) = plt.subplots(2, 1, figsize=(15,8), sharex=True)

# Plot max, min values, & 2nd, 3rd quartile
sns.boxplot([df["Sales"]], whis=np.inf, ax=axis1)

# Plot sales values
df["Sales"].plot(kind='hist', edgecolor="black", bins=75, xlim=(0,15000), ax=axis2)
plt.xlabel("Daily Sales")
plt.ylabel("Count")
plt.savefig("Sales_Frequency_noclose.png", bbox_inches="tight")

sales frequency

In the above figure, the box plot and the histogram are x-axis aligned. Therefore, the median line in the boxplot, and the quartile ranges correspond to the boxes in the histogram below it as well. We can see that the inner quartile range corresponds to approximately 5,000 and 8,000 sales per day.

Models

In this section, I apply different regression models to attempt to approximate the number of sales from each store on a given day. I look at the errors of each and their error distribution to attempt to determine the best fitting model.

Altering the data

First, we need to alter the data. We need a function that creates the testing and training set first, which is done by the function below

1
2
3
4
5
6
7
8
9
def Make_X_y(data, col="y"):
    y = data[col].as_matrix()
    X = deepcopy(data)
    del X[col]
    X = X.as_matrix()

    X, X_v, y, y_v = train_test_split(X, y, test_size=0.2, random_state=8)

    return X, y, X_v, y_v

We also need to drop the Customers column, as it allows us to “view into the future” which would invalidate any model that we create.

1
df = df.drop("Customers", axis=1)

Finally, before splitting between our test and training set, we need to make categorical columns into numerical. By using df.dtypes, we can see that the only column that isn’t numerical is the PromoInterval column. We can make it categorical with the following code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# First make it into a "category" type since it's an "object" (do for all "object")
df["PromoInterval"] = df["PromoInterval"].astype("category")

# Get the columns that are category
cat_columns = df.select_dtypes(["category"]).columns

# Create mapping
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

# Convert columns to `int64` since some are `int8` (preserve numerical precision)
df[cat_columns] = df[cat_columns].astype("int64")

# Get the list of the features
features_list = df.drop("Sales", axis=1).columns.values

# Finally, split the data into test and validation sets
# 'col' is the column we're attempting to predict sales with
X, y, X_v, y_v = Make_X_y(df, col="Sales")

Now that we have the above code set up, we can apply our regressors to the data to attempt to predict.

Random Forest Regression

Random Forests are pretty resistant when it comes to unbalanced data sets. It is also able to give the most important features in determining the number of sales.

The random forest regressor is much easier to use with Scikit-learn, as the default options are usually the best and cross-validation doesn’t need to be used. The documentation can be found here. As the random forest is able to determine the importance factor for each factor, the max_features parameter doesn’t need to be set as it will default to the total number of features. max_depth defaults to None meaning that there won’t be a maximum depth for the trees to be set at. It is possible that n_estimators (number of subtrees) would be a problem, but in testing various values with cross-validation, I did not see any incremental gains. The others such as min_split seem to match up logically with how you would train a random forest, and shouldn’t cause too much variation in the results.

Our model can be trained in the following way

1
2
3
4
5
6
rf_reg = RandomForestRegressor(n_jobs=-1, random_state=8)
rf_reg.fit(X, y)

features = rf_reg.feature_importances_

feature_importance = DataFrame(list(reversed(sorted(zip(features, features_list)))))

This will fit to our training set, as well as obtain the list of the most important determined features and store them in a data frame. We can then visualize the data, and see that we get

1
2
3
4
5
6
fig, ax = plt.subplots(1, 1, figsize=(15,8))
sns.barplot(x=0, y=1, color="salmon", data=feature_importance, edgecolor="black")
plt.xlabel("Relative Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance")
plt.savefig("Feature_Importance.png", bbox_inches="tight")

Feature Importance

Feature Importance
CompetitionDistance 0.194650
Store 0.174714
Promo 0.135008
DayOfWeek 0.071729
CompetitionOpenSinceYear 0.058781
CompetitionOpenSinceMonth 0.058551
CompetitionOpen 0.049446
Day 0.043992
WeekOfYear 0.037136
StoreType 0.036059
Promo2SinceYear 0.029928
Assortment 0.029330
Promo2SinceWeek 0.024404
PromoInterval 0.014829
PromoOpen 0.014714
Month 0.011000
SchoolHoliday 0.005532
Year 0.004418
Promo2 0.002683
StateHoliday 0.001611
IsPromoMonth 0.001484

Where the most important features are listed at the top of the table/figure. We can see that the results make sense as the CompetitionDistance is most likely going to be the determining factor in how many sales you have on a given day, followed by the store and the promotion.

Before we can predict the number of sales for a store on a given day, we need another function which we will use for all of our models in order to compare the results.

1
2
3
4
5
6
def Error_Rate(y_true, y_pred):
    data = []
    for i in range(len(y_true)):
        data.append(abs(y_true[i] - y_pred[i]))

    return np.asarray(data)

The above function takes in the true y values (sales) and the predicted y values and takes the mean error. We can use this to get a distribution of the data to determine how the model is performing and use it to compare the different models.

To predict on the validation set, we can use

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Calculate the Prediction
y_pred = rf_reg.predict(X_v)

# Calculate the statistics on the error rate
mse = mean_squared_error(y_v, y_pred)
r2 = r2_socre(y_v, y_pred)
mae = median_absolute_error(y_v, y_pred)

# Calculate the mean absolute error for all `y_v`
pred_dist = Error_Rate(y_v, y_pred)

# Get the statistics on `pred_dist`
mean = np.mean(pred_dist)
variance = np.var(pred_dist)
sigma = np.sqrt(variance)

# Plot the distribution of the mean absolute error
fig, ax = plt.subplots(1, 1, figsize=(15,8))
plt.hist(pred_dist, bins=1000)
plt.xlim([0,5000])
plt.xlabel("Absolute Error")
plt.ylabel("Count")
plt.title("Distribution of the Absolute Error")
plt.savefig("Abs_Error.png", bbox_inches="tight")

random forest mae dist

The above results show the data as being a half-normal distribution (unscaled), which is good as it is easy to understand the results.

Statistical Factor Value
Mean Squared Error 901595.90
\(R^{2}\) Error 0.91
Median Absolute Error 416.60
\(\mu\) 614.21
\(\sigma^{2}\) 524337.73
\(\sigma\) 724.11

Where the above shows that the mean absolute error is 614.21, though the Median Absolute Error is more relevant since the data is non-linear, giving 416.60. As this can be seen as a half-normal, that means that 68% of the predicted values fall at or below an absolute error of 724.11 sales for a given day. This is less than 10% error with respect to the mean of the total number of daily sales.

Stoichastic Gradient Descent Regression

SGD Regression is interesting as it’s supposed to approximate the same as a linear regression function it learns the “plane”. In testing with cross validation, the optimal parameters were found to be the following

1
2
sgd = SGDRegressor(loss="squared_loss", learning_rate="invscaling", penalty="elasticnet", eta0=0.01, alpha=1e-8, n_iter=10, fit_intercept=True)
sgd.fit(X, y)

SGD is a very fast algorithm. We can predict with this and look at the distribution just as last time with the random forest. We can also look at the absolute error distribution as well. In doing so, we get

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
pred_y = sgd.predict(X_v)
pred_dist = Error_Rate(y_v, pred_y)

# Plot Distribution
fig, ax = plt.subplots(1,1,figsize=(15,8))
plt.hist(pred_dist, bins=1000)
plt.xlabel("Absolute Error")
plt.ylabel("Count")
plt.title("SGD Distribution of the Absolute Error")
plt.savefig("sgd_Abs_Error_Large.png", bbox_inches="tight")

# Get statistical properties
mse = mean_squared_error(y_v, pred_dist)
r2 = r2_score(y_v, pred_dist)
mae = median_absolute_error(y_v, pred_dist)

mean = np.mean(pred_dist)
variance = np.var(pred_dist)
sigma = np.sqrt(variance)

sgd mae dist

Statistical Factor Value
Mean Squared Error 1.04e+33
\(R^{2}\) Error -1.95e+26
Median Absolute Error 3.10e+15
\(\mu\) 2.62e+16
\(\sigma^{2}\) 1.21e+33
\(\sigma\) 3.48e+16

It’s important to note that a negative \(R^{2}\) value, especially with a value that large as well, means that it fits the data very poorly, and that it wasn’t able to determine the intercept. Even the distribution of the absolute error shows that it is a non linear error, though it does remove the largest of the errors.

K-Nearest Neighbors Regression

KNN Regression uses K-Nearest Neighbors to find the \(k\) closest data points for a given prediction, and then takes the average (or distance-weighted average) to determine the value for the predictor.

The best way to do this is to use different values of \(k\) and keep testing them on the Mean Squared Error/Mean Absolute Error to find the minimum. However, as I’m testing on a laptop, this was taking too long in testing on the set \(k=\{3,5,7,9,11,13,15,17\}\). There also was not any visible spot where the MSE was decreasing to a “minimum.”

In an ideal world, it would be best to iterate over different values of \(k\), but due to my limited computing resources, I simply “played around with” various values of \(k\) until I found a good value. The value that I found was \(k = 113\) seemed to be a good fit, as some points below this had the MSE go up, as well as some values of \(k > 113\) (at some point). This model was fit and predicted with the following code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Predict the values and get the Absolute Error
knn_reg = KNeighborsRegressor(n_neighbors=113, n_jobs=-1)
knn_reg.fit(X, y)
y_pred = knn_reg.predict(X_v)
pred_dist = Error_Rate(y_v, y_pred)

# Plot the distribution
fig, ax = plt.subplots(1, 1, figsize=(15,8))
plt.hist(pred_dist, bins=1000)
plt.xlabel("Absolute Error")
plt.ylabel("Count")
plt.xlim([0,5000])
plt.title("KNN Distribution of the Absolute Error")
plt.savefig("knn_Abs_Error.png", bbox_inches="tight")

# Get the statistical values
mean = np.mean(pred_dist)
variance = np.var(pred_dist)
sigma = np.sqrt(variance)

knn mae dist

Statistical Factor Value
\(\mu\) 1393.69
\(\sigma^{2}\) 1644992.39
\(\sigma\) 1282.57

Which from the figure and the table above you can see the values of the half-normal distribution. Even from the magnitude of the figure, you can see that it is roughly two times as bad as the random forest regressor.

Lasso

Lasso is a nice linear regression tool as it is able to vary the coefficients and remove features that it deems aren’t important for reducing the value in the calculated cost function.

However, for Lasso to converge quickly, it does require that the data be normalized for it to converge quickly. Especially with a large data set such as this with over 1 million entries. Thankfully, however, Scikit-Learn’s lasso implementation allows for you to set a parameter in the function for it to do this automatically.

In order to build our model, we can choose

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
lasso = Lasso(alpha=0.001, fit_intercept=True, normalize=True, precompute=True)
lasso.fit(X, y)

# Create Data Frame of the "important features"
lasso_feature_importance = DataFrame(list(reversed(sorted(zip(lasso.coef_, features_list)))))

# Plot a barchart of the feature values
fig, ax = plt.subplots(1, 1, figsize=(15,8))
sns.barplot(x=0, y=1, color="salmon", data=lasso_feature_importance, edgecolor="black")
plt.xlabel("Coefficient Values of Features")
plt.ylabel("Features")
plt.title("Lasso Coefficients")
plt.savefig("Lasso_Feature_Importance.png", bbox_inches="tight")

lasso feature importance

Feature Coefficient
Promo 2144.90
StateHoliday 1298.46
Assortment 418.29
Year 171.16
Month 73.52
SchoolHoliday 66.43
Promo2SinceWeek 19.68
PromoOpen 5.13
CompetitionOpenSinceYear 0.19
Store 0.03
WeekOfYear 0.00
CompetitionOpen -0.00
CompetitionDistance -0.03
Promo2SinceYear -0.23
Day -2.62
CompetitionOpenSinceMonth -43.51
StoreType -60.58
IsPromoMonth -80.66
PromoInterval -120.22
DayOfWeek -136.75
Promo2 -699.73

The most important factors can be determined by the absolute value of their coefficient. For example, the top 3 features are Promo, StateHoliday, and Promo2, as \(|-699.73| > 418.29\). However, it’s important to note that this isn’t the relative importance as with the random forest. For that you would have to take the total, and divide each coefficient by the total coefficient.

As before, we can look at the statistical values for the distribution for the absolute error on the predicted values, giving

lasso abs error dist

Statistical Factor Value
\(\mu\) 1345.48
\(\sigma^{2}\) 1641285.72
\(\sigma\) 1266.98

Which shows that it is quite similar to the KNN regression case, meaning the KNN regressor is a good approximation of a linear regression technique. This is most likely due to using the \(\ell_{2}\) distance for the KNN model.

Conclusion

In conclusion, the best regression model wound up being the random forest, and was able to predict a given store within 10% of the total daily sales. The KNN regressor and Lasso were quite similar in their results, while the SGD Regression technique performed quite poorly.

Some changes that I would make to this project would be to actually predict the number of customers that would be visiting on a given day, as that might be a better indicator on the number of people to staff, and might be able to build a better predictor.

I would have also liked to explore the use of an ensemble method to potentially lower the results as well. As well as potentially looking at the results as treating this as a classification problem and using a technique such as XGBoost and/or a Random Forest Classifier to classify it to a class (Sales/Customers treated as a string), then converting that predicted class into an int and seeing how it effected the error rate. However, this is a bit unintuitive, so I expect it to perform worse than the regression case.