library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
po <- read.csv("toy_sales_PO.csv")
head(po)
## store weeks_to_xmas avg_week_sales is_on_sale y0 y1
## 1 1 3 12.98 1 162.08 212.08
## 2 1 2 12.98 1 136.12 186.12
## 3 1 1 12.98 1 110.16 160.16
## 4 1 0 12.98 0 84.20 134.20
## 5 2 3 19.92 0 98.85 148.85
## 6 2 2 19.92 0 59.01 109.01
## weekly_amount_sold
## 1 212.08
## 2 186.12
## 3 160.16
## 4 84.20
## 5 98.85
## 6 59.01
#Yi1 = potential outcome if treated Yi0 = PO if not treated
#columns needed, is_on_sale: 1 for yes 0 no, y1: outcome if treated y0: untreated
po <- po %>%
mutate(difference = y1 - y0)
ATE <- po %>%
summarize(ATE = mean(difference, na.rm = TRUE))
print(ATE)
## ATE
## 1 45.90093
Interpretation: The positive impact of 45.9 sugests that implementing the sale has a beneficial impact on the sales of 45.9 units compared to not having a sale.
observed <- read.csv("toy_sales_data_observed.csv")
head(observed)
## store weeks_to_xmas avg_week_sales is_on_sale weekly_amount_sold
## 1 1 3 12.98 1 212.08
## 2 1 2 12.98 1 186.12
## 3 1 1 12.98 1 160.16
## 4 1 0 12.98 0 84.20
## 5 2 3 19.92 0 98.85
## 6 2 2 19.92 0 59.01
observed_means <- observed %>%
group_by(is_on_sale)%>%
summarise(
mean_sales = mean(weekly_amount_sold, na.rm = TRUE)
)
observed_means
## # A tibble: 2 × 2
## is_on_sale mean_sales
## <int> <dbl>
## 1 0 62.8
## 2 1 141.
# Calculate the Estimated ATE as the difference between means
estimated_ate <- observed_means %>%
spread(is_on_sale, mean_sales) %>%
mutate(Estimated_ATE = `1` - `0`) %>%
select(Estimated_ATE)
# Display the Estimated ATE
estimated_ate
## # A tibble: 1 × 1
## Estimated_ATE
## <dbl>
## 1 77.9
Interpretation: This shows a positive impact of 77.94 for the estimated ATE.
# ate - estimated_ate
ATE - estimated_ate
## ATE
## 1 -32.04115
Interpretation: This negative bias indicates that the estimated ATE is actualy higher than the true ATE. This shows that there is probably an overwestimation due to confounding variables or a selection bias for the purchase of toys. Maybe the compounding factor is that the lower the price on tows might lower the expectation of quality. Or maybe there is a selection bias of customers that might spend less money because they are bargain hunting and don’t have as much to spend.
colnames(observed)
## [1] "store" "weeks_to_xmas" "avg_week_sales"
## [4] "is_on_sale" "weekly_amount_sold"
balance_table <- observed %>%
group_by(is_on_sale) %>%
summarise(
mean_size = mean(avg_week_sales, na.rm = TRUE),
sd_size = sd(avg_week_sales, na.rm = TRUE),
n= n()
)
balance_table
## # A tibble: 2 × 4
## is_on_sale mean_size sd_size n
## <int> <dbl> <dbl> <int>
## 1 0 18.8 3.72 976
## 2 1 21.6 4.99 1024
Interpretation: There is a difference in the average size between the treatment and the control groups. Speicifically the treatement has a higher mean size than the control. This sugggests that the stores that are in the treatment group have higher average sales than the control. This difference might hurt the case for exchangability.
The standard deviation in the treatment group is also largetr than the control group, indicating that there is more variability in the zide of stores that had a sale. This suggessts that stores with sales have more diverse characteristics which might hurt exchangability.
Report an adjusted ATE. How does it compare to the true ATE? Explain how the adjustment is working.
model <- lm(weekly_amount_sold ~ is_on_sale + avg_week_sales, data = observed)
summary(model)
##
## Call:
## lm(formula = weekly_amount_sold ~ is_on_sale + avg_week_sales,
## data = observed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.997 -46.180 -4.093 43.293 242.178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8972 6.2117 0.466 0.641
## is_on_sale 68.9867 2.9049 23.749 <2e-16 ***
## avg_week_sales 3.1902 0.3135 10.175 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.88 on 1997 degrees of freedom
## Multiple R-squared: 0.3098, Adjusted R-squared: 0.3091
## F-statistic: 448.1 on 2 and 1997 DF, p-value: < 2.2e-16
ate_value <- ATE$ATE[1]
cat("The true ATE is", ate_value, "/n")
## The true ATE is 45.90093 /n
print("The adjusted ATE is 68.9867")
## [1] "The adjusted ATE is 68.9867"
Interpretation: The model is adjusting for variability in weekly_amounts_sold that is attrituable to the difference in the avg_weekly_sales. This is isolateing is_on_sale for weekly_amount_sold. Without this adjustment the effect might have been biased due to the confounding variable of size of the store.
Use toy_sales_PO.csv for the calculation. You should report 4 numbers and make a comment.
# Calculate CATE for each level of weeks_to_xmas
cate_table <- po %>%
group_by(weeks_to_xmas) %>%
summarise(
CATE = mean(y1 - y0, na.rm = TRUE)
)
cate_table
## # A tibble: 4 × 2
## weeks_to_xmas CATE
## <int> <dbl>
## 1 0 38.7
## 2 1 46.1
## 3 2 48.9
## 4 3 49.8
Interpretation: The CATE decreases as the weeks get closer to Christmas. This suggests that the total impact of the sale is the strongest three weeks out then decreases in effectiveness the closer you get to Christmas week. This can be due to customer behaviour or even market saturation. Able.com should then focus their sales in early december to make the biggest impact.
Use toy_sales.csv for the estimate. Again, you should report 4 numbers and make a comment addressing whether estimated CATE is higher or lower than the true CATE you calculated for the previous question. Simply outputting your regression results will not suffice! Challenge: create a plot that illustrates the statistical result.
colnames(observed)
## [1] "store" "weeks_to_xmas" "avg_week_sales"
## [4] "is_on_sale" "weekly_amount_sold"
#im model with is_on_sale and weeks_to_xmas
int_model <- lm(weekly_amount_sold ~ is_on_sale * weeks_to_xmas + avg_week_sales, data= observed)
summary(int_model)
##
## Call:
## lm(formula = weekly_amount_sold ~ is_on_sale * weeks_to_xmas +
## avg_week_sales, data = observed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.151 -30.776 -2.707 31.088 181.080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.8416 4.9599 -10.251 < 2e-16 ***
## is_on_sale 37.8046 3.6365 10.396 < 2e-16 ***
## weeks_to_xmas 32.2646 1.3437 24.012 < 2e-16 ***
## avg_week_sales 3.8356 0.2361 16.243 < 2e-16 ***
## is_on_sale:weeks_to_xmas 9.4897 1.8933 5.012 5.85e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.36 on 1995 degrees of freedom
## Multiple R-squared: 0.6129, Adjusted R-squared: 0.6122
## F-statistic: 789.8 on 4 and 1995 DF, p-value: < 2.2e-16
Interpretation: The interaction model supports that the effectiveness of the sales campaign increases as the number of weeks to Christmas decreases, with a significant interaction effect (estimate = 9.4897). The adjusted CATE is higher closer to Christmas, indicating that the sales campaign is more impactful as the holiday approaches.
#extract coefficients
coef_summary <- summary(int_model)$coefficients
# Extract main effect and interaction
main_effect <- coef_summary["is_on_sale", "Estimate"]
interaction_effect <- coef_summary["is_on_sale:weeks_to_xmas", "Estimate"]
# Calculate adjusted CATE for each level of weeks_to_xmas
adjusted_CATE <- tibble(
weeks_to_xmas = 0:3,
adjusted_CATE = main_effect + interaction_effect * (0:3)
)
# Display the adjusted CATE
adjusted_CATE
## # A tibble: 4 × 2
## weeks_to_xmas adjusted_CATE
## <int> <dbl>
## 1 0 37.8
## 2 1 47.3
## 3 2 56.8
## 4 3 66.3
Interpretation: This actually stays consistent in the theme of as the weeks to Christmas get smaller so does the effect of the sale.
# CATE From before
true_CATE <- c(38.74824, 46.13646, 48.93506, 49.78396)
# Combine adjusted CATE with the true CATE values
cate_comparison <- tibble(
weeks_to_xmas = 0:3,
true_CATE = true_CATE,
adjusted_CATE = main_effect + interaction_effect * (0:3)
)
# Display the comparison
cate_comparison
## # A tibble: 4 × 3
## weeks_to_xmas true_CATE adjusted_CATE
## <int> <dbl> <dbl>
## 1 0 38.7 37.8
## 2 1 46.1 47.3
## 3 2 48.9 56.8
## 4 3 49.8 66.3
Again, as the trend of getting closer to Christmas lowers the effect of the sale, in the adjusted CATED there is even more of a difference between the week 3 and week 0 (adjusted week 3 being ~ 17 units higher and week 2 being ~ 8 units higher). This is further supporting the strategy of focusing on sales for week 3 and 2.
# Plot the comparison between true CATE and adjusted CATE
ggplot(cate_comparison, aes(x = weeks_to_xmas)) +
geom_line(aes(y = true_CATE, color = "True CATE"), size = 1) +
geom_line(aes(y = adjusted_CATE, color = "Adjusted CATE"), size = 1, linetype = "dashed") +
labs(title = "Comparison of True CATE vs Adjusted CATE",
x = "Weeks to Christmas",
y = "CATE",
color = "CATE Type") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This shows that the interpretation is correct. The focus should be on weeks 2 and 3 with a pretty large increase in units sold.