Able.com Case

packages
Data read in and viewing
1) What is the true ATE of having a sale?
2) What is the estimated ATE of having a sale using the observed data, toy_sales.csv?
3)What is the bias involved in using the observed data to estimate ATE?
4) A balance table is used to check whether the treatment and control groups in a study are exchangeable based on pre-treatment characteristics. Create a balance table using the observed data (toy_sales.csv) that shows how company size varies by treatment status. Comment on the exchangeability of the groups.
5) Company size confounds the relationship between weekly_amount_sold and is_on_sale in the observed data (toy_sales.csv). Create a statistical model to adjust for the confounding. Hint: use linear regression.
6) The conditional average treatment effect or CATE is defined as the average treatment effect for a subgroup of the data. Think of weeks_to_xmas as defining subgroups with levels equal to 3, 2, 1, 0. What is the true ATE of a sales campaign conditional on weeks to Christmas?
7) Use a statistical model to estimate the adjusted ATE of a sales campaign conditional on weeks to Christmas. Hint: add an interaction term to your earlier regression model.

packages

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data read in and viewing

po <- read.csv("toy_sales_PO.csv")
head(po)

##   store weeks_to_xmas avg_week_sales is_on_sale     y0     y1
## 1     1             3          12.98          1 162.08 212.08
## 2     1             2          12.98          1 136.12 186.12
## 3     1             1          12.98          1 110.16 160.16
## 4     1             0          12.98          0  84.20 134.20
## 5     2             3          19.92          0  98.85 148.85
## 6     2             2          19.92          0  59.01 109.01
##   weekly_amount_sold
## 1             212.08
## 2             186.12
## 3             160.16
## 4              84.20
## 5              98.85
## 6              59.01

1) What is the true ATE of having a sale?

#Yi1 = potential outcome if treated Yi0  = PO if not treated 
#columns needed, is_on_sale: 1 for yes 0 no, y1: outcome if treated y0: untreated 

po <- po %>% 
  mutate(difference = y1 - y0)
ATE <- po %>% 
  summarize(ATE = mean(difference, na.rm = TRUE))

print(ATE)

##        ATE
## 1 45.90093

Interpretation: The positive impact of 45.9 sugests that implementing the sale has a beneficial impact on the sales of 45.9 units compared to not having a sale.

2) What is the estimated ATE of having a sale using the observed data, toy_sales.csv?

observed <- read.csv("toy_sales_data_observed.csv")

head(observed)

##   store weeks_to_xmas avg_week_sales is_on_sale weekly_amount_sold
## 1     1             3          12.98          1             212.08
## 2     1             2          12.98          1             186.12
## 3     1             1          12.98          1             160.16
## 4     1             0          12.98          0              84.20
## 5     2             3          19.92          0              98.85
## 6     2             2          19.92          0              59.01

observed_means <- observed %>% 
  group_by(is_on_sale)%>% 
  summarise(
    mean_sales = mean(weekly_amount_sold, na.rm = TRUE)
  )

observed_means

## # A tibble: 2 × 2
##   is_on_sale mean_sales
##        <int>      <dbl>
## 1          0       62.8
## 2          1      141.

# Calculate the Estimated ATE as the difference between means
estimated_ate <- observed_means %>%
  spread(is_on_sale, mean_sales) %>%
  mutate(Estimated_ATE = `1` - `0`) %>%
  select(Estimated_ATE)

# Display the Estimated ATE
estimated_ate

## # A tibble: 1 × 1
##   Estimated_ATE
##           <dbl>
## 1          77.9

Interpretation: This shows a positive impact of 77.94 for the estimated ATE.

3)What is the bias involved in using the observed data to estimate ATE?

# ate - estimated_ate 
ATE - estimated_ate

##         ATE
## 1 -32.04115

Interpretation: This negative bias indicates that the estimated ATE is actualy higher than the true ATE. This shows that there is probably an overwestimation due to confounding variables or a selection bias for the purchase of toys. Maybe the compounding factor is that the lower the price on tows might lower the expectation of quality. Or maybe there is a selection bias of customers that might spend less money because they are bargain hunting and don’t have as much to spend.

4) A balance table is used to check whether the treatment and control groups in a study are exchangeable based on pre-treatment characteristics. Create a balance table using the observed data (toy_sales.csv) that shows how company size varies by treatment status. Comment on the exchangeability of the groups.

colnames(observed)

## [1] "store"              "weeks_to_xmas"      "avg_week_sales"    
## [4] "is_on_sale"         "weekly_amount_sold"

balance_table <- observed %>% 
  group_by(is_on_sale) %>%
  summarise(
    mean_size = mean(avg_week_sales, na.rm = TRUE), 
    sd_size = sd(avg_week_sales, na.rm = TRUE), 
    n= n()
  )
balance_table

## # A tibble: 2 × 4
##   is_on_sale mean_size sd_size     n
##        <int>     <dbl>   <dbl> <int>
## 1          0      18.8    3.72   976
## 2          1      21.6    4.99  1024

Interpretation: There is a difference in the average size between the treatment and the control groups. Speicifically the treatement has a higher mean size than the control. This sugggests that the stores that are in the treatment group have higher average sales than the control. This difference might hurt the case for exchangability.

The standard deviation in the treatment group is also largetr than the control group, indicating that there is more variability in the zide of stores that had a sale. This suggessts that stores with sales have more diverse characteristics which might hurt exchangability.

5) Company size confounds the relationship between weekly_amount_sold and is_on_sale in the observed data (toy_sales.csv). Create a statistical model to adjust for the confounding. Hint: use linear regression.

Report an adjusted ATE. How does it compare to the true ATE? Explain how the adjustment is working.

model <- lm(weekly_amount_sold ~ is_on_sale + avg_week_sales, data = observed) 

summary(model)

## 
## Call:
## lm(formula = weekly_amount_sold ~ is_on_sale + avg_week_sales, 
##     data = observed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -162.997  -46.180   -4.093   43.293  242.178 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.8972     6.2117   0.466    0.641    
## is_on_sale      68.9867     2.9049  23.749   <2e-16 ***
## avg_week_sales   3.1902     0.3135  10.175   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.88 on 1997 degrees of freedom
## Multiple R-squared:  0.3098, Adjusted R-squared:  0.3091 
## F-statistic: 448.1 on 2 and 1997 DF,  p-value: < 2.2e-16

ate_value <- ATE$ATE[1]
cat("The true ATE is", ate_value, "/n")

## The true ATE is 45.90093 /n

print("The adjusted ATE is 68.9867")

## [1] "The adjusted ATE is 68.9867"

Interpretation: The model is adjusting for variability in weekly_amounts_sold that is attrituable to the difference in the avg_weekly_sales. This is isolateing is_on_sale for weekly_amount_sold. Without this adjustment the effect might have been biased due to the confounding variable of size of the store.

6) The conditional average treatment effect or CATE is defined as the average treatment effect for a subgroup of the data. Think of weeks_to_xmas as defining subgroups with levels equal to 3, 2, 1, 0. What is the true ATE of a sales campaign conditional on weeks to Christmas?

Use toy_sales_PO.csv for the calculation. You should report 4 numbers and make a comment.

# Calculate CATE for each level of weeks_to_xmas
cate_table <- po %>%
  group_by(weeks_to_xmas) %>%
  summarise(
    CATE = mean(y1 - y0, na.rm = TRUE)
  )


cate_table

## # A tibble: 4 × 2
##   weeks_to_xmas  CATE
##           <int> <dbl>
## 1             0  38.7
## 2             1  46.1
## 3             2  48.9
## 4             3  49.8

Interpretation: The CATE decreases as the weeks get closer to Christmas. This suggests that the total impact of the sale is the strongest three weeks out then decreases in effectiveness the closer you get to Christmas week. This can be due to customer behaviour or even market saturation. Able.com should then focus their sales in early december to make the biggest impact.

7) Use a statistical model to estimate the adjusted ATE of a sales campaign conditional on weeks to Christmas. Hint: add an interaction term to your earlier regression model.

Use toy_sales.csv for the estimate. Again, you should report 4 numbers and make a comment addressing whether estimated CATE is higher or lower than the true CATE you calculated for the previous question. Simply outputting your regression results will not suffice! Challenge: create a plot that illustrates the statistical result.

Step 1: Interaction model

colnames(observed)

## [1] "store"              "weeks_to_xmas"      "avg_week_sales"    
## [4] "is_on_sale"         "weekly_amount_sold"

#im model with is_on_sale and weeks_to_xmas 
int_model <- lm(weekly_amount_sold ~ is_on_sale * weeks_to_xmas + avg_week_sales, data= observed)

summary(int_model)

## 
## Call:
## lm(formula = weekly_amount_sold ~ is_on_sale * weeks_to_xmas + 
##     avg_week_sales, data = observed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.151  -30.776   -2.707   31.088  181.080 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -50.8416     4.9599 -10.251  < 2e-16 ***
## is_on_sale                37.8046     3.6365  10.396  < 2e-16 ***
## weeks_to_xmas             32.2646     1.3437  24.012  < 2e-16 ***
## avg_week_sales             3.8356     0.2361  16.243  < 2e-16 ***
## is_on_sale:weeks_to_xmas   9.4897     1.8933   5.012 5.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.36 on 1995 degrees of freedom
## Multiple R-squared:  0.6129, Adjusted R-squared:  0.6122 
## F-statistic: 789.8 on 4 and 1995 DF,  p-value: < 2.2e-16

Interpretation: The interaction model supports that the effectiveness of the sales campaign increases as the number of weeks to Christmas decreases, with a significant interaction effect (estimate = 9.4897). The adjusted CATE is higher closer to Christmas, indicating that the sales campaign is more impactful as the holiday approaches.

Step 2: Estimate Adjusted CATE

#extract coefficients 
coef_summary <- summary(int_model)$coefficients
# Extract main effect and interaction 
main_effect <- coef_summary["is_on_sale", "Estimate"]
interaction_effect <- coef_summary["is_on_sale:weeks_to_xmas", "Estimate"]

# Calculate adjusted CATE for each level of weeks_to_xmas
adjusted_CATE <- tibble(
  weeks_to_xmas = 0:3,
  adjusted_CATE = main_effect + interaction_effect * (0:3)
)

# Display the adjusted CATE
adjusted_CATE

## # A tibble: 4 × 2
##   weeks_to_xmas adjusted_CATE
##           <int>         <dbl>
## 1             0          37.8
## 2             1          47.3
## 3             2          56.8
## 4             3          66.3

Interpretation: This actually stays consistent in the theme of as the weeks to Christmas get smaller so does the effect of the sale.

step 3 Compare Adjusted and True CATE VALUES

# CATE From before 
true_CATE <- c(38.74824, 46.13646, 48.93506, 49.78396)  

# Combine adjusted CATE with the true CATE values
cate_comparison <- tibble(
  weeks_to_xmas = 0:3,
  true_CATE = true_CATE,
  adjusted_CATE = main_effect + interaction_effect * (0:3)
)

# Display the comparison
cate_comparison

## # A tibble: 4 × 3
##   weeks_to_xmas true_CATE adjusted_CATE
##           <int>     <dbl>         <dbl>
## 1             0      38.7          37.8
## 2             1      46.1          47.3
## 3             2      48.9          56.8
## 4             3      49.8          66.3

Again, as the trend of getting closer to Christmas lowers the effect of the sale, in the adjusted CATED there is even more of a difference between the week 3 and week 0 (adjusted week 3 being ~ 17 units higher and week 2 being ~ 8 units higher). This is further supporting the strategy of focusing on sales for week 3 and 2.

Visualize

# Plot the comparison between true CATE and adjusted CATE
ggplot(cate_comparison, aes(x = weeks_to_xmas)) +
  geom_line(aes(y = true_CATE, color = "True CATE"), size = 1) +
  geom_line(aes(y = adjusted_CATE, color = "Adjusted CATE"), size = 1, linetype = "dashed") +
  labs(title = "Comparison of True CATE vs Adjusted CATE",
       x = "Weeks to Christmas",
       y = "CATE",
       color = "CATE Type") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This shows that the interpretation is correct. The focus should be on weeks 2 and 3 with a pretty large increase in units sold.