# Load the data
experimental_data <- read.csv("https://raw.githubusercontent.com/jefftwebb/data/main/experimental_email_data.csv")
observational_data <- read.csv("https://raw.githubusercontent.com/jefftwebb/data/main/observed_email_data.csv")
# Inspect the first few rows and the structure of the data
head(experimental_data)
## mkt_email next_mnth_pv age tenure vehicle food beverage art baby
## 1 0 244.26 61 1 0 2 2 0 2
## 2 0 29.67 36 1 0 2 0 2 0
## 3 0 11.73 64 0 0 1 0 0 0
## 4 0 41.41 74 0 0 1 0 0 3
## 5 0 447.89 59 0 0 1 1 3 2
## 6 1 1178.33 39 0 0 0 0 0 0
## personal_care toys clothing decor cell_phones construction home_appliances
## 1 0 0 2 0 3 1 1
## 2 0 1 2 2 3 1 1
## 3 3 1 3 0 2 1 1
## 4 1 0 0 2 6 3 1
## 5 2 2 3 0 2 0 1
## 6 1 0 3 1 0 4 1
## electronics sports tools games industry pc jewel books music_books_movies
## 1 1 0 0 3 1 0 1 0 0
## 2 1 1 1 2 1 2 1 0 2
## 3 2 0 0 3 0 1 0 1 0
## 4 1 0 2 2 1 1 0 4 1
## 5 5 0 0 1 0 0 1 1 2
## 6 2 0 1 3 1 0 0 0 2
## health
## 1 2
## 2 2
## 3 1
## 4 0
## 5 1
## 6 4
str(experimental_data)
## 'data.frame': 10000 obs. of 26 variables:
## $ mkt_email : int 0 0 0 0 0 1 1 1 1 0 ...
## $ next_mnth_pv : num 244.3 29.7 11.7 41.4 447.9 ...
## $ age : int 61 36 64 74 59 39 23 52 31 33 ...
## $ tenure : int 1 1 0 0 0 0 1 1 0 0 ...
## $ vehicle : int 0 0 0 0 0 0 0 0 0 0 ...
## $ food : int 2 2 1 1 1 0 1 0 0 0 ...
## $ beverage : int 2 0 0 0 1 0 2 1 2 2 ...
## $ art : int 0 2 0 0 3 0 0 1 1 0 ...
## $ baby : int 2 0 0 3 2 0 0 2 0 1 ...
## $ personal_care : int 0 0 3 1 2 1 1 1 1 1 ...
## $ toys : int 0 1 1 0 2 0 2 1 2 1 ...
## $ clothing : int 2 2 3 0 3 3 4 3 3 1 ...
## $ decor : int 0 2 0 2 0 1 1 1 2 2 ...
## $ cell_phones : int 3 3 2 6 2 0 1 2 6 4 ...
## $ construction : int 1 1 1 3 0 4 1 0 0 1 ...
## $ home_appliances : int 1 1 1 1 1 1 4 1 1 2 ...
## $ electronics : int 1 1 2 1 5 2 4 3 3 2 ...
## $ sports : int 0 1 0 0 0 0 3 1 1 1 ...
## $ tools : int 0 1 0 2 0 1 0 0 1 2 ...
## $ games : int 3 2 3 2 1 3 1 4 2 3 ...
## $ industry : int 1 1 0 1 0 1 0 0 0 2 ...
## $ pc : int 0 2 1 1 0 0 1 2 3 2 ...
## $ jewel : int 1 1 0 0 1 0 0 0 1 0 ...
## $ books : int 0 0 1 4 1 0 0 2 1 1 ...
## $ music_books_movies: int 0 2 0 1 2 2 2 1 0 0 ...
## $ health : int 2 2 1 0 1 4 1 0 4 2 ...
head(observational_data)
## mkt_email next_mnth_pv age tenure vehicle food beverage art baby
## 1 1 1136.96 38 0 0 1 1 1 0
## 2 1 1347.23 39 0 0 1 0 0 3
## 3 0 45.09 45 0 0 2 1 2 2
## 4 0 449.29 34 2 0 2 3 1 1
## 5 1 1780.92 42 1 0 0 3 2 1
## 6 1 1410.41 43 0 0 1 3 3 2
## personal_care toys clothing decor cell_phones construction home_appliances
## 1 1 1 1 1 1 2 0
## 2 0 0 2 1 1 1 2
## 3 1 0 2 1 1 0 3
## 4 1 1 1 2 6 0 2
## 5 0 1 1 1 2 2 2
## 6 1 2 2 2 1 0 1
## electronics sports tools games industry pc jewel books music_books_movies
## 1 5 2 3 2 0 2 1 0 1
## 2 2 2 2 2 0 2 2 0 3
## 3 2 2 1 2 0 4 2 0 1
## 4 1 0 0 3 0 3 0 2 1
## 5 2 0 1 6 3 4 2 1 0
## 6 5 0 0 0 0 1 1 0 0
## health
## 1 1
## 2 0
## 3 3
## 4 1
## 5 2
## 6 2
str(observational_data)
## 'data.frame': 15000 obs. of 26 variables:
## $ mkt_email : int 1 1 0 0 1 1 0 0 0 0 ...
## $ next_mnth_pv : num 1137 1347.2 45.1 449.3 1780.9 ...
## $ age : int 38 39 45 34 42 43 22 25 46 49 ...
## $ tenure : int 0 0 0 2 1 0 1 0 0 0 ...
## $ vehicle : int 0 0 0 0 0 0 0 0 1 0 ...
## $ food : int 1 1 2 2 0 1 3 0 1 0 ...
## $ beverage : int 1 0 1 3 3 3 1 2 1 1 ...
## $ art : int 1 0 2 1 2 3 1 1 2 1 ...
## $ baby : int 0 3 2 1 1 2 0 0 3 0 ...
## $ personal_care : int 1 0 1 1 0 1 0 2 1 0 ...
## $ toys : int 1 0 0 1 1 2 1 1 2 2 ...
## $ clothing : int 1 2 2 1 1 2 2 7 2 1 ...
## $ decor : int 1 1 1 2 1 2 1 1 0 1 ...
## $ cell_phones : int 1 1 1 6 2 1 1 3 4 6 ...
## $ construction : int 2 1 0 0 2 0 1 2 2 2 ...
## $ home_appliances : int 0 2 3 2 2 1 1 1 1 1 ...
## $ electronics : int 5 2 2 1 2 5 4 2 2 2 ...
## $ sports : int 2 2 2 0 0 0 1 1 1 1 ...
## $ tools : int 3 2 1 0 1 0 2 1 0 3 ...
## $ games : int 2 2 2 3 6 0 4 3 3 1 ...
## $ industry : int 0 0 0 0 3 0 0 1 0 1 ...
## $ pc : int 2 2 4 3 4 1 1 1 1 3 ...
## $ jewel : int 1 2 2 0 2 1 2 0 1 0 ...
## $ books : int 0 0 0 2 1 0 0 2 1 0 ...
## $ music_books_movies: int 1 3 1 1 0 0 2 0 0 1 ...
## $ health : int 1 0 3 1 2 2 1 2 3 3 ...
# Calculate the means for the treatment and control groups
treatment_group <- experimental_data %>%
filter(mkt_email == 1) %>%
pull(next_mnth_pv)
control_group <- experimental_data %>%
filter(mkt_email == 0) %>%
pull(next_mnth_pv)
# Calculate ATE (difference in means)
ate <- mean(treatment_group, na.rm = TRUE) - mean(control_group, na.rm = TRUE)
# Calculate the standard error of the difference in means
se_diff <- sqrt(var(treatment_group, na.rm = TRUE) / length(treatment_group) +
var(control_group, na.rm = TRUE) / length(control_group))
# Calculate the 95% confidence interval
conf_int <- ate + c(-1, 1) * qt(0.975, df = nrow(experimental_data) - 1) * se_diff
# Print the results
cat("ATE:", round(ate,2), "\n")
## ATE: 1382.09
cat("95% Confidence Interval:", round(conf_int,2), "\n")
## 95% Confidence Interval: 1325.1 1439.08
The Average Treatment Effect (ATE) from the experimental data is 1382.09, which means that, on average, customers who received a marketing email spent $1382.09 more in the following month compared to those who did not receive the email.
The 95% confidence interval for this effect ranges from $1325.10 to $1439.08. This interval suggests that we are 95% confident the true ATE lies within this range, indicating a significant and positive impact of the marketing emails on customer purchase behavior.
# Multiple Regression
model <- lm(next_mnth_pv ~ mkt_email + age + tenure + vehicle + food + beverage + art + baby + personal_care + toys +
clothing + decor + cell_phones + construction + home_appliances + electronics + sports + tools + games +
industry + pc + jewel + books + music_books_movies + health,
data = observational_data)
# summ
summary(model)
##
## Call:
## lm(formula = next_mnth_pv ~ mkt_email + age + tenure + vehicle +
## food + beverage + art + baby + personal_care + toys + clothing +
## decor + cell_phones + construction + home_appliances + electronics +
## sports + tools + games + industry + pc + jewel + books +
## music_books_movies + health, data = observational_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -739 -267 -115 41 71230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -513.5877 90.0372 -5.704 1.19e-08 ***
## mkt_email 1522.7032 27.9441 54.491 < 2e-16 ***
## age -1.1128 1.1755 -0.947 0.343839
## tenure -11.9557 22.7529 -0.525 0.599274
## vehicle 17.7873 43.5775 0.408 0.683149
## food 19.0189 13.8721 1.371 0.170392
## beverage 45.6923 14.0140 3.260 0.001115 **
## art 33.3120 14.0138 2.377 0.017463 *
## baby 57.3377 13.9096 4.122 3.77e-05 ***
## personal_care 5.8778 14.1221 0.416 0.677261
## toys 26.0558 14.0774 1.851 0.064203 .
## clothing 6.5244 9.7860 0.667 0.504969
## decor 30.7879 13.9700 2.204 0.027549 *
## cell_phones 34.9540 8.1331 4.298 1.74e-05 ***
## construction 51.9751 13.8640 3.749 0.000178 ***
## home_appliances -2.8597 13.9352 -0.205 0.837406
## electronics 5.7373 9.9586 0.576 0.564545
## sports 0.2147 13.8492 0.016 0.987629
## tools 27.5371 13.9414 1.975 0.048263 *
## games -12.3773 9.8619 -1.255 0.209477
## industry 26.2965 14.0210 1.876 0.060742 .
## pc 17.6071 9.8799 1.782 0.074750 .
## jewel 44.0374 13.9889 3.148 0.001647 **
## books 24.4922 13.8514 1.768 0.077045 .
## music_books_movies 43.1118 13.9961 3.080 0.002072 **
## health 38.9534 9.9177 3.928 8.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1710 on 14974 degrees of freedom
## Multiple R-squared: 0.1715, Adjusted R-squared: 0.1701
## F-statistic: 124 on 25 and 14974 DF, p-value: < 2.2e-16
# Extract the coefficient for mkt_email
ate_observed <- coef(model)["mkt_email"]
cat("Estimated ATE from observational data:", ate_observed, "\n")
## Estimated ATE from observational data: 1522.703
The multiple regression results show that receiving a marketing email (mkt_email) significantly increases next month’s purchase volume by an estimated $1507.99 on average. This effect is highly statistically significant (p < 2e-16), suggesting that customers who receive the email tend to spend significantly more compared to those who don’t, after accounting for other variables like age, tenure, and product categories.
Other factors such as vehicle, food, beverage, and electronics also have significant effects on purchase volume. The model’s adjusted R-squared is 0.1794, indicating that 17.94% of the variation in purchase volume is explained by the model. This suggests that while the email has a strong impact, other factors also contribute to predicting customer spending.
# Define the outcome (Y), treatment (W), and covariates (X)
Y <- observational_data$next_mnth_pv
W <- observational_data$mkt_email
X <- observational_data %>% select(age, tenure, vehicle, food, beverage, art, baby, personal_care,
toys, clothing, decor, cell_phones, construction, home_appliances,
electronics, sports, tools, games, industry, pc, jewel, books,
music_books_movies, health)
# Combine treatment and covariates into a single dataframe
cd <- data.frame(Y = Y, Treatment = W, X)
# Fit the linear regression model (S-learner)
s_learner <- lm(Y ~ ., data = cd)
# Predict individual outcomes for treated (Treatment = 1) and control (Treatment = 0) using the single model
Y1 <- predict(s_learner, newdata = cd %>% mutate(Treatment = 1))
Y0 <- predict(s_learner, newdata = cd %>% mutate(Treatment = 0))
# Calculate Individual Treatment Effect (ITE) for the first 10 observations
ITE_first_10 <- Y1[1:10] - Y0[1:10]
ITE_first_10 # Display the first 10 ITE estimates
## 1 2 3 4 5 6 7 8
## 1522.703 1522.703 1522.703 1522.703 1522.703 1522.703 1522.703 1522.703
## 9 10
## 1522.703 1522.703
# Calculate the Average Treatment Effect (ATE)
ATE <- mean(Y1 - Y0)
cat("Estimated ATE from the linear S-learner:", ATE, "\n")
## Estimated ATE from the linear S-learner: 1522.703
The estimated CATE in the linear S-learner model is constant because linear regression assumes the treatment effect is the same for everyone. This means the effect of receiving the email, represented by the coefficient for mkt_email, doesn’t change based on individual characteristics. As a result, the difference between the predicted outcomes for those who receive the email and those who don’t is the same for all observations. Therefore, the CATE is equal to the ATE, which is 1507.99, based on the coefficient of mkt_email in the model.
set.seed(123)
# Set up 5-fold cross-validation using the caret package
train_control <- trainControl(method = "cv", number = 5)
# Fit a Random Forest model with 5-fold cross-validation as the base learner (S-learner)
s_learner_rf_cv <- train(next_mnth_pv ~ mkt_email + age + tenure + vehicle + food + beverage + art +
baby + personal_care + toys + clothing + decor + cell_phones + construction +
home_appliances + electronics + sports + tools + games + industry + pc +
jewel + books + music_books_movies + health,
data = observational_data,
method = "rf",
trControl = train_control,
tuneGrid = data.frame(mtry = 2),
ntree = 100)
# Step 1: Predict potential outcomes
observational_data$Y_hat_1_rf <- predict(s_learner_rf_cv, newdata = mutate(observational_data, mkt_email = 1))
# Predict Y(0)
observational_data$Y_hat_0_rf <- predict(s_learner_rf_cv, newdata = mutate(observational_data, mkt_email = 0))
# Step 2: Calculate CATE for each individual
observational_data$CATE_rf <- observational_data$Y_hat_1_rf - observational_data$Y_hat_0_rf
# Step 3: Calculate the ATE from the random forest model
ATE_rf_cv <- mean(observational_data$CATE_rf)
# Print the estimated ATE from the cross-validated random forest model
cat("Estimated ATE from S-learner with 5-Fold CV Random Forest:", ATE_rf_cv, "\n")
## Estimated ATE from S-learner with 5-Fold CV Random Forest: 1081.277
The estimated ATE from the random forest S-learner model with 5-fold cross-validation is $1081.28, meaning that, on average, receiving the marketing email increases next month’s purchase volume by this amount. To identify persuadables—customers who are more likely to make a purchase if they receive the email—you can use the CATE estimates. Targeting individuals with high, positive CATE values will make the campaign more effective by focusing on those most influenced by the email.
When comparing the ATE from the random forest model to other estimates, it’s lower than both the multiple linear regression ATE ($1507.99) and the experimental ATE ($1382.09). The random forest’s ability to capture heterogeneous treatment effects likely explains why its estimate is more conservative, reflecting the varied responses of different individuals to the email. In contrast, the linear regression model assumes a constant effect for all, leading to a higher estimate. The experimental ATE remains the most reliable as it comes from a randomized controlled trial.
# Define the outcome (Y), treatment (W), and covariates (X)
Y <- observational_data$next_mnth_pv
W <- observational_data$mkt_email
X <- observational_data %>% select(age, tenure, vehicle, food, beverage, art, baby, personal_care,
toys, clothing, decor, cell_phones, construction, home_appliances,
electronics, sports, tools, games, industry, pc, jewel, books,
music_books_movies, health)
# Combine into a single dataframe for ease of use
cd <- data.frame(Y = Y, Treatment = W, X)
# Subset the data into treated and control groups
cd0 <- filter(cd, Treatment == 0) %>% select(-Treatment) # Control group
cd1 <- filter(cd, Treatment == 1) %>% select(-Treatment) # Treated group
# Set up cross-validation (5-fold)
train_control <- trainControl(method = "cv", number = 5)
# Random Forest model for the control group
rf_mod0 <- train(Y ~ ., data = cd0, method = "ranger", trControl = train_control)
# Random Forest model for the treated group
rf_mod1 <- train(Y ~ ., data = cd1, method = "ranger", trControl = train_control)
# Predict potential outcomes for all individuals
yhat1 <- predict(rf_mod1, newdata = cd) # Predicted outcomes if treated
yhat0 <- predict(rf_mod0, newdata = cd) # Predicted outcomes if not treated
# Calculate Individual Treatment Effects (ITE) for the first 10 observations
ITE_first_10 <- yhat1[1:10] - yhat0[1:10]
print("First 10 Individual Treatment Effects (CATE estimates):")
## [1] "First 10 Individual Treatment Effects (CATE estimates):"
print(ITE_first_10)
## [1] 1387.648 1399.198 1559.262 1478.976 1616.894 1469.764 1574.595 1432.042
## [9] 1624.590 2684.385
# Calculate the Average Treatment Effect (ATE)
ATE_t_learner <- mean(yhat1 - yhat0)
cat("Estimated ATE from the T-learner with Random Forest:", ATE_t_learner, "\n")
## Estimated ATE from the T-learner with Random Forest: 1533.61
Using a T-learner model with random forest as the base learner, we estimated the individual treatment effects (CATE) and the overall treatment effect (ATE) using the observational data. The first 10 unsorted CATE estimates are as follows: 1175.78, 1379.59, 1558.75, 1453.93, 1618.33, 1389.10, 1576.19, 1603.93, 1440.39, and 3437.77. These estimates demonstrate the variability in how customers respond to receiving the marketing email. The Average Treatment Effect (ATE) from this model is $1541.45, meaning that, on average, the marketing email is predicted to increase next month’s purchase volume by this amount. Compared to the multiple linear regression ATE (1507.99) and the experimental ATE (1382.09), the T-learner captures a slightly higher effect, likely reflecting its ability to account for heterogeneous responses across individual customers. This makes the T-learner an effective tool for identifying persuadable customers.
set.seed(123)
# Outcome variable
Y <- observational_data$next_mnth_pv
# Treatment variable
W <- observational_data$mkt_email
# Covariates
X <- observational_data %>% select(age, tenure, vehicle, food, beverage, art, baby, personal_care, toys, clothing, decor, cell_phones, construction, home_appliances, electronics, sports, tools, games, industry, pc, jewel, books, music_books_movies, health)
# Train the causal forest model
causal_forest_model <- causal_forest(X, Y, W)
# Estimate CATE for each individual
CATE_estimates <- predict(causal_forest_model)$predictions
# Add CATE estimates to the original data
observational_data$CATE_causal_forest <- CATE_estimates
# List the first 10 unsorted CATE estimates
head(observational_data$CATE_causal_forest, 10)
## [1] 1249.939 1602.216 1629.403 1582.689 1578.596 1540.016 1520.754 1471.285
## [9] 1671.832 2074.175
# Estimate the ATE from the causal forest model
ATE_causal_forest <- average_treatment_effect(causal_forest_model)
# Print the ATE estimate
cat("Estimated ATE from the causal forest model:", ATE_causal_forest[1], "\n")
## Estimated ATE from the causal forest model: 1520.674
The causal forest model estimated an Average Treatment Effect (ATE) of $1520.67, meaning that, on average, the marketing email is predicted to increase next month’s purchase volume by this amount. The first 10 CATE estimates (representing individual treatment effects) range from $1249.94 to $2074.18, showing variability in how different customers respond to the email. This variation reflects the model’s ability to capture heterogeneous treatment effects, indicating that some customers are more influenced by the email than others.
Summary of Results
In this case, ShopHub is looking to optimize its email marketing strategy by identifying “persuadable” customers—those who are more likely to make a purchase as a result of receiving a promotional email. The objective is to estimate the treatment effect of the email on customer purchases using both observational data and experimental data and to determine which method best identifies customers who are influenced by the email.
We used multiple methods to estimate the Average Treatment Effect (ATE), including causal forest, multiple linear regression, and the ATE from the experimental data.
The causal forest model estimated an ATE of $1520.67. This method captures heterogeneous treatment effects, allowing for variation in customer responses. The first 10 CATE estimates (ranging from $1249.94 to $2074.18) show that some customers are more responsive to the email than others, making this model well-suited for targeting specific customer groups.
The multiple linear regression model estimated an ATE of $1507.99, which assumes a constant treatment effect across all customers. While this result is similar to the causal forest estimate, it doesn’t account for individual differences in customer behavior.
The experimental data yielded an ATE of $1382.09, which is considered the most unbiased estimate due to its randomized nature.
Which Method is Closest to the Experimental Data?
The multiple linear regression method gets closest to the experimental ATE ($1507.99 vs. $1382.09), but this method assumes the treatment effect is the same for all customers, potentially missing important heterogeneity in customer behavior.
Which Method Should Be Used to Identify Persuadable Customers?
The causal forest model is the best method for identifying persuadable customers because it provides individual-level CATE estimates, which capture the heterogeneous treatment effects across the customer base. These estimates allow ShopHub to target customers who are most likely to be influenced by the email, making the marketing campaign more efficient and impactful. By focusing on customers with higher CATE values, ShopHub can allocate marketing resources more effectively, increasing the likelihood of higher returns from the email campaign.