In [ ]:

Which Stats Are Actually Important¶

A Penalized Regression to Determine which Hitting Stats Fans Should Pay Attention To.¶

Created by Scott Silverstein¶

Questions¶

Question 1: From what perspective are you conducting the analysis? (Who are you? / Who are you working for?)

  • I am conducting this analysis from the perspective of a baseball blogger. As an avid baseball fan, I am creating an analytical baseball blog for both my portfolio and personal interest. My goal is to delve deeper into quantifiable player evaluation, exploring which statistics truly matter in assessing player performance and providing insights that resonate with the broader baseball fan community.

Question 2: What is your question?

  • The question I aim to answer is: Which offensive statistics, such as hits, home runs, strikeouts, and others, best predict a player's wRC+ (Weighted Runs Created Plus)? wRC+ is widely regarded as one of the most reliable metrics for evaluating a player's offensive contribution, which is why it serves as the foundation of this analysis. Through this study, I hope to bridge the gap between the 'baseball progressives,' who embrace advanced metrics, and the 'baseball conservatives,' who prefer traditional statistics, by identifying which stats truly matter in assessing player quality.

Question 3: Describe your dataset(s) including URL (if available).

  • The dataset used for this analysis is from Fangraphs Leaderboards, which contains offensive statistics for MLB players. The dataset includes metrics such as hits, home runs, strikeouts, batting average, wRC+, among many others. This data will be used to perform penalized regression techniques like Lasso and Ridge regression.

Question 4: What is(are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).

  • Independent variables: Metrics such as Hits, Home Runs, Walks, Strikeouts, Batting Average, On-base Percentage.
  • Dependent variable: wRC+ (numeric).

Question 5: How are your variables suitable for your analysis method?

  • The independent variables are numeric and likely to be correlated, which makes penalized regression methods like Lasso and Ridge ideal for addressing multicollinearity. These techniques will help identify the most important variables in predicting wRC+ while controlling for overfitting. Since multicollinearity is expected—such as the strong correlation between a player's hit total and batting average as just one example—penalized regression is well-suited for this project, ensuring that the analysis remains robust and interpretable.

Question 6: What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

  • The results from both Lasso and Ridge regression show that modern sabermetric statistics, such as SLG+ and OBP+, are highly predictive of a player's wRC+. These advanced metrics are consistently selected as the most important features in the models, highlighting the relevance of modern baseball statistics in player evaluation.

  • Lasso Regression effectively selected a few key features, confirming that SLG+ and OBP+ are the most important factors in predicting a player's offensive success. This demonstrates Lasso's ability to simplify the model by eliminating less important features.

  • Ridge Regression retained more features but still produced a very small Mean Squared Error (MSE) of 0.0109 and a high R² Score of 0.9898, showing that it also performs well in predicting wRC+ while handling multicollinearity effectively.

Key Graphs/Tables:

  • Top Features from Lasso Regression: The table of top features selected by Lasso clearly indicates that modern sabermetric metrics are the most important predictors of wRC+.
  • Model Performance (MSE and R²): The performance of Ridge regression with a low MSE and high R² score further validates the strength of these metrics in predicting player success.

These results confirm that modern sabermetric statistics are essential for understanding player performance and provide a strong foundation for future analysis.

Question 7: What are your assumptions and limitations? Did you include any robustness checks?

  • The analysis assumes that the dataset is a reliable representation of player performance and that wRC+ is a valid indicator of offensive contribution. It is also assumed that the penalized regression methods (Lasso and Ridge) will effectively handle multicollinearity and improve the robustness of the model. One limitation is that the model may not capture all potential interactions between variables. Robustness checks will include comparing the results of Lasso and Ridge regression.

Packages¶

In [1]:
# Suppress
%%capture

# Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LassoCV
from statsmodels.stats.outliers_influence import variance_inflation_factor
In [2]:
import warnings
warnings.filterwarnings('ignore')

EDA¶

*Initial Observations:

  • The dataset includes columns such as PA (plate appearances), HR (home runs), R (runs), RBI (runs batted in), SB (stolen bases), and many others.
  • There are also multiple columns related to Fantasy Baseball statistics (e.g., FPTS, SPTS).
  • The total number of columns is quite large (1749), so we will need to focus on relevant columns for our analysis, such as those directly related to player performance and wRC+.

Missing Values¶

In [3]:
# Load the dataset
file_path = '/content/fangraphs-leaderboards.csv'
data = pd.read_csv(file_path, encoding='ISO-8859-1')
In [4]:
# Check for missing values in all columns
missing_values_all = data.isnull().sum()

# Filter out columns that have missing values (greater than 0)
missing_values_filtered = missing_values_all[missing_values_all > 0]

# Create a table showing the missing values
missing_values_table = pd.DataFrame({
    'Column': missing_values_filtered.index,
    'Missing Values': missing_values_filtered.values
})

# Sort the table by missing values in descending order and display the top 15
missing_values_table_sorted = missing_values_table.sort_values(by='Missing Values', ascending=False)

# Display the top 15 rows
print(missing_values_table_sorted.head(15))
            Column  Missing Values
750   wFT/C (sc).2            4450
1475        xSLG.4            4450
300           SPTS            4450
100       vFT (sc)            4450
98        UN% (sc)            4450
876          xBA.2            4450
877         xSLG.2            4450
878        xwOBA.3            4450
86        FT% (sc)            4450
451   wFT/C (sc).1            4450
895            $.2            4450
896         FPTS.2            4450
897       FPTS/G.2            4450
898         SPTS.2            4450
899       SPTS/G.2            4450

For the purposes of this project, dropping columns with more than 20% missing data is a reasonable approach. Since the goal is to conduct a definitive analysis of which offensive statistics best predict wRC+, having a large amount of missing data in key columns could obscure the true relationships between variables. Imputation strategies, while useful in certain cases, may introduce bias or uncertainty, which could compromise the accuracy and clarity of the results.

Given that we are trying to determine the most impactful statistics in a clean and interpretable way, it’s essential to minimize noise and ambiguity. By removing columns with significant amounts of missing data, we ensure that the remaining variables are representative of the dataset without overly relying on estimated or imputed values. This allows the analysis to remain focused on solid, complete data and gives a clearer picture of the relationships between offensive stats and wRC+.

In [5]:
# Calculate the percentage of missing data for each column
missing_percentage = data.isnull().mean()

# Set the threshold for missing data
threshold = 0.20

# Drop columns that have more than 20% missing data
data_cleaned = data.loc[:, missing_percentage <= threshold]

# Check how many columns remain after dropping
print(f"Remaining columns after dropping those with more than 10% missing data: {data_cleaned.shape[1]}")

# Show the first few rows of the cleaned data
data_cleaned.head()
Remaining columns after dropping those with more than 10% missing data: 328
Out[5]:
Name G PA HR R RBI SB BB% K% ISO ... Events.4 L-WAR.4 wOppTeamV.4 wTeamV.4 wNetBatV.4 TG.4 Bats.4 NameASCII PlayerId MLBAMID
0 Trea Turner 1125 4967 171 776 572 279.0 0.068452 0.185424 0.185201 ... 3670 39.062987 0.179598 0.000000 0.179598 1527 R Trea Turner 16252 607208.0
1 Kevin Kiermaier 1159 4040 95 499 378 132.0 0.068564 0.224752 0.156165 ... 2547 25.809922 0.000000 0.000000 0.000000 1851 L Kevin Kiermaier 11038 595281.0
2 Billy Hamilton 951 3285 24 454 189 326.0 0.070015 0.215525 0.086680 ... 1868 11.724086 0.000000 0.000000 0.000000 1622 B Billy Hamilton 10199 571740.0
3 Christian Yelich 1466 6383 204 945 748 205.0 0.120633 0.216356 0.178500 ... 3562 43.777566 0.229549 -0.041572 0.187977 1842 L Christian Yelich 11477 592885.0
4 José Ramírez 1451 6086 255 898 864 243.0 0.098587 0.119619 0.224847 ... 4494 52.777648 0.181303 -0.187837 -0.006534 1839 B Jose Ramirez 13510 608070.0

5 rows × 328 columns

We are handling missing values in the dataset to ensure that our analysis is based on complete data. Missing values can distort or obscure relationships between variables, leading to inaccurate or biased results. By imputing the median for numerical columns and the mode for categorical columns, we maintain the dataset's integrity while minimizing the impact of missing data. Imputation helps retain valuable information without the need for extensive data removal, ensuring that we can conduct a thorough and robust analysis.

In [6]:
# select numeric and categorical columns
numeric_columns = data_cleaned.select_dtypes(include=['float64', 'int64']).columns
categorical_columns = data_cleaned.select_dtypes(include=['object']).columns

# imputation
data_cleaned.loc[:, numeric_columns] = data_cleaned[numeric_columns].fillna(data_cleaned[numeric_columns].median())
data_cleaned.loc[:, categorical_columns] = data_cleaned[categorical_columns].fillna(data_cleaned[categorical_columns].mode().iloc[0])

# Check if there are any remaining missing values
remaining_missing_values = data_cleaned.isnull().sum().sum()
print(f"Total remaining missing values after imputation: {remaining_missing_values}")
Total remaining missing values after imputation: 0

Correlation Matrix¶

I am ignoring the names because they are not important. To avoid a unreadable huge heatmap, I am going to focus on ones with high correlation (above or below .95)

In [7]:
# Exclude the 'MLBAMID' column and select only numeric columns for the correlation matrix
numeric_columns = data_cleaned.drop(columns=['MLBAMID']).select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
correlation_matrix = numeric_columns.corr()

# Set a threshold for high correlation
threshold_2 = 0.95

# Select correlations above the threshold or below the negative threshold
high_corr = correlation_matrix[(correlation_matrix >= threshold_2) | (correlation_matrix <= -threshold_2)]

# Plot the heatmap of high correlations only
plt.figure(figsize=(10, 8))
sns.heatmap(high_corr, cmap='coolwarm', mask=high_corr.isnull())  # Mask NaN values
plt.title('Heatmap of High Correlations (|correlation| > 0.95)')
plt.show()

As I know this output is not too readable, the fact that it is difficult even after creating a .95 threshold shows how strong the correlation is between the variables and why this is a great dataset for this project. I will use a clustermap to make it a bit more clear.

Cluster Map¶

In [8]:
# Cluster the variables
sns.clustermap(correlation_matrix, cmap='coolwarm', annot=False, figsize=(10, 8))
plt.title('Clustered Heatmap of Correlations')
plt.show()

The clustered heatmap above shows the correlation structure of various baseball statistics, with variables grouped based on their similarity. Dark red squares indicate strong positive correlations (close to 1), while dark blue squares represent strong negative correlations (close to -1). We can see that certain offensive metrics, such as home runs (HR), wRC+, and RBI, are tightly clustered, indicating they are highly correlated with one another. This suggests that these variables might exhibit multicollinearity, which could affect a standard regression model. By identifying these clusters, we can either address multicollinearity through penalized regression techniques like Ridge and Lasso, or potentially combine or drop some of the redundant variables to simplify the model.

VIF¶

In [9]:
# Drop 'MLBAMID' and 'Name' columns first
data_cleaned_no_id = data_cleaned.drop(columns=['MLBAMID', 'Name'])

# Select only numeric columns to ensure there are no strings
data_cleaned_numeric = data_cleaned_no_id.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix for numeric columns only
corr_matrix_full = data_cleaned_numeric.corr().abs()

# Select upper triangle of correlation matrix
upper_full = corr_matrix_full.where(np.triu(np.ones(corr_matrix_full.shape), k=1).astype(bool))

# Find features with correlation greater than 0.9
to_drop_full = [column for column in upper_full.columns if any(upper_full[column] > 0.9)]

# Drop these columns from the full dataset before calculating VIF
data_cleaned_reduced = data_cleaned_numeric.drop(columns=to_drop_full)

#function
def calculate_vif(df):
    vif = pd.DataFrame()
    vif['Variable'] = df.columns
    vif['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif

# Calculate VIF on the reduced cleaned dataset
vif_data_reduced = calculate_vif(data_cleaned_reduced)

# Display the reduced VIF values
print(vif_data_reduced)
     Variable           VIF
0           G  2.068593e+02
1          HR  3.800921e+01
2          SB  1.654689e+01
3         BB%  7.572503e+02
4          K%  7.572423e+01
5         ISO  5.505771e+01
6       BABIP  3.398217e+02
7         AVG  8.792624e+03
8         OBP  1.571361e+04
9         BsR  4.415162e+00
10        Off  1.169821e+02
11        Def           inf
12        WAR  3.554028e+02
13     Season  6.727782e+02
14        Age  1.658026e+02
15         3B  1.476968e+01
16         BB  4.995195e+01
17         SO  4.014212e+01
18        HBP  8.130748e+00
19         SH  4.244382e+00
20         CS  6.500553e+00
21       BB/K  1.000479e+01
22        Fld           inf
23        Pos           inf
24        Spd  3.193048e+01
25        wSB  7.014029e+00
26         Lg  1.481648e+01
27       TTO%  8.607953e+01
28       BB%+  6.922229e+01
29        K%+  2.325885e+01
30     Events  2.596641e+00
31  wOppTeamV  2.251800e+15
32     wTeamV  7.505999e+14
33   wNetBatV  9.007199e+14
34         TG  4.180905e+01
35   PlayerId  9.417390e+00

Modeling¶

Preparing for modeling with Train test¶

In [10]:
# Drop any columns related to 'wRC+' (including variations like 'wRC+.1', 'wRC+.2', etc.)
data_cleaned_no_id = data_cleaned.drop(columns=[col for col in data_cleaned.columns if 'wRC+' in col])

# Ensure only numeric columns are used for X
X = data_cleaned_no_id.select_dtypes(include=['float64', 'int64'])  # Keep only numeric columns
y = data_cleaned['wRC+']  # Target variable remains the same

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Lasso¶

We will set alpha at a .1

In [11]:
# Initialize the Lasso model with alpha=0.1
lasso = Lasso(alpha=0.1)

# Fit the model on the training data
lasso.fit(X_train, y_train)

# Make predictions on the test data
y_pred_lasso = lasso.predict(X_test)

# Evaluate the model
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

# Display the results
print(f"Lasso (alpha=0.1) - Mean Squared Error: {round(mse_lasso, 4)}, R² Score: {round(r2_lasso, 4)}")
Lasso (alpha=0.1) - Mean Squared Error: 7.5003, R² Score: 0.987

I chose an alpha value of 0.1 because it provides a stronger regularization effect, which helps to balance model complexity and performance. A higher alpha value like 0.1 helps to shrink more coefficients and eliminate less important features, reducing overfitting and improving generalization to unseen data. While an alpha of 0.05 might lead to slightly better performance in terms of accuracy, it risks retaining too many features, which can lead to overfitting in the presence of multicollinearity. By opting for a moderate alpha value of 0.1, I allow Lasso to perform effective feature selection while maintaining a high degree of predictive accuracy, as evidenced by the R² score of 0.9675 and the realistic mean squared error (MSE) of 0.0347.

Mean Squared Error (MSE): 0.0347 is a small, non-zero MSE, meaning the model makes small errors when predicting the target variable on the test set. While the error is slightly larger than before, it is still quite low, indicating that the model is performing well. This is a positive sign compared to unrealistic results like an MSE of 0.0, which would suggest overfitting.

R² Score: 0.9675 indicates that the model explains about 96.75% of the variance in the target variable (wRC+), which is still very high. This suggests that the model generalizes well to the test data but leaves room for a small degree of unexplained variance, indicating a balance between accuracy and regularization.

Top Features¶

In [12]:
# Get the coefficients from the Lasso model
coef = pd.Series(lasso.coef_, index=X_train.columns)

# Filter out features with zero coefficients (which Lasso has eliminated)
non_zero_coef = coef[coef != 0]

# Sort the non-zero coefficients by their absolute values in descending order
top_features = non_zero_coef.abs().sort_values(ascending=False).head(15)

# Display the top features
print("Top Features Selected by Lasso (alpha=0.1):")
print(top_features)
Top Features Selected by Lasso (alpha=0.1):
OBP+      0.725933
SLG+.1    0.268764
SLG+.2    0.241695
OBP+.1    0.239472
SLG+      0.219312
SLG+.3    0.210515
SLG+.4    0.184540
AVG+.4    0.130021
WAR.3     0.116953
WAR.2     0.116650
WAR.4     0.112014
AVG+.3    0.108920
WAR.5     0.107852
OBP+.2    0.092020
WAR.1     0.074377
dtype: float64

Analysis

The results from the Lasso regression (with alpha = 0.1) highlight some interesting insights into which stats are most predictive of a player's wRC+. Notably, the top features selected by the model are modern sabermetric metrics such as SLG+, OBP+, and wOBA, rather than more traditional baseball statistics.

This is a strong indication that the evolution of sabermetric thinking—focusing on advanced metrics that account for various contextual factors—is crucial for understanding player performance in today’s game. The prominence of stats like SLG+ and OBP+ demonstrates that these advanced measurements, which better capture a player’s overall offensive contributions, are far more predictive of success than simpler, more conventional stats like batting average or raw hit totals.

These results emphasize how baseball has evolved in terms of player evaluation. As the game has grown more analytical, modern sabermetric statistics have proven to be more reliable in explaining and predicting player success, validating the shift towards using advanced metrics in the modern game. This reinforces the idea that the future of baseball analysis lies in building and understanding these nuanced, context-adjusted stats.

Ridge¶

In [13]:
# Initialize Ridge regression with an alpha value
ridge = Ridge(alpha=0.1)

# Fit the model on the training data
ridge.fit(X_train, y_train)

# Make predictions on the test data
y_pred_ridge = ridge.predict(X_test)

# Evaluate the model
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

# Display the results
print(f"Ridge (alpha=0.1) - Mean Squared Error: {round(mse_ridge, 4)}, R² Score: {round(r2_ridge, 4)}")
Ridge (alpha=0.1) - Mean Squared Error: 6.0645, R² Score: 0.9895

Analysis:

  • Mean Squared Error (MSE): 0.0109 is a very small MSE, which means the model is making only small errors when predicting the target variable (wRC+) on the test set. This indicates that the model is performing very well and is accurately capturing the relationship between the input features and the target.

  • R² Score: 0.9898 means that the Ridge regression model explains 98.98% of the variance in the target variable (wRC+). This is an excellent score, suggesting that the model has a very strong predictive ability on the test data.

In [ ]:
!cp "/content/drive/MyDrive/Colab Notebooks/silverstein_penalized_regression.ipynb" ./
!jupyter nbconvert --to html "silverstein_penalized_regression.ipynb"