Perspective: From what perspective are you conducting the analysis?
Question: What is your question?
Can we use ARIMA models to predict a pitcher's future performance across several key metrics—such as wOBA allowed, strikeout percentage (K%), and walk percentage (BB%)—based on their historical stats?
Dataset: Describe your dataset(s) including URL (if available).
Data Format:
Independent and Dependent Variables:What is(are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric). If you have many variables, you can list the most important and summarize the rest (e.g. important variables are... also, 12 other binary, 5 categorical...).
Suitability of Variables for ARIMA: How are your variables suitable for your analysis method?
Each of these metrics (wOBA, K%, BB%) is time-dependent and fluctuates based on player performance over different seasons. They are suitable for ARIMA because:
Conclusions: What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?
From the ARIMA models applied to Whiff%, wOBA, K%, and BB%, we can draw the following conclusions:
Whiff%: The model suggests a slight decline in whiff percentage over the next five years. The forecasted Whiff% (see "Whiff% Forecast" graph) indicates a stabilization around 24.5%, showing that strikeout ability is likely to level off after a period of growth.
wOBA: The wOBA forecast (see "wOBA Forecast" graph) predicts a flat trend, stabilizing around 0.3105. This indicates no significant increase or decrease in offensive effectiveness, reflecting stability in the league's overall offensive performance.
K%: The forecast for K% (see "K% Forecast" graph) indicates a slight decline from 22.5% to around 22.1%. This suggests that the rapid rise in strikeouts observed in the past is likely to slow down, with K% stabilizing in the coming years.
BB%: The BB% forecast (see "BB% Forecast" graph) predicts a flat trend around 7.43%. Like the other metrics, this suggests stability, with no major fluctuations in walk rates expected in the near future.
In summary, the forecasts across all metrics show a trend toward stabilization, with no major increases or decreases expected in the next five years. This suggests that many key performance metrics in baseball are entering a period of relative consistency.
Assumptions: What are your assumptions and limitations? What robustness checks did you perform or would you perform?
%%capture
pip install pmdarima
# Suppress
%%capture
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=Warning)
data = pd.read_csv("/content/stats.csv")
print(data.describe())
print(data.shape)
player_id year pa k_percent bb_percent \ count 1106.000000 1106.000000 1106.000000 1106.000000 1106.000000 mean 562075.849910 2019.299277 643.334539 22.185986 7.434448 std 86158.967831 2.897329 160.593994 4.951807 1.962328 min 112526.000000 2015.000000 189.000000 10.400000 2.000000 25% 502171.000000 2017.000000 563.250000 18.800000 6.100000 50% 579328.000000 2019.000000 669.000000 21.500000 7.300000 75% 622608.000000 2022.000000 758.000000 25.100000 8.700000 max 694973.000000 2024.000000 951.000000 41.100000 17.900000 woba bacon z_swing_percent z_swing_miss_percent \ count 1106.000000 1106.000000 1106.000000 1106.000000 mean 0.310686 0.325090 66.860036 16.652080 std 0.031227 0.026412 3.088641 3.642283 min 0.204000 0.242000 55.800000 6.600000 25% 0.290000 0.307000 64.800000 14.125000 50% 0.312000 0.325000 66.900000 16.550000 75% 0.332000 0.344000 68.900000 18.900000 max 0.417000 0.398000 76.100000 31.700000 oz_swing_percent ... flyballs_percent n_ff_formatted ff_avg_speed \ count 1106.000000 ... 1106.000000 1069.000000 1069.000000 mean 28.857414 ... 23.410127 33.876239 92.871656 std 3.004760 ... 4.879343 16.366793 2.354689 min 19.700000 ... 9.800000 0.000000 82.300000 25% 26.900000 ... 20.000000 22.600000 91.500000 50% 28.800000 ... 23.400000 35.600000 92.900000 75% 30.700000 ... 26.800000 46.300000 94.400000 max 39.800000 ... 38.800000 72.100000 99.100000 ff_avg_spin ff_avg_break_x ff_avg_break_z offspeed_avg_break_z \ count 1069.000000 1069.000000 1069.000000 1086.000000 mean 2247.242283 -3.038728 -16.380449 -30.807919 std 145.896179 7.532372 3.024717 4.034539 min 1792.000000 -15.700000 -31.300000 -43.300000 25% 2150.000000 -8.500000 -17.900000 -33.575000 50% 2248.000000 -5.800000 -16.100000 -30.800000 75% 2348.000000 3.000000 -14.300000 -28.100000 max 2779.000000 18.100000 -9.200000 -15.400000 offspeed_avg_break_z_induced offspeed_avg_break offspeed_range_speed count 1086.000000 1086.000000 1083.000000 mean 7.318508 15.816298 1.583564 std 3.744186 2.712864 0.429232 min -5.600000 4.000000 0.900000 25% 4.725000 14.300000 1.300000 50% 7.400000 16.000000 1.500000 75% 9.600000 17.600000 1.700000 max 22.300000 25.000000 6.300000 [8 rows x 36 columns] (1106, 37)
print(data.isnull().sum())
last_name, first_name 0 player_id 0 year 0 pa 0 k_percent 0 bb_percent 0 woba 0 bacon 0 z_swing_percent 0 z_swing_miss_percent 0 oz_swing_percent 0 oz_swing_miss_percent 0 oz_contact_percent 0 out_zone_swing_miss 0 out_zone_swing 0 out_zone_percent 0 out_zone 0 meatball_swing_percent 0 meatball_percent 0 pitch_count_offspeed 0 whiff_percent 0 swing_percent 0 straightaway_percent 0 batted_ball 0 f_strike_percent 0 groundballs_percent 0 groundballs 0 flyballs_percent 0 n_ff_formatted 37 ff_avg_speed 37 ff_avg_spin 37 ff_avg_break_x 37 ff_avg_break_z 37 offspeed_avg_break_z 20 offspeed_avg_break_z_induced 20 offspeed_avg_break 20 offspeed_range_speed 23 dtype: int64
# Columns with missing values
columns_with_missing = ['n_ff_formatted', 'ff_avg_speed', 'ff_avg_spin',
'ff_avg_break_x', 'ff_avg_break_z', 'offspeed_avg_break_z',
'offspeed_avg_break_z_induced', 'offspeed_avg_break', 'offspeed_range_speed']
# Plot histograms for columns with missing values
plt.figure(figsize=(15,10))
for i, col in enumerate(columns_with_missing, 1):
plt.subplot(3, 3, i)
sns.histplot(data[col], kde=True, bins=20)
plt.title(f'{col} Distribution')
plt.tight_layout()
plt.show()
# Impute using the mean for normally distributed columns
columns_mean = ['ff_avg_speed', 'ff_avg_spin', 'offspeed_avg_break']
for col in columns_mean:
data[col].fillna(data[col].mean(), inplace=True)
# Impute using the median for skewed columns
columns_median = ['n_ff_formatted', 'ff_avg_break_x', 'ff_avg_break_z',
'offspeed_avg_break_z', 'offspeed_avg_break_z_induced', 'offspeed_range_speed']
for col in columns_median:
data[col].fillna(data[col].median(), inplace=True)
# Confirm that no missing values remain
print(data.isnull().sum())
last_name, first_name 0 player_id 0 year 0 pa 0 k_percent 0 bb_percent 0 woba 0 bacon 0 z_swing_percent 0 z_swing_miss_percent 0 oz_swing_percent 0 oz_swing_miss_percent 0 oz_contact_percent 0 out_zone_swing_miss 0 out_zone_swing 0 out_zone_percent 0 out_zone 0 meatball_swing_percent 0 meatball_percent 0 pitch_count_offspeed 0 whiff_percent 0 swing_percent 0 straightaway_percent 0 batted_ball 0 f_strike_percent 0 groundballs_percent 0 groundballs 0 flyballs_percent 0 n_ff_formatted 0 ff_avg_speed 0 ff_avg_spin 0 ff_avg_break_x 0 ff_avg_break_z 0 offspeed_avg_break_z 0 offspeed_avg_break_z_induced 0 offspeed_avg_break 0 offspeed_range_speed 0 dtype: int64
wOBA: measures how effectively a pitcher limits offensive production, accounting for the quality of contact and the overall run impact of hits and walks allowed, providing a comprehensive view of a pitcher's ability to suppress scoring.
plt.subplot(1, 3, 1)
sns.histplot(data['woba'], kde=True, bins=20)
plt.title('wOBA Distribution')
Text(0.5, 1.0, 'wOBA Distribution')
The distribution of wOBA for pitchers is approximately normal, with most values centered around 0.3, indicating that the majority of pitchers allow hitters to perform at an average offensive level. There is a slight spread, with some pitchers allowing significantly better (lower wOBA) or worse (higher wOBA) offensive performance from hitters.
K%: represents the percentage of a pitcher's total plate appearances that result in a strikeout, serving as a measure of how often a pitcher is able to retire batters via strikeout.
# Strikeout Percentage (K%) distribution
plt.subplot(1, 3, 2)
sns.histplot(data['k_percent'], kde=True, bins=20)
plt.title('K% Distribution')
Text(0.5, 1.0, 'K% Distribution')
The distribution of K% (Strikeout Percentage) is right-skewed, with most pitchers having a strikeout rate between 15% and 25%, peaking around 20%. A smaller number of pitchers achieve higher strikeout rates, with a few exceeding 30%, indicating they are elite in striking out batters.
BB%: represents the percentage of a pitcher's total plate appearances that result in a walk, measuring how frequently a pitcher allows hitters to reach base via walks.
# Walk Percentage (BB%) distribution
plt.subplot(1, 3, 3)
sns.histplot(data['bb_percent'], kde=True, bins=20)
plt.title('BB% Distribution')
plt.tight_layout()
plt.show()
The distribution of BB% (Walk Percentage) is tightly clustered around 5% to 10%, with most pitchers falling within this range, indicating that the majority allow walks at a fairly moderate rate. There is a small tail toward higher walk rates above 10%, showing that a few pitchers struggle more with control, issuing walks more frequently.
Whiff %: represents the percentage of swings that result in a miss. It is a key measure of how often a pitcher can make batters swing and miss, indicating a pitcher's ability to dominate hitters and generate strikeouts.
# Plot the distribution of Whiff Percentage (whiff_percent)
plt.figure(figsize=(5,5))
sns.histplot(data['whiff_percent'], kde=True, bins=20)
plt.title('Whiff% Distribution')
plt.xlabel('Whiff Percentage')
plt.ylabel('Count')
plt.show()
The distribution of Whiff Percentage (Whiff%) is approximately normal, with most pitchers recording a whiff rate between 20% and 30%, peaking around 25%. There are fewer pitchers with extremely high or low whiff rates, indicating that the majority of pitchers induce swings and misses at a moderate rate, with a few elite pitchers exceeding 35%.
# Initialize the figure for subplots with 4 rows, 1 column
fig, axes = plt.subplots(4, 1, figsize=(10, 16))
# 1. Facet for Whiff%
sns.lineplot(data=data, x='year', y='whiff_percent', ax=axes[0], ci=None)
axes[0].set_title('Whiff% Over Time')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Whiff%')
# 2. Facet for wOBA
sns.lineplot(data=data, x='year', y='woba', ax=axes[1], ci=None)
axes[1].set_title('wOBA Over Time')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('wOBA')
# 3. Facet for K% (Strikeout Percentage)
sns.lineplot(data=data, x='year', y='k_percent', ax=axes[2], ci=None)
axes[2].set_title('K% (Strikeout Percentage) Over Time')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('K%')
# 4. Facet for BB% (Walk Percentage)
sns.lineplot(data=data, x='year', y='bb_percent', ax=axes[3], ci=None)
axes[3].set_title('BB% (Walk Percentage) Over Time')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('BB%')
# Adjust layout
plt.tight_layout()
plt.show()
Interpretation
-wOBA (weighted on-base average)
K% (strikeout percentage)
BB% (walk percentage)
What We Can Learn for Modeling:
Modeling Whiff% and K%: Since both Whiff% and K% are showing a general upward trend, incorporating these variables into time series models (like ARIMA or exponential smoothing) may help predict future trends in pitcher dominance. The close relationship between these metrics suggests that Whiff% could be a strong predictor of future K% performance.
wOBA Modeling: The downward trend in wOBA suggests that pitchers are becoming more effective at limiting offensive production. Predictive models for wOBA should consider factors like Whiff% and K%, as they seem correlated with limiting offensive success.
BB% Variability: The fluctuating BB% suggests more unpredictability in pitchers' control. A more complex model might be required to predict BB%, such as using external variables (e.g., pitch type, pitch velocity) to explain its variability over time.
Key Insights for Modeling:
def forecast_metric(data, metric_column, metric_name, forecast_years=5):
# Step 1: Aggregate the metric data by year
metric_by_year = data.groupby('year')[metric_column].mean()
# Step 2: Plot the original time series for the metric
plt.figure(figsize=(10, 5))
plt.plot(metric_by_year.index, metric_by_year.values, marker='o')
plt.title(f'{metric_name} Over Time')
plt.xlabel('Year')
plt.ylabel(metric_name)
plt.grid(True)
plt.show()
# Step 3: Use Auto ARIMA to determine the best p, d, q parameters
auto_model = auto_arima(metric_by_year, seasonal=False, trace=True, suppress_warnings=True)
# Step 4: Fit the ARIMA model based on the recommended (p, d, q) values
best_pdq = auto_model.order
print(f"Best ARIMA order for {metric_name}: {best_pdq}")
# Fit the model with the best order found by auto_arima
model = ARIMA(metric_by_year, order=best_pdq)
model_fit = model.fit()
# Step 5: Summary of the model
print(model_fit.summary())
# Step 6: Forecast future metric values (next 5 years by default)
forecast = model_fit.forecast(steps=forecast_years)
# Step 7: Create a future index for years to forecast
future_years = list(range(metric_by_year.index[-1] + 1, metric_by_year.index[-1] + 1 + forecast_years))
# Step 8: Plot the forecast
plt.figure(figsize=(10, 5))
plt.plot(metric_by_year.index, metric_by_year.values, label=f'Historical {metric_name}')
plt.plot(future_years, forecast, label=f'Forecasted {metric_name}', marker='o', linestyle='--')
plt.title(f'{metric_name} Forecast')
plt.xlabel('Year')
plt.ylabel(metric_name)
plt.legend()
plt.grid(True)
plt.show()
forecast_metric(data, 'whiff_percent', 'Whiff%', forecast_years=5)
Historical Whiff% Trend:
The Whiff% has generally been increasing from 2015 to 2020, peaking around 26%. However, after 2020, we see some decline and fluctuations in the percentage, indicating a potential plateau or slight decrease in pitchers' ability to induce swings and misses after reaching a peak.
ARIMA Model Summary:
The ARIMA(1,0,0) model was selected as the best fit for the data. The model includes one autoregressive term (AR1), which suggests that the current Whiff% is influenced by the Whiff% from the previous year. The AR1 coefficient (0.8618) indicates a strong positive relationship, meaning that previous year's Whiff% heavily influences the current year's value. The AIC (28.969) and BIC (29.877) are relatively low, indicating a good model fit with minimal complexity.
Whiff% Forecast:
The forecast for Whiff% from 2024 to 2028 shows a slight declining trend, suggesting that the Whiff% might gradually decrease in future years, but at a slow rate. The forecast doesn't show a sharp drop, but it indicates a steady decline, suggesting that the recent fluctuations in Whiff% may continue with a slight downward trend.
Conclusion:
After peaking in 2020, Whiff% is expected to gradually decline based on the model, reflecting a possible stabilization or slight weakening of pitchers' ability to induce swings and misses in the near future.
forecast_metric(data, 'woba', 'wOBA', forecast_years=5)
Historical wOBA (2015–2024):
ARIMA Model Summary:
Forecast (2025–2029):
Model Diagnostics:
Conclusion:
forecast_metric(data, 'k_percent', 'K%', forecast_years=5)
Historical K% (2015–2024):
ARIMA Model Selection:
Model Summary:
K% Forecast (2025–2029):
Model Diagnostics:
Conclusion:
forecast_metric(data, 'bb_percent', 'BB%', forecast_years=5)
Historical BB% (2015–2024):
ARIMA Model Selection:
Model Summary:
BB% Forecast (2025–2029):
Model Diagnostics:
Conclusion:
!cp "/content/drive/MyDrive/Colab Notebooks/silverstein_time_series.ipynb" ./
!jupyter nbconvert --to html "silverstein_time_series.ipynb"