Time Series Analysis Predicting Future Pitching Performances¶

Created by Scott Silverstein¶

Questions¶

Perspective: From what perspective are you conducting the analysis?

  • For this analysis, we are assuming the perspective of a baseball blogger who is passionate about advanced baseball analytics. The aim is to provide readers with insights into how we can predict a pitcher's future performance, focusing on a number of stats to , and give data-driven predictions about which pitchers may improve or decline over the next few seasons.

Question: What is your question?

Can we use ARIMA models to predict a pitcher's future performance across several key metrics—such as wOBA allowed, strikeout percentage (K%), and walk percentage (BB%)—based on their historical stats?

Dataset: Describe your dataset(s) including URL (if available).

  • Dataset Source: The dataset contains detailed statistics for MLB players, including pitching metrics and player performance data from different seasons. It includes stats like wOBA, strikeout percentage (K%), walk percentage (BB%), and others that will be used for forecasting all coming from baseballsavant.com Baseball Savant Leaderboard

Data Format:

  • CSV file containing columns for year, player_id, and multiple performance metrics such as wOBA, K%, BB%, and others.

Independent and Dependent Variables:What is(are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric). If you have many variables, you can list the most important and summarize the rest (e.g. important variables are... also, 12 other binary, 5 categorical...).

  • Independent Variable (Time Component): year (numeric). This serves as the time component for ARIMA, capturing the yearly trend in player performance.
  • Dependent Variables (Stats for Forecasting): wOBA (Weighted On-Base Average allowed): Measures how well hitters perform against a pitcher, considering the quality of hits. Strikeout Percentage (K%): Percentage of total plate appearances that end in a strikeout. Walk Percentage (BB%): Percentage of total plate appearances that end in a walk. These stats will each be modeled separately using ARIMA to forecast future performance.

Suitability of Variables for ARIMA: How are your variables suitable for your analysis method?

  • Each of these metrics (wOBA, K%, BB%) is time-dependent and fluctuates based on player performance over different seasons. They are suitable for ARIMA because:

    • These stats typically exhibit trends over time (e.g., gradual improvement or decline in strikeouts or walks).
    • ARIMA can capture such trends and project them into the future.

    Conclusions: What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

From the ARIMA models applied to Whiff%, wOBA, K%, and BB%, we can draw the following conclusions:

  1. Whiff%: The model suggests a slight decline in whiff percentage over the next five years. The forecasted Whiff% (see "Whiff% Forecast" graph) indicates a stabilization around 24.5%, showing that strikeout ability is likely to level off after a period of growth.

  2. wOBA: The wOBA forecast (see "wOBA Forecast" graph) predicts a flat trend, stabilizing around 0.3105. This indicates no significant increase or decrease in offensive effectiveness, reflecting stability in the league's overall offensive performance.

  3. K%: The forecast for K% (see "K% Forecast" graph) indicates a slight decline from 22.5% to around 22.1%. This suggests that the rapid rise in strikeouts observed in the past is likely to slow down, with K% stabilizing in the coming years.

  4. BB%: The BB% forecast (see "BB% Forecast" graph) predicts a flat trend around 7.43%. Like the other metrics, this suggests stability, with no major fluctuations in walk rates expected in the near future.

In summary, the forecasts across all metrics show a trend toward stabilization, with no major increases or decreases expected in the next five years. This suggests that many key performance metrics in baseball are entering a period of relative consistency.

Assumptions: What are your assumptions and limitations? What robustness checks did you perform or would you perform?

  • Assumptions:
    • The player's performance in terms of wOBA, K%, and BB% will follow similar trends to what has been observed in the past.
    • There are no significant external factors (like injuries or major role changes) that drastically alter performance beyond what the ARIMA model can capture.
      • Limitations:
    • ARIMA models do not account for external changes such as player trades, injuries, or significant changes in training and strategy.
    • Projections are based solely on historical trends and may not account for sudden improvements or declines due to coaching changes or team adjustments.
      • Robustness Check:
    • Conduct stationarity tests (e.g., Augmented Dickey-Fuller) to ensure the data is suitable for ARIMA. Apply differencing to transform the data if needed.
In [1]:
%%capture
pip install pmdarima
In [2]:
# Suppress
%%capture

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
In [3]:
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=Warning)

EDA¶

Summary Stats¶

In [4]:
data = pd.read_csv("/content/stats.csv")
print(data.describe())
print(data.shape)
           player_id         year           pa    k_percent   bb_percent  \
count    1106.000000  1106.000000  1106.000000  1106.000000  1106.000000   
mean   562075.849910  2019.299277   643.334539    22.185986     7.434448   
std     86158.967831     2.897329   160.593994     4.951807     1.962328   
min    112526.000000  2015.000000   189.000000    10.400000     2.000000   
25%    502171.000000  2017.000000   563.250000    18.800000     6.100000   
50%    579328.000000  2019.000000   669.000000    21.500000     7.300000   
75%    622608.000000  2022.000000   758.000000    25.100000     8.700000   
max    694973.000000  2024.000000   951.000000    41.100000    17.900000   

              woba        bacon  z_swing_percent  z_swing_miss_percent  \
count  1106.000000  1106.000000      1106.000000           1106.000000   
mean      0.310686     0.325090        66.860036             16.652080   
std       0.031227     0.026412         3.088641              3.642283   
min       0.204000     0.242000        55.800000              6.600000   
25%       0.290000     0.307000        64.800000             14.125000   
50%       0.312000     0.325000        66.900000             16.550000   
75%       0.332000     0.344000        68.900000             18.900000   
max       0.417000     0.398000        76.100000             31.700000   

       oz_swing_percent  ...  flyballs_percent  n_ff_formatted  ff_avg_speed  \
count       1106.000000  ...       1106.000000     1069.000000   1069.000000   
mean          28.857414  ...         23.410127       33.876239     92.871656   
std            3.004760  ...          4.879343       16.366793      2.354689   
min           19.700000  ...          9.800000        0.000000     82.300000   
25%           26.900000  ...         20.000000       22.600000     91.500000   
50%           28.800000  ...         23.400000       35.600000     92.900000   
75%           30.700000  ...         26.800000       46.300000     94.400000   
max           39.800000  ...         38.800000       72.100000     99.100000   

       ff_avg_spin  ff_avg_break_x  ff_avg_break_z  offspeed_avg_break_z  \
count  1069.000000     1069.000000     1069.000000           1086.000000   
mean   2247.242283       -3.038728      -16.380449            -30.807919   
std     145.896179        7.532372        3.024717              4.034539   
min    1792.000000      -15.700000      -31.300000            -43.300000   
25%    2150.000000       -8.500000      -17.900000            -33.575000   
50%    2248.000000       -5.800000      -16.100000            -30.800000   
75%    2348.000000        3.000000      -14.300000            -28.100000   
max    2779.000000       18.100000       -9.200000            -15.400000   

       offspeed_avg_break_z_induced  offspeed_avg_break  offspeed_range_speed  
count                   1086.000000         1086.000000           1083.000000  
mean                       7.318508           15.816298              1.583564  
std                        3.744186            2.712864              0.429232  
min                       -5.600000            4.000000              0.900000  
25%                        4.725000           14.300000              1.300000  
50%                        7.400000           16.000000              1.500000  
75%                        9.600000           17.600000              1.700000  
max                       22.300000           25.000000              6.300000  

[8 rows x 36 columns]
(1106, 37)

Missing Values¶

In [5]:
print(data.isnull().sum())
last_name, first_name            0
player_id                        0
year                             0
pa                               0
k_percent                        0
bb_percent                       0
woba                             0
bacon                            0
z_swing_percent                  0
z_swing_miss_percent             0
oz_swing_percent                 0
oz_swing_miss_percent            0
oz_contact_percent               0
out_zone_swing_miss              0
out_zone_swing                   0
out_zone_percent                 0
out_zone                         0
meatball_swing_percent           0
meatball_percent                 0
pitch_count_offspeed             0
whiff_percent                    0
swing_percent                    0
straightaway_percent             0
batted_ball                      0
f_strike_percent                 0
groundballs_percent              0
groundballs                      0
flyballs_percent                 0
n_ff_formatted                  37
ff_avg_speed                    37
ff_avg_spin                     37
ff_avg_break_x                  37
ff_avg_break_z                  37
offspeed_avg_break_z            20
offspeed_avg_break_z_induced    20
offspeed_avg_break              20
offspeed_range_speed            23
dtype: int64
In [6]:
# Columns with missing values
columns_with_missing = ['n_ff_formatted', 'ff_avg_speed', 'ff_avg_spin',
                        'ff_avg_break_x', 'ff_avg_break_z', 'offspeed_avg_break_z',
                        'offspeed_avg_break_z_induced', 'offspeed_avg_break', 'offspeed_range_speed']

# Plot histograms for columns with missing values
plt.figure(figsize=(15,10))

for i, col in enumerate(columns_with_missing, 1):
    plt.subplot(3, 3, i)
    sns.histplot(data[col], kde=True, bins=20)
    plt.title(f'{col} Distribution')

plt.tight_layout()
plt.show()
In [7]:
# Impute using the mean for normally distributed columns
columns_mean = ['ff_avg_speed', 'ff_avg_spin', 'offspeed_avg_break']
for col in columns_mean:
    data[col].fillna(data[col].mean(), inplace=True)

# Impute using the median for skewed columns
columns_median = ['n_ff_formatted', 'ff_avg_break_x', 'ff_avg_break_z',
                  'offspeed_avg_break_z', 'offspeed_avg_break_z_induced', 'offspeed_range_speed']
for col in columns_median:
    data[col].fillna(data[col].median(), inplace=True)

# Confirm that no missing values remain
print(data.isnull().sum())
last_name, first_name           0
player_id                       0
year                            0
pa                              0
k_percent                       0
bb_percent                      0
woba                            0
bacon                           0
z_swing_percent                 0
z_swing_miss_percent            0
oz_swing_percent                0
oz_swing_miss_percent           0
oz_contact_percent              0
out_zone_swing_miss             0
out_zone_swing                  0
out_zone_percent                0
out_zone                        0
meatball_swing_percent          0
meatball_percent                0
pitch_count_offspeed            0
whiff_percent                   0
swing_percent                   0
straightaway_percent            0
batted_ball                     0
f_strike_percent                0
groundballs_percent             0
groundballs                     0
flyballs_percent                0
n_ff_formatted                  0
ff_avg_speed                    0
ff_avg_spin                     0
ff_avg_break_x                  0
ff_avg_break_z                  0
offspeed_avg_break_z            0
offspeed_avg_break_z_induced    0
offspeed_avg_break              0
offspeed_range_speed            0
dtype: int64

Distributions of Key Statistics¶

wOBA: measures how effectively a pitcher limits offensive production, accounting for the quality of contact and the overall run impact of hits and walks allowed, providing a comprehensive view of a pitcher's ability to suppress scoring.

In [8]:
plt.subplot(1, 3, 1)
sns.histplot(data['woba'], kde=True, bins=20)
plt.title('wOBA Distribution')
Out[8]:
Text(0.5, 1.0, 'wOBA Distribution')

The distribution of wOBA for pitchers is approximately normal, with most values centered around 0.3, indicating that the majority of pitchers allow hitters to perform at an average offensive level. There is a slight spread, with some pitchers allowing significantly better (lower wOBA) or worse (higher wOBA) offensive performance from hitters.

K%: represents the percentage of a pitcher's total plate appearances that result in a strikeout, serving as a measure of how often a pitcher is able to retire batters via strikeout.

In [ ]:
# Strikeout Percentage (K%) distribution
plt.subplot(1, 3, 2)
sns.histplot(data['k_percent'], kde=True, bins=20)
plt.title('K% Distribution')
Out[ ]:
Text(0.5, 1.0, 'K% Distribution')

The distribution of K% (Strikeout Percentage) is right-skewed, with most pitchers having a strikeout rate between 15% and 25%, peaking around 20%. A smaller number of pitchers achieve higher strikeout rates, with a few exceeding 30%, indicating they are elite in striking out batters.

BB%: represents the percentage of a pitcher's total plate appearances that result in a walk, measuring how frequently a pitcher allows hitters to reach base via walks.

In [ ]:
# Walk Percentage (BB%) distribution
plt.subplot(1, 3, 3)
sns.histplot(data['bb_percent'], kde=True, bins=20)
plt.title('BB% Distribution')

plt.tight_layout()
plt.show()

The distribution of BB% (Walk Percentage) is tightly clustered around 5% to 10%, with most pitchers falling within this range, indicating that the majority allow walks at a fairly moderate rate. There is a small tail toward higher walk rates above 10%, showing that a few pitchers struggle more with control, issuing walks more frequently.

Whiff %: represents the percentage of swings that result in a miss. It is a key measure of how often a pitcher can make batters swing and miss, indicating a pitcher's ability to dominate hitters and generate strikeouts.

In [ ]:
# Plot the distribution of Whiff Percentage (whiff_percent)
plt.figure(figsize=(5,5))
sns.histplot(data['whiff_percent'], kde=True, bins=20)
plt.title('Whiff% Distribution')
plt.xlabel('Whiff Percentage')
plt.ylabel('Count')
plt.show()

The distribution of Whiff Percentage (Whiff%) is approximately normal, with most pitchers recording a whiff rate between 20% and 30%, peaking around 25%. There are fewer pitchers with extremely high or low whiff rates, indicating that the majority of pitchers induce swings and misses at a moderate rate, with a few elite pitchers exceeding 35%.

Trends¶

In [ ]:
# Initialize the figure for subplots with 4 rows, 1 column
fig, axes = plt.subplots(4, 1, figsize=(10, 16))

# 1. Facet for Whiff%
sns.lineplot(data=data, x='year', y='whiff_percent', ax=axes[0], ci=None)
axes[0].set_title('Whiff% Over Time')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Whiff%')

# 2. Facet for wOBA
sns.lineplot(data=data, x='year', y='woba', ax=axes[1], ci=None)
axes[1].set_title('wOBA Over Time')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('wOBA')

# 3. Facet for K% (Strikeout Percentage)
sns.lineplot(data=data, x='year', y='k_percent', ax=axes[2], ci=None)
axes[2].set_title('K% (Strikeout Percentage) Over Time')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('K%')

# 4. Facet for BB% (Walk Percentage)
sns.lineplot(data=data, x='year', y='bb_percent', ax=axes[3], ci=None)
axes[3].set_title('BB% (Walk Percentage) Over Time')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('BB%')

# Adjust layout
plt.tight_layout()
plt.show()

Interpretation

  • Whiff% (swing and miss percentage)
    • Shows a clear upward trend from 2015 to 2020, suggesting that pitchers have become increasingly effective at inducing swings and misses. After peaking in 2020, it shows a slight decline but remains higher than earlier years.
    • This indicates that pitchers are improving their ability to generate whiffs, possibly due to factors like increased velocity, improved pitch design, or better pitch sequencing. wOBA Over Time:

-wOBA (weighted on-base average)

  • Shows more fluctuation over the years, with a peak around 2017 and a noticeable decline afterward. This suggests that after 2017, hitters' overall offensive production against pitchers decreased, aligning with the rise in Whiff% and K%.
  • The fluctuations could be due to changes in league conditions, like adjustments to the ball, hitting strategies, or pitcher dominance.
  • K% (strikeout percentage)

    • Also follows an upward trend, peaking around 2020, similar to the Whiff%. The trend highlights the increased reliance on strikeouts by pitchers, with more hitters being retired via strikeouts each year.
    • The correlation between K% and Whiff% is expected, as pitchers who induce more whiffs are likely to generate more strikeouts.
  • BB% (walk percentage)

    • Fluctuates more than the other metrics, peaking around 2018 and then declining before bouncing back. This suggests variability in pitchers' control over the years, with no clear trend of improvement or decline.
    • The fluctuation may indicate that while pitchers are improving in generating strikeouts, they might still struggle with control, causing walks to rise in certain periods.

What We Can Learn for Modeling:

  • Modeling Whiff% and K%: Since both Whiff% and K% are showing a general upward trend, incorporating these variables into time series models (like ARIMA or exponential smoothing) may help predict future trends in pitcher dominance. The close relationship between these metrics suggests that Whiff% could be a strong predictor of future K% performance.

  • wOBA Modeling: The downward trend in wOBA suggests that pitchers are becoming more effective at limiting offensive production. Predictive models for wOBA should consider factors like Whiff% and K%, as they seem correlated with limiting offensive success.

  • BB% Variability: The fluctuating BB% suggests more unpredictability in pitchers' control. A more complex model might be required to predict BB%, such as using external variables (e.g., pitch type, pitch velocity) to explain its variability over time.

Key Insights for Modeling:

  • Whiff% and K%: Use time series models or regression models to predict future strikeout and whiff rates. These variables are closely related and follow clear trends.
  • wOBA: May benefit from being modeled alongside Whiff% and K%, as they likely influence a pitcher's ability to limit runs.
  • BB%: Requires careful modeling due to its fluctuations; including other features (e.g., pitch count, fatigue) could improve predictions.

Modeling¶

WHIFF %¶

In [ ]:
def forecast_metric(data, metric_column, metric_name, forecast_years=5):
    # Step 1: Aggregate the metric data by year
    metric_by_year = data.groupby('year')[metric_column].mean()

    # Step 2: Plot the original time series for the metric
    plt.figure(figsize=(10, 5))
    plt.plot(metric_by_year.index, metric_by_year.values, marker='o')
    plt.title(f'{metric_name} Over Time')
    plt.xlabel('Year')
    plt.ylabel(metric_name)
    plt.grid(True)
    plt.show()

    # Step 3: Use Auto ARIMA to determine the best p, d, q parameters
    auto_model = auto_arima(metric_by_year, seasonal=False, trace=True, suppress_warnings=True)

    # Step 4: Fit the ARIMA model based on the recommended (p, d, q) values
    best_pdq = auto_model.order
    print(f"Best ARIMA order for {metric_name}: {best_pdq}")

    # Fit the model with the best order found by auto_arima
    model = ARIMA(metric_by_year, order=best_pdq)
    model_fit = model.fit()

    # Step 5: Summary of the model
    print(model_fit.summary())

    # Step 6: Forecast future metric values (next 5 years by default)
    forecast = model_fit.forecast(steps=forecast_years)

    # Step 7: Create a future index for years to forecast
    future_years = list(range(metric_by_year.index[-1] + 1, metric_by_year.index[-1] + 1 + forecast_years))

    # Step 8: Plot the forecast
    plt.figure(figsize=(10, 5))
    plt.plot(metric_by_year.index, metric_by_year.values, label=f'Historical {metric_name}')
    plt.plot(future_years, forecast, label=f'Forecasted {metric_name}', marker='o', linestyle='--')
    plt.title(f'{metric_name} Forecast')
    plt.xlabel('Year')
    plt.ylabel(metric_name)
    plt.legend()
    plt.grid(True)
    plt.show()
In [ ]:
forecast_metric(data, 'whiff_percent', 'Whiff%', forecast_years=5)

Historical Whiff% Trend:

The Whiff% has generally been increasing from 2015 to 2020, peaking around 26%. However, after 2020, we see some decline and fluctuations in the percentage, indicating a potential plateau or slight decrease in pitchers' ability to induce swings and misses after reaching a peak.

ARIMA Model Summary:

The ARIMA(1,0,0) model was selected as the best fit for the data. The model includes one autoregressive term (AR1), which suggests that the current Whiff% is influenced by the Whiff% from the previous year. The AR1 coefficient (0.8618) indicates a strong positive relationship, meaning that previous year's Whiff% heavily influences the current year's value. The AIC (28.969) and BIC (29.877) are relatively low, indicating a good model fit with minimal complexity.

Whiff% Forecast:

The forecast for Whiff% from 2024 to 2028 shows a slight declining trend, suggesting that the Whiff% might gradually decrease in future years, but at a slow rate. The forecast doesn't show a sharp drop, but it indicates a steady decline, suggesting that the recent fluctuations in Whiff% may continue with a slight downward trend.

Conclusion:

After peaking in 2020, Whiff% is expected to gradually decline based on the model, reflecting a possible stabilization or slight weakening of pitchers' ability to induce swings and misses in the near future.

wOBA¶

In [ ]:
forecast_metric(data, 'woba', 'wOBA', forecast_years=5)

Historical wOBA (2015–2024):

  • The historical wOBA shows fluctuations over the years, with peaks in 2017 and 2022 and valleys in 2018 and 2024.
  • This trend suggests variability in performance or environmental factors (e.g., league-wide conditions) that influence wOBA over time.

ARIMA Model Summary:

  • The ARIMA(0, 0, 0) model with intercept was chosen as the best-fitting model based on the lowest AIC score.
  • This ARIMA configuration essentially represents a constant model because the p, d, and q values are all set to 0.
  • The model has a constant term (intercept) of approximately 0.3105, meaning the model assumes wOBA will stay close to this value in the forecast.
  • The flat forecast in this case is expected, as the model found no significant trend or autoregressive/moving average patterns in the data.

Forecast (2025–2029):

  • The forecasted wOBA for the next 5 years is flat, hovering around 0.3105.
  • This flat prediction indicates that the model does not detect any strong upward or downward trends, and it forecasts that wOBA will remain relatively stable.
  • Given the lack of significant trends in the historical data, the model assumes that future values will stay close to the mean of past observations.

Model Diagnostics:

  • The model's AIC (Akaike Information Criterion) is -74.772, which suggests the model fits the data well, though it is a very simple model.
  • The Ljung-Box test (Q) suggests that the residuals are uncorrelated (p-value > 0.05), meaning the model residuals behave like white noise, which is a good sign of model fit.
  • The Heteroskedasticity test (H) and Jarque-Bera test (JB) show that there are no major issues with heteroskedasticity or non-normality in the residuals.

Conclusion:

  • The ARIMA(0,0,0) model indicates that there is no significant autoregressive or moving average pattern in the historical wOBA data, leading to a flat forecast.
  • The forecast suggests that wOBA will remain stable around 0.3105 in the coming years, without any strong trends.

KK %¶

In [ ]:
forecast_metric(data, 'k_percent', 'K%', forecast_years=5)

Historical K% (2015–2024):

  • The historical K% (Strikeout Percentage) shows a clear upward trend from 2015 to 2019, reaching a peak in 2021 at around 23.5%.
  • After 2021, there is a slight decline, with K% stabilizing around 22.5%–23% over the next few years (2022–2024). This suggests a recent leveling off after a period of rapid increase.

ARIMA Model Selection:

  • The best ARIMA model selected by auto ARIMA is ARIMA(1, 0, 1), which means:
    • p=1: There is a significant autoregressive component, meaning past values of K% influence future values.
    • d=0: No differencing is required, indicating that the data is already stationary.
    • q=1: There is also a moving average component, meaning the model accounts for the error in previous time steps to improve predictions.

Model Summary:

  • The constant term (const = 21.81) represents the baseline value of K%, which is expected to stabilize around this level.
  • The AR(1) term (0.819) is significant, meaning that the previous year's K% heavily influences the current year’s value.
  • The MA(1) term (0.256) is not statistically significant (p-value = 0.694), suggesting that the moving average component may not add much value in improving the model’s predictions.
  • Model Fit: The AIC (26.907) indicates that this is a fairly well-fitting model, but further refinement could be considered.

K% Forecast (2025–2029):

  • The forecasted K% shows a slight decline over the next five years, dipping from around 22.5% to 22.1% by 2029.
  • This slight decrease suggests that the strikeout rate may stabilize or slightly decrease after peaking in recent years. However, the change is gradual, indicating that K% is expected to remain relatively high compared to historical levels.
  • The flat and gradual decline in the forecast reflects the model’s interpretation that the earlier growth in strikeouts has plateaued.

Model Diagnostics:

  • The Ljung-Box (Q) test shows a p-value > 0.05, meaning there is no significant autocorrelation in the residuals, which is a good sign for model fit.
  • Heteroskedasticity and normality tests show no significant issues with the residuals.

Conclusion:

  • The ARIMA(1, 0, 1) model suggests that K% will stabilize around 22.1%–22.5% over the next five years, with no significant upward or downward trends expected.
  • While the model fits the data well, it is largely driven by the autoregressive component, with limited contributions from the moving average. This forecast suggests that the rapid rise in K% observed between 2015 and 2021 is likely to slow, leading to a more stable strikeout rate in the coming years.

BB%¶

In [ ]:
forecast_metric(data, 'bb_percent', 'BB%', forecast_years=5)

Historical BB% (2015–2024):

  • The historical data for BB% (Walk Percentage) shows significant variability, with noticeable peaks in 2018 and 2022 around 7.8%–8.0%, followed by declines.
  • This suggests that the walk percentage has fluctuated substantially over the past decade, making it difficult to predict based on historical trends.

ARIMA Model Selection:

  • The best ARIMA model selected for BB% is ARIMA(0, 0, 0) with an intercept, indicating that the model treats the data as having no discernible trend or seasonality.
  • The model essentially assumes that future BB% values will remain constant, reflecting the mean of the historical data.

Model Summary:

  • The constant term (7.43%) suggests that the model forecasts BB% to stabilize around this value in future years.
  • Model diagnostics, such as AIC = 9.419, suggest that this is a very simple model without significant detected patterns in the data.

BB% Forecast (2025–2029):

  • The forecasted BB% is expected to remain flat at approximately 7.43% over the next five years.
  • Given the absence of significant autoregressive or moving average components, the model predicts that future walk percentages will be close to the historical average without substantial changes.
  • The flat forecast reflects the conclusion that future values of BB% will not deviate significantly from recent historical averages.

Model Diagnostics:

  • The Ljung-Box (Q) test indicates no significant autocorrelation in the residuals, suggesting the model residuals behave like white noise.
  • Tests for heteroskedasticity and normality (Jarque-Bera test) indicate no major issues with the residuals, meaning the model fits the data reasonably well.

Conclusion:

  • The ARIMA(0, 0, 0) model predicts that BB% will remain stable at approximately 7.43% over the next five years.
  • The absence of clear trends or patterns in the historical data led the model to forecast future BB% values as constant.
In [ ]:
!cp "/content/drive/MyDrive/Colab Notebooks/silverstein_time_series.ipynb" ./
!jupyter nbconvert --to html "silverstein_time_series.ipynb"