Grouping Pitchers by Pitch Arsenal Using PCA and Clustering¶
By Scott Silverstein
- What is your name? Include all team members if submitting as a group.
- Scott Silverstein
From what perspective are you conducting the analysis? (Who are you? / Who are you working for?)
- I am an analytics consultant for a professional baseball team, aiming to explore patterns in pitchersā arsenals to improve player evaluation and game strategies.
What is your question?
- How can pitchers be clustered based on their pitch arsenals, and what insights can be derived about groups with specific pitch-dominance traits?
Describe your dataset(s) including URL (if available).
- The dataset is sourced from Baseball Savant and contains detailed statistics on pitchers' pitch types, speeds, spin rates, and break metrics across multiple seasons.
What are your variables? Include variable type (binary, categorical, numeric). If you have many variables, you can list the most important and summarize the rest.
- Key variables include:
- Numeric: Average pitch speed, spin rate, and break values for pitch types such as four-seam fastball (
ff_avg_speed
,ff_avg_spin
), slider (sl_avg_speed
), changeup, etc. - Numeric: Pitch usage frequencies (
n_ff_formatted
,n_sl_formatted
, etc.). - Total of 39 variables, including missing data for less common pitches.
- Numeric: Average pitch speed, spin rate, and break values for pitch types such as four-seam fastball (
- Key variables include:
How are your variables suitable for your analysis method?
- The variables are numeric and exhibit multicollinearity (e.g., spin rate correlates with speed), making PCA suitable for dimensionality reduction. Missing values will be imputed appropriately, as they represent unused pitch types. Clustering can group pitchers with similar arsenal profiles, highlighting unique traits.
Compare the outputs from clustering and PCA (alone or with clustering). What are the strengths and limitations of each? Where do they provide similar information? Where do they provide different information?
a. Clustering and PCA ComparisonClustering Purpose:
- Identified groups of pitchers based on their pitch arsenals and metrics (e.g., fastball speed, slider spin).
- The analysis resulted in four distinct clusters:
- Cluster 1: Fastball-dominant pitchers.
- Cluster 2: Slider-reliant pitchers.
- Cluster 3: Limited-velocity pitchers.
- Cluster 4: Balanced, varied pitchers.
PCA Purpose:
- Reduced the high-dimensional dataset (pitch metrics like speed, spin, and break) into uncorrelated principal components (PCs).
- PC1 was heavily weighted toward changeup metrics, while PC2 and PC3 captured variance related to fastballs and sliders.
- Retaining the first few PCs explained a large portion of the variance, simplifying the data for clustering.
b. Strengths Observed
- **Clustering**:
- Provided interpretable groups of pitchers based on distinct pitch styles.
- Identified actionable pitcher profiles, such as fastball-dominant pitchers (Cluster 1) and slider-heavy pitchers (Cluster 2).
- Highlighted differences in specialty pitches, like higher usage of sweepers and knuckleballs in Cluster 4.
- **PCA**:
- Simplified the dataset by reducing multicollinearity among metrics, such as speed and spin being correlated.
- PC1 highlighted changeup dominance, while PC2 focused on fastball speed and spin.
- Made clustering more robust by reducing the influence of noisy or redundant variables.
c. Limitations Observed
- **Clustering**:
- Direct clustering on raw metrics was sensitive to noise and overlapping values, making some clusters less distinct without PCA preprocessing.
- Less-used pitches, like knuckleballs and splitters, contributed little to some clusters, making interpretation harder.
- **PCA**:
- Principal components (e.g., PC1, PC2) were abstract and required analyzing feature loadings to interpret their contributions.
- PCA alone did not produce meaningful groups, requiring clustering for actionable insights.
d. Similarities Between PCA and Clustering
- Both methods revealed patterns in pitcher profiles:
- Clustering grouped pitchers into clear archetypes (e.g., fastball-dominant or slider-reliant).
- PCA explained these clusters by identifying which metrics contributed most variance, such as changeup speed and spin in PC1.
- Both confirmed that fastball metrics, as well as specialty pitches, were key in distinguishing pitchers.
e. Differences in Outputs
- **Interpretability**:
- Clustering directly produced interpretable groups (e.g., Cluster 1 = fastball-dominant pitchers).
- PCA required additional analysis of feature loadings to link PCs to real-world metrics.
- **Focus**:
- Clustering grouped data points (pitchers) into actionable categories.
- PCA focused on summarizing and reducing features (pitch metrics) while retaining variance.
f. Benefits of Combining PCA and Clustering
- PCA preprocessing improved clustering results:
- Noise and multicollinearity from features like fastball speed and spin were reduced.
- Clustering on the reduced dataset produced more robust and visually separable groups.
- PCA added context to clusters:
- Feature contributions from PCs explained why certain pitchers were grouped together, linking clusters to pitch metrics like changeup dominance in PC1 or fastball emphasis in PC2.
g. Final Comparison Summary
- Clustering identified four actionable pitcher profiles:
- Cluster 1: Fastball-dominant pitchers.
- Cluster 2: Slider-heavy pitchers.
- Cluster 3: Limited-velocity pitchers with less diverse arsenals.
- Cluster 4: Balanced pitchers with varied arsenals, including specialty pitches.
- PCA helped simplify the dataset and explain variance, complementing clustering to make results more interpretable.
What conclusions can you draw from these two methods to answer your question?
a. Clustering provided actionable groupings of pitchers based on their pitch arsenals and metrics.- The four clusters identified distinct pitching profiles:
- Cluster 1: Fastball-dominant pitchers with high speed and spin.
- Cluster 2: Slider-heavy pitchers with strong spin and break metrics.
- Cluster 3: Limited-velocity pitchers with less diversity in pitch types.
- Cluster 4: Balanced pitchers with varied arsenals, including specialty pitches like splitters and sweepers.
b. PCA helped simplify the high-dimensional dataset, making clustering more robust and interpretable. - PCA reduced multicollinearity by combining correlated features (e.g., speed, spin, and break) into uncorrelated principal components. - The first few PCs captured most of the variance, highlighting key contributions from changeup, fastball, and slider metrics.
c. Combining PCA and clustering enabled a clearer understanding of pitcher profiles. - PCA preprocessing improved the quality of clustering by reducing noise and redundancy in the data. - Clustering provided interpretable insights, while PCA explained the underlying variance contributing to the groupings.
d. The results revealed that pitchers could be categorized into clear archetypes based on pitch metrics. - Fastball-heavy pitchers were separated from slider-reliant pitchers. - Specialty pitches, like knuckleballs and splitters, contributed to the uniqueness of certain clusters. - These findings align with real-world observations of pitcher styles and usage patterns.
e. The combination of these methods answered the original question of identifying clusters based on pitch type. - Clustering provided distinct groups of pitchers. - PCA added context by showing which metrics (e.g., speed, spin) were most influential in defining these groups. - Together, these methods offered a data-driven way to classify pitchers and identify key characteristics of their pitch arsenals.
- The four clusters identified distinct pitching profiles:
What are the limitations of your analysis?
- Imputation for missing data may introduce bias.
- PCAās reduced interpretability might mask specific pitch dynamics.
- Clustering results depend heavily on chosen metrics (e.g., distance measures) and preprocessing.
%matplotlib inline
Packages¶
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster, set_link_color_palette
from sklearn.metrics import silhouette_score
Data Cleanup¶
Basic info¶
# Load the dataset
file_path = '/content/pitch_type.csv'
data = pd.read_csv(file_path)
# Display initial structure for context
print("Initial Dataset Shape:", data.shape)
print(data.info())
print(data.head())
Initial Dataset Shape: (1106, 39) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1106 entries, 0 to 1105 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 last_name, first_name 1106 non-null object 1 player_id 1106 non-null int64 2 year 1106 non-null int64 3 n_ff_formatted 1069 non-null float64 4 ff_avg_speed 1069 non-null float64 5 ff_avg_spin 1069 non-null float64 6 ff_avg_break_z_induced 1069 non-null float64 7 n_sl_formatted 818 non-null float64 8 sl_avg_speed 818 non-null float64 9 sl_avg_spin 814 non-null float64 10 sl_avg_break 818 non-null float64 11 n_ch_formatted 988 non-null float64 12 ch_avg_speed 988 non-null float64 13 ch_avg_spin 988 non-null float64 14 ch_avg_break 988 non-null float64 15 n_cu_formatted 954 non-null float64 16 cu_avg_speed 954 non-null float64 17 cu_avg_spin 954 non-null float64 18 cu_avg_break 954 non-null float64 19 n_si_formatted 918 non-null float64 20 si_avg_speed 918 non-null float64 21 si_avg_spin 918 non-null float64 22 si_avg_break 918 non-null float64 23 n_fc_formatted 517 non-null float64 24 fc_avg_speed 517 non-null float64 25 fc_avg_spin 517 non-null float64 26 fc_avg_break 517 non-null float64 27 n_fs_formatted 140 non-null float64 28 fs_avg_speed 140 non-null float64 29 fs_avg_spin 140 non-null float64 30 fs_avg_break 140 non-null float64 31 n_st_formatted 143 non-null float64 32 st_avg_speed 143 non-null float64 33 st_avg_spin 143 non-null float64 34 st_avg_break 143 non-null float64 35 n_sv_formatted 28 non-null float64 36 sv_avg_speed 28 non-null float64 37 sv_avg_spin 28 non-null float64 38 sv_avg_break 28 non-null float64 dtypes: float64(36), int64(2), object(1) memory usage: 337.1+ KB None last_name, first_name player_id year n_ff_formatted ff_avg_speed \ 0 Colon, Bartolo 112526 2015 29.1 90.9 1 Burnett, A.J. 150359 2015 11.7 91.7 2 Hudson, Tim 218596 2015 7.0 88.5 3 Buehrle, Mark 279824 2015 26.4 84.5 4 Sabathia, CC 282332 2015 25.2 90.8 ff_avg_spin ff_avg_break_z_induced n_sl_formatted sl_avg_speed \ 0 2255.0 15.5 9.7 82.8 1 2082.0 12.0 NaN NaN 2 2126.0 11.8 NaN NaN 3 2076.0 12.1 NaN NaN 4 2114.0 14.7 22.5 79.6 sl_avg_spin ... fs_avg_spin fs_avg_break n_st_formatted st_avg_speed \ 0 2178.0 ... NaN NaN NaN NaN 1 NaN ... NaN NaN NaN NaN 2 NaN ... 1369.0 10.3 NaN NaN 3 NaN ... NaN NaN NaN NaN 4 1823.0 ... NaN NaN NaN NaN st_avg_spin st_avg_break n_sv_formatted sv_avg_speed sv_avg_spin \ 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN sv_avg_break 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN [5 rows x 39 columns]
Identify Pitch Usage Columns and Create Binary Flag¶
To ensure that pitchers who do not throw a certain pitch are represented appropriately, we create binary flags for each pitch type. These flags indicate whether a pitcher uses a particular pitch (1 for usage, 0 for non-usage). This step helps preserve meaningful information about the absence of specific pitches, which can be critical for clustering and analysis.
# Identify columns related to pitch usage
pitch_columns = [col for col in data.columns if 'n_' in col]
# Create binary flags for each pitch type
for col in pitch_columns:
flag_col = col.replace('n_', 'has_')
data[flag_col] = data[col].notna().astype(int)
# Display new columns with binary flags
print("Sample Data with Binary Flags:")
print(data[[col for col in data.columns if 'has_' in col]].head())
Sample Data with Binary Flags: has_ff_formatted has_sl_formatted has_ch_formatted has_cu_formatted \ 0 1 1 1 1 1 1 0 1 1 2 1 0 0 1 3 1 0 1 1 4 1 1 1 0 has_si_formatted has_fc_formatted has_fs_formatted has_st_formatted \ 0 1 0 0 0 1 1 0 0 0 2 1 1 1 0 3 1 1 0 0 4 1 1 0 0 has_sv_formatted 0 0 1 0 2 0 3 0 4 0
Handle Missing Values by Imputation¶
Pitchers who do not throw a certain type of pitch naturally have missing data for related metrics (e.g., spin rate, speed). In this step, missing values in numeric columns are replaced with 0, explicitly indicating non-usage of that pitch. This ensures that clustering algorithms do not misinterpret missing values and that all pitchers are included in the analysis.
# Identify numeric columns for imputation
numeric_columns = [col for col in data.columns if data[col].dtype == 'float64']
# Replace missing values in numeric columns with 0
data[numeric_columns] = data[numeric_columns].fillna(0)
# Verify imputation
print("Sample Data after Missing Value Imputation:")
print(data[numeric_columns].head())
Sample Data after Missing Value Imputation: n_ff_formatted ff_avg_speed ff_avg_spin ff_avg_break_z_induced \ 0 29.1 90.9 2255.0 15.5 1 11.7 91.7 2082.0 12.0 2 7.0 88.5 2126.0 11.8 3 26.4 84.5 2076.0 12.1 4 25.2 90.8 2114.0 14.7 n_sl_formatted sl_avg_speed sl_avg_spin sl_avg_break n_ch_formatted \ 0 9.7 82.8 2178.0 6.3 7.4 1 0.0 0.0 0.0 0.0 8.8 2 0.0 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 0.0 21.1 4 22.5 79.6 1823.0 11.8 14.0 ch_avg_speed ... fs_avg_spin fs_avg_break n_st_formatted st_avg_speed \ 0 82.6 ... 0.0 0.0 0.0 0.0 1 86.3 ... 0.0 0.0 0.0 0.0 2 0.0 ... 1369.0 10.3 0.0 0.0 3 78.7 ... 0.0 0.0 0.0 0.0 4 83.9 ... 0.0 0.0 0.0 0.0 st_avg_spin st_avg_break n_sv_formatted sv_avg_speed sv_avg_spin \ 0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 0.0 0.0 4 0.0 0.0 0.0 0.0 0.0 sv_avg_break 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 [5 rows x 36 columns]
Standardize Numeric Data¶
Pitch metrics such as spin rate and speed vary widely in scale. To prevent clustering algorithms from being biased toward features with larger ranges, I will standardize the numeric data. Standardization transforms the data to have a mean of 0 and a standard deviation of 1, ensuring that all features contribute equally to the analysis.
# Select columns for standardization (numeric and binary flags)
scaled_columns = numeric_columns + [col for col in data.columns if 'has_' in col]
# Standardize the selected columns
scaler = StandardScaler()
data[scaled_columns] = scaler.fit_transform(data[scaled_columns])
# Verify standardization
print("Sample Data after Standardization:")
print(data[scaled_columns].head())
Sample Data after Standardization: n_ff_formatted ff_avg_speed ff_avg_spin ff_avg_break_z_induced \ 0 -0.211812 0.067336 0.193389 0.107785 1 -1.223574 0.114787 -0.210085 -0.838111 2 -1.496867 -0.075016 -0.107468 -0.892162 3 -0.368809 -0.312270 -0.224079 -0.811085 4 -0.438586 0.061405 -0.135454 -0.108420 n_sl_formatted sl_avg_speed sl_avg_spin sl_avg_break n_ch_formatted \ 0 -0.318350 0.540557 0.435467 0.165177 -0.509354 1 -1.102058 -1.681349 -1.640706 -1.340870 -0.354382 2 -1.102058 -1.681349 -1.640706 -1.340870 -1.328491 3 -1.102058 -1.681349 -1.640706 -1.340870 1.007157 4 0.715822 0.454687 0.097064 1.479980 0.221228 ch_avg_speed ... sv_avg_break has_ff_formatted has_sl_formatted \ 0 0.252418 ... -0.156807 0.186042 0.593362 1 0.392578 ... -0.156807 0.186042 -1.685312 2 -2.876543 ... -0.156807 0.186042 -1.685312 3 0.104683 ... -0.156807 0.186042 -1.685312 4 0.301663 ... -0.156807 0.186042 0.593362 has_ch_formatted has_cu_formatted has_si_formatted has_fc_formatted \ 0 0.345591 0.399161 0.452541 -0.936888 1 0.345591 0.399161 0.452541 -0.936888 2 -2.893593 0.399161 0.452541 1.067364 3 0.345591 0.399161 0.452541 1.067364 4 0.345591 -2.505258 0.452541 1.067364 has_fs_formatted has_st_formatted has_sv_formatted 0 -0.380693 -0.38535 -0.161165 1 -0.380693 -0.38535 -0.161165 2 2.626785 -0.38535 -0.161165 3 -0.380693 -0.38535 -0.161165 4 -0.380693 -0.38535 -0.161165 [5 rows x 45 columns]
Review dataset¶
print("Final Cleaned Dataset Shape:", data.shape)
print(data.head())
Final Cleaned Dataset Shape: (1106, 48) last_name, first_name player_id year n_ff_formatted ff_avg_speed \ 0 Colon, Bartolo 112526 2015 -0.211812 0.067336 1 Burnett, A.J. 150359 2015 -1.223574 0.114787 2 Hudson, Tim 218596 2015 -1.496867 -0.075016 3 Buehrle, Mark 279824 2015 -0.368809 -0.312270 4 Sabathia, CC 282332 2015 -0.438586 0.061405 ff_avg_spin ff_avg_break_z_induced n_sl_formatted sl_avg_speed \ 0 0.193389 0.107785 -0.318350 0.540557 1 -0.210085 -0.838111 -1.102058 -1.681349 2 -0.107468 -0.892162 -1.102058 -1.681349 3 -0.224079 -0.811085 -1.102058 -1.681349 4 -0.135454 -0.108420 0.715822 0.454687 sl_avg_spin ... sv_avg_break has_ff_formatted has_sl_formatted \ 0 0.435467 ... -0.156807 0.186042 0.593362 1 -1.640706 ... -0.156807 0.186042 -1.685312 2 -1.640706 ... -0.156807 0.186042 -1.685312 3 -1.640706 ... -0.156807 0.186042 -1.685312 4 0.097064 ... -0.156807 0.186042 0.593362 has_ch_formatted has_cu_formatted has_si_formatted has_fc_formatted \ 0 0.345591 0.399161 0.452541 -0.936888 1 0.345591 0.399161 0.452541 -0.936888 2 -2.893593 0.399161 0.452541 1.067364 3 0.345591 0.399161 0.452541 1.067364 4 0.345591 -2.505258 0.452541 1.067364 has_fs_formatted has_st_formatted has_sv_formatted 0 -0.380693 -0.38535 -0.161165 1 -0.380693 -0.38535 -0.161165 2 2.626785 -0.38535 -0.161165 3 -0.380693 -0.38535 -0.161165 4 -0.380693 -0.38535 -0.161165 [5 rows x 48 columns]
PCA¶
# Select only numeric columns (including binary flags)
pca_columns = [col for col in data.columns if col.startswith(('ff_', 'sl_', 'cu_', 'si_', 'ch_', 'has_'))]
# Apply PCA
pca = PCA()
pca_result = pca.fit_transform(data[pca_columns])
# Explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)
# Visualize the explained variance
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
plt.title('Explained Variance by Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid()
plt.show()
Cumulative Explained Variance:
- The y-axis represents the cumulative percentage of the total variance in the dataset explained by the principal components.
- The x-axis shows the number of principal components.
Key Observations:
- The curve rises steeply at first, indicating that the first few components explain a large proportion of the variance in the dataset.
- After about 6 components, the curve begins to flatten, meaning that additional components contribute less to the total variance.
- By 10 components, the cumulative explained variance exceeds 90%, suggesting that these 10 components capture most of the dataset's structure.
Choosing the Number of Components:
- Typically, the number of components is chosen where the cumulative variance reaches a satisfactory level (e.g., 90%).
- From the plot, retaining the first 10 components is a reasonable choice, as they explain over 90% of the variance.
Dimensionality Reduction:
- By reducing the dataset from potentially dozens of features to these 10 principal components, you retain the key information while discarding less important variability. This makes subsequent clustering more computationally efficient and interpretable.
Feature contributions¶
print("Feature contributions to PC2:")
print(components.loc['PC2'].sort_values(ascending=False).head())
Feature contributions to PC2: ff_avg_spin 0.304537 ff_avg_speed 0.292026 ff_avg_break_z_induced 0.291926 has_ff_formatted 0.287610 ch_avg_break 0.241365 Name: PC2, dtype: float64
Retain Multiple PCs for Clustering:¶
# Display contributions for all PCs
# Define the number of components based on explained variance \
num_components = len(pca.components_)
for i in range(num_components):
print(f"Feature contributions to PC{i+1}:")
print(components.loc[f'PC{i+1}'].sort_values(ascending=False).head())
Feature contributions to PC1: has_ch_formatted 0.373448 ch_avg_speed 0.372151 ch_avg_spin 0.358590 ch_avg_break 0.346982 has_si_formatted 0.253067 Name: PC1, dtype: float64 Feature contributions to PC2: ff_avg_spin 0.304537 ff_avg_speed 0.292026 ff_avg_break_z_induced 0.291926 has_ff_formatted 0.287610 ch_avg_break 0.241365 Name: PC2, dtype: float64 Feature contributions to PC3: has_cu_formatted 0.351846 cu_avg_spin 0.349911 cu_avg_speed 0.348376 cu_avg_break 0.322361 has_fc_formatted 0.164561 Name: PC3, dtype: float64 Feature contributions to PC4: si_avg_spin 0.331663 si_avg_break 0.328058 si_avg_speed 0.326781 has_si_formatted 0.324357 sl_avg_speed 0.252549 Name: PC4, dtype: float64 Feature contributions to PC5: has_ff_formatted 0.359402 ff_avg_speed 0.354541 ff_avg_spin 0.329061 ff_avg_break_z_induced 0.245096 si_avg_break 0.137768 Name: PC5, dtype: float64 Feature contributions to PC6: has_st_formatted 0.714181 has_fc_formatted 0.565005 sl_avg_break 0.093005 cu_avg_break 0.071189 ch_avg_spin 0.065638 Name: PC6, dtype: float64 Feature contributions to PC7: has_sv_formatted 0.898582 has_st_formatted 0.391422 cu_avg_break 0.127061 sl_avg_break 0.068204 sl_avg_spin 0.061492 Name: PC7, dtype: float64 Feature contributions to PC8: has_fc_formatted 0.760846 has_sv_formatted 0.201797 sl_avg_break 0.195804 has_fs_formatted 0.118418 ch_avg_spin 0.054058 Name: PC8, dtype: float64 Feature contributions to PC9: has_fs_formatted 0.815169 ch_avg_break 0.306076 ff_avg_break_z_induced 0.275782 ch_avg_spin 0.193832 cu_avg_break 0.128117 Name: PC9, dtype: float64 Feature contributions to PC10: cu_avg_break 0.657918 sl_avg_break 0.344856 has_ff_formatted 0.128736 ff_avg_speed 0.097633 ff_avg_spin 0.088544 Name: PC10, dtype: float64 Feature contributions to PC11: ff_avg_break_z_induced 0.627330 sl_avg_break 0.408757 cu_avg_break 0.184616 si_avg_break 0.174545 ch_avg_break 0.065785 Name: PC11, dtype: float64 Feature contributions to PC12: sl_avg_break 0.671894 cu_avg_speed 0.220477 has_cu_formatted 0.173505 has_fs_formatted 0.108697 has_ff_formatted 0.087315 Name: PC12, dtype: float64 Feature contributions to PC13: ch_avg_break 0.501528 ch_avg_spin 0.385547 ff_avg_spin 0.236594 cu_avg_spin 0.107025 sl_avg_spin 0.092937 Name: PC13, dtype: float64 Feature contributions to PC14: ff_avg_spin 0.477263 ch_avg_spin 0.294721 si_avg_spin 0.290919 cu_avg_spin 0.232700 sl_avg_spin 0.182240 Name: PC14, dtype: float64 Feature contributions to PC15: ch_avg_spin 0.660581 ff_avg_break_z_induced 0.148038 sl_avg_speed 0.139365 has_ff_formatted 0.136807 has_cu_formatted 0.128818 Name: PC15, dtype: float64 Feature contributions to PC16: cu_avg_spin 0.594090 sl_avg_spin 0.340587 si_avg_break 0.199052 has_ff_formatted 0.188391 ff_avg_speed 0.172436 Name: PC16, dtype: float64 Feature contributions to PC17: si_avg_break 0.709988 ff_avg_spin 0.202035 ch_avg_spin 0.186809 has_sl_formatted 0.133405 sl_avg_speed 0.131780 Name: PC17, dtype: float64 Feature contributions to PC18: sl_avg_spin 0.624594 has_cu_formatted 0.254985 si_avg_break 0.231020 cu_avg_speed 0.149822 cu_avg_break 0.088291 Name: PC18, dtype: float64 Feature contributions to PC19: ff_avg_speed 0.490108 cu_avg_speed 0.340608 ch_avg_speed 0.303088 si_avg_speed 0.211675 sl_avg_speed 0.097424 Name: PC19, dtype: float64 Feature contributions to PC20: si_avg_spin 0.764599 has_ff_formatted 0.190109 ff_avg_speed 0.098559 ch_avg_speed 0.059756 has_cu_formatted 0.046594 Name: PC20, dtype: float64 Feature contributions to PC21: cu_avg_speed 0.608264 has_ff_formatted 0.338904 has_ch_formatted 0.094710 has_si_formatted 0.075140 sl_avg_speed 0.048278 Name: PC21, dtype: float64 Feature contributions to PC22: ch_avg_speed 0.622794 has_ff_formatted 0.296005 has_si_formatted 0.094704 has_cu_formatted 0.089103 si_avg_break 0.035498 Name: PC22, dtype: float64 Feature contributions to PC23: has_sl_formatted 0.698334 ff_avg_speed 0.124128 cu_avg_speed 0.110215 ch_avg_speed 0.043043 si_avg_spin 0.038070 Name: PC23, dtype: float64 Feature contributions to PC24: si_avg_speed 0.690029 has_ff_formatted 0.203760 has_sl_formatted 0.071871 has_ch_formatted 0.043950 ff_avg_spin 0.024593 Name: PC24, dtype: float64
Clustering¶
Prepare for Clustering¶
# Drop unnecessary columns
data_preprocessed = data.drop(columns=['player_id', 'year'], errors='ignore')
# Select only numeric columns
numeric_columns = data_preprocessed.select_dtypes(include=['float64', 'int64']).columns
data_numeric = data_preprocessed[numeric_columns]
# Scale the numeric data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_numeric)
# Display the scaled data
print("Scaled Data Shape:", data_scaled.shape)
print("Sample of Scaled Data:")
print(data_scaled[:5])
Scaled Data Shape: (1106, 45) Sample of Scaled Data: [[-2.11811731e-01 6.73362924e-02 1.93388597e-01 1.07784706e-01 -3.18350034e-01 5.40557305e-01 4.35466804e-01 1.65177097e-01 -5.09353553e-01 2.52418250e-01 2.81334731e-01 3.62595619e-01 -1.21385113e+00 5.21363108e-01 -3.97349215e-01 2.14584159e-01 2.04698190e+00 3.21247601e-01 3.93041929e-01 3.31132291e-01 -6.68806664e-01 -9.36237137e-01 -9.31095654e-01 -8.67091784e-01 -3.11510237e-01 -3.80523474e-01 -3.72628380e-01 -3.71057661e-01 -2.90245005e-01 -3.84982900e-01 -3.83559141e-01 -3.75373683e-01 -1.38388765e-01 -1.61064310e-01 -1.60011322e-01 -1.56807243e-01 1.86042433e-01 5.93361812e-01 3.45591086e-01 3.99160545e-01 4.52540637e-01 -9.36887887e-01 -3.80693494e-01 -3.85349567e-01 -1.61164593e-01] [-1.22357446e+00 1.14787034e-01 -2.10085260e-01 -8.38110999e-01 -1.10205843e+00 -1.68134860e+00 -1.64070579e+00 -1.34086972e+00 -3.54381644e-01 3.92577543e-01 1.97653402e-01 -4.29528335e-01 1.91681848e+00 5.50887386e-01 -9.58971254e-02 1.50734259e-01 1.84751802e+00 4.27731647e-01 2.49191715e-01 -3.49732940e-02 -6.68806664e-01 -9.36237137e-01 -9.31095654e-01 -8.67091784e-01 -3.11510237e-01 -3.80523474e-01 -3.72628380e-01 -3.71057661e-01 -2.90245005e-01 -3.84982900e-01 -3.83559141e-01 -3.75373683e-01 -1.38388765e-01 -1.61064310e-01 -1.60011322e-01 -1.56807243e-01 1.86042433e-01 -1.68531237e+00 3.45591086e-01 3.99160545e-01 4.52540637e-01 -9.36887887e-01 -3.80693494e-01 -3.85349567e-01 -1.61164593e-01] [-1.49686669e+00 -7.50159333e-02 -1.07467631e-01 -8.92162183e-01 -1.10205843e+00 -1.68134860e+00 -1.64070579e+00 -1.34086972e+00 -1.32849079e+00 -2.87654328e+00 -2.66800516e+00 -2.58986639e+00 1.83844620e-03 3.11002629e-01 1.73703423e-01 7.89233264e-01 1.68325365e+00 3.47149126e-01 2.89421012e-01 -9.12972301e-02 1.49261794e+00 9.44407261e-01 8.74190001e-01 6.47033123e-02 1.43444678e+00 2.45381493e+00 2.34939449e+00 1.88010848e+00 -2.90245005e-01 -3.84982900e-01 -3.83559141e-01 -3.75373683e-01 -1.38388765e-01 -1.61064310e-01 -1.60011322e-01 -1.56807243e-01 1.86042433e-01 -1.68531237e+00 -2.89359316e+00 3.99160545e-01 4.52540637e-01 1.06736357e+00 2.62678511e+00 -3.85349567e-01 -1.61164593e-01] [-3.68809395e-01 -3.12269643e-01 -2.24078573e-01 -8.11085408e-01 -1.10205843e+00 -1.68134860e+00 -1.64070579e+00 -1.34086972e+00 1.00715728e+00 1.04682778e-01 1.07140945e-01 3.26589985e-01 -3.20910998e-01 1.59690705e-01 3.37841509e-02 7.07198265e-03 3.45672328e-01 1.97495873e-01 2.26029393e-01 7.76745783e-02 1.17978017e+00 8.74166325e-01 7.06715833e-01 7.16959879e-01 -3.11510237e-01 -3.80523474e-01 -3.72628380e-01 -3.71057661e-01 -2.90245005e-01 -3.84982900e-01 -3.83559141e-01 -3.75373683e-01 -1.38388765e-01 -1.61064310e-01 -1.60011322e-01 -1.56807243e-01 1.86042433e-01 -1.68531237e+00 3.45591086e-01 3.99160545e-01 4.52540637e-01 1.06736357e+00 -3.80693494e-01 -3.85349567e-01 -1.61164593e-01] [-4.38586135e-01 6.14049497e-02 -1.35454257e-01 -1.08420027e-01 7.15821875e-01 4.54686545e-01 9.70640172e-02 1.47997987e+00 2.21228306e-01 3.01663407e-01 6.62170165e-01 1.10556179e-01 -1.24612607e+00 -2.48642269e+00 -2.39717100e+00 -1.89246256e+00 1.15526102e+00 3.78806545e-01 3.18678683e-01 5.84590004e-01 -6.68806664e-01 1.10754750e+00 7.69411291e-01 2.72963729e+00 -3.11510237e-01 -3.80523474e-01 -3.72628380e-01 -3.71057661e-01 -2.90245005e-01 -3.84982900e-01 -3.83559141e-01 -3.75373683e-01 -1.38388765e-01 -1.61064310e-01 -1.60011322e-01 -1.56807243e-01 1.86042433e-01 5.93361812e-01 3.45591086e-01 -2.50525763e+00 4.52540637e-01 1.06736357e+00 -3.80693494e-01 -3.85349567e-01 -1.61164593e-01]]
K- Means Clustering¶
# Find optimal k using the elbow method
inertia = []
k_values = range(1, 11)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(data_scaled)
inertia.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(8, 5))
plt.plot(k_values, inertia, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.grid()
plt.show()
# Perform K-Means with the chosen k
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_labels = kmeans.fit_predict(data_scaled)
# Add cluster labels to the original dataset
data_preprocessed['Cluster'] = kmeans_labels
print("Cluster Labels Assigned:")
print(data_preprocessed[['Cluster']].head())
Cluster Labels Assigned: Cluster 0 1 1 3 2 2 3 3 4 1
The elbow method suggests that the optimal number of clusters is around 4, as the rate of inertia reduction significantly slows after this point. This indicates that using 4 clusters balances the trade-off between compactness (inertia) and the number of clusters.
Hierarchical Clustering¶
# Compute the linkage matrix
linkage_matrix = linkage(data_scaled, method='ward')
# Set a custom color palette for cluster branches
set_link_color_palette(['orange', 'blue', 'green', 'red', 'purple'])
# Plot the dendrogram with custom colors and simplified labels
plt.figure(figsize=(12, 8))
dendrogram(
linkage_matrix,
color_threshold=40, # Threshold for coloring clusters
above_threshold_color='grey', # Color for branches above the threshold
truncate_mode='level', # Show only the top levels of the tree
p=5, # Display the top 5 levels of the hierarchy
)
plt.axhline(y=40, color='red', linestyle='--', label='Cluster Cut Threshold') # Add a horizontal threshold line
plt.title('Cluster Threshold with Simplified Labels')
plt.xlabel('Data Points (Truncated)')
plt.ylabel('Euclidean Distance')
plt.legend()
plt.grid()
plt.show()
This dendrogram visualizes hierarchical clustering, with distinct cluster groups highlighted below the red threshold line (at a distance of 40). The colors represent the identified clusters, showing how data points are grouped based on their similarity at different levels of the hierarchy.
Evaluate cluster results¶
# Calculate silhouette scores for K-Means and Hierarchical clustering
silhouette_kmeans = silhouette_score(data_scaled, kmeans_labels)
silhouette_hierarchical = silhouette_score(data_scaled, hierarchical_labels)
print(f"Silhouette Score for K-Means: {silhouette_kmeans}")
print(f"Silhouette Score for Hierarchical Clustering: {silhouette_hierarchical}")
Silhouette Score for K-Means: 0.23400809694736652 Silhouette Score for Hierarchical Clustering: 0.35425332275556287
Hierarchical clustering is more effective for this dataset, as it forms clusters with better internal cohesion and separation compared to k-means. You might consider using hierarchical clustering results for further analysis.
Visualize¶
# Visualize K-Means clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data_scaled[:, 0], y=data_scaled[:, 1], hue=kmeans_labels, palette='viridis')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(title='Cluster')
plt.show()
# Visualize Hierarchical clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data_scaled[:, 0], y=data_scaled[:, 1], hue=hierarchical_labels, palette='viridis')
plt.title('Hierarchical Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(title='Cluster')
plt.show()
This scatter plot shows the results of k-means clustering with 4 clusters, where data points are color-coded by their assigned cluster. Most clusters are compact and closely grouped around the center, but there is an outlier in Cluster 0 (purple) that is far removed from the rest of the data, indicating a potential anomaly.
Interpretable information¶
Agggregate
# Add hierarchical cluster labels to the dataset (if not already done)
data_preprocessed['Hierarchical_Cluster'] = hierarchical_labels
# Use these cluster labels for aggregation and visualization
cluster_column = 'Hierarchical_Cluster'
Cluster Summary
# Select numeric columns for analysis
numeric_columns = data_preprocessed.select_dtypes(include=['float64', 'int64']).columns
# Summarize each cluster by calculating the mean of numeric metrics
cluster_summary = data_preprocessed.groupby('Hierarchical_Cluster')[numeric_columns].mean()
# Display the summarized results
print("Cluster Summary:")
print(cluster_summary)
# Save for further inspection
cluster_summary.to_csv('hierarchical_cluster_summary.csv', index=True)
Cluster Summary: n_ff_formatted ff_avg_speed ff_avg_spin \ Hierarchical_Cluster 1 0.021327 0.206482 0.159335 2 -0.236109 0.183633 0.246863 3 -1.903898 -5.324254 -5.065765 4 0.082753 0.180623 0.175744 ff_avg_break_z_induced n_sl_formatted sl_avg_speed \ Hierarchical_Cluster 1 0.254782 0.023007 0.114032 2 0.109715 -0.885356 -0.772709 3 -4.081182 -0.191481 -0.300164 4 0.123431 0.031881 0.017766 sl_avg_spin sl_avg_break n_ch_formatted ch_avg_speed \ Hierarchical_Cluster 1 0.104619 0.083446 -1.156690 -2.021971 2 -0.774306 -0.712496 -0.006881 -0.117597 3 -0.275192 -0.185006 0.296919 -0.107041 4 0.018342 0.016176 0.179616 0.343231 ... sv_avg_break has_ff_formatted has_sl_formatted \ Hierarchical_Cluster ... 1 ... -0.156807 0.186042 0.116072 2 ... 6.037079 0.186042 -0.790119 3 ... -0.156807 -5.375118 -0.268839 4 ... -0.156807 0.186042 0.016676 has_ch_formatted has_cu_formatted has_si_formatted \ Hierarchical_Cluster 1 -2.018138 -0.071826 -0.213031 2 -0.117150 -0.638132 0.072214 3 -0.092137 -0.385817 0.452541 4 0.341964 0.047898 0.014292 has_fc_formatted has_fs_formatted has_st_formatted \ Hierarchical_Cluster 1 0.295456 2.301652 0.218785 2 -0.292664 0.156356 -0.172464 3 0.254829 -0.218127 0.017407 4 -0.050349 -0.377326 -0.031574 has_sv_formatted Hierarchical_Cluster 1 -0.161165 2 6.204837 3 -0.161165 4 -0.161165 [4 rows x 45 columns]
Identify dominant pitch types¶
# Identify pitch usage columns
pitch_columns = [col for col in data_preprocessed.columns if 'has_' in col]
# Summarize pitch usage by cluster
pitch_usage_summary = data_preprocessed.groupby('Hierarchical_Cluster')[pitch_columns].sum()
# Display the dominant pitch types for each cluster
print("Dominant Pitch Types by Cluster:")
print(pitch_usage_summary)
# Save the summary for inspection
pitch_usage_summary.to_csv('pitch_usage_summary.csv', index=True)
Dominant Pitch Types by Cluster: has_ff_formatted has_sl_formatted has_ch_formatted \ Hierarchical_Cluster 1 27.534280 17.178649 -298.684417 2 5.209188 -22.123330 -3.280187 3 -198.879360 -9.947051 -3.409051 4 166.135892 14.891733 305.373655 has_cu_formatted has_si_formatted has_fc_formatted \ Hierarchical_Cluster 1 -10.630276 -31.528602 43.727475 2 -17.867686 2.021990 -8.194598 3 -14.275242 16.744004 9.428680 4 42.773204 12.762609 -44.961558 has_fs_formatted has_st_formatted has_sv_formatted Hierarchical_Cluster 1 340.644538 32.380143 -23.852360 2 4.377975 -4.828996 173.735431 3 -8.070702 0.644046 -5.963090 4 -336.951811 -28.195192 -143.919981
# Melt the dataset for easier faceting
key_metrics = [col for col in data_preprocessed.columns if 'avg_' in col] # All pitch-related metrics
melted_data = pd.melt(
data_preprocessed,
id_vars=['Hierarchical_Cluster'],
value_vars=key_metrics,
var_name='Metric',
value_name='Value'
)
# Use Seaborn's FacetGrid to create facet-wrapped boxplots
g = sns.FacetGrid(melted_data, col="Metric", col_wrap=4, height=4, sharey=False)
g.map(sns.boxplot, 'Hierarchical_Cluster', 'Value', order=sorted(data_preprocessed['Hierarchical_Cluster'].unique()))
g.set_titles("{col_name}")
g.set_axis_labels("Cluster", "Value")
g.tight_layout()
plt.show()
# Define pitch groups
pitch_groups = {
'Fastball': ['ff_avg_speed', 'ff_avg_spin', 'ff_avg_break_z_induced'],
'Slider': ['sl_avg_speed', 'sl_avg_spin', 'sl_avg_break'],
'Curveball': ['cu_avg_speed', 'cu_avg_spin', 'cu_avg_break'],
'Changeup': ['ch_avg_speed', 'ch_avg_spin', 'ch_avg_break'],
'Sinker': ['si_avg_speed', 'si_avg_spin', 'si_avg_break'],
'Cutter': ['fc_avg_speed', 'fc_avg_spin', 'fc_avg_break'],
'Splitter': ['fs_avg_speed', 'fs_avg_spin', 'fs_avg_break'],
'Knuckleball': ['st_avg_speed', 'st_avg_spin', 'st_avg_break'],
'Sweeper': ['sv_avg_speed', 'sv_avg_spin', 'sv_avg_break']
}
# Calculate the number of pitch groups
n_pitch_groups = len(pitch_groups)
n_cols = 2 # Number of columns in the grid
n_rows = (n_pitch_groups + 1) // n_cols # Rows needed for the grid
# Create a grid of subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows), constrained_layout=True)
axes = axes.flatten() # Flatten the axes array for easy iteration
# Plot each pitch group
for i, (pitch_type, metrics) in enumerate(pitch_groups.items()):
melted = pd.melt(data_preprocessed, id_vars='Hierarchical_Cluster', value_vars=metrics)
sns.boxplot(x='Hierarchical_Cluster', y='value', hue='variable', data=melted, ax=axes[i])
axes[i].set_title(f'{pitch_type} Metrics Across Clusters')
axes[i].set_xlabel('Cluster')
axes[i].set_ylabel('Value')
axes[i].legend(title='Metric')
axes[i].grid()
# Hide any unused subplots
for j in range(i + 1, len(axes)):
axes[j].set_visible(False)
# Save the combined figure as an image
plt.savefig('pitch_metrics_across_clusters.png', dpi=300)
plt.show()
Cluster Summaries
Cluster 1
- Fastball:
- High average speed and moderate spin.
- Low induced break.
- Slider:
- Moderate speed, spin, and break.
- Curveball, Changeup, Sinker:
- Balanced usage, with moderate values across speed, spin, and break.
- Specialty Pitches (Splitter, Sweeper, Knuckleball):
- Rarely used or low values compared to other clusters.
- Interpretation: Likely represents fastball-reliant pitchers who also mix in sliders and other secondary pitches sparingly.
- Fastball:
Cluster 2
- Fastball:
- Moderate speed but lower spin and induced break.
- Slider and Curveball:
- Higher spin rates and break compared to Cluster 1.
- Changeup:
- Moderate speed and spin.
- Specialty Pitches:
- Minimal or no usage, particularly for splitter and knuckleball.
- Interpretation: Likely slider-heavy pitchers, complemented by occasional use of curveballs and fastballs.
- Fastball:
Cluster 3
- Fastball:
- Low speed, spin, and break across metrics.
- Slider and Curveball:
- Also lower than average usage across all metrics.
- Changeup and Sinker:
- Limited presence or effectiveness.
- Specialty Pitches:
- Rare to non-existent use of advanced pitch types like splitter or sweeper.
- Interpretation: Pitchers with limited velocity or overall pitch diversity, potentially rookies, minor leaguers, or specialized relief pitchers.
- Fastball:
Cluster 4
- Fastball:
- Moderate to high speed, spin, and induced break (second only to Cluster 1).
- Slider and Curveball:
- Balanced metrics across speed, spin, and break.
- Changeup:
- Moderate values across all metrics, more usage than in other clusters.
- Specialty Pitches:
- Higher use of splitters, sweepers, and knuckleballs compared to other clusters.
- Interpretation: Likely balanced pitchers with a varied arsenal, incorporating specialty pitches alongside fastballs and breaking balls.
- Fastball:
High-Level Definitions for the Clusters
- Cluster 1: Fastball-dominant pitchers with moderate use of secondary pitches.
- Cluster 2: Slider-reliant pitchers, with strong spin and break metrics for breaking pitches.
- Cluster 3: Low-velocity or limited arsenal pitchers, potentially specialists or less experienced players.
- Cluster 4: Balanced arsenal pitchers who incorporate fastballs, secondary pitches, and some specialty pitches effectively.
!cp "/content/drive/MyDrive/Colab Notebooks/pca_clustering_silverstein.ipynb" ./
!jupyter nbconvert --to html "pca_clustering_silversteinipynb"
[NbConvertApp] Converting notebook pca_clustering_silversteinipynb to html [NbConvertApp] WARNING | Alternative text is missing on 7 image(s). [NbConvertApp] Writing 1286567 bytes to pca_clustering_silversteinipyn.html