Clustering MLB Players into Performance Archetypes Using SVM and Feature Cutoffs¶

By Scott Silverstein

1) From what perspective are you conducting the analysis?

  • This analysis is conducted from the perspective of a baseball blogger who is passionate about deep statistical insights. The goal is to provide fans and analysts with a fresh perspective on MLB player performance by uncovering hidden patterns in player statistics. Using advanced machine learning techniques, like clustering and Support Vector Machines (SVM), this analysis aims to define new player archetypes, offering readers a unique way to think about player roles and value beyond traditional categories.

2) What is your question?

  • Can we discover and define new player archetypes in Major League Baseball (MLB) by clustering players based on their performance statistics, and then use Support Vector Machines (SVM) to formalize and classify these archetypes?
  • The goal is to move beyond traditional player labels and instead identify nuanced roles, such as "Power SPeed combo" or "unicorn hitters," by analyzing the complete range of player stats.

3) Describe your dataset(s) including URL (if available)

  • The dataset consists of Major League Baseball (MLB) player statistics from FanGraphs, one of the most popular sources for advanced baseball metrics. The dataset includes a variety of performance statistics for each player, covering aspects of hitting, fielding, and base running. These statistics allow for a comprehensive analysis of player abilities. Key statistics in the dataset include:

    • Batting Average (AVG)
    • Home Runs (HR)
    • Runs Batted In (RBI)
    • Stolen Bases (SB)
    • Slugging Percentage (SLG)
    • On-Base Percentage (OBP)
    • Etc.

The dataset does not have a publicly available URL for this project, but it has been sourced from FanGraphs' leaderboards.

4) What is(are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric). If you have many variables, you can list the most important and summarize the rest (e.g. important variables are... also, 12 other binary, 5 categorical...).

  • Independent Variables:

    • The dataset contains numerous performance metrics that will be used as independent variables to cluster and classify players. These variables are all numeric and include, but are not limited to:

      • Batting Average (AVG) – numeric
      • Home Runs (HR) – numeric
      • Runs Batted In (RBI) – numeric
      • Stolen Bases (SB) – numeric
      • Slugging Percentage (SLG) – numeric
      • On-Base Percentage (OBP) – numeric
      • Strikeout Rate (K%) – numeric
      • many more

In total, the dataset contains numerous performance statistics for each player, all of which are numeric and describe different aspects of their game.

Dependent Variable: There is no predefined dependent variable in this case, as we are using unsupervised learning to discover and define new player archetypes. However, once clusters are formed, we will assign labels to these clusters based on their performance characteristics and use these labels as the dependent variable in subsequent SVM classification.

5) How are your variables suitable for your analysis method?

The performance metrics in the dataset are highly suitable for both clustering and Support Vector Machine (SVM) analysis due to the following reasons:

a) Numerical Nature: All the independent variables are numeric, which is ideal for distance-based clustering algorithms like K-means. Numeric variables allow for straightforward calculations of distances and help to create meaningful groupings of players based on their performance metrics.

b) Diversity of Performance Metrics: The dataset includes a wide variety of stats (e.g., batting, power, speed, and discipline), which provides a comprehensive view of each player's abilities. This diversity enables the clustering algorithm to uncover nuanced patterns in player performance and group similar players together based on their roles.

c) High Dimensionality: SVMs perform well in high-dimensional spaces, meaning the large number of stats (features) can be handled effectively by the SVM to find decision boundaries between clusters. SVM can separate players into distinct categories with complex decision boundaries when needed.

d) Scalability: By using scaling techniques, we can normalize the various stats that may have different ranges (e.g., Home Runs vs. On-Base Percentage) to ensure each variable contributes equally to the clustering and classification process.

This combination of features makes the dataset ideal for both the initial clustering (to define player archetypes) and the subsequent use of SVM to classify players into these categories.

6)What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

  • The analysis using K-Means clustering and SVM classification reveals four distinct player categories: Power Hitters, Power-Speed Hybrids, Speedy Contact Hitters, and Defensive Specialists. These clusters reflect different offensive and defensive skill sets among players. The t-SNE visualization (Figure 1) clearly shows that Cluster 1 (Defensive Specialists) is distinctly separated from the other groups, while Clusters 0 and 3 overlap, indicating similar characteristics but with differences in power and speed. Additionally, the variance analysis (Table 1) of key metrics like Home Runs, ISO, and Stolen Bases supports the differentiation between power-focused players and contact/speed-oriented players across the clusters.

7) What are your assumptions and limitations? What robustness checks did you perform or would you perform?

Assumptions:

Statistical Independence:

  • We assume player stats are independent, even though some (like OBP and AVG) may be correlated.

Consistent Performance:

  • Player performance in the dataset is assumed to reflect their typical ability, without accounting for factors like injuries or anomalies.

Cluster Interpretability:

  • We assume the clustering algorithm will naturally group players into meaningful categories.

Balanced Dataset:

  • We assume the dataset contains enough diversity in player types for well-defined clusters.

Limitations:

Feature Redundancy:

  • Correlated stats could reduce the clarity of clusters. Snapshot of Performance: The data may not capture long-term player ability due to year-to-year variability.

Cluster Overlap:

  • Some clusters may not have clear or interpretable boundaries.

SVM Complexity:

  • The decision boundaries created by SVM may be complex and less intuitive.

Data Imbalance:

  • Certain player types may be underrepresented, impacting the clustering and classification.

Robustness Checks:

Feature Correlation:

  • We would check for highly correlated variables and adjust for redundancy.

Cross-Validation:

  • Use cross-validation to ensure SVM generalizes well to new players.

Clustering Sensitivity:

  • Test different numbers of clusters to ensure optimal grouping. Normalization:
  • Scale stats to prevent any one feature from dominating the analysis.

Outlier Detection:

  • Address outliers that could distort clustering.

Packages¶

In [ ]:

In [1]:
%matplotlib inline
In [2]:
%%capture
from pandas.errors import PerformanceWarning
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.manifold import TSNE
In [3]:
# Silence warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)
warnings.simplefilter("ignore", PerformanceWarning)

EDA¶

In [4]:
data = pd.read_csv("/content/fangraphs-leaderboards.csv", encoding='latin1')
data.head()
Out[4]:
Season Name Team G PA HR R RBI SB BB% ... TG.4 Bats.4 FPTS.4 FPTS/G.4 SPTS.4 SPTS/G.4 XBR.3 NameASCII PlayerId MLBAMID
0 2024 Corbin Carroll ARI 158 684 22 121 74 35 0.106725 ... 162 L NaN NaN NaN NaN 8.625678 Corbin Carroll 25878 682998
1 2018 Dee Strange-Gordon SEA 141 588 4 62 36 30 0.015306 ... 162 L NaN NaN NaN NaN 6.418709 Dee Strange-Gordon 8203 543829
2 2023 Gunnar Henderson BAL 150 622 28 100 82 10 0.090032 ... 162 L NaN NaN NaN NaN 6.097221 Gunnar Henderson 26289 683002
3 2017 Xander Bogaerts BOS 148 635 10 94 62 15 0.088189 ... 162 R NaN NaN NaN NaN 6.003955 Xander Bogaerts 12161 593428
4 2018 Ozzie Albies ATL 158 684 24 105 72 14 0.052632 ... 162 B NaN NaN NaN NaN 5.864576 Ozzie Albies 16556 645277

5 rows × 1722 columns

In [5]:
# Check for duplicate column names
duplicate_column_names = data.columns[data.columns.duplicated()].tolist()

# Display duplicate column names
print("Duplicate Column Names:", duplicate_column_names)
Duplicate Column Names: []
In [6]:
# Transpose the DataFrame and drop duplicates based on identical column values
duplicate_columns = data.T.duplicated()

# List the columns that are duplicated
repeated_columns = data.columns[duplicate_columns]

# Display the repeated columns
print("Repeated Columns:", repeated_columns)
Repeated Columns: Index(['Team.1', 'G.1', 'PA.1', 'HR.1', 'R.1', 'RBI.1', 'SB.1', 'AVG.1',
       'BB%.1', 'K%.1',
       ...
       'wOppTeamV.4', 'wTeamV.4', 'wNetBatV.4', 'TG.4', 'Bats.4', 'FPTS.4',
       'FPTS/G.4', 'SPTS.4', 'SPTS/G.4', 'XBR.3'],
      dtype='object', length=1391)
In [7]:
# Transpose the DataFrame and check for duplicated columns based on identical content
duplicate_columns = data.T.duplicated()

# Get the names of the duplicated columns
repeated_columns = data.columns[duplicate_columns]

# Drop the duplicated columns from the DataFrame
data = data.drop(columns=repeated_columns)

# Display the remaining columns to verify that duplicates are removed
print(f"Deleted {len(repeated_columns)} duplicated columns.")
Deleted 1391 duplicated columns.
In [8]:
# Display a list of all column names
column_names = data.columns.tolist()

# Print the list of column names
print("List of columns in the DataFrame:")
print(column_names)
List of columns in the DataFrame:
['Season', 'Name', 'Team', 'G', 'PA', 'HR', 'R', 'RBI', 'SB', 'BB%', 'K%', 'ISO', 'BABIP', 'AVG', 'OBP', 'SLG', 'wOBA', 'xwOBA', 'wRC+', 'BsR', 'Off', 'Def', 'WAR', 'AB', 'H', '1B', '2B', '3B', 'BB', 'IBB', 'SO', 'HBP', 'SF', 'SH', 'GDP', 'CS', 'GB', 'FB', 'LD', 'IFFB', 'Pitches', 'Balls', 'Strikes', 'IFH', 'BU', 'BUH', 'BB/K', 'OPS', 'GB/FB', 'LD%', 'GB%', 'FB%', 'IFFB%', 'HR/FB', 'IFH%', 'BUH%', 'wRAA', 'wRC', 'Bat', 'Fld', 'Rep', 'Pos', 'RAR', 'Dol', 'Spd', 'WPA', '-WPA', '+WPA', 'RE24', 'REW', 'pLI', 'phLI', 'PH', 'WPA/LI', 'Clutch', 'FB%.1', 'FBv', 'SL%', 'SLv', 'CT%', 'CTv', 'CB%', 'CBv', 'CH%', 'CHv', 'SF%', 'SFv', 'KN%', 'KNv', 'XX%', 'wFB', 'wSL', 'wCT', 'wCB', 'wCH', 'wSF', 'wKN', 'wFB/C', 'wSL/C', 'wCT/C', 'wCB/C', 'wCH/C', 'wSF/C', 'wKN/C', 'O-Swing%', 'Z-Swing%', 'Swing%', 'O-Contact%', 'Z-Contact%', 'Contact%', 'Zone%', 'F-Strike%', 'SwStr%', 'FA% (sc)', 'FC% (sc)', 'FS% (sc)', 'FO% (sc)', 'SI% (sc)', 'SL% (sc)', 'CU% (sc)', 'KC% (sc)', 'EP% (sc)', 'CH% (sc)', 'SC% (sc)', 'KN% (sc)', 'vFA (sc)', 'vFC (sc)', 'vFS (sc)', 'vFO (sc)', 'vSI (sc)', 'vSL (sc)', 'vCU (sc)', 'vKC (sc)', 'vEP (sc)', 'vCH (sc)', 'vSC (sc)', 'vKN (sc)', 'FA-X (sc)', 'FC-X (sc)', 'FS-X (sc)', 'FO-X (sc)', 'SI-X (sc)', 'SL-X (sc)', 'CU-X (sc)', 'KC-X (sc)', 'EP-X (sc)', 'CH-X (sc)', 'SC-X (sc)', 'KN-X (sc)', 'FA-Z (sc)', 'FC-Z (sc)', 'FS-Z (sc)', 'FO-Z (sc)', 'SI-Z (sc)', 'SL-Z (sc)', 'CU-Z (sc)', 'KC-Z (sc)', 'EP-Z (sc)', 'CH-Z (sc)', 'SC-Z (sc)', 'KN-Z (sc)', 'wFA (sc)', 'wFC (sc)', 'wFS (sc)', 'wFO (sc)', 'wSI (sc)', 'wSL (sc)', 'wCU (sc)', 'wKC (sc)', 'wEP (sc)', 'wCH (sc)', 'wSC (sc)', 'wKN (sc)', 'wFA/C (sc)', 'wFC/C (sc)', 'wFS/C (sc)', 'wFO/C (sc)', 'wSI/C (sc)', 'wSL/C (sc)', 'wCU/C (sc)', 'wKC/C (sc)', 'wEP/C (sc)', 'wCH/C (sc)', 'wSC/C (sc)', 'wKN/C (sc)', 'O-Swing% (sc)', 'Z-Swing% (sc)', 'Swing% (sc)', 'O-Contact% (sc)', 'Z-Contact% (sc)', 'Contact% (sc)', 'Zone% (sc)', 'Pace', 'wSB', 'UBR', 'Age Rng', 'Lg', 'wGDP', 'Pull%', 'Cent%', 'Oppo%', 'Soft%', 'Med%', 'Hard%', 'TTO%', 'CH% (pi)', 'CS% (pi)', 'CU% (pi)', 'FA% (pi)', 'FC% (pi)', 'FS% (pi)', 'KN% (pi)', 'SB% (pi)', 'SI% (pi)', 'SL% (pi)', 'XX% (pi)', 'vCH (pi)', 'vCS (pi)', 'vCU (pi)', 'vFA (pi)', 'vFC (pi)', 'vFS (pi)', 'vKN (pi)', 'vSB (pi)', 'vSI (pi)', 'vSL (pi)', 'vXX (pi)', 'CH-X (pi)', 'CS-X (pi)', 'CU-X (pi)', 'FA-X (pi)', 'FC-X (pi)', 'FS-X (pi)', 'KN-X (pi)', 'SB-X (pi)', 'SI-X (pi)', 'SL-X (pi)', 'XX-X (pi)', 'CH-Z (pi)', 'CS-Z (pi)', 'CU-Z (pi)', 'FA-Z (pi)', 'FC-Z (pi)', 'FS-Z (pi)', 'KN-Z (pi)', 'SB-Z (pi)', 'SI-Z (pi)', 'SL-Z (pi)', 'XX-Z (pi)', 'wCH (pi)', 'wCS (pi)', 'wCU (pi)', 'wFA (pi)', 'wFC (pi)', 'wFS (pi)', 'wKN (pi)', 'wSB (pi)', 'wSI (pi)', 'wSL (pi)', 'wXX (pi)', 'wCH/C (pi)', 'wCS/C (pi)', 'wCU/C (pi)', 'wFA/C (pi)', 'wFC/C (pi)', 'wFS/C (pi)', 'wKN/C (pi)', 'wSB/C (pi)', 'wSI/C (pi)', 'wSL/C (pi)', 'wXX/C (pi)', 'O-Swing% (pi)', 'Z-Swing% (pi)', 'Swing% (pi)', 'O-Contact% (pi)', 'Z-Contact% (pi)', 'Contact% (pi)', 'Zone% (pi)', 'FRM', 'AVG+', 'BB%+', 'K%+', 'OBP+', 'SLG+', 'ISO+', 'BABIP+', 'LD+%', 'GB%+', 'FB%+', 'HR/FB%+', 'Pull%+', 'Cent%+', 'Oppo%+', 'Soft%+', 'Med%+', 'Hard%+', 'EV', 'LA', 'Barrels', 'Barrel%', 'maxEV', 'HardHit', 'HardHit%', 'Events', 'CStr%', 'CSW%', 'L-WAR', 'PPTV', 'CPTV', 'BPTV', 'DSV', 'DGV', 'BTV', 'wPPTV', 'wCPTV', 'wBPTV', 'wDSV', 'wDGV', 'wBTV', 'EBV', 'ESV', 'wOppTeamV', 'wTeamV', 'wNetBatV', 'TG', 'Bats', 'XBR', 'Age', 'NameASCII', 'PlayerId', 'MLBAMID']
In [9]:
# Rename the column from 'Season' to 'Season'
data.rename(columns={'Season': 'Season'}, inplace=True)

# Verify the column renaming
print(data.columns)
Index(['Season', 'Name', 'Team', 'G', 'PA', 'HR', 'R', 'RBI', 'SB', 'BB%',
       ...
       'wOppTeamV', 'wTeamV', 'wNetBatV', 'TG', 'Bats', 'XBR', 'Age',
       'NameASCII', 'PlayerId', 'MLBAMID'],
      dtype='object', length=331)

Missing Values¶

In [10]:
# Check for missing values
data.isnull().sum()
Out[10]:
0
Season 0
Name 0
Team 0
G 0
PA 0
... ...
XBR 142
Age 0
NameASCII 0
PlayerId 0
MLBAMID 0

331 rows × 1 columns


For this project, dropping columns with more than 20% missing data is a reasonable approach. Since the goal is to cluster MLB players based on performance statistics and define clear player archetypes, significant amounts of missing data could distort the clustering and obscure relationships between variables. While imputation strategies may be useful in some cases, they can introduce bias and reduce clarity. By focusing on solid, complete data, we ensure that the resulting player archetypes are more accurate and interpretable, allowing for a clearer understanding of how performance metrics define player roles.

In [11]:
# drop columns with more than 20% missing values
threshold = len(data) * 0.80

# Drop columns with more than 20% missing values
data = data.dropna(axis=1, thresh=threshold)

# Recalculate the percentage of missing values per column
missing_percentages = (data.isnull().sum() / len(data)) * 100

# Sort and display the top 15 columns with the highest percentage of missing values
top_15_null_columns = missing_percentages.sort_values(ascending=False).head(15)

# Display
print(top_15_null_columns)
phLI        18.400000
XX%         14.618182
XBR         10.327273
Fld          0.727273
UBR          0.218182
wGDP         0.218182
CH% (pi)     0.000000
FS% (pi)     0.000000
FC% (pi)     0.000000
FA% (pi)     0.000000
CU% (pi)     0.000000
Season       0.000000
TTO%         0.000000
SI% (pi)     0.000000
Med%         0.000000
dtype: float64

Imputing

In [12]:
# Select only numeric columns (int and float) for skewness and imputation
numeric_columns = data.select_dtypes(include=['float64', 'int64'])

# Calculate skewness for numeric columns only
skewness = numeric_columns.skew()

# Define a function to impute based on skewness
def impute_based_on_skewness(df, skewness):
    for column in df.columns:
        if df[column].isnull().sum() > 0:
            if abs(skewness[column]) > 0.5:
                df[column] = df[column].fillna(df[column].median())
            else:  #
                df[column] = df[column].fillna(df[column].mean())

# Apply the function to impute missing values for the numeric data
impute_based_on_skewness(numeric_columns, skewness)

# Replace the numeric columns in the original dataframe with the imputed ones
data[numeric_columns.columns] = numeric_columns

# Check the result to see if missing values are imputed
print(data.isnull().sum())
Season       0
Name         0
Team         0
G            0
PA           0
            ..
XBR          0
Age          0
NameASCII    0
PlayerId     0
MLBAMID      0
Length: 264, dtype: int64
In [13]:
# Display only the columns with missing values
missing_columns = data.columns[data.isnull().any()]  # Get columns with missing values
missing_data = data[missing_columns].isnull().sum()  # Count missing values in those columns

# Display the columns with their missing value count
print(missing_data)
Series([], dtype: float64)
In [14]:
# Check the data types of the problematic columns
print(data[['Age Rng', 'Bats']].dtypes)
# Display the first few rows of the problematic columns
print(data[['Age Rng', 'Bats']].head(10))
Age Rng    object
Bats       object
dtype: object
   Age Rng Bats
0  23 - 23    L
1  30 - 30    L
2  22 - 22    L
3  24 - 24    R
4  21 - 21    B
5  23 - 23    R
6  27 - 27    L
7  30 - 30    B
8  25 - 25    R
9  27 - 27    B
In [15]:
# Split the Age Rng into Min Age and Max Age
data[['Min Age', 'Max Age']] = data['Age Rng'].str.split(' - ', expand=True)

# Convert the new columns to numeric
data['Min Age'] = pd.to_numeric(data['Min Age'], errors='coerce')
data['Max Age'] = pd.to_numeric(data['Max Age'], errors='coerce')
In [16]:
# Check for missing values in the dataset
missing_values = data.isnull().sum()

# Filter and display only columns with missing values
columns_with_missing_values = missing_values[missing_values > 0]

# Display the columns with missing values
print(columns_with_missing_values)
Series([], dtype: int64)
In [17]:
# Calculate skewness for Min Age and Max Age
min_age_skewness = data['Min Age'].skew()
max_age_skewness = data['Max Age'].skew()

print(f"Skewness of 'Min Age': {min_age_skewness:.2f}")
print(f"Skewness of 'Max Age': {max_age_skewness:.2f}")

# Plot histograms to visualize the distribution
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
data['Min Age'].hist(bins=20)
plt.title('Min Age Distribution')

plt.subplot(1, 2, 2)
data['Max Age'].hist(bins=20)
plt.title('Max Age Distribution')

plt.tight_layout()
plt.show()
Skewness of 'Min Age': 0.46
Skewness of 'Max Age': 0.46
In [18]:
# Impute missing values in Min Age and Max Age using the mean
data['Min Age'].fillna(data['Min Age'].mean(), inplace=True)
data['Max Age'].fillna(data['Max Age'].mean(), inplace=True)

# Verify that there are no more missing values
print(data[['Min Age', 'Max Age']].isnull().sum())
Min Age    0
Max Age    0
dtype: int64

Data Information¶

Feature Scaling

In [19]:
# Check the range of values in numeric columns
range_check = data.select_dtypes(include=['float64', 'int64']).agg(['min', 'max'])

# Display range check
print(range_check)
     Season    G   PA  HR    R  RBI  SB       BB%        K%       ISO  ...  \
min    2015   44  186   0   13   10   0  0.015306  0.043155  0.050710  ...   
max    2024  162  753  62  149  144  73  0.221713  0.439024  0.379249  ...   

     wOppTeamV    wTeamV  wNetBatV   TG        XBR  Age  PlayerId  MLBAMID  \
min   0.000000 -0.523237 -0.387861   58 -10.284589   20       393   116338   
max   0.748088  0.000000  0.681861  163   8.625678   41     33333   807799   

     Min Age  Max Age  
min       20       20  
max       41       41  

[2 rows x 261 columns]
In [20]:
# Step 1: Retain 'PlayerId', 'MLBAMID', 'Season', and 'Name' for later analysis (exclude them from PCA)
columns_to_exclude = ['PlayerId', 'MLBAMID', 'Season', 'Name']
features_for_pca = data.drop(columns=columns_to_exclude)

# Step 2: Select only numeric columns for PCA
numeric_features = features_for_pca.select_dtypes(include=['float64', 'int64'])

# Step 3: Standardize the features before applying PCA
scaler = StandardScaler()
scaled_features = scaler.fit_transform(numeric_features)

# Step 4: Apply PCA (retain 95% of variance)
pca = PCA(n_components=0.95)
pca_features = pca.fit_transform(scaled_features)

# Display how many components were retained after PCA
print(f"PCA reduced the dataset to {pca_features.shape[1]} components.")

# Step 5: Visualize the explained variance by each principal component
explained_variance = pca.explained_variance_ratio_

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(explained_variance) + 1), explained_variance.cumsum(), marker='o', linestyle='--')
plt.title('Explained Variance by Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()
PCA reduced the dataset to 72 components.

Explanation

The explained variance plot shows that PCA reduced the dataset to 72 components. The first 20 components capture around 70% of the variance, with the curve gradually flattening as more components are added. Approximately 90% of the total variance is captured by the first 50 components, while the remaining components contribute minimally. This suggests that most of the important variance is explained by the initial components, and the additional components provide diminishing returns in terms of variance captured.

Modeling¶

K-Means Clustering¶

I will exclude age too, because that will probably automatically cluster all the players with the same age. Also getting rid of anything else that might be impacted by age, like strikes, balls, ab, and PA. As age is definitely a classifier, it won't be interesting because we could have just grouped by age and not modeled it.

In [21]:
# Step 1: Exclude non-performance columns AND 'Age', 'Min Age', 'Max Age' for clustering
columns_to_exclude = ['PlayerId', 'MLBAMID', 'Season', 'Name', 'Age', 'Min Age', 'Max Age', 'Pitches','Strikes','Balls', 'AB', 'PA']
features_for_kmeans = data.drop(columns=columns_to_exclude)

# Step 2: Initialize KMeans model (with 3 clusters, adjust based on your needs)
kmeans = KMeans(n_clusters=3, random_state=42)

# Step 3: Fit the KMeans model on PCA-transformed features
kmeans.fit(pca_features)

# Step 4: Get cluster labels for each player (these labels will be used as the target in SVM)
cluster_labels = kmeans.labels_

# Step 5: Retain 'PlayerId', 'MLBAMID', 'Season', and 'Name' for post-analysis (IDs and seasons)
ids_and_season = data[['PlayerId', 'MLBAMID', 'Season', 'Name']]

# Step 6: Combine the IDs, seasons, and cluster labels to review
final_result = pd.concat([ids_and_season.reset_index(drop=True), pd.DataFrame(cluster_labels, columns=['Cluster'])], axis=1)

# Step 7: View the final result to see which players belong to each cluster
print(final_result.head())

# Get the unique clusters and their counts
cluster_counts = pd.Series(cluster_labels).value_counts()

# Display the number of players in each cluster
print(cluster_counts)

# Display the total number of unique clusters
print(f"Total number of clusters: {len(cluster_counts)}")
   PlayerId  MLBAMID  Season                Name  Cluster
0     25878   682998    2024      Corbin Carroll        2
1      8203   543829    2018  Dee Strange-Gordon        0
2     26289   683002    2023    Gunnar Henderson        2
3     12161   593428    2017     Xander Bogaerts        0
4     16556   645277    2018        Ozzie Albies        0
1    491
0    444
2    440
Name: count, dtype: int64
Total number of clusters: 3

This seems really low lets use the elbow method.

In [22]:
# Step 1: Calculate inertia (within-cluster sum of squares) for different cluster sizes
inertia_values = []
cluster_range = range(2, 11)  # Test cluster sizes from 2 to 10

for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(pca_features)
    inertia_values.append(kmeans.inertia_)

# Step 2: Plot the inertia values to find the "elbow"
plt.figure(figsize=(8, 5))
plt.plot(cluster_range, inertia_values, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.grid(True)
plt.show()

Elbow method interpretation: The elbow in the plot (where the inertia starts to decrease more slowly) seems to occur around 4 or 5 clusters. After this point, the reduction in inertia (within-cluster sum of squares) starts to level off. This suggests that 4 or 5 clusters might be a good choice, as adding more clusters beyond this point does not significantly reduce the inertia.

In [23]:
# Step 1: Initialize KMeans with 4 clusters
optimal_n_clusters = 4
kmeans = KMeans(n_clusters=optimal_n_clusters, random_state=42)

# Step 2: Fit the KMeans model on PCA-transformed features (excluding age-related columns)
kmeans.fit(pca_features)

# Step 3: Get the new cluster labels
cluster_labels = kmeans.labels_

# Step 4: Retain 'PlayerId', 'MLBAMID', 'Season', and 'Name' for post-analysis (IDs and seasons)
ids_and_season = data[['PlayerId', 'MLBAMID', 'Season', 'Name']]

# Step 5: Combine the IDs, seasons, and cluster labels
final_result = pd.concat([ids_and_season.reset_index(drop=True), pd.DataFrame(cluster_labels, columns=['Cluster'])], axis=1)

# Step 6: View the final result
print(final_result.head())

# Get the unique clusters and their counts
cluster_counts = pd.Series(cluster_labels).value_counts()

# Display the number of players in each cluster
print(cluster_counts)

# Display the total number of unique clusters
print(f"Total number of clusters: {len(cluster_counts)}")
   PlayerId  MLBAMID  Season                Name  Cluster
0     25878   682998    2024      Corbin Carroll        3
1      8203   543829    2018  Dee Strange-Gordon        0
2     26289   683002    2023    Gunnar Henderson        3
3     12161   593428    2017     Xander Bogaerts        0
4     16556   645277    2018        Ozzie Albies        0
0    470
3    400
2    364
1    141
Name: count, dtype: int64
Total number of clusters: 4

SVM¶

I am excluding age. because it will dominate the clustering process

In [24]:
# Step 1: Exclude non-performance columns (PlayerId, MLBAMID, Season, Name) for SVM
features_for_svm = data.drop(columns=columns_to_exclude)

# Step 2: Select only numeric columns for SVM
numeric_features = features_for_svm.select_dtypes(include=['float64', 'int64'])

# Step 3: Split the PCA-transformed numeric data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(pca_features, cluster_labels, test_size=0.3, random_state=42)

# Step 4: Initialize and train the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Step 5: Evaluate the SVM model
print(f"Training Accuracy: {svm_model.score(X_train, y_train)}")
print(f"Test Accuracy: {svm_model.score(X_test, y_test)}")

# Step 6: Make predictions on the test set
predicted_labels = svm_model.predict(X_test)

# Step 7: Combine the IDs, seasons, and predicted labels for final analysis
final_result_svm = pd.concat([ids_and_season.reset_index(drop=True), pd.DataFrame(predicted_labels, columns=['Predicted Cluster'])], axis=1)

# Step 8: View the final result (showing predicted clusters)
print(final_result_svm.head())
Training Accuracy: 1.0
Test Accuracy: 0.9564164648910412
   PlayerId  MLBAMID  Season                Name  Predicted Cluster
0     25878   682998    2024      Corbin Carroll                3.0
1      8203   543829    2018  Dee Strange-Gordon                0.0
2     26289   683002    2023    Gunnar Henderson                2.0
3     12161   593428    2017     Xander Bogaerts                0.0
4     16556   645277    2018        Ozzie Albies                0.0

The SVM model has shown strong performance, achieving a training accuracy of 1.0 (100%) and a test accuracy of 95.6%. The perfect training accuracy indicates that the model has fit the training data exceptionally well, classifying all the training examples correctly. A slightly lower test accuracy suggests that the model is still generalizing well to unseen data, with a high level of accuracy in predicting the cluster assignments for the test set. This balance between the training and test accuracy indicates that the model is not overfitting too severely, though the perfect training accuracy does raise a minor concern about overfitting.

The Predicted Cluster column in the results represents the SVM model's classification of players into one of the four clusters. For instance, Corbin Carroll (2024 season) is assigned to Cluster 3, Dee Strange-Gordon (2018 season) is in Cluster 0, and Gunnar Henderson (2023 season) is in Cluster 2. These clusters were originally created using K-Means clustering, and the SVM model is now predicting these clusters based on the PCA-reduced features. The high test accuracy suggests that the model is successfully distinguishing between different types of players across seasons based on performance data.

Overfitting uncertainty

To make sure that we are not overfitting with how accurate that model was, I will run cross validation to make sure that this is the case

In [25]:
# Step 1: Exclude non-performance columns (PlayerId, MLBAMID, Season, Name) for SVM
features_for_svm = data.drop(columns=columns_to_exclude)

# Step 2: Select only numeric columns for SVM
numeric_features = features_for_svm.select_dtypes(include=['float64', 'int64'])

# Step 3: Perform cross-validation using the PCA-transformed features and cluster labels as target
svm_model = SVC(kernel='linear')

# Perform 5-fold cross-validation (using PCA-transformed features and cluster labels)
cv_scores = cross_val_score(svm_model, pca_features, cluster_labels, cv=5)

# Step 4: Display the cross-validation scores and the mean score (rounded to 3 decimal places)
cv_scores_rounded = [round(score, 3) for score in cv_scores]
mean_cv_score_rounded = round(cv_scores.mean(), 3)

# Step 5: Print the rounded scores
print(f"Cross-validation scores for each fold: {cv_scores_rounded}")
print(f"Mean cross-validation accuracy: {mean_cv_score_rounded}")
Cross-validation scores for each fold: [0.949, 0.964, 0.935, 0.96, 0.945]
Mean cross-validation accuracy: 0.951

Interpretation

The cross-validation results show consistent performance across the five folds, with accuracies ranging from 94.9% to 96.4%. This indicates that the SVM model generalizes well, as there are no significant drops in accuracy across the different subsets of the data. The mean cross-validation accuracy of 95.1% aligns closely with the individual fold scores, confirming that the model maintains high accuracy on average.

The low variance between the fold scores, ranging from 0.935 to 0.964, suggests that the model is not overfitting and is robust in predicting the cluster labels generated by K-Means. Overall, the model generalizes effectively across the dataset, making it well-suited for classifying players based on performance metrics. Based on this information, I will confidently rely on the initial model, as overfitting is not a significant concern.

Plotting Clusters

T-SNE CHART

In [26]:
# Impute NaNs with the median value of each column
numeric_columns = numeric_columns.fillna(numeric_columns.median())

# Check again for any remaining NaNs
print(numeric_columns.isnull().sum().sum())
0
In [40]:
# Apply t-SNE for 2D visualization of clusters
tsne = TSNE(n_components=2, random_state=42)
tsne_features = tsne.fit_transform(numeric_columns)

# Plot the t-SNE results
plt.figure(figsize=(8, 6))
plt.scatter(tsne_features[:, 0], tsne_features[:, 1], c=cluster_labels, cmap='viridis', s=50)
plt.title('t-SNE Visualization of Clusters')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster')
plt.grid(True)

# Save the figure as a PNG file before displaying
plt.savefig("tsne_plot.png", format='png')

# Display the plot in the notebook
plt.show()

This t-SNE visualization provides a two-dimensional view of the clusters derived from the player performance data. Cluster 1 (blue) is distinctly separated on the left, suggesting these players have characteristics quite different from the other clusters, reinforcing the idea that they may be defensive specialists or outliers in their performance metrics. The other three clusters (0, 2, and 3) overlap significantly, especially Cluster 0 (purple) and Cluster 3 (yellow), indicating that these players share more similar traits but still have distinct differences in specific metrics (e.g., power versus speed). This suggests further exploration into their unique metrics is necessary to fully distinguish between them.

Cluster Characteristics¶

Moving forward, it would be beneficial to investigate the characteristics of each cluster more closely to understand what defines them. For example, Cluster 0 may consist of players with a particular skill set (e.g., speed-focused), while Cluster 3 could represent players with different attributes (e.g., power hitters). Additionally, if there is a concern about overfitting, further tuning of the SVM’s regularization parameters (such as adjusting the C value) could help find a better balance between model complexity and performance. Overall, the model is performing exceptionally well, and the next logical step would be to analyze and interpret the meaning of these clusters in the context of player performance.

In [28]:
# Step 1: Add the predicted cluster labels from the SVM model to the original dataset
final_result_svm = pd.concat([ids_and_season.reset_index(drop=True), pd.DataFrame(predicted_labels, columns=['Predicted Cluster'])], axis=1)

# Step 2: Combine the predicted clusters with the original data
data_with_predicted_clusters = pd.concat([data, pd.DataFrame(predicted_labels, columns=['Predicted Cluster'])], axis=1)

# Step 3: Exclude non-performance columns like 'PlayerId', 'MLBAMID', 'Season', and 'Name'
performance_data = data_with_predicted_clusters.drop(columns=columns_to_exclude)

# Step 4: Select only numeric performance-related columns for analysis
numeric_columns = performance_data.select_dtypes(include=['float64', 'int64'])

# Step 5: Calculate the mean values of the numeric performance features for each predicted cluster
cluster_means_svm = numeric_columns.groupby('Predicted Cluster').mean()

# Step 6: Display the mean values of features for each predicted cluster
print(cluster_means_svm)
                            G         HR          R        RBI         SB  \
Predicted Cluster                                                           
0.0                142.500000  20.173913  82.557971  70.833333  16.318841   
1.0                143.218182  18.018182  78.654545  66.563636  15.145455   
2.0                142.982456  20.359649  82.026316  71.464912  16.052632   
3.0                137.735849  18.745283  77.754717  65.915094  13.716981   

                        BB%        K%       ISO     BABIP       AVG  ...  \
Predicted Cluster                                                    ...   
0.0                0.085231  0.191754  0.184516  0.312136  0.273484  ...   
1.0                0.079007  0.191835  0.170632  0.303551  0.264341  ...   
2.0                0.084487  0.204223  0.181110  0.310047  0.267394  ...   
3.0                0.083558  0.201764  0.178396  0.307451  0.265367  ...   

                       Events     CStr%      CSW%     L-WAR       DGV  \
Predicted Cluster                                                       
0.0                421.768116  0.163732  0.261157  3.694578  0.057971   
1.0                427.654545  0.163004  0.261697  3.128038  0.000000   
2.0                423.807018  0.160050  0.264392  3.428330  0.017544   
3.0                402.084906  0.165295  0.266392  3.173385  0.047170   

                   wOppTeamV    wTeamV  wNetBatV          TG       XBR  
Predicted Cluster                                                       
0.0                 0.039185 -0.006331  0.032854  156.811594  2.430008  
1.0                 0.009599 -0.004448  0.005151  158.218182  2.140175  
2.0                 0.044106 -0.005176  0.038931  158.324561  2.398534  
3.0                 0.035820 -0.003017  0.032803  152.339623  2.276904  

[4 rows x 250 columns]
In [29]:
# Calculate the mean for each cluster across all numeric features
cluster_means = numeric_columns.groupby(cluster_labels).mean()

# Calculate the variance for each cluster across all numeric features
cluster_variance = numeric_columns.groupby(cluster_labels).var()

# Display the top metrics with the highest variance across clusters
top_variance_metrics = cluster_variance.mean().sort_values(ascending=False).head(10)
print("Top 10 metrics with highest variance across clusters:")
print(top_variance_metrics)

# Display the means for top variance metrics
print("Cluster means for top variance metrics:")
print(cluster_means[top_variance_metrics.index])
Top 10 metrics with highest variance across clusters:
Events     2594.675240
HR/FB%+    1725.110894
BB%+       1203.723488
GB         1039.402947
FB          817.909615
HardHit     795.804251
ISO+        768.148936
SO          761.338795
K%+         599.628192
Soft%+      387.422831
dtype: float64
Cluster means for top variance metrics:
       Events     HR/FB%+        BB%+          GB          FB     HardHit  \
0  440.780851   80.703824   89.055797  197.259574  145.644681  145.542553   
1  151.418440  112.259210  104.759454   64.021277   54.134752   59.574468   
2  430.741758  145.403572  129.478573  171.250000  166.381868  186.648352   
3  402.565000  111.274168  101.098256  167.200000  153.812500  165.832500   

         ISO+          SO        K%+      Soft%+  
0   88.771715  103.010638  82.691111  100.873666  
1  111.819973   48.659574  92.867043   95.261669  
2  141.396042  128.060440  93.531665   86.066942  
3  110.258241  131.455000  99.133567   96.413571  
In [30]:
# Visualize the means of top variance features using a bar chart
high_variance_features = top_variance_metrics.index  # Features with high variance

# Calculate means for these features by cluster
cluster_means_top_variance = cluster_means[high_variance_features]

# Plot the top variance features by cluster
cluster_means_top_variance.T.plot(kind='bar', figsize=(12, 8))
plt.title("Top-Variance Performance Metrics by Predicted Cluster")
plt.ylabel("Mean Value")
plt.xlabel("Performance Metrics")
plt.xticks(rotation=45)
plt.legend(title="Cluster")
plt.grid(True)
plt.tight_layout()
plt.show()
  • Cluster 2 tends to perform best in power-hitting metrics (HR/FB%, ISO+, HardHit), suggesting it's likely a power hitter group.
  • Cluster 1 generally performs the lowest across most metrics, indicating this group could be more contact-focused with fewer strikeouts and walks.
  • Cluster 0 and Cluster 3 seem to fall in between, with Cluster 0 seems to be hitting a ton of ground balls and Cluster 3 showing a good balance of power metrics but with higher strikeouts.

Separating cluster 0 and cluster 3

In [31]:
# Impute NaNs with the median value of each column
numeric_columns = numeric_columns.fillna(numeric_columns.median())

# Check again for any remaining NaNs
print(numeric_columns.isnull().sum().sum())
0
In [34]:
# Compare key metrics for Cluster 0 and Cluster 3
key_metrics = ['HR', 'ISO', 'SB', 'K%', 'BB%']

# cluster labels
cluster_labels_series = pd.Series(cluster_labels)

# Filter the data for Cluster 0 and Cluster 3
cluster_0_3_data = numeric_columns[cluster_labels_series.isin([0, 3])]

# Compare mean values of key metrics for both clusters
cluster_0_3_means = cluster_0_3_data.groupby(cluster_labels_series).mean()[key_metrics]

# Display the results
print(cluster_0_3_means)
          HR       ISO         SB        K%       BB%
0  15.148936  0.147338  10.804255  0.174129  0.074476
3  21.242500  0.179519   9.027500  0.223099  0.085378
  • Cluster 3 appears to be composed of power hitters with higher home runs, ISO, and strikeouts, but lower speed and slightly better plate discipline.
  • Cluster 0, on the other hand, seems to feature more balanced players with a focus on speed (higher stolen bases), better contact (lower strikeout rate), and lower power.

Cluster names¶

  1. Cluster 0: Speedy Contact Hitters
  • Characteristics:

    • Moderate power (lower HR and ISO compared to other clusters).
    • High stolen bases (SB), suggesting strong baserunning ability and speed.
    • Lower strikeout rate (K%) and decent contact skills, meaning players in this group likely prioritize putting the ball in play over swinging for the fences.

    Suggested Name: "Speedy Contact Hitters"

  1. Cluster 1: Defensive Specialists
  • Characteristics:
    • Cluster 1 was previously isolated in t-SNE and had the lowest overall performance across key metrics.
    • Likely composed of contact-oriented players who don’t excel in power or stolen bases but may have solid defensive abilities.
    • Likely lower offensive contributions but potentially high-value players for their defensive skills.

Suggested Name: "Defensive Specialists"

  1. Cluster 2: Power-Speed Hybrids
  • Characteristics:
    • Balanced power and speed, excelling in home runs (HR) and ISO while also performing well in stolen bases (SB).
    • Higher strikeout rate (K%), suggesting aggressive hitters who combine both power and speed.
    • Versatile players who can contribute in multiple areas offensively.

Suggested Name: "Power-Speed Hybrids" (players who bring both power and speed to the game).

  1. Cluster 3: Power Hitters
  • Characteristics:
    • Highest in home runs (HR) and ISO, making them the power hitters of the group.
    • Lower in stolen bases (SB) and higher strikeout rates, suggesting a focus on hitting for power over contact or speed.
    • These players likely drive in runs and hit for extra bases but at the cost of strikeouts. Suggested Name: "Power Hitters"

My Favorite Players

In [35]:
# Assigning descriptive names to the clusters
cluster_names = {
    0: "Speedy Contact Hitters",
    1: "Defensive Specialists",
    2: "Power-Speed Hybrids",
    3: "Power Hitters"
}


# Function to look up a player and return their cluster information
def get_player_cluster(player_name):
    # Search for the player in the DataFrame
    player_info = final_result[final_result['Name'].str.contains(player_name, case=False, na=False)]

    # If player is found
    if not player_info.empty:
        cluster_label = player_info.iloc[0]['Cluster']  # Get the cluster label
        cluster_name = cluster_names[cluster_label]  # Get the cluster name from the dictionary
        return f"{player_name} belongs to '{cluster_name}' (Cluster {cluster_label})"
    else:
        return f"Player {player_name} not found in the dataset."

# Example usage:
print(get_player_cluster("Randy Arozarena"))
Randy Arozarena belongs to 'Power Hitters' (Cluster 3)
In [36]:
print(get_player_cluster("Francisco Lindor"))
Francisco Lindor belongs to 'Speedy Contact Hitters' (Cluster 0)

This is doing exaclty what I wanted it to do!

In [41]:
!cp "/content/drive/MyDrive/Colab Notebooks/silverstein_svm.ipynb" ./
!jupyter nbconvert --to html "silverstein_svm.ipynb"
[NbConvertApp] Converting notebook silverstein_svm.ipynb to html
[NbConvertApp] Writing 1074074 bytes to silverstein_svm.html