Introduction¶

The goal of this project is to address a critical question in baseball analytics: Which types of players are underpaid, overpaid, or fairly paid based on their on-field performance and contract details? By combining advanced machine learning techniques with detailed performance and contract data, the project aims to provide actionable insights into salary fairness across different player categories.

Which machine learning methods did you implement?
The project implemented Ridge Regression, K-Means Clustering, Regression Models, Support Vector Machines (SVM), Ensemble Models (Random Forest and Gradient Boosting), and Neural Networks. Each method contributed uniquely to the analysis, providing a comprehensive and multi-faceted perspective on salary fairness.

Discuss the key contribution of each method to your analysis. If a method didn't contribute, discuss why it didn't. A sentence or two for each method is plenty.

  • Ridge Regression: This method was used for feature selection, helping to reduce multicollinearity and refine the dataset for subsequent analysis. It ensured the model considered the most impactful features while avoiding overfitting.
  • K-Means Clustering: It grouped players into distinct categories based on performance metrics, providing a foundation to evaluate salary fairness within specific player types (e.g., Power Hitters, Balanced Hitters, Utility Players).
  • Regression Models: These predicted fair salaries for players and categorized them as underpaid, overpaid, or fairly paid. This provided an objective baseline for salary fairness.
  • Support Vector Machines (SVM): SVMs analyzed the relationship between player categories (clusters) and salary fairness. They helped identify which clusters were more likely to contain underpaid or overpaid players.
  • Ensemble Models: Random Forest and Gradient Boosting models captured complex interactions between features and provided more robust predictions of salary fairness. These models highlighted subtle patterns missed by simpler methods.
  • Neural Networks: The neural network refined predictions further, capturing deeper nonlinear relationships between features and salary fairness. It improved classification accuracy and added robustness to the analysis.

Did all methods support your conclusions or did some provide conflicting results? If so they provided conflicting results, how did you reconcile the differences?
While most methods aligned in identifying key trends, there were some discrepancies. For instance, SVM struggled with predicting overpaid players, as reflected in low precision and recall for that category. This was reconciled by prioritizing ensemble methods and neural networks for final predictions, as they provided better overall accuracy and insight into the patterns of salary fairness across player clusters.

Data Sources¶

  1. Statcast Data: Advanced player metrics, including launch angle, hard-hit percentage, and swing speed from Baseball Savant.
  2. Player Contracts Data: Salary and contract details, including annual average values (AAV) and total contract values from Spotrac.
  3. Career Batting Stats: Comprehensive batting data sourced from Fangraphs.

This workflow integrates clustering to understand player types, regression to define salary fairness benchmarks, and classification to validate the alignment between these insights. By combining these methods with advanced ensemble models and neural networks, the project ensures robust, interpretable insights into salary fairness for teams and players.

Packages¶

In [1]:
!pip install optuna
Requirement already satisfied: optuna in /usr/local/lib/python3.10/dist-packages (4.1.0)
Requirement already satisfied: alembic>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from optuna) (1.14.0)
Requirement already satisfied: colorlog in /usr/local/lib/python3.10/dist-packages (from optuna) (6.9.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from optuna) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from optuna) (24.2)
Requirement already satisfied: sqlalchemy>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from optuna) (2.0.36)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from optuna) (4.66.6)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.10/dist-packages (from optuna) (6.0.2)
Requirement already satisfied: Mako in /usr/local/lib/python3.10/dist-packages (from alembic>=1.5.0->optuna) (1.3.8)
Requirement already satisfied: typing-extensions>=4 in /usr/local/lib/python3.10/dist-packages (from alembic>=1.5.0->optuna) (4.12.2)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from sqlalchemy>=1.4.2->optuna) (3.1.1)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.10/dist-packages (from Mako->alembic>=1.5.0->optuna) (3.0.2)
In [ ]:

In [63]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and feature engineering
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.feature_selection import RFE

# Model training and evaluation
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import RidgeCV, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans

# Metrics
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error, explained_variance_score,
    classification_report, confusion_matrix
)

# Dimensionality reduction
from sklearn.decomposition import PCA

# Deep learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

# Optimization
import optuna

# Statistical analysis
from scipy.stats import f_oneway

# Warnings
import warnings
from sklearn.exceptions import UndefinedMetricWarning
In [62]:
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=Warning)

Step 1 Data Preprocessing¶

Merging data¶

In [3]:
# Load datasets
career_batting = pd.read_csv('/content/career batting.csv')
contracts = pd.read_csv('/content/contracts.csv')
hitting = pd.read_csv('/content/hitting.csv')

# Step 1: Standardize and Align the 'Player' Column
# Fix Career Batting dataset
career_batting.rename(columns={'Name': 'Player'}, inplace=True)

# Fix Hitting dataset: Split 'last_name, first_name' into 'Player'
hitting['Player'] = hitting['last_name, first_name'].str.split(', ').str[::-1].str.join(' ')
hitting['Player'] = hitting['Player'].str.strip().str.title()

# Standardize 'Player' column in all datasets
contracts['Player'] = contracts['Player'].str.strip().str.title()
career_batting['Player'] = career_batting['Player'].str.strip().str.title()

# Step 2: Merge Datasets
data = pd.merge(hitting, contracts, on='Player', how='inner')
data = pd.merge(data, career_batting, on='Player', how='inner')

# Step 3: Display Merge Results
print(f"Merged dataset contains {data.shape[0]} rows and {data.shape[1]} columns.")
print(data.head())
Merged dataset contains 427 rows and 64 columns.
  last_name, first_name  player_id  year  home_run   woba  xwoba  \
0         DeLuca, Jonny     676356  2024         6  0.269  0.272   
1       Marte, Starling     516782  2024         7  0.313  0.337   
2           Brown, Seth     664913  2024        14  0.289  0.294   
3       Andujar, Miguel     609280  2024         4  0.306  0.286   
4        Correa, Carlos     621043  2024        14  0.385  0.358   

   avg_swing_speed  blasts_contact  blasts_swing  squared_up_contact  ...  \
0             70.8             8.0           6.2                26.1  ...   
1             72.4            17.2          12.4                33.6  ...   
2             73.3            14.5          10.5                30.9  ...   
3             71.6            11.6           9.9                31.0  ...   
4             74.5            19.9          16.2                32.2  ...   

        AVG       OBP       SLG      wOBA  xwOBA        wRC+       BsR  \
0  0.216867  0.277778  0.331325  0.268877  0.277   76.974637  0.649684   
1  0.268657  0.326975  0.388060  0.313053  0.338  103.886335  2.114106   
2  0.231183  0.282500  0.379032  0.289137  0.296   91.145522 -0.659635   
3  0.284768  0.319749  0.377483  0.305782  0.287  102.854265 -2.035956   
4  0.310345  0.387978  0.517241  0.385305  0.359  155.497900 -0.577044   

         Off        Def       WAR  
0  -8.887511   5.785232  0.925904  
1   3.833610 -10.129863  0.622260  
2  -4.712187 -10.903075 -0.235582  
3  -0.994141  -5.335523  0.444545  
4  22.727909   6.440044  4.275914  

[5 rows x 64 columns]
In [4]:
# Rename columns for clarity and usability
data.rename(columns={
    'Team\n                        Currently With': 'Current Team',
    'Age\n                        At Signing': 'Age at Signing'
}, inplace=True)

# Display the updated column names to confirm changes
print(data.columns)
Index(['last_name, first_name', 'player_id', 'year', 'home_run', 'woba',
       'xwoba', 'avg_swing_speed', 'blasts_contact', 'blasts_swing',
       'squared_up_contact', 'squared_up_swing', 'avg_swing_length', 'swords',
       'exit_velocity_avg', 'launch_angle_avg', 'sweet_spot_percent', 'barrel',
       'barrel_batted_rate', 'solidcontact_percent', 'flareburner_percent',
       'poorlyunder_percent', 'poorlytopped_percent', 'poorlyweak_percent',
       'hard_hit_percent', 'avg_best_speed', 'avg_hyper_speed',
       'z_swing_percent', 'z_swing_miss_percent', 'oz_swing_percent',
       'oz_swing_miss_percent', 'out_zone_swing', 'meatball_swing_percent',
       'meatball_percent', 'iz_contact_percent', 'in_zone_swing_miss',
       'whiff_percent', 'swing_percent', 'Player', 'Rank', 'Pos',
       'Current Team', 'Age at Signing', 'Value', 'AAV', 'G', 'PA', 'HR', 'R',
       'RBI', 'SB', 'BB%', 'K%', 'ISO', 'BABIP', 'AVG', 'OBP', 'SLG', 'wOBA',
       'xwOBA', 'wRC+', 'BsR', 'Off', 'Def', 'WAR'],
      dtype='object')

Too keep this based on past performances I am elimnating any expected predicted stats such as xwoba.

In [5]:
# Remove columns starting with 'x' (e.g., xWOBA, xwOBA, etc.)
columns_to_drop = [col for col in data.columns if col.lower().startswith('x')]
data = data.drop(columns=columns_to_drop, errors='ignore')

# Display updated dataset columns to confirm
print(f"Remaining columns in the dataset: {data.columns.tolist()}")
Remaining columns in the dataset: ['last_name, first_name', 'player_id', 'year', 'home_run', 'woba', 'avg_swing_speed', 'blasts_contact', 'blasts_swing', 'squared_up_contact', 'squared_up_swing', 'avg_swing_length', 'swords', 'exit_velocity_avg', 'launch_angle_avg', 'sweet_spot_percent', 'barrel', 'barrel_batted_rate', 'solidcontact_percent', 'flareburner_percent', 'poorlyunder_percent', 'poorlytopped_percent', 'poorlyweak_percent', 'hard_hit_percent', 'avg_best_speed', 'avg_hyper_speed', 'z_swing_percent', 'z_swing_miss_percent', 'oz_swing_percent', 'oz_swing_miss_percent', 'out_zone_swing', 'meatball_swing_percent', 'meatball_percent', 'iz_contact_percent', 'in_zone_swing_miss', 'whiff_percent', 'swing_percent', 'Player', 'Rank', 'Pos', 'Current Team', 'Age at Signing', 'Value', 'AAV', 'G', 'PA', 'HR', 'R', 'RBI', 'SB', 'BB%', 'K%', 'ISO', 'BABIP', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+', 'BsR', 'Off', 'Def', 'WAR']

EDA¶

Average Annual Value Distribution (Target)¶

In [6]:
# Convert AAV to numeric by removing dollar signs and commas
data['AAV'] = data['AAV'].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Plot the distribution of AAV
plt.figure(figsize=(10, 6))
plt.hist(data['AAV'], bins=30, edgecolor='k', alpha=0.7)
plt.title('Distribution of Average Annual Value (AAV)', fontsize=14)
plt.xlabel('AAV ($)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image

This histogram shows the distribution of Average Annual Value (AAV) for player contracts, revealing a highly skewed distribution where most players earn significantly less, with a few earning substantially higher salaries. This imbalance will impact the regression and clustering models, as special care will be needed to handle the skewness (e.g., log transformation or robust scaling) to ensure fairness and accuracy when identifying overpaid and underpaid players.

Relationships between performance metrics and salary¶

In [7]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=data['woba'], y=data['AAV'])
plt.title('Scatter Plot of AAV vs wOBA', fontsize=14)
plt.xlabel('wOBA', fontsize=12)
plt.ylabel('AAV ($)', fontsize=12)
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

The scatter plot shows that higher woba values generally correspond to higher salaries, but there is significant variance, particularly at mid-range woba values, suggesting inconsistencies in how performance translates to pay. This variability indicates the need for models that capture nonlinear relationships and account for potential outliers when predicting salary fairness.

Correlation Heatmap¶

In [8]:
# Filter numeric columns for correlation calculation, excluding irrelevant features
numeric_data = data.select_dtypes(include=['float64', 'int64']).drop(columns=['player_id'], errors='ignore')

# Identify the top 10 features most correlated with AAV
correlation_with_aav = numeric_data.corr()['AAV'].abs().sort_values(ascending=False)
top_10_features = correlation_with_aav.index[1:11]  # Exclude 'AAV' itself

# Generate a heatmap for the top 10 features
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_data[top_10_features].corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Top 10 Features Most Correlated with AAV (Excluding Less Relevent Columns)', fontsize=14)
plt.show()
No description has been provided for this image

The heatmap indicates that features such as barrel, home_run, HR, and wOBA exhibit strong positive correlations with each other and with AAV, making them significant predictors of player salary. However, the high collinearity among these metrics underscores the need for feature selection techniques (e.g., Ridge regression) to prioritize the most impactful predictors while minimizing redundancy and improving model interpretability.

Pairwise Scatterplot¶

In [9]:
# Select the top features from the correlation heatmap for pairplot
top_features_correlation = ['barrel', 'home_run', 'HR', 'Off', 'RBI', 'R', 'WAR', 'PA', 'out_zone_swing', 'woba']

# Generate pairplot for the selected features
sns.pairplot(data[top_features_correlation], diag_kind='kde', plot_kws={'alpha': 0.7})
plt.suptitle('Pairwise Scatter Plots of Features from Correlation Heatmap', y=1.02, fontsize=14)
plt.show()
Output hidden; open in https://colab.research.google.com to view.

The pairwise scatter plots show strong linear and nonlinear relationships between many of the selected features (e.g., barrel, home_run, woba, and WAR). This indicates that these features are interrelated, which could introduce multicollinearity in regression models and clustering analysis. For clustering, the clear patterns suggest well-defined player groupings are possible; however, regularization techniques like Ridge regression can help address multicollinearity by prioritizing the most relevant features, ensuring both interpretability and robust model performance.

Modeling¶

Penalized Regression with Ridge¶

In [10]:
# Step 1: Drop irrelevant columns that won't affect clustering
columns_to_drop = ['Current Team', 'Value', 'Pos', 'player_id', 'Player']
numeric_data = data.drop(columns=columns_to_drop, errors='ignore')

# Step 2: Retain only numeric columns
numeric_data = numeric_data.select_dtypes(include=['float64', 'int64'])

# Step 3: Define X (features) and y (target - AAV for salary prediction)
X = numeric_data.drop(columns=['AAV'], errors='ignore')  # Drop 'AAV' from features
y = data['AAV']  # Target variable is the player's salary (AAV)

# Step 4: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Apply Ridge Regression with cross-validation to select the optimal alpha
ridge = RidgeCV(alphas=np.logspace(-4, 4, 100), cv=5).fit(X_train, y_train)

# Step 6: Output model performance
print(f"Optimal Alpha: {ridge.alpha_}")
print(f"Training Score: {ridge.score(X_train, y_train)}")
print(f"Testing Score: {ridge.score(X_test, y_test)}")
Optimal Alpha: 3944.206059437664
Training Score: 0.488496115355994
Testing Score: 0.47744649196711564

The Ridge regression model achieved an optimal alpha of ~3944, with a training score of 0.49 and a testing score of 0.48, indicating moderate predictive power and a minimal gap between the two scores. While the scores suggest that the model captures some variance in player salary (AAV), other external factors not included in the dataset (e.g., market trends, team-specific strategies) likely play a role in salary determination. However, since the primary objective of the project is to provide interpretable clustering and salary fairness insights rather than perfect salary prediction, these scores are sufficient to proceed with meaningful analysis.

e## Clustering with Kmeans

Step 1: Standardize the Features

In [11]:
# Step 1: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Features have been standardized.")
Features have been standardized.

Step 2: Determine the Optimal Number of Clusters Using the Elbow Method

In [12]:
# Step 2: Determine the optimal number of clusters using the Elbow Method
wcss = []  # Within-cluster sum of squares
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

# Plot the Elbow Method
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
No description has been provided for this image

The Elbow Method plot shows a noticeable "elbow" at 3 clusters, where the within-cluster sum of squares (WCSS) begins to decrease at a slower rate. This suggests that using 3 clusters would balance simplicity and the ability to capture meaningful groupings in the data.

Step 4: Apply K-Means Clustering with 3 Clusters

In [13]:
optimal_clusters = 3
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

Visualize Clusters

In [14]:
# Calculate cluster centers
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=X.columns)

# Calculate the range of each feature across clusters
feature_ranges = cluster_centers.max() - cluster_centers.min()

# Rank features by their range across clusters
cluster_feature_importance = feature_ranges.sort_values(ascending=False)
print("Feature Importance for Clusters (based on range):")
print(cluster_feature_importance)
Feature Importance for Clusters (based on range):
RBI                       1.782278
home_run                  1.763592
HR                        1.763592
barrel                    1.751891
R                         1.705439
PA                        1.700571
out_zone_swing            1.648301
in_zone_swing_miss        1.644526
G                         1.632397
SLG                       1.569091
wOBA                      1.535930
woba                      1.534729
wRC+                      1.532636
K%                        1.501313
barrel_batted_rate        1.469382
avg_swing_speed           1.452471
ISO                       1.451022
WAR                       1.444572
avg_hyper_speed           1.414773
avg_best_speed            1.404051
whiff_percent             1.369697
squared_up_swing          1.358466
swords                    1.336693
oz_swing_miss_percent     1.326162
iz_contact_percent        1.276464
z_swing_miss_percent      1.263196
blasts_contact            1.254969
OBP                       1.232009
hard_hit_percent          1.230051
Off                       1.209755
AVG                       1.197710
exit_velocity_avg         1.193627
blasts_swing              1.184209
squared_up_contact        0.979993
avg_swing_length          0.851049
poorlyweak_percent        0.772629
SB                        0.720279
solidcontact_percent      0.666515
BABIP                     0.592202
meatball_percent          0.555028
flareburner_percent       0.545582
Def                       0.517075
poorlytopped_percent      0.486297
BB%                       0.477383
z_swing_percent           0.451283
sweet_spot_percent        0.428835
meatball_swing_percent    0.425121
Age at Signing            0.404148
launch_angle_avg          0.356471
swing_percent             0.228362
poorlyunder_percent       0.155923
BsR                       0.126523
oz_swing_percent          0.035725
year                      0.000000
dtype: float64

The feature importance rankings based on the range across cluster centers indicate that RBI, home_run, and HR are the most defining features for distinguishing the clusters, suggesting that offensive performance metrics play a significant role in cluster differentiation. Similarly, features like barrel, PA, and out_zone_swing also contribute strongly, emphasizing both power-hitting tendencies and plate discipline as key traits in cluster separation. On the other hand, features with lower importance, such as BsR (base running), oz_swing_percent, and year, have minimal impact, indicating that these characteristics are less relevant in defining the player groupings.

In [15]:
# Select top 10 features by importance
top_features = cluster_feature_importance.head(10).index.tolist()

# Examine cluster centers for top features
top_cluster_centers = cluster_centers[top_features]
print("Cluster Centers for Top Features:")
print(top_cluster_centers)

# Visualize cluster centers to identify patterns
plt.figure(figsize=(10, 6))
top_cluster_centers.T.plot(kind='bar', figsize=(14, 8), colormap='viridis')
plt.title("Top Feature Averages Across Clusters", fontsize=16)
plt.xlabel("Features", fontsize=14)
plt.ylabel("Cluster Center Values", fontsize=14)
plt.legend(title="Cluster", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Cluster Centers for Top Features:
        RBI  home_run        HR    barrel         R        PA  out_zone_swing  \
0  0.850462  0.955295  0.955295  0.977622  0.764058  0.718688        0.725266   
1  0.098277 -0.177575 -0.177575 -0.245649  0.214206  0.317940        0.238906   
2 -0.931817 -0.808296 -0.808296 -0.774270 -0.941381 -0.981883       -0.923036   

   in_zone_swing_miss         G       SLG  
0            0.905643  0.654965  0.782040  
1           -0.201446  0.389540  0.006054  
2           -0.738883 -0.977432 -0.787052  
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

The cluster centers reveal distinct differences across the top features, which help characterize the player types within each cluster. Cluster 0 shows significantly high values in power-related metrics like RBI, home runs, HR, and SLG, indicating it likely represents "Power Hitters." Cluster 1 has moderate values across most features, suggesting it includes "Balanced Hitters" with a focus on consistency rather than extremes. Cluster 2 exhibits the lowest values across all key metrics, particularly RBI and SLG, which aligns with the profile of "Utility Players" or players with lower offensive contributions.

To label the clusters:

  • Cluster 0 can be labeled as "Power Hitters" based on their dominance in power metrics like RBI and SLG.
  • Cluster 1 can be labeled as "Balanced Hitters" given their moderate performance across metrics.
  • Cluster 2 can be labeled as "Utility Players," as their overall contribution appears lower across the measured metrics.
In [16]:
# Add the numeric cluster labels from KMeans to the dataset
data['Cluster'] = cluster_labels  # This adds the numeric cluster labels (0, 1, 2) to the 'Cluster' column in the DataFrame

# Define a dictionary for cluster labels
cluster_labels_map = {
    0: "Power Hitters",      # Cluster 0 corresponds to Power Hitters
    1: "Balanced Hitters",   # Cluster 1 corresponds to Balanced Hitters
    2: "Utility Players"     # Cluster 2 corresponds to Utility Players
}

# Map the descriptive cluster labels to the numeric clusters
data['Cluster_Label'] = data['Cluster'].map(cluster_labels_map)

# Check if the mapping is successful by displaying a sample of the dataset
print(data[['Cluster', 'Cluster_Label']].drop_duplicates())
   Cluster     Cluster_Label
0        2   Utility Players
1        0     Power Hitters
3        1  Balanced Hitters

Regression Model to Define Salary Fairness¶

Based on Ridge Results

In [17]:
# Step 1: Predict salaries using the trained Ridge model
y_test_pred = ridge.predict(X_test)

# Step 2: Define a threshold for fairness (10% of average salary as an example)
threshold = 0.1 * y.mean()

# Step 3: Define salary fairness categories
def classify_salary(actual, predicted, threshold):
    if actual < predicted - threshold:
        return "Underpaid"
    elif actual > predicted + threshold:
        return "Overpaid"
    else:
        return "Fairly Paid"

# Step 4: Apply the classification logic to the test set
fairness_labels = [
    classify_salary(actual, predicted, threshold)
    for actual, predicted in zip(y_test, y_test_pred)
]

# Step 5: Combine the results into a new DataFrame
fairness_results = X_test.copy()
fairness_results['Actual Salary'] = y_test
fairness_results['Predicted Salary'] = y_test_pred
fairness_results['Fairness'] = fairness_labels

# Display a summary of the fairness classification
print(fairness_results[['Actual Salary', 'Predicted Salary', 'Fairness']].head(10))
     Actual Salary  Predicted Salary     Fairness
419      4300000.0      4.671809e+06  Fairly Paid
75        800000.0      1.828226e+06    Underpaid
177       800000.0      4.191798e+06    Underpaid
30        800000.0      2.206821e+06    Underpaid
358     12166667.0      2.783316e+06     Overpaid
271       800000.0      1.493708e+06    Underpaid
155      6750000.0      5.686262e+06     Overpaid
152       800000.0      1.996744e+06    Underpaid
165     16250000.0      2.218170e+07    Underpaid
175     51000000.0      2.838382e+07     Overpaid

The salary fairness analysis evaluates whether a player's actual salary aligns with their predicted salary based on performance metrics. Using Ridge regression, salaries are predicted from player performance data, and fairness is determined by comparing actual and predicted salaries within a defined threshold. Players are categorized as "Underpaid" if their actual salary falls significantly below their performance-based prediction, "Overpaid" if their salary exceeds the prediction by a large margin, or "Fairly Paid" if the two are closely aligned, ensuring a performance-driven evaluation of compensation.

In [18]:
# Plot the fairness distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=fairness_results, x='Fairness', order=["Underpaid", "Fairly Paid", "Overpaid"])
plt.title("Salary Fairness Distribution", fontsize=16)
plt.xlabel("Fairness Category", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

The distribution of salary fairness categories shows that the majority of players are classified as either "Underpaid" or "Overpaid," with relatively fewer players falling into the "Fairly Paid" category. This suggests a potential mismatch between player performance and compensation for many players, emphasizing the need for performance-driven salary evaluations.

Model Evaluation

In [19]:
# Evaluate Ridge regression performance
mse = mean_squared_error(y_test, y_test_pred)
mae = mean_absolute_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)

print("Model Performance Metrics:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R2): {r2:.2f}")

# Evaluate classification distribution
fairness_distribution = fairness_results['Fairness'].value_counts(normalize=True) * 100
print("\nFairness Classification Distribution:")
print(fairness_distribution)

# Visualize residuals
residuals = y_test - y_test_pred
plt.figure(figsize=(8, 5))
plt.scatter(y_test, residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Residuals of Actual vs. Predicted Salary")
plt.xlabel("Actual Salary")
plt.ylabel("Residuals")
plt.show()
Model Performance Metrics:
Mean Squared Error (MSE): 52552780126582.57
Mean Absolute Error (MAE): 4420454.71
R-squared (R2): 0.48

Fairness Classification Distribution:
Fairness
Overpaid       45.348837
Underpaid      40.697674
Fairly Paid    13.953488
Name: proportion, dtype: float64
No description has been provided for this image

Let's try some feature selection.

In [20]:
# Use RFE with Linear Regression
selector = RFE(estimator=LinearRegression(), n_features_to_select=10)  # Adjust the number of features
selector = selector.fit(X_train, y_train)

# Get selected features
selected_features = X.columns[selector.support_]
print("Selected Features by RFE:")
print(selected_features)

# Train the model using only selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

linear_model_rfe = LinearRegression()
linear_model_rfe.fit(X_train_selected, y_train)
y_test_pred_rfe = linear_model_rfe.predict(X_test_selected)

# Evaluate the RFE model
mse_rfe = mean_squared_error(y_test, y_test_pred_rfe)
mae_rfe = mean_absolute_error(y_test, y_test_pred_rfe)
r2_rfe = r2_score(y_test, y_test_pred_rfe)

print("RFE Model Performance:")
print(f"Mean Squared Error (MSE): {mse_rfe:.2f}")
print(f"Mean Absolute Error (MAE): {mae_rfe:.2f}")
print(f"R-squared (R2): {r2_rfe:.2f}")
Selected Features by RFE:
Index(['woba', 'BB%', 'K%', 'ISO', 'BABIP', 'AVG', 'OBP', 'SLG', 'wOBA',
       'WAR'],
      dtype='object')
RFE Model Performance:
Mean Squared Error (MSE): 60572139713454.79
Mean Absolute Error (MAE): 5034050.13
R-squared (R2): 0.40

The results of the Recursive Feature Elimination (RFE) model reveal the following:

The selected features (woba, BB%, K%, ISO, BABIP, AVG, OBP, SLG, wOBA, WAR) reflect a well-rounded set of performance metrics, including batting efficiency (woba, OBP), power (ISO, SLG), and overall player contribution (WAR). These features provide a balanced foundation for predicting salaries based on key aspects of player performance.

Model performance metrics indicate moderate predictive accuracy. The Mean Squared Error (MSE) of ( 6.06 \times 10^{13} ) and Mean Absolute Error (MAE) of ( 5,034,050.13 ) suggest that salary predictions deviate significantly from actual values on average. Additionally, the R-squared (R²) value of ( 0.40 ) highlights that the model explains only 40% of the variance in salary, leaving substantial room for improvement.

The moderate R² score shows that while the selected features provide some explanatory power, the model does not fully capture the underlying patterns in the data. The relatively high MSE and MAE suggest that while the model has improved with a refined feature set, it still has limitations in accurately predicting player salaries. To improve, additional feature engineering, alternative modeling techniques (e.g., ensemble methods like Random Forest or Gradient Boosting), and careful reassessment of the dataset for outliers or salary discrepancies should be considered.

Polynomial Features

In [21]:
# Generate polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train a linear model on polynomial features
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_test_pred_poly = poly_model.predict(X_test_poly)

# Evaluate the polynomial model
mse_poly = mean_squared_error(y_test, y_test_pred_poly)
mae_poly = mean_absolute_error(y_test, y_test_pred_poly)
r2_poly = r2_score(y_test, y_test_pred_poly)

print("Polynomial Regression Model Performance:")
print(f"Mean Squared Error (MSE): {mse_poly:.2f}")
print(f"Mean Absolute Error (MAE): {mae_poly:.2f}")
print(f"R-squared (R2): {r2_poly:.2f}")
Polynomial Regression Model Performance:
Mean Squared Error (MSE): 366931699370140.38
Mean Absolute Error (MAE): 14483942.33
R-squared (R2): -2.65

The Polynomial Regression model demonstrates poor performance, as reflected in its evaluation metrics. The Mean Squared Error (MSE) of ( 3.67 \times 10^{14} ) and the Mean Absolute Error (MAE) of ( 14,483,942.33 ) indicate that the predictions are significantly off from actual values, with large errors on average.

The R-squared (R²) value of ( -2.65 ) is particularly concerning, as it implies that the model performs worse than a simple baseline model (e.g., predicting the mean salary). This suggests that the polynomial regression model is likely overfitting the training data or failing to capture meaningful relationships in the features.

The model's performance highlights the limitations of using polynomial regression in this context, especially when the dataset's relationships are not inherently nonlinear or when the feature set lacks sufficient predictive power. Simpler models like linear regression with regularization (e.g., Ridge) or ensemble models might perform better with these data characteristics.

Ensemble Methods for Determining Pay Equity¶

In [22]:
# Step 1: Apply Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluate Random Forest Model
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Model Performance:")
print(f"Mean Squared Error (MSE): {mse_rf:.2f}")
print(f"Mean Absolute Error (MAE): {mae_rf:.2f}")
print(f"R-squared (R2): {r2_rf:.2f}")

# Step 2: Apply Gradient Boosting Regressor
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)

# Evaluate Gradient Boosting Model
mse_gb = mean_squared_error(y_test, y_pred_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print("\nGradient Boosting Model Performance:")
print(f"Mean Squared Error (MSE): {mse_gb:.2f}")
print(f"Mean Absolute Error (MAE): {mae_gb:.2f}")
print(f"R-squared (R2): {r2_gb:.2f}")
Random Forest Model Performance:
Mean Squared Error (MSE): 58179247382947.98
Mean Absolute Error (MAE): 4567600.40
R-squared (R2): 0.42

Gradient Boosting Model Performance:
Mean Squared Error (MSE): 58321976580156.20
Mean Absolute Error (MAE): 4581316.48
R-squared (R2): 0.42

This still isn't giving me a strong predictor. Let me do some feature engineering

Check Correlation

In [23]:
# Ensure only numeric columns are used for correlation
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# Check correlations with AAV
if 'AAV' in numeric_data.columns:
    correlation_with_aav = numeric_data.corr()['AAV'].sort_values(ascending=False)
    print("Correlation of features with AAV:")
    print(correlation_with_aav.head(20))  # Display top 20 correlations
else:
    print("Error: 'AAV' column not found in the numeric data.")
Correlation of features with AAV:
AAV                   1.000000
barrel                0.599061
home_run              0.591440
HR                    0.591440
Off                   0.560521
RBI                   0.554944
R                     0.544367
WAR                   0.542641
PA                    0.481392
out_zone_swing        0.433293
woba                  0.421734
wOBA                  0.421301
wRC+                  0.420590
SLG                   0.408186
in_zone_swing_miss    0.406619
G                     0.383526
OBP                   0.381380
ISO                   0.370662
blasts_swing          0.357841
avg_hyper_speed       0.330052
Name: AAV, dtype: float64

Run the models with interactions and removing low correlations

In [24]:
# Step 2: Remove features with low correlation (e.g., less than 0.3)
low_corr_threshold = 0.3  # Adjust threshold as needed
low_corr_features = correlation_with_aav[correlation_with_aav.abs() < low_corr_threshold].index.tolist()
data = data.drop(columns=low_corr_features, errors='ignore')
print(f"Removed low correlation features: {low_corr_features}")

# Step 3: Add interaction terms for high-correlation features
data['home_run_RBI'] = data['home_run'] * data['RBI']  # Interaction of power metrics
data['barrel_WAR'] = data['barrel'] * data['WAR']      # Interaction of power and value metrics
data['PA_R'] = data['PA'] * data['R']                 # Interaction of plate appearances and runs

# Step 4: Prepare dataset again for modeling
numeric_data = data.select_dtypes(include=['float64', 'int64'])
X = numeric_data.drop(columns=['AAV'], errors='ignore')  # Features
y = data['AAV']  # Target variable

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Train and evaluate Random Forest and Gradient Boosting again

# Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
print("\nRandom Forest Updated Metrics:")
print(f"R-squared: {r2_score(y_test, y_pred_rf):.2f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred_rf):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_rf):.2f}")

# Gradient Boosting
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
print("\nGradient Boosting Updated Metrics:")
print(f"R-squared: {r2_score(y_test, y_pred_gb):.2f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred_gb):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_gb):.2f}")
Removed low correlation features: ['hard_hit_percent', 'swords', 'BB%', 'Age at Signing', 'avg_swing_speed', 'SB', 'avg_swing_length', 'solidcontact_percent', 'squared_up_swing', 'sweet_spot_percent', 'launch_angle_avg', 'iz_contact_percent', 'squared_up_contact', 'BABIP', 'z_swing_percent', 'poorlyunder_percent', 'meatball_swing_percent', 'BsR', 'flareburner_percent', 'meatball_percent', 'swing_percent', 'oz_swing_percent', 'z_swing_miss_percent', 'whiff_percent', 'Def', 'oz_swing_miss_percent', 'poorlytopped_percent', 'poorlyweak_percent', 'K%']

Random Forest Updated Metrics:
R-squared: 0.59
MAE: 3809383.21
MSE: 41409893747045.52

Gradient Boosting Updated Metrics:
R-squared: 0.58
MAE: 3828080.14
MSE: 41900967779212.24

Still low. Let's work with some hyperparameter tuning

In [25]:
# Random Forest Parameter Grid
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Gradient Boosting Parameter Grid
gb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Random Forest Randomized Search
rf_random_search = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_distributions=rf_param_grid,
    n_iter=20,  # Number of random combinations to try
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1,
    random_state=42,
    verbose=2
)
rf_random_search.fit(X_train, y_train)
print("Best Random Forest Parameters:", rf_random_search.best_params_)
print("Best Random Forest Score (MSE):", -rf_random_search.best_score_)

# Gradient Boosting Randomized Search
gb_random_search = RandomizedSearchCV(
    estimator=GradientBoostingRegressor(random_state=42),
    param_distributions=gb_param_grid,
    n_iter=20,  # Number of random combinations to try
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1,
    random_state=42,
    verbose=2
)
gb_random_search.fit(X_train, y_train)
print("\nBest Gradient Boosting Parameters:", gb_random_search.best_params_)
print("Best Gradient Boosting Score (MSE):", -gb_random_search.best_score_)

# Evaluate Random Forest
y_pred_rf = rf_random_search.best_estimator_.predict(X_test)

mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mape_rf = mean_absolute_percentage_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
explained_variance_rf = explained_variance_score(y_test, y_pred_rf)

print("\nRandom Forest Final Metrics:")
print(f"MSE: {mse_rf:.2f}")
print(f"RMSE: {rmse_rf:.2f}")
print(f"MAE: {mae_rf:.2f}")
print(f"MAPE: {mape_rf:.2%}")  # Show as percentage
print(f"R-squared: {r2_rf:.2f}")
print(f"Explained Variance Score: {explained_variance_rf:.2f}")

# Evaluate Gradient Boosting
y_pred_gb = gb_random_search.best_estimator_.predict(X_test)

mse_gb = mean_squared_error(y_test, y_pred_gb)
rmse_gb = np.sqrt(mse_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)
mape_gb = mean_absolute_percentage_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)
explained_variance_gb = explained_variance_score(y_test, y_pred_gb)

print("\nGradient Boosting Final Metrics:")
print(f"MSE: {mse_gb:.2f}")
print(f"RMSE: {rmse_gb:.2f}")
print(f"MAE: {mae_gb:.2f}")
print(f"MAPE: {mape_gb:.2%}")  # Show as percentage
print(f"R-squared: {r2_gb:.2f}")
print(f"Explained Variance Score: {explained_variance_gb:.2f}")
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Random Forest Parameters: {'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_depth': None}
Best Random Forest Score (MSE): 42198347442005.25
Fitting 3 folds for each of 20 candidates, totalling 60 fits

Best Gradient Boosting Parameters: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 7, 'learning_rate': 0.1}
Best Gradient Boosting Score (MSE): 43175407439658.445

Random Forest Final Metrics:
MSE: 39907321734235.45
RMSE: 6317224.21
MAE: 3822361.80
MAPE: 143.91%
R-squared: 0.60
Explained Variance Score: 0.60

Gradient Boosting Final Metrics:
MSE: 40593618058757.37
RMSE: 6371312.11
MAE: 3833741.15
MAPE: 146.88%
R-squared: 0.60
Explained Variance Score: 0.60

The results for both the Random Forest and Gradient Boosting models indicate relatively low predictive accuracy. With an ( R^2 ) score of 0.60 for both models, they explain approximately 60% of the variance in AAV. This suggests that while the models capture some meaningful patterns in the data, they leave a substantial portion of the variability unexplained. Additionally, the high Mean Absolute Percentage Error (MAPE) values of 143.91% (Random Forest) and 146.88% (Gradient Boosting) reflect significant prediction errors, especially for lower-salary players.

Despite the low predictive accuracy, these models still serve as a useful baseline for defining Overpaid, Fairly Paid, and Underpaid categories. The predicted AAV values establish a systematic framework to benchmark actual salaries. By using an objective threshold (e.g., within 10% of predicted salary for Fairly Paid), the models enable a consistent methodology to classify player salary fairness, even if precise salary predictions are less reliable.

The fairness classifications derived from these predictions will be instrumental in downstream tasks. This ensures that subsequent analyses, such as determining which player groups are most overpaid or underpaid, remain interpretable and actionable.

Interpretation

The Proportion of Players by Salary Fairness chart illustrates that the majority of players are either classified as "Fairly Paid" or "Underpaid," with "Overpaid" players constituting the smallest group. This suggests that, based on the model, actual salaries are more aligned with predictions for a significant portion of the dataset.

The Distribution of Actual Salaries by Fairness Category box plot highlights that "Overpaid" players generally have much higher actual salaries compared to the "Fairly Paid" and "Underpaid" groups. The spread of salaries is wide in the "Overpaid" category, indicating variability in overpayment, while "Fairly Paid" and "Underpaid" categories show much tighter distributions.

The Distribution of Predicted Salaries by Fairness Category box plot shows that the model predicts significantly higher salaries for the "Overpaid" group compared to the "Underpaid" group, aligning with the fairness classification. The "Fairly Paid" group has predicted salaries closely concentrated around the middle range, which aligns with its definition. The patterns provide validation that the model's classifications of fairness are consistent with both actual and predicted salary trends.

SVM to See which types of players are being paid fairly¶

I will use Optuna for hyperparameter tuning to optimize the performance of the SVM model efficiently. Optuna uses Bayesian optimization to intelligently explore the parameter space, reducing computation time compared to exhaustive methods like grid or random search.

In [32]:
# Step 1: Ensure Clusters and Fairness Labels Exist
# Add the clusters from the K-Means model
data['Cluster'] = kmeans.labels_  # Use the labels from the K-Means model
cluster_labels_map = {
    0: "Power Hitters",      # High in RBI, HR, SLG, and related metrics
    1: "Balanced Hitters",   # Moderate across all metrics
    2: "Utility Players"     # Lower contributions across all metrics
}
data['Cluster_Label'] = data['Cluster'].map(cluster_labels_map)

# Add the 'Fairness' column if it doesn't exist
if 'Predicted Salary' not in data.columns:
    data['Predicted Salary'] = rf_model.predict(data[top_features + ['Cluster']])

threshold = 0.1 * data['AAV'].mean()

def classify_salary(actual, predicted, threshold):
    if actual < predicted - threshold:
        return "Underpaid"
    elif actual > predicted + threshold:
        return "Overpaid"
    else:
        return "Fairly Paid"

data['Fairness'] = [
    classify_salary(actual, predicted, threshold)
    for actual, predicted in zip(data['AAV'], data['Predicted Salary'])
]

# Map Fairness to numeric labels for SVM
data['Fairness_Label'] = data['Fairness'].map({'Fairly Paid': 0, 'Underpaid': 1, 'Overpaid': 2})

# Step 2: Prepare the SVM Data
X = data[top_features + ['Cluster']]  # Include cluster information as a feature
y = data['Fairness_Label']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 3: Simplified Random Search with Reduced Trials
reduced_svm_param_grid = {
    'C': [0.1, 1, 10],        # Limited options for regularization strength
    'kernel': ['linear', 'rbf'],  # Focus on common kernels
    'gamma': ['scale']         # Use default gamma for simplicity
}

svm_model = SVC(random_state=42)
svm_random_search = RandomizedSearchCV(
    estimator=svm_model,
    param_distributions=reduced_svm_param_grid,
    n_iter=5,  # Fewer trials for faster results
    scoring='accuracy',
    cv=2,  # Reduce cross-validation folds
    n_jobs=-1,
    random_state=42,
    verbose=1  # Reduce verbosity
)
svm_random_search.fit(X_train, y_train)

# Step 4: Evaluate the SVM Model
best_svm = svm_random_search.best_estimator_
y_pred = best_svm.predict(X_test)

print("Simplified Random Search Best Parameters:", svm_random_search.best_params_)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Fairly Paid', 'Underpaid', 'Overpaid']))

# Step 5: Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
            xticklabels=['Fairly Paid', 'Underpaid', 'Overpaid'],
            yticklabels=['Fairly Paid', 'Underpaid', 'Overpaid'])
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# Step 6: Visualize Fairness by Cluster
fairness_cluster_summary = data.groupby(['Cluster', 'Fairness']).size().unstack(fill_value=0)
fairness_cluster_summary.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title("Fairness Categories by Clusters")
plt.xlabel("Player Cluster")
plt.ylabel("Number of Players")
plt.legend(title="Fairness Category")
plt.tight_layout()
plt.show()
Fitting 2 folds for each of 5 candidates, totalling 10 fits
Simplified Random Search Best Parameters: {'kernel': 'rbf', 'gamma': 'scale', 'C': 0.1}

Classification Report:
              precision    recall  f1-score   support

 Fairly Paid       0.65      0.78      0.70        40
   Underpaid       0.42      0.55      0.48        29
    Overpaid       0.00      0.00      0.00        17

    accuracy                           0.55        86
   macro avg       0.36      0.44      0.39        86
weighted avg       0.44      0.55      0.49        86

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
No description has been provided for this image
No description has been provided for this image

The classification report indicates that the SVM model is reasonably effective at identifying "Fairly Paid" players, with a precision of 0.65 and recall of 0.78, but struggles significantly with "Overpaid" players, where both precision and recall are 0. This imbalance suggests the model is better suited for detecting fairness in salaries rather than misclassifications of overpaid players.

The confusion matrix visualizes this, showing a clear concentration of correct predictions in the "Fairly Paid" category but significant misclassifications between "Underpaid" and "Overpaid." This is further highlighted in the stacked bar chart, where fairness categories vary noticeably by player cluster, with "Power Hitters" having a relatively even distribution across all fairness types, while "Utility Players" lean more towards being underpaid.

Can it be Improved through Boosting and Bagging¶

In [35]:
# Step 1: Ensure Clusters and Fairness Labels Exist
data['Fairness_Label'] = data['Fairness'].map({'Fairly Paid': 0, 'Underpaid': 1, 'Overpaid': 2})
X = data[top_features + ['Cluster']]  # Use top features and cluster as predictors
y = data['Fairness_Label']

# Step 2: Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 3: Random Forest with Random Search
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=rf_param_grid,
    n_iter=10,  # Fewer iterations for faster results
    scoring='accuracy',
    cv=3,
    n_jobs=-1,
    random_state=42,
    verbose=2
)
rf_random_search.fit(X_train, y_train)
rf_best = rf_random_search.best_estimator_

# Random Forest Evaluation
rf_y_pred = rf_best.predict(X_test)
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, rf_y_pred, target_names=['Fairly Paid', 'Underpaid', 'Overpaid']))

# Confusion Matrix for Random Forest
conf_matrix_rf = confusion_matrix(y_test, rf_y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_rf, annot=True, fmt="d", cmap="Blues",
            xticklabels=['Fairly Paid', 'Underpaid', 'Overpaid'],
            yticklabels=['Fairly Paid', 'Underpaid', 'Overpaid'])
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# Step 4: Gradient Boosting with Random Search
gb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

gb_random_search = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_distributions=gb_param_grid,
    n_iter=10,  # Fewer iterations for faster results
    scoring='accuracy',
    cv=3,
    n_jobs=-1,
    random_state=42,
    verbose=2
)
gb_random_search.fit(X_train, y_train)
gb_best = gb_random_search.best_estimator_

# Gradient Boosting Evaluation
gb_y_pred = gb_best.predict(X_test)
print("\nGradient Boosting Classification Report:")
print(classification_report(y_test, gb_y_pred, target_names=['Fairly Paid', 'Underpaid', 'Overpaid']))

# Confusion Matrix for Gradient Boosting
conf_matrix_gb = confusion_matrix(y_test, gb_y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_gb, annot=True, fmt="d", cmap="Blues",
            xticklabels=['Fairly Paid', 'Underpaid', 'Overpaid'],
            yticklabels=['Fairly Paid', 'Underpaid', 'Overpaid'])
plt.title("Confusion Matrix - Gradient Boosting")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# Step 5: Fairness Distribution by Clusters
fairness_cluster_summary = data.groupby(['Cluster', 'Fairness']).size().unstack(fill_value=0)
fairness_cluster_summary.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title("Fairness Categories by Clusters")
plt.xlabel("Player Cluster")
plt.ylabel("Number of Players")
plt.legend(title="Fairness Category")
plt.tight_layout()
plt.show()
Fitting 3 folds for each of 10 candidates, totalling 30 fits

Random Forest Classification Report:
              precision    recall  f1-score   support

 Fairly Paid       0.63      0.72      0.67        40
   Underpaid       0.35      0.38      0.37        29
    Overpaid       0.44      0.24      0.31        17

    accuracy                           0.51        86
   macro avg       0.48      0.45      0.45        86
weighted avg       0.50      0.51      0.50        86

No description has been provided for this image
Fitting 3 folds for each of 10 candidates, totalling 30 fits

Gradient Boosting Classification Report:
              precision    recall  f1-score   support

 Fairly Paid       0.68      0.80      0.74        40
   Underpaid       0.50      0.52      0.51        29
    Overpaid       0.22      0.12      0.15        17

    accuracy                           0.57        86
   macro avg       0.47      0.48      0.47        86
weighted avg       0.53      0.57      0.54        86

No description has been provided for this image
No description has been provided for this image

The results demonstrate a moderate level of success in identifying salary fairness categories using ensemble models like Random Forest and Gradient Boosting. The classification report for Gradient Boosting reveals strong performance for the "Fairly Paid" category with an F1-score of 0.74, but weaker performance for "Underpaid" (0.51) and "Overpaid" (0.15). This suggests the model is more adept at identifying fairly paid players, but struggles with the less represented "Overpaid" category, likely due to class imbalance.

The confusion matrices further highlight these observations, as most misclassifications occur between "Underpaid" and "Overpaid" categories, indicating overlap in features or insufficient differentiation by the model. The fairness-by-cluster visualization reveals that "Utility Players" (Cluster 2) are predominantly underpaid, while "Power Hitters" (Cluster 0) are better distributed across fairness categories. This suggests that the model captures some meaningful patterns but still faces challenges with generalizing fairness across all clusters.

Overall, while the models provide useful insights, the results indicate room for improvement in addressing class imbalance and refining feature selection or representation. Additional techniques, such as rebalancing classes or incorporating advanced ensemble methods, could improve predictive accuracy, especially for "Underpaid" and "Overpaid" categories.

Fairness Distribution by Cluster

In [38]:
# Define the mapping for clusters to player categories
cluster_labels_map = {
    0: "Power Hitters",
    1: "Balanced Hitters",
    2: "Utility Players"
}

# Replace the cluster numbers in the table with their corresponding labels
fairness_cluster_summary.index = fairness_cluster_summary.index.map(cluster_labels_map)

# Display the updated table
print(fairness_cluster_summary)
Fairness          Fairly Paid  Overpaid  Underpaid
Cluster                                           
Power Hitters              36        46         69
Balanced Hitters           45        29         51
Utility Players           116         8         27

Power Hitters have a significant proportion of both "Overpaid" and "Underpaid" players, with relatively fewer categorized as "Fairly Paid," highlighting a notable disparity in salary alignment within this group. Balanced Hitters show a more even distribution, but still have a notable number of "Underpaid" players. Utility Players, on the other hand, have the highest number of "Fairly Paid" players and very few "Overpaid" players, indicating that salaries for this group are generally more aligned with their performance.

Proportional Analysis

In [39]:
fairness_proportions = fairness_cluster_summary.div(fairness_cluster_summary.sum(axis=1), axis=0)
print(fairness_proportions)
Fairness          Fairly Paid  Overpaid  Underpaid
Cluster                                           
Power Hitters        0.238411  0.304636   0.456954
Balanced Hitters     0.360000  0.232000   0.408000
Utility Players      0.768212  0.052980   0.178808

Power Hitters have a relatively balanced distribution, with 45.7% categorized as "Underpaid," 30.5% as "Overpaid," and 23.8% as "Fairly Paid," indicating significant disparities in salary alignment. Balanced Hitters show a slightly better distribution, with 40.8% "Underpaid," 23.2% "Overpaid," and 36% "Fairly Paid," reflecting a moderate alignment overall. Utility Players, however, stand out with 76.8% "Fairly Paid," only 5.3% "Overpaid," and 17.9% "Underpaid," suggesting that this group experiences the highest salary alignment among the clusters.

Visualization

In [40]:
fairness_cluster_summary.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title("Fairness Categories by Clusters")
plt.xlabel("Player Cluster")
plt.ylabel("Number of Players")
plt.legend(title="Fairness Category")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [41]:
sns.boxplot(x='Cluster', y='AAV', hue='Fairness', data=data)
plt.title("Actual Salary Distribution by Cluster and Fairness")
plt.show()

sns.boxplot(x='Cluster', y='Predicted Salary', hue='Fairness', data=data)
plt.title("Predicted Salary Distribution by Cluster and Fairness")
plt.show()
No description has been provided for this image
No description has been provided for this image

The salary distributions across clusters show significant variability, particularly in "Power Hitters," where both actual and predicted salaries demonstrate a wide range, including several extreme outliers. Using neural network analysis, these outliers can be further evaluated and removed to improve fairness categorizations and refine salary predictions for greater alignment with cluster-specific performance metrics.

Neural Networks for Further Analysis¶

Data Preparation¶

In [45]:
# Step 1: Normalize the feature set
scaler = StandardScaler()
numeric_features = ['barrel', 'home_run', 'HR', 'RBI', 'SLG', 'Predicted Salary', 'Cluster']  # Include relevant features
X = data[numeric_features]
X_scaled = scaler.fit_transform(X)

# Step 2: Encode the target variable (Fairness Labels)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(data['Fairness_Label'])  # Ensure 'Fairness_Label' is numeric (0, 1, 2)
y_categorical = to_categorical(y_encoded)  # Convert to one-hot encoding for multi-class classification

# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_categorical, test_size=0.2, random_state=42, stratify=y_encoded)

print("Data prepared:")
print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Testing set: {X_test.shape}, {y_test.shape}")
Data prepared:
Training set: (341, 7), (341, 3)
Testing set: (86, 7), (86, 3)

Model Preparation¶

In [50]:
# Step 1: Correct column alignment for X_train and X_test
X_train_df = pd.DataFrame(X_train, columns=data[top_features].columns[:X_train.shape[1]])
X_test_df = pd.DataFrame(X_test, columns=data[top_features].columns[:X_test.shape[1]])

# Ensure 'Cluster' is included in the datasets
X_train_df['Cluster'] = data.loc[X_train_df.index, 'Cluster'].values
X_test_df['Cluster'] = data.loc[X_test_df.index, 'Cluster'].values

# Convert back to numpy arrays
X_train_with_clusters = X_train_df.values
X_test_with_clusters = X_test_df.values

# Step 2: Build the Neural Network Model
model = Sequential([
    Dense(64, input_dim=X_train_with_clusters.shape[1], activation='relu'),  # Input layer with 64 neurons
    Dropout(0.3),  # Dropout to prevent overfitting
    Dense(32, activation='relu'),  # Hidden layer with 32 neurons
    Dropout(0.3),
    Dense(16, activation='relu'),  # Another hidden layer with 16 neurons
    Dense(y_train.shape[1], activation='softmax')  # Output layer with softmax for multi-class classification
])

# Compile the model
model.compile(optimizer='adam',  # Adam optimizer
              loss='categorical_crossentropy',  # Loss for multi-class classification
              metrics=['accuracy'])

print(model.summary())

# Step 3: Train the Model
history = model.fit(X_train_with_clusters, y_train,
                    validation_split=0.2,
                    epochs=50,  # Adjust epochs as needed
                    batch_size=16,  # Mini-batch size
                    verbose=1)

# Step 4: Evaluate the Model
loss, accuracy = model.evaluate(X_test_with_clusters, y_test)
print(f"\nTest Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

# Step 5: Visualize Training Progress
plt.figure(figsize=(12, 6))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Step 6: Analyze Results by Cluster
# Predict fairness categories for the test set
y_pred = np.argmax(model.predict(X_test_with_clusters), axis=1)

# Map predictions back to fairness categories
fairness_pred = pd.Series(y_pred).map({0: 'Fairly Paid', 1: 'Underpaid', 2: 'Overpaid'})

# Add predictions and clusters to a summary dataframe
results_df = pd.DataFrame({
    'Cluster': X_test_df['Cluster'],
    'True Fairness': pd.Series(np.argmax(y_test, axis=1)).map({0: 'Fairly Paid', 1: 'Underpaid', 2: 'Overpaid'}),
    'Predicted Fairness': fairness_pred
})

# Analyze fairness distribution by cluster
fairness_by_cluster = results_df.groupby('Cluster')['Predicted Fairness'].value_counts(normalize=True).unstack()
print(fairness_by_cluster)

# Visualize fairness distribution by cluster
fairness_by_cluster.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Fairness Prediction Distribution by Cluster')
plt.xlabel('Player Cluster')
plt.ylabel('Proportion')
plt.legend(title='Fairness Category')
plt.tight_layout()
plt.show()
/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_4 (Dense)                      │ (None, 64)                  │             576 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 (Dropout)                  │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_5 (Dense)                      │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_3 (Dropout)                  │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_6 (Dense)                      │ (None, 16)                  │             528 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_7 (Dense)                      │ (None, 3)                   │              51 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 3,235 (12.64 KB)
 Trainable params: 3,235 (12.64 KB)
 Non-trainable params: 0 (0.00 B)
None
Epoch 1/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 4s 43ms/step - accuracy: 0.2246 - loss: 1.1693 - val_accuracy: 0.5217 - val_loss: 1.0670
Epoch 2/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.4648 - loss: 1.0619 - val_accuracy: 0.5652 - val_loss: 1.0218
Epoch 3/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.5145 - loss: 1.0363 - val_accuracy: 0.5507 - val_loss: 0.9828
Epoch 4/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.5609 - loss: 1.0067 - val_accuracy: 0.5652 - val_loss: 0.9460
Epoch 5/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.5792 - loss: 0.9247 - val_accuracy: 0.5652 - val_loss: 0.9179
Epoch 6/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.6251 - loss: 0.8729 - val_accuracy: 0.5797 - val_loss: 0.9050
Epoch 7/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.5675 - loss: 0.8712 - val_accuracy: 0.5797 - val_loss: 0.8876
Epoch 8/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.6938 - loss: 0.7504 - val_accuracy: 0.5942 - val_loss: 0.8745
Epoch 9/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step - accuracy: 0.5608 - loss: 0.8675 - val_accuracy: 0.6087 - val_loss: 0.8453
Epoch 10/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - accuracy: 0.6494 - loss: 0.8256 - val_accuracy: 0.6522 - val_loss: 0.8230
Epoch 11/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6412 - loss: 0.8263 - val_accuracy: 0.6232 - val_loss: 0.7975
Epoch 12/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6405 - loss: 0.7704 - val_accuracy: 0.6377 - val_loss: 0.7743
Epoch 13/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6179 - loss: 0.7574 - val_accuracy: 0.6377 - val_loss: 0.7590
Epoch 14/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6185 - loss: 0.7804 - val_accuracy: 0.6522 - val_loss: 0.7559
Epoch 15/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7186 - loss: 0.7183 - val_accuracy: 0.6522 - val_loss: 0.7604
Epoch 16/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6558 - loss: 0.7328 - val_accuracy: 0.6377 - val_loss: 0.7365
Epoch 17/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.6746 - loss: 0.7809 - val_accuracy: 0.6377 - val_loss: 0.7204
Epoch 18/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.7195 - loss: 0.6682 - val_accuracy: 0.6377 - val_loss: 0.7254
Epoch 19/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6132 - loss: 0.7899 - val_accuracy: 0.6377 - val_loss: 0.7142
Epoch 20/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6123 - loss: 0.7903 - val_accuracy: 0.6377 - val_loss: 0.7098
Epoch 21/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.6870 - loss: 0.7149 - val_accuracy: 0.6522 - val_loss: 0.7213
Epoch 22/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.7263 - loss: 0.6558 - val_accuracy: 0.6522 - val_loss: 0.7162
Epoch 23/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6732 - loss: 0.7611 - val_accuracy: 0.6232 - val_loss: 0.6950
Epoch 24/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6790 - loss: 0.7092 - val_accuracy: 0.6377 - val_loss: 0.6889
Epoch 25/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.7029 - loss: 0.6892 - val_accuracy: 0.6377 - val_loss: 0.6879
Epoch 26/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - accuracy: 0.7311 - loss: 0.6062 - val_accuracy: 0.6377 - val_loss: 0.6824
Epoch 27/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 1s 21ms/step - accuracy: 0.6925 - loss: 0.6339 - val_accuracy: 0.6377 - val_loss: 0.6810
Epoch 28/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 1s 21ms/step - accuracy: 0.7426 - loss: 0.6591 - val_accuracy: 0.6522 - val_loss: 0.6876
Epoch 29/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7164 - loss: 0.6015 - val_accuracy: 0.6377 - val_loss: 0.6923
Epoch 30/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.6936 - loss: 0.6121 - val_accuracy: 0.6377 - val_loss: 0.6968
Epoch 31/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.6732 - loss: 0.6753 - val_accuracy: 0.6667 - val_loss: 0.6781
Epoch 32/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6719 - loss: 0.6605 - val_accuracy: 0.6667 - val_loss: 0.6685
Epoch 33/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7308 - loss: 0.6109 - val_accuracy: 0.6667 - val_loss: 0.6671
Epoch 34/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.6786 - loss: 0.6823 - val_accuracy: 0.6667 - val_loss: 0.6625
Epoch 35/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.6741 - loss: 0.6703 - val_accuracy: 0.6667 - val_loss: 0.6680
Epoch 36/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.6687 - loss: 0.6631 - val_accuracy: 0.6377 - val_loss: 0.6789
Epoch 37/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.6559 - loss: 0.6474 - val_accuracy: 0.6377 - val_loss: 0.6729
Epoch 38/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.7335 - loss: 0.6217 - val_accuracy: 0.6667 - val_loss: 0.6539
Epoch 39/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7076 - loss: 0.6215 - val_accuracy: 0.6522 - val_loss: 0.6550
Epoch 40/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.7345 - loss: 0.6303 - val_accuracy: 0.6377 - val_loss: 0.6819
Epoch 41/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - accuracy: 0.7237 - loss: 0.6648 - val_accuracy: 0.6522 - val_loss: 0.6706
Epoch 42/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - accuracy: 0.7043 - loss: 0.6001 - val_accuracy: 0.6667 - val_loss: 0.6686
Epoch 43/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.7683 - loss: 0.6141 - val_accuracy: 0.6522 - val_loss: 0.6635
Epoch 44/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.6912 - loss: 0.6185 - val_accuracy: 0.6522 - val_loss: 0.6581
Epoch 45/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.7393 - loss: 0.6437 - val_accuracy: 0.6522 - val_loss: 0.6593
Epoch 46/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7281 - loss: 0.6000 - val_accuracy: 0.6232 - val_loss: 0.6618
Epoch 47/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.6727 - loss: 0.6101 - val_accuracy: 0.6377 - val_loss: 0.6695
Epoch 48/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.7014 - loss: 0.6247 - val_accuracy: 0.6377 - val_loss: 0.6630
Epoch 49/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.7559 - loss: 0.5823 - val_accuracy: 0.6667 - val_loss: 0.6585
Epoch 50/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step - accuracy: 0.6760 - loss: 0.6398 - val_accuracy: 0.6522 - val_loss: 0.6477
3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - accuracy: 0.7394 - loss: 0.6679 

Test Loss: 0.6681965589523315
Test Accuracy: 0.7209302186965942
No description has been provided for this image
No description has been provided for this image
3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step
Predicted Fairness  Fairly Paid  Overpaid  Underpaid
Cluster                                             
0                      0.391304  0.260870   0.347826
1                      0.545455  0.181818   0.272727
2                      0.390244  0.341463   0.268293
No description has been provided for this image

The neural network model demonstrates distinct insights into the salary fairness distribution across different player clusters. For "Power Hitters" (Cluster 0), the model predicts a relatively balanced distribution between "Fairly Paid" (39%) and "Underpaid" (35%), but also a noticeable share (26%) categorized as "Overpaid," reflecting a mix of salary alignment within this group. In contrast, "Balanced Hitters" (Cluster 1) show the highest proportion of "Fairly Paid" players (54%), with lower proportions of "Overpaid" (18%) and "Underpaid" (27%), indicating better salary alignment overall. "Utility Players" (Cluster 2) feature the most evenly distributed fairness categories, with 39% "Fairly Paid," 34% "Overpaid," and 27% "Underpaid," suggesting a diverse salary alignment within this group.

The model's accuracy and loss curves indicate consistent improvements during training and validation phases, with a final accuracy around 70%. While the model provides valuable insights into the fairness of salary distributions within player clusters, the modest accuracy and overlapping proportions in some clusters suggest that further refinement, potentially with more features or advanced regularization, could improve classification performance and offer deeper insights into salary dynamics across player types.

In [61]:
!cp "/content/drive/MyDrive/Colab Notebooks/final_project.ipynb" ./
!jupyter nbconvert --to html "final_project.ipynb"
[NbConvertApp] Converting notebook final_project.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 18 image(s).
[NbConvertApp] Writing 1328985 bytes to final_project.html