As a lifelong baseball fanatic and former collegiate player, I’ve always been passionate about analyzing the game through data. Baseball, with its vast datasets and rich statistics, is a perfect match for any data enthusiast. This project stemmed from my desire to build a reliable data pipeline that I can use to generate insights and support my baseball analyses for a future blog.
For a detailed narrative of the project, including the motivation and challenges faced, refer to the Project Writeup.
The Baseball Stats Pipeline automates the collection, transformation, and storage of baseball statistics using Python, Airflow, and GCP. It creates a robust relational database for querying and analysis. For an overview of the project methodology and key outcomes, see the Project Presentation.
For more information on these features and the technical implementation, refer to the Project Presentation.
The project resulted in a simplified but functional ERD that structures baseball statistics for analysis:
For further insights into the challenges and next steps, review the Project Writeup.
This project was a transformative experience, allowing me to apply classroom concepts to a personal passion. The resulting pipeline enables me to explore and analyze baseball data efficiently, and it will serve as the backbone for future projects.
For a summary of the project objectives, methodology, and outcomes, visit the Project Presentation.