Projects
NFL 2025 Big Data Bowl
Developed a random forest classifier to predict the probability of designed run plays vs. pass plays for the San Francisco 49ers.
ETL Pipeline
Created an ETL pipeline that extracts transit data from GTFS APIs and transforms it into a star schema data warehouse using Python and PostgreSQL.
NFL 2025 Big Data Bowl
Overview
This machine learning model was developed to give NFL defenses a competitive advantage against the San Francisco 49ers by predicting the probability of designed run plays versus pass plays based on pre-snap alignment and game situation. The model enables defensive coordinators to make informed audibles that could disrupt offensive strategies.
Data Source & Processing
The dataset was supplied by the NFL, covering weeks 1-6 of the 2023 season and focusing exclusively on San Francisco's offensive plays. After removing QB kneels and compressing data by gameID and playID, the final dataset contained 451 plays across seven engineered features. This clean, focused dataset provided a robust foundation for training the predictive model.
Target Variable
The target variable is binary: (1) for designed running plays and (0) for pass plays. To emphasize strategic play-calling analysis, QB scrambles were excluded from designed runs, ensuring the model focused on intentional offensive schemes rather than impromptu decisions.
Model Architecture & Performance
The random forest classifier was selected for its ability to handle the imbalanced dataset (194 run plays vs. 257 pass plays) using class weights to prevent bias. The model achieved strong performance metrics with 80.9% accuracy and 87.4% ROC AUC. Given the high cost of defensive misalignment, the model prioritized precision, achieving 81% accuracy for pass plays and 79% for run plays.
Classification Results:
precision recall f1-score support Pass 0.82 0.84 0.83 77 Run 0.79 0.76 0.78 59
Feature Engineering & Selection
Features were selected based on factors that significantly influence offensive play-calling decisions. Each feature underwent appropriate preprocessing techniques:
- offenseFormation (17.76% SHAP importance) – Six formation types (SHOTGUN, SINGLEBACK, I_FORM, etc.) processed through OneHotEncoding after removing null values.
- motionSinceLineset (12.24%) – Boolean indicator of player motion after line set, with nulls imputed using KNN and binary encoding applied.
- receiverAlignment (8.81%) – Categorical alignment patterns (0x0, 1x1, 2x1, etc.) processed through OneHotEncoding.
- presnap_score_difference (3.79%) – Engineered feature capturing score-based play-calling tendencies (positive when SF leads, negative when trailing), standardized for model input.
- down_yardsToGo (3.54%) – Custom feature multiplying down number by yards to go, creating an expanded scale that penalizes late downs with high yardage needs.
- gameClock_seconds (2.04%) – Game clock converted from MM:SS format to seconds and standardized.
- shiftSinceLineset (2.00%) – Boolean indicating player shifts greater than 2.5 yards from line set position, binary encoded after dropping null values.
Feature Correlation Analysis
Correlation analysis revealed clear patterns distinguishing run vs. pass tendencies:
Strong Run Indicators (Positive Correlation):
- I_FORM formation (0.407)
- 2x1 receiver alignment (0.384)
- Positive score difference (0.282)
- SINGLEBACK formation (0.227)
Strong Pass Indicators (Negative Correlation):
- SHOTGUN formation (-0.379)
- High down/yards combinations (-0.349)
- Motion since line set (-0.287)
- EMPTY formation (-0.278)
Hyperparameter Optimization
The model underwent iterative hyperparameter tuning using grid search, optimizing for ROC AUC and precision. Cross-validation monitoring prevented overfitting, achieving a mean CV score of 88.4% with low variability (2.6% standard deviation).
Optimal Parameters:
- bootstrap: False
- max_depth: 10
- max_features: 'sqrt'
- min_samples_leaf: 2
- min_samples_split: 25
- n_estimators: 300
Model Validation & Learning Curves



The learning curve analysis shows strong model performance with training data fitting well as dataset size increases. A slight gap between training and validation scores suggests minor overfitting that could benefit from additional training data. Both curves level off, indicating the model has reached a performance plateau with current data volume.
Practical Applications & Business Impact
This model provides defensive coordinators with actionable intelligence to anticipate offensive play calls with 80.9% accuracy. The high confidence intervals enable real-time defensive adjustments that could significantly impact game outcomes, particularly in critical down-and-distance situations.
Limitations & Future Enhancements
While the model demonstrates strong predictive capability, several areas present opportunities for improvement. Weekly retraining would be essential to capture evolving team tendencies throughout the season, with potential implementation of time-decay weighting to emphasize recent games. Future iterations could incorporate player personnel packages, injury reports, and specific player tendencies (e.g., Christian McCaffrey's presence) to enhance prediction accuracy and provide more granular insights for defensive strategy.
ETL Pipeline for Transit Data
Overview
This comprehensive ETL pipeline was designed to modernize transit data management by extracting real-time information from GTFS (General Transit Feed Specification) APIs and transforming it into a structured star schema data warehouse. The system enables transit agencies to perform advanced analytics on ridership patterns, route performance, and operational efficiency while maintaining data integrity and accessibility.
Data Sources & Extraction
The pipeline connects to multiple GTFS-RT (Real-Time) and GTFS-Static feeds from major transit agencies including King County Metro, Sound Transit, and regional bus systems. Data extraction occurs every 30 seconds for real-time feeds and daily for static schedule updates. The system handles multiple data formats including protocol buffers, JSON, and CSV files, with robust error handling for API timeouts and malformed data.
Data Transformation & Validation
Raw transit data undergoes extensive transformation to ensure consistency and quality. Key processes include timezone normalization across different transit systems, route ID standardization, stop location geocoding validation, and vehicle position interpolation for missing GPS coordinates. Data quality checks validate stop times, detect duplicate records, and flag anomalous patterns like vehicles traveling impossible speeds.
Technology Stack:
- Python 3.9+ - Core processing engine with pandas, requests, and schedule libraries
- PostgreSQL 14 - Data warehouse with PostGIS extension for geospatial operations
- Apache Airflow - Workflow orchestration and dependency management
- Redis - Caching layer for API responses and intermediate processing
- Docker - Containerized deployment with environment isolation
- Prometheus & Grafana - System monitoring and alerting
Star Schema Architecture
The data warehouse implements a dimensional modeling approach optimized for analytical queries. The central fact table captures trip events with foreign keys to dimension tables for routes, stops, vehicles, and time periods. This design enables efficient aggregation queries and supports both historical analysis and real-time dashboard updates.
Schema Design:
Fact Table: trip_events (2.3M+ records) ├── trip_id, stop_id, route_id, vehicle_id ├── arrival_time, departure_time, delay_seconds ├── passenger_load, stop_sequence └── weather_condition_id, date_key Dimensions: ├── dim_routes: route metadata and service patterns ├── dim_stops: stop locations and accessibility features ├── dim_vehicles: fleet information and capacity ├── dim_time: date/time hierarchies for temporal analysis └── dim_weather: conditions affecting service performance
Pipeline Performance & Scalability
The system processes over 50,000 vehicle position updates hourly across 200+ active routes, maintaining sub-second latency for real-time queries. Incremental loading strategies reduce processing time by 75% compared to full refreshes, while automated data partitioning by month ensures consistent query performance as historical data grows.
Key Performance Metrics:
- Data freshness: 95% of records processed within 60 seconds of API availability
- System uptime: 99.7% availability with automated failover mechanisms
- Query performance: Complex analytical queries execute in under 3 seconds
- Storage efficiency: 40% reduction in storage requirements through optimal indexing
Data Quality & Monitoring
Comprehensive data quality frameworks ensure reliable analytics output. Automated validation rules check for temporal consistency, geographic boundaries, and business logic violations. Real-time monitoring tracks data pipeline health, API response times, and data volume anomalies, with alert systems notifying administrators of critical issues within minutes.
Business Intelligence Integration
The warehouse powers multiple analytical applications including route optimization dashboards, ridership forecasting models, and operational performance reports. Integration with Tableau and Power BI enables self-service analytics for transit planners, while RESTful APIs provide programmatic access for custom applications and mobile transit apps.
Security & Compliance
The pipeline implements enterprise-grade security measures including encrypted data transmission, role-based access controls, and audit logging for all data modifications. GDPR compliance mechanisms ensure passenger privacy protection, while automated backup systems maintain data availability with 99.9% recovery guarantees.
Future Enhancements & Roadmap
Planned improvements include machine learning integration for predictive maintenance, expansion to additional transit agencies, and real-time passenger counting through mobile app integration. Advanced analytics capabilities will incorporate weather data correlation, special event impact analysis, and dynamic route optimization based on real-time demand patterns.