Train delays can significantly impact travel plans and valuable lifetime. Therefore, this project focuses on predicting train delays by analyzing a rich dataset of train journeys, incorporating various features
such as departure and arrival stations, timestamps, and delay information. We used a classification
approach, grouping delays into classes that reflect specific minute ranges.
To enhance the accuracy of these predictions, we used a combination of data preprocessing, feature
engineering, and advanced machine learning models, including XGBoost optimized through Bayesian
optimization and RandomSearch. Our solution not only improves delay predictions but also provides
actionable insights for better schedule management.
Introduction
Train delay prediction is a critical challenge for enhancing passenger satisfaction. With the advent of machine learning, analyzing historical train journey data offers a powerful way to anticipate delays and optimize schedules. Our project aimed to develop a robust predictive model that can accurately forecast train delays based on a dataset containing features such as departure and arrival stations, times, and delay records. By leveraging advanced machine learning techniques, we sought to address the problem of delay prediction and provide actionable insights to railway customers.
We met regularly once a week for a jour fix to code together and discuss the next steps and used various digital tools to communicate and collaborate:
Slack: our central communication tool
Miro: for Brainstorming, Idea Development, Milestone Setting, Planning and Strategy
Programming environment: Google Colab
Methodology
1. Data Pre-processing
In our project, the initial phase involved extensive data pre-processing to ensure the quality and usability of the data for machine learning. We started by loading a dataset containing train schedules and delays, where we handled missing values by replacing NaN with zeros or false values depending on the context. We converted categorical features, such as day_of_week, month, ab_bahnhof (departure station), and zugnr (train number), into a categorical data type, which is important for machine learning models that can handle categorical inputs. Additionally, we converted timestamp data into Unix timestamps for better numerical handling in the model. Unnecessary columns like index and date were dropped, and false or invalid entries in the an_verspaetung (arrival delay) column were removed to ensure data integrity.
2. Data Analysis
In the data analysis phase, we examined the distribution of the an_verspaetung values by grouping them into discrete bins, categorizing the delay times into classes, e.g. class 1 for zero to five minutes [0-5]. In a real case scenario, depending on the class of delay, the train passenger would be eligible for a corresponding reimbursement of costs. This categorization was crucial as it allowed us to switch from a regression problem to a classification problem, facilitate modeling and prediction of the probability of a train arriving with varying delays. We also analyzed the station pairs (departure and arrival stations) and determined the train paths to better understand the delay patterns. Additionally, we enriched the dataset with weather information from APIs, which could potentially affect train delays, thus adding another layer of valuable features for our model.
3. Machine Learning Model Development:
For the model development, we employed XGBoost, a powerful gradient boosting algorithm, optimized
for classification tasks. Initially, we split the data into training and testing sets, ensuring the model's performance could be evaluated on unseen data. We then used Bayesian optimization and RandomizedSearchCV to fine-tune hyperparameters such as max_depth, learning_rate, and n_estimators, enhancing the model's accuracy. The final model was trained on the processed data, where the target variable was the categorized delay. The model was then evaluated using a confusion matrix to understand its performance across different delay classes.
4. Results and Impact:
The results of our model showed a reasonable accuracy of around 66%, with the confusion matrix
highlighting that most predictions were accurate for trains with no delays or very short delays (under 6
minutes). However, predicting longer delays was more challenging, as seen in lower precision and recall for higher delay classes. This analysis indicates that while the model can effectively predict on-time or near on-time arrivals, more complex delays (e.g., those influenced by multiple factors like weather or station congestion) require further feature engineering and perhaps more sophisticated models. The project underscores the value of categorizing delay times for more manageable predictions and reveals the impact of external factors like weather on train punctuality, suggesting areas for future improvement.
Results and Impact
The results of our model showed a reasonable accuracy of around 66%, with the confusion matrix highlighting that most predictions were accurate for trains with no delays or very short delays (under 6
minutes). However, predicting longer delays was more challenging, as seen in lower precision and recall for higher delay classes. This analysis indicates that while the model can effectively predict on-time or near on-time arrivals, more complex delays (e.g., those influenced by multiple factors like weather or station congestion) require further feature engineering and perhaps more sophisticated models. The project underscores the value of categorizing delay times for more manageable predictions and reveals the impact of external factors like weather on train punctuality, suggesting areas for future improvement.
Team & Rollen
Lars
Search, data collection and analysis
Mirko Brokhoff
Pre-processing, Initial model training, feature development
Carl
Pre-processing, Hyperparameter optimization
Emma Dierschke
Pre-processing, UI, the lion’s share of this blogpost
Mentor:in
Nils Uhrberg
Mathis Hunke