Streetcar Delay Predictor

ML, Python


Description

Many customers rely on the TTC to travel to their destinations safely and promptly. Customer dissatisfaction could lead to massive repercussions for the TTC including decreased ridership, loyalty and profitability. With this understanding, it’s important to know how long delays are when they occur so that patrons can plan accordingly.

Our machine learning model can predict with low error the min delay for a TTC bus given the route, day, location, incident, and direction. Predicting the min delay gave us a Root Mean Square Error (RMSE) of ~1.4 minutes.

Background

We chose TTC data as wanted something with real-world application. TTC provides a wealth of data on a range of things from busses to subways to streetcars. While we needed to do some preprocessing and data cleaning, we felt the data was large enough and useful enough for our purposes.

map of toronto with streetcar routes highlighted

Data Exploration

To start this project, we first needed to get to know the data better, so I created some charts to take a look at it in different ways.

bar chart of delay times by route

Leslie-Barnes station has the second highest number of delays, and the longest in delay time by a wide margin!

bar chart of delay times for Leslie-Barnes station

We can also see the time of day when most delays occur.

chart of delay times by time of day

Model

The regressor was a built as a decision tree using SciKit. We needed a regressor, as we were trying to predict a numerical value.

To use this regressor, I first had to convert our categorical data into numerical, so I implemented one-hot encoding. Our model was able to get a much more reasonable RMSE of ~1.4 min. To check for fit, we found the difference between the RMSE of the training and testing set. Since this difference was less than 1, we could be sure we were not overfitting or underfitting. Below is a visualization of the decision tree.

decision tree of our model

Errors

Here is a visualizations of the model's error rate. The more yellow dots you see, the greater the error!

decision tree of our model

Baseline Comparison

For a baseline, I simply calculated the average delay time based on the type of delay incident. The baseline had a RMSE of ~20min so the model was overall an improvement.

baseline performance chart comparison to our model

Contributions

I constructed the both the Simple Decision Tree regression model using SciKit, as well as the Gradient Boosted regression model using CatBoost. I also tested them for fit and compared their accuracies. As well, I conducted a lot of the early data exploration (building all the charts featured here), and designed the slide deck.