Project Description

With NYC Subway Data from 2009-2011, I analyzed each subway line to derive conclusions regarding on-time performance & reliability. I parsed several CSV files and by using Numpy & Sklearn, generated data which I plotted using Bokeh, an interactive visualization tool for plots. While doing this project, I chose to stick to standard Python variable types. On hindsight, it would be smart to convert some of these variable types to the DataFrames which Pandas provides, and this is something I’d like to implement in the future.

This is my first experience with Bokeh and it was a pretty intense experience analyzing the real world chaos that is the NYC Subway System. Even though the system was far more efficient than I thought it would be, there were some key trends. To summarize the summary section of the report, the longer the route or the more the route spends its time in Manhattan, the worse on time performance gets. Also, any route which interacts with the 4/5/6 route instantly takes an OTP hit because the 4/5/6 train is pretty much the only route which serves Eastern Manhattan.

This explains why the second avenue subway is being built and heavily demanded, because it will then be the second route to serve Eastern Manhattan. However, this line connects with a limited amount of stations in which one can access the rest of the subway system, which could be a problem. Also, my data collection stopped at 2011 and it would be interesting to use the real time data from 2012 and beyond. These aspects are items I wish to investigate further at a later point in time.