Going beyond sample datasets and basic visualizations I was looking for open data in my professional domain, the aviation and airport industry. Potential candidates for visualizations are connections, routes, flight plans, airport and airline performance. Performance is usually the comparison of scheduled operations vs. actual milestones. The delay of arriving or departure flights is not only affecting passengers and many parties inside and outside the airport community, but it is driving sentiments, perception and reputation and eventually costs money. This kind of data is not something operators like to release but thanks to the Freedom of Information Act (FOIA), a US Federal law, public gets access to all kind of statistics. From the US DOT (Department of Transportation) you can access and download a variety of datasets, one of them is the On-Time Arrival Performance of US airlines in the US and their delay causes since the year 2003 (link). You can filter by airline, airport and timeframe, review the summary on the DOT website or download the set as CSV for your own analysis. I downloaded the complete dataset for 2015, a 2,25 MB file with roughly 13.500 records.
It provides total arriving flights, cancelled and diverted flights, the delay count and total time by reason (weather, carrier, NAS, security, late aircraft) for each month-airport-airline combination for 14 carriers at 322 airports.
This summary statistic dataset is based on an even more detailed dataset of individual flight information (link) which provides the key information of every single flight. You can download this datasets as well on a per month base, I downloaded as prezipped file with all columns. I downloaded data for January 2015, it revealed 470.000 records (1 month!) in a 200 MB CSV file.
For a simple start lets begin with the first dataset that already summarizes the On-Time performance for the complete year 2015 with all carriers and airports. I will skip the attempt to use the LibreOffice chart feature but use instead Tableau Public which is free to download and use, though it comes with some restrictions. The learning curve is not too steep and you should get up and running with simple visualizations in no time. Since it accepts text/cs files we can jumpstart right away. Lets have a lokk at the result, the most basic visualization I could think of, a matrix and a barchart combined to a dashboard. You can access the viz right here and explore the details.
It shows initially all data for the whole year 2015. The matrix display at which airports the carriers operate (darker color represents more traffic, white cells equals to no service)
Please note, some of the data is wrong due to a conversion error when importing and preparing the csv file. Lesson learned: Always do a random check of data in the viz before releasing to public.
Clicking into a single cell reveals the arrival delay details of one airline at one airport.
(click into a white cell to show all data again)
Click a column to show the sum of all delays for one airline at all airports, optionally sorted by traffic volume.
Click a row to show the sum of all delays for all airlines at one airport, optionally sorted by the airlines traffic volume.
How does the visualization transport information ? Let’s look at the strong and weak points.
+ The matrix shows the available connections by colored cell. The color represents the traffic volume.
– The color density does not work well due to the distribution of data. Most cells appear in the same orange shade. Out of the ~13.5000 recordsets roughly 11.000 records have less than 500 flights, only 24 records more than 10.000 flights.
+ The matrix works as filter for the barchart, this creates an exploratory feature.
– In the barchart we find the performance per month, but the counts are shown linear. High number of arrival and low counts of delays or diversion/cancellation. There is no visual impact as the barchart alway looks the same, it does not highlight any problems.
Conclusion: Though it was quite easy to come up with this first visualization (in less than 20min), there are weak points we need to fix in order to create an impact.
Lets try another iteration of it..