After reviewing the flaws of the previous visualization of the DOT Airline performance data in part 1, I created an improved version with the same recordsets. It is a separate viz because the first version have some mistakes due to the number conversion during the csv import. I cleaned up, checked the data and used calculated fields to derive the sum of delays.
Airline Performance in the US 2015
The basic concept is still the same, the matrix on the top left controls the dashboard, initially you see all data for 2015 combined, clicking into cells drills down.
I changed the barchart to stacked bars comparing total to delayed flight in one bar for each month.
I moved the split delay reasons into a separate bar chart and added a pie chart which reveals the main reason for delays (surprisingly weather and security have the smalles share!) The 2 lists are a Top 10 style lists highlighting the airports and airlines with the most delays.
How does the visualization transport information ? Let’s look at the strong and weak points of the second iteration.
+ The key information presentation is improved. We can see the viz is about delays.
– The dashboard starts to look a bit disorganized and the viewer eyes are moving around without a centre of attention.
+ The barchart now makes sense, you can compare total flights and delays.
– The detail delay reason over time does not create too much value as the distribution of reason is quite similar.
Conclusion: Spending more time on both data and visualizations improved the overall impact, though a bit cluttered.
Going beyond sample datasets and basic visualizations I was looking for open data in my professional domain, the aviation and airport industry. Potential candidates for visualizations are connections, routes, flight plans, airport and airline performance. Performance is usually the comparison of scheduled operations vs. actual milestones. The delay of arriving or departure flights is not only affecting passengers and many parties inside and outside the airport community, but it is driving sentiments, perception and reputation and eventually costs money. This kind of data is not something operators like to release but thanks to the Freedom of Information Act (FOIA), a US Federal law, public gets access to all kind of statistics. From the US DOT (Department of Transportation) you can access and download a variety of datasets, one of them is the On-Time Arrival Performance of US airlines in the US and their delay causes since the year 2003 (link). You can filter by airline, airport and timeframe, review the summary on the DOT website or download the set as CSV for your own analysis. I downloaded the complete dataset for 2015, a 2,25 MB file with roughly 13.500 records.
Arrival Delays in Tableau
Airline Delays in the US in 2015 by DOT
It provides total arriving flights, cancelled and diverted flights, the delay count and total time by reason (weather, carrier, NAS, security, late aircraft) for each month-airport-airline combination for 14 carriers at 322 airports.
Big Data – Nothing new anymore. Occasionally still used as buzzword, but for many company it became a productive tool to analyze huge volumes of data. I noticed the term Big Data disappeared from the 2015 and 2016 Gartner report “Hype Cycle for Emerging Technologies Identifies the Computing Innovations That Organizations Should Monitor” (compare the 2013 and 2014 report), I guess it jumped straight into the plateau of productivity.
We know the definition (volume, velocity, variety, plus variability and complexity), read books about, join conferences and meetups but for a developer (outside a corporate environment with access to some kind of big data) the question remains “How do I get Big Data”. While we have access to the various tools and platforms there is no stream we can easily tap in. Of course you could create random data in the millions but this would not create content that you could analyze. There are efforts to publish data accesible to the public (open data), but it is hardly a large volume and it is usually not streaming.
In the attempt to get at least a small portion of Big Data, I only found Twitter to play with. As a message based social networking service it certainly falls into the Big Data space with more than 310 million active users and 6.000 tweets a second (from 5.000 tweets a day in 2007). The 3 V’s are ticked and fortunately Twitter gives us API access to the data for developers. You can search the tweets, retrieve information about user accounts and listen to the status stream (though you can tap into the public stream only which is supposedly 1% of all, the gardenhose and firehose is off limits, you only get expensive access to it through some data resellers). Still we can poke our nose into Big Data ‘lite’.
I did some experiments with the Twitter search and streaming API, also in the context of aviation and airport. I started to persist trends, search results and filtered live stream into MongoDB. I will share some of my findings soon.
The same questions as some years back: How to get started with it ?
The good news: There is plenty of material out there about D3, books, websites, tutorials. I purchased a couple of books from Packt Books and O’Reilly as PDF version, I think they have the most books, though most of them based on earlier D3 versions.
I wont recommend a specific one, most follow the same principles, a short intro to the DOM, explain the D3 basics followed by more or less complex samples. As with any other technology or programming language you learn most hands-on, there so many examples and good tutorials available. I never attempted to understand everything, understand the concepts, dissect existing visualizations and get creative !
D3.js Website and
I will revisit some of the old visualizations I created and also try to convert some newer ideas into visualizations. Now tinkering and playing became much easier, I recommend JSFiddleif you want to avoid local web server setup, though JSFiddle does not support version 4.0 yet (at the time of writing this).
I am working with D3 again after experimenting with visualization tools like Tableau and realizing quickly that outstanding visualizations only can be created with the power of D3, though you have to work with sourcecode and understand whats going on under the hood of a modern web browser. Please dont forget D3 is released under the BSD license !
I have been observing and looking at MongoDB for a couple of years already but never used it in any really serious or professional context. I am a content PostgreSQL user since 2005 and exposed to Oracle for more than 15 years. MongoDb as a NoSQL database showed up in 2009, it is a free and open-source cross-platform document-oriented database, it breaks with the traditional relational database concept. You can read more about its history and the key facts at Wikipedia.
Here is a good overview of the key facts of the current NoSQL landscape products.
Last year release 3.0 came out in 2015 with a significant number of improvements and changes. Time to spend time on it again and jump straight into the 3.x release world. I noticed a lot of books, tutorials and websites refer to the earlier versions and unfortunately using deprecated functions and features.
It takes very little to get started. Download and install it on Windows, Linux or Mac and get you hands dirty with the DB shell, which I recommend to do for the beginning to pick up the basics and concepts, before using a GUI. Free yourself from the relational DB way of thinking, try to adopt the concept of documents instead. There is also a number of books about MongoDB out there, but the website is quite rich in info and you find myriads of people talking about it in blogs and forums.
When you look for a GUI, there a couple of them, some free, some commercial, I recommend to have try with 3T’s MongoChef. Though it is commercial (decent price tag) it is free to use for non-commercial (!) activities.
Here is a good page with an overview of current tools.
Some findings around replication following an earlier blog post.
I have a simple streaming replication setup with 1 primary and 1 standby DB server. My reference DB is about 42GB in size.
A full vacuum will break the replication. Remember, PostgresSQL is recreating every table in the full mode. This wont fit through any default sized xlogs.
A regular vacuum works fine though.
In a complete disaster scenario where both primary and standby are corrupted I have to rely on the nightly pg_dump.
The import on the primary DB server (AWS m2.4xl instance) takes about 3 hours to restore with the default parameters.
We can significantly speed it up by utilizing more than 1 default CPU for the data and index creation using the j parameter. Constraints cannot be created in parallel though. sudo -u postgres pg_restore -j 8 -v -d dbname dumpfilename
This way I could reducte the restore time to 70 minutes.
I still have to get the standby in sync with pg_basebackup, takes another 60 minutes. This time I could reduce to 30 minutes by tar and copying the complete DB files over to the standby.
(But pg_basebackup works while the primary server is in use, provided you have enough xlogs to cover the until you want to link the standby)