Big Data – How to create it or get access to it?

Big Data – Nothing new anymore. Occasionally still used as buzzword, but for many company it became a productive tool to analyze huge volumes of data. I noticed the term Big Data disappeared from the 2015 and 2016 Gartner report “Hype Cycle for Emerging Technologies Identifies the Computing Innovations That Organizations Should Monitor” (compare the 2013 and 2014 report), I guess it jumped straight into the plateau of productivity.

We know the definition (volume, velocity, variety, plus variability and complexity), read books about, join conferences and meetups but for a developer (outside a corporate environment with access to some kind of big data) the question remains “How do I get Big Data”. While we have access to the various tools and platforms there is no stream we can easily tap in. Of course you could create random data in the millions but this would not create content that you could analyze. There are efforts to publish data accesible to the public (open data), but it is hardly a large volume and it is usually not streaming.

In the attempt to get at least a small portion of Big Data, I only found Twitter to play with. As a message based social networking service it certainly falls into the Big Data space with more than 310 million active users and 6.000 tweets a second (from 5.000 tweets a day in 2007). The 3 V’s are ticked and fortunately Twitter gives us API access to the data for developers. You can search the tweets, retrieve information about user accounts and listen to the status stream (though you can tap into the public stream only which is supposedly 1% of all, the gardenhose and firehose is off limits, you only get expensive access to it through some data resellers). Still we can poke our nose into Big Data ‘lite’.

I did some experiments with the Twitter search and streaming API, also in the context of aviation and airport. I started to persist trends, search results and filtered live stream into MongoDB. I will share some of my findings soon.

Twitter Stream

Twitter Stream

 

D3.js Getting started (again..)

The same questions as some years back: How to get started with it ?

The learning curve for beginners is certainly a bit steep if you dont have a basic background in at least HTML, CSS and JavaScript, in this case you rather go for a visualization tool like Tableau if you look for fast results.
The good news: There is plenty of material out there about D3, books, websites, tutorials. I purchased a couple of books from Packt Books and O’Reilly as PDF version, I think they have the most books, though most of them based on earlier D3 versions.

Packt Books

Packt Books

O'Reilly Books

O’Reilly Books

I wont recommend a specific one, most follow the same principles, a short intro to the DOM, explain the D3 basics followed by more or less complex samples. As with any other technology or programming language you learn most hands-on, there so many examples and good tutorials available. I never attempted to understand everything, understand the concepts, dissect existing visualizations and get creative !

Continue reading

D3.js Data-Driven-Documents Revisited

I came across this fantastic JavaScript library for DOM manipulation in 2011, at that time it was still fairly new in the visualization space and I picked it up for a couple of visualizations moving away from prefuse and other thick client embeded libraries. Now 5 years later, it matured massively and got adopted at large scale. Just recently the version 4.0 was releasey. One of the main features certainly the modularit

D3.js Webesit

D3.js Website and

I will revisit some of the old visualizations I created and also try to convert some newer ideas into visualizations. Now tinkering and playing became much easier, I recommend JSFiddle if you want to avoid local web server setup, though JSFiddle does not support version 4.0 yet (at the time of writing this).

JSFiddle

JSFiddle

You also can run your visualization with the good old GitHub Gist and the bl.ocks site or the browser plugins.

I am working with D3 again after experimenting with visualization tools like Tableau and realizing quickly that outstanding visualizations only can be created with the power of D3, though you have to work with sourcecode and understand whats going on under the hood of a modern web browser. Please dont forget D3 is released under the BSD license !

 

Eventually MongoDB

I have been observing and looking at MongoDB for a couple of years already but never used it in any really serious or professional context. I am a content PostgreSQL user since 2005 and exposed to Oracle for more than 15 years. MongoDb as a NoSQL database showed up in 2009, it is a free and open-source cross-platform document-oriented database, it breaks with the traditional relational database concept. You can read more about its history and the key facts at Wikipedia.
Here
is a good overview of the key facts of the current NoSQL landscape products.

Last year release 3.0 came out in 2015 with a significant number of improvements and changes. Time to spend time on it again and jump straight into the 3.x release world. I noticed a lot of books, tutorials and websites refer to the earlier versions and unfortunately using deprecated functions and features.

It takes very little to get started. Download and install it on Windows, Linux or Mac and get you hands dirty with the DB shell, which I recommend to do for the beginning to pick up the basics and concepts, before using a GUI. Free yourself from the relational DB way of thinking, try to adopt the concept of documents instead. There is also a number of books about MongoDB out there, but the website is quite rich in info and you find myriads of people talking about it in blogs and forums.

mongoshell

When you look for a GUI, there a couple of them, some free, some commercial, I recommend to have try with 3T’s MongoChef. Though it is commercial (decent price tag) it is free to use for non-commercial (!) activities.

MongoChef

Here is a good page with an overview of current tools.

PostgreSQL Replication II

Some findings around replication following an earlier blog post.

I have a simple streaming replication setup with 1 primary and 1 standby DB server. My reference DB is about 42GB in size.

  • A full vacuum will break the replication. Remember, PostgresSQL is recreating every table in the full mode. This wont fit through any default sized xlogs.
    A regular vacuum works fine though.
  • In a complete disaster scenario where both primary and standby are corrupted I have to rely on the nightly pg_dump.
    The import on the primary DB server (AWS m2.4xl instance) takes about 3 hours to restore with the default parameters.
    We can significantly speed it up by utilizing more than 1 default CPU for the data and index creation using the j parameter. Constraints cannot be created in parallel though.
    sudo -u postgres pg_restore -j 8 -v -d dbname dumpfilename
    This way I could reducte the restore time to 70 minutes.
    I still have to get the standby in sync with pg_basebackup, takes another 60 minutes. This time I could reduce to 30 minutes by tar and copying the complete DB files over to the standby.
    (But pg_basebackup works while the primary server is in use, provided you have enough xlogs to cover the until you want to link the standby)

Airports – Ready for the Cloud ?

Unlike airlines which are used to distributed operations and having systems like a reservation system hosted centrally at their hub (originating in the times of mainframe servers with access to this crucial part of their operations only via a remote connection), airports still tend to follow a much more traditional approach. Airport operations are local and and not geographically distributed like airlines, over decades they established local data-centres on-premise and created a mindset of full control only available with the server and IT services right in their basement. Along come big IT departments with teams of server-, network-, db-admins and support.

St. Albert at Dublin Airport, circa 1950 (CC by National LIbrary of Irland)

St. Albert at Dublin Airport, circa 1950 (CC by National Library of Ireland)

This paradigm is slowly changing, due to the fact airports need to cut cost and operate more efficiently. In parallel we can observe an attitude change at management levels, becoming more open to solutions which are outside of their physical control, they buy in the concept of SAAS, consuming a service on a subscription base with a well defined SLA and availability. This shift started with less crucial back-office systems, like Email-Server and document repositories, and now moving on towards more operation critical systems. Slow adopters or companies restricted by policies or governance issues start moving towards a private cloud, eventually cutting down on operations costs. Airports start to understand internet availability in the year 2015 reached a commodity level like water and electricity, they start to adopt even public cloud hosted services.
Zero tolerance systems like ATC or something less life critical like a FIDS system will remain certainly a local solution, but AODB’s are moving into the cloud. All the vendors jumped on the bandwagon and offer some kind of cloud solution, be in a private cloud offering (with the vendor) or even deployed to a public cloud. The potential in this approach is the opportunity to offer an AODB solution at a fraction of a price of traditional AODB projects. Deploying to a public cloud, without any local requirements other than an internet connection and a browser, a small airport can start using an AODB without any investment, maybe at a price as cheap as 3.000,- Euro monthly subscription. Assuming a smaller airport (less than 1 million PAX/year or something like 25..50 commercial flights a day plus GA) is operating with simple requirements (flight plan import and management, operational flight tracking,  billing, Type B and AFTN message interface).

To answer the questions: Yes, they are ready.
But it depends on the IT strategy of medium to big airports or the restricted budget and need of smaller airports.

Let’s see who is serving the long tail in the airport market !

Airport AODB and Big Data

There are 4 technology terms that almost every IT company in any vertical business picked up during the last few years: Cloud, Big Data, Mobile and Internet of Things. In this series I review some of these buzz words in the Airport IT business context.

Today I will review the question: Does AODB data qualify for Big Data ?

What is Big Data ? A term that everyone uses, from consumer, consultants, user to people at CxO level. All have it, want it, need it, do it. Everyone joins the crowd, though few really know what it is and how to apply it in one’s own context or if one actually has a reasonable use-case for Big Data. I would not attempt to explain Big Data in this article, rather pick key elements and try to apply them. Big Data is too big to swallow and it comes in so many use cases, technologies, brands, products, flavours and attributes from Amazon, Google, Facebook to SAP Hana, No SQL, Mongo DB, Map Reduce, Hadoop and NSA. You can google for the term Big Data, read the Wikipedia Page about it or read one of the hundreds of books about, I can recommend a few here that I read.

9390823839_a02d5e07fd_z

Terra Ceia Island Farms gladiolus being loaded onto a U.S. Airlines plane at the Sarasota Airport (by State Library and Archives of Florida)

Do not expect an explicit yes or no answer in the end, though we will do a Big Data compliance check for every paradigm, but still I like to be open to find use cases.

The 3 (+3*) key words for that usually used to define Big Data:

  • Volume
    An enormous amount of data created by human interaction, machines and sensors
  • Velocity
    A massive and continuous flow of data from various sources, usually in real-time, but not limited to.
  • Variety
    Data from many sources in structured and unstructured form
  • Veracity*
    Noise and abnormality in data.
  • Validity*
    Correctness and accuracy of the data
  • Volatility*
    Data relevance. Lifespan of the data in which data shall be considered for any analysis.

(* Only the first 3 V’s – Volume, Velocity and Variety – are the canonical Big Data attributes, the additional V’s show up sometimes in discussion or papers, but they basically apply to all data.)

Some basic facts about an AODB product. An Aiport Operational Database is an enterprise application with a closed and rather predictable user-base. Depending on the size of the airport and the community serving it we might have up to 200 concurrent users and 1000 active accounts. It is not consumer facing, it is not social media, no click-streams and no input of sensors.

Volume

We need to build a data scenario in a typical or average AODB setup to discuss the term volume. I will try to create a typical scenario in order to create a total number of records or attributes over a typical time-span. Please consider this is as a simplified sample, the figures might vary a lot, depending on various factors, like country and region, international or regional, hub, primary airport, etc.
Usually the airport size in publications or comparisons is derived from the number of PAX and or movements of aircrafts per year.
For some references please refer to the ACI website.

A midsized airport around 20 to 30 million PAX a year might have around 500 turnarounds, this will be 1000 movements (arrival and departure).

Lets assume every movement have ..
20 milestones (scheduled, estimated, actual timings, A-CDM milestones or proprietary timings,..). Each of these milestones gets 3 updates (in average!)
10 load attributes (number of pax, total, by class, infants, cargo and weights,..). Each of these attributes gets 2 updates (in average!)
20 network messages (IATA Type B, AFTN, CFMU,..) This can vary extremely depending on the system landscape.
25 various attributes (flight number, registration, tailnumber, flightplan ID, aircraft type, callsign, resources, connecting flights,..) Each of these attributes gets 2 updates (in average!)
This results in 150 attributes (inclusive of updates) per movement. Applied to 1.000 movements day, we will have
150.000 attributes per day,
4.500.000 attributes per month
27.000.000 attributes per season (6 months, one seasonal schedule)

This approach is conservative, it does not cover audit or system logging. It does not consider a situation where the AODB serves as central data repository (warehouse?), with data-feeds from other systems for permanent storage. In more complex environments I saw requirements to process and store 10.000.000 ground radar updates or 1.000.000 updates from the Building Management system a day.

Do 27 million attributes in 6 months qualify for big data volume ?
In this case I would say no, but taking into the account the option to store more than one season of data and maybe to cover more than one location in a multi-airport situation, maybe yes !

Velocity

Do 150.000 attributes a day qualify for big data velocity ?
Braking down to an average of 1,7 updates a second, rather not. It does not require an Big Data architecture to process this.
Compare with Twitter (not a fair comparison though): ~10.000 tweets a second.

Variety

First, we have almost no unstructured data. Once the AODB has been put in place and production there are hardly changes in the structure of the data. Unstructured data might come with free format messages or partial free format content.
The variety also depends on the complexity of the IT landscape and the number of interfaces, AODB’s often play the role of central system integration and we face lot of inbound data streams, but they usually come in an agreed format.

 

Thoughts on Big Data Analysis

One of the selling points of Big Data is the analytic you can apply to the vast amount of data to identify patterns, extract useful knowledge and business value from the data collected. This might help to improve your business strategies or processes or focus on certain areas of value, even predict future scenarios given certain repeating conditions. We can see definitely value added to the AODB context here, though lot of the data is given and can be adjusted only with limitations or not at all, eg. flight schedules provided by airlines (usually result from slot coordination procedures) and the airport physical resources (stands, gates, belts,..). The potential lies in the analytic of actual data, even the airport can’t necessary change schedules, but with patterns emerging from actual data vs. scheduled data (eg. delays in dependency of certain weather, season, etc.), we can optimize the resources. Analysing connecting flight info can help to improve turnaround ground times and avoid delays, detecting frequent aircraft changes can help to improve gate allocation an other scenarios.
And looking at the big picture, if we would be able to collect from a network or countrywide or on a level like Eurocontrol, Big Data analysis certainly will create more valuable insights and improve on-time performance.

Big Data Bookshelf

Big Data: Principles and Best Practices of Scalable Realtime Data Systems

Big Data For Dummies

Ethics of Big Data: Balancing Risk and Innovation

Data Science for Business: What you need to know about data mining and data-analytic thinking

Some assorted links