Big Data – How to create it or get access to it?

Big Data – Nothing new anymore. Occasionally still used as buzzword, but for many company it became a productive tool to analyze huge volumes of data. I noticed the term Big Data disappeared from the 2015 and 2016 Gartner report “Hype Cycle for Emerging Technologies Identifies the Computing Innovations That Organizations Should Monitor” (compare the 2013 and 2014 report), I guess it jumped straight into the plateau of productivity.

We know the definition (volume, velocity, variety, plus variability and complexity), read books about, join conferences and meetups but for a developer (outside a corporate environment with access to some kind of big data) the question remains “How do I get Big Data”. While we have access to the various tools and platforms there is no stream we can easily tap in. Of course you could create random data in the millions but this would not create content that you could analyze. There are efforts to publish data accesible to the public (open data), but it is hardly a large volume and it is usually not streaming.

In the attempt to get at least a small portion of Big Data, I only found Twitter to play with. As a message based social networking service it certainly falls into the Big Data space with more than 310 million active users and 6.000 tweets a second (from 5.000 tweets a day in 2007). The 3 V’s are ticked and fortunately Twitter gives us API access to the data for developers. You can search the tweets, retrieve information about user accounts and listen to the status stream (though you can tap into the public stream only which is supposedly 1% of all, the gardenhose and firehose is off limits, you only get expensive access to it through some data resellers). Still we can poke our nose into Big Data ‘lite’.

I did some experiments with the Twitter search and streaming API, also in the context of aviation and airport. I started to persist trends, search results and filtered live stream into MongoDB. I will share some of my findings soon.

Twitter Stream

Twitter Stream


Airport AODB and Big Data

There are 4 technology terms that almost every IT company in any vertical business picked up during the last few years: Cloud, Big Data, Mobile and Internet of Things. In this series I review some of these buzz words in the Airport IT business context.

Today I will review the question: Does AODB data qualify for Big Data ?

What is Big Data ? A term that everyone uses, from consumer, consultants, user to people at CxO level. All have it, want it, need it, do it. Everyone joins the crowd, though few really know what it is and how to apply it in one’s own context or if one actually has a reasonable use-case for Big Data. I would not attempt to explain Big Data in this article, rather pick key elements and try to apply them. Big Data is too big to swallow and it comes in so many use cases, technologies, brands, products, flavours and attributes from Amazon, Google, Facebook to SAP Hana, No SQL, Mongo DB, Map Reduce, Hadoop and NSA. You can google for the term Big Data, read the Wikipedia Page about it or read one of the hundreds of books about, I can recommend a few here that I read.


Terra Ceia Island Farms gladiolus being loaded onto a U.S. Airlines plane at the Sarasota Airport (by State Library and Archives of Florida)

Do not expect an explicit yes or no answer in the end, though we will do a Big Data compliance check for every paradigm, but still I like to be open to find use cases.

The 3 (+3*) key words for that usually used to define Big Data:

  • Volume
    An enormous amount of data created by human interaction, machines and sensors
  • Velocity
    A massive and continuous flow of data from various sources, usually in real-time, but not limited to.
  • Variety
    Data from many sources in structured and unstructured form
  • Veracity*
    Noise and abnormality in data.
  • Validity*
    Correctness and accuracy of the data
  • Volatility*
    Data relevance. Lifespan of the data in which data shall be considered for any analysis.

(* Only the first 3 V’s – Volume, Velocity and Variety – are the canonical Big Data attributes, the additional V’s show up sometimes in discussion or papers, but they basically apply to all data.)

Some basic facts about an AODB product. An Aiport Operational Database is an enterprise application with a closed and rather predictable user-base. Depending on the size of the airport and the community serving it we might have up to 200 concurrent users and 1000 active accounts. It is not consumer facing, it is not social media, no click-streams and no input of sensors.


We need to build a data scenario in a typical or average AODB setup to discuss the term volume. I will try to create a typical scenario in order to create a total number of records or attributes over a typical time-span. Please consider this is as a simplified sample, the figures might vary a lot, depending on various factors, like country and region, international or regional, hub, primary airport, etc.
Usually the airport size in publications or comparisons is derived from the number of PAX and or movements of aircrafts per year.
For some references please refer to the ACI website.

A midsized airport around 20 to 30 million PAX a year might have around 500 turnarounds, this will be 1000 movements (arrival and departure).

Lets assume every movement have ..
20 milestones (scheduled, estimated, actual timings, A-CDM milestones or proprietary timings,..). Each of these milestones gets 3 updates (in average!)
10 load attributes (number of pax, total, by class, infants, cargo and weights,..). Each of these attributes gets 2 updates (in average!)
20 network messages (IATA Type B, AFTN, CFMU,..) This can vary extremely depending on the system landscape.
25 various attributes (flight number, registration, tailnumber, flightplan ID, aircraft type, callsign, resources, connecting flights,..) Each of these attributes gets 2 updates (in average!)
This results in 150 attributes (inclusive of updates) per movement. Applied to 1.000 movements day, we will have
150.000 attributes per day,
4.500.000 attributes per month
27.000.000 attributes per season (6 months, one seasonal schedule)

This approach is conservative, it does not cover audit or system logging. It does not consider a situation where the AODB serves as central data repository (warehouse?), with data-feeds from other systems for permanent storage. In more complex environments I saw requirements to process and store 10.000.000 ground radar updates or 1.000.000 updates from the Building Management system a day.

Do 27 million attributes in 6 months qualify for big data volume ?
In this case I would say no, but taking into the account the option to store more than one season of data and maybe to cover more than one location in a multi-airport situation, maybe yes !


Do 150.000 attributes a day qualify for big data velocity ?
Braking down to an average of 1,7 updates a second, rather not. It does not require an Big Data architecture to process this.
Compare with Twitter (not a fair comparison though): ~10.000 tweets a second.


First, we have almost no unstructured data. Once the AODB has been put in place and production there are hardly changes in the structure of the data. Unstructured data might come with free format messages or partial free format content.
The variety also depends on the complexity of the IT landscape and the number of interfaces, AODB’s often play the role of central system integration and we face lot of inbound data streams, but they usually come in an agreed format.


Thoughts on Big Data Analysis

One of the selling points of Big Data is the analytic you can apply to the vast amount of data to identify patterns, extract useful knowledge and business value from the data collected. This might help to improve your business strategies or processes or focus on certain areas of value, even predict future scenarios given certain repeating conditions. We can see definitely value added to the AODB context here, though lot of the data is given and can be adjusted only with limitations or not at all, eg. flight schedules provided by airlines (usually result from slot coordination procedures) and the airport physical resources (stands, gates, belts,..). The potential lies in the analytic of actual data, even the airport can’t necessary change schedules, but with patterns emerging from actual data vs. scheduled data (eg. delays in dependency of certain weather, season, etc.), we can optimize the resources. Analysing connecting flight info can help to improve turnaround ground times and avoid delays, detecting frequent aircraft changes can help to improve gate allocation an other scenarios.
And looking at the big picture, if we would be able to collect from a network or countrywide or on a level like Eurocontrol, Big Data analysis certainly will create more valuable insights and improve on-time performance.

Big Data Bookshelf

Big Data: Principles and Best Practices of Scalable Realtime Data Systems

Big Data For Dummies

Ethics of Big Data: Balancing Risk and Innovation

Data Science for Business: What you need to know about data mining and data-analytic thinking

Some assorted links