Evolution of Data Science, Part 3 – Moving forward: Considerations for Big Data & Analytics in the Cloud era

In 2013, 5 Exabyte’s of content were created per day.

There are 200,000 tweets per day, 2.5 million Facebook shares and 4 million Google searches per minute.

By 2020, we’ll have 30 billion wireless devices (check out more Internet stats)

Houston, we have a big data challenge.

What is Big Data? Do you have it?

IBM developed a conceptual framework, the “4 V’s” of Big Data – Volume (how big is it?), Variety (is it of differing types?), Veracity (what level on uncertainty lies in the data?) and Velocity (How fast does it go?).

This can help characterize if you are dealing with big data.

It’s not so useful day-to-day in practice – you “just know” if you’re dealing with big data because you will have all sorts of issues surrounding its collection and analysis, or will already be using the well-known big data tools. Is that you? You might have big data.

Why the cloud?

As a start up, it would have been incredibly difficult to do what we are doing with the upfront costs of physical infrastructure. For us, this wasn’t even considered an option, so we worked out our total ownership cost for our predicted workload on various cloud providers, and went with AWS. It was also around the time that Amazon Redshift, the petabyte-scale relational data store, was released. Redshift fit our use case perfectly, and we still use it today extensively.

Once upon a time, the cost to store and process truly big datasets was once only accessible to the large corporates – the companies that could afford to host a large physical infrastructure. With this rise in accessibility, we have seen a massive uptake of tools and techniques of Data Science.

The industry is also realizing the value of the data they hold, and the magnitude of success that can be achieved by using their data to make data-driven decisions. So with this increase in accessibility and ability, we are seeing many big data technology companies emerging today, and huge further growth is still predicted: The industry will be worth $50 billion by 2017. Not bad, considering that the information on the Internet weighs about 50 grams.

What are your requirements?

As with anything in IT, the first thing to do is to perform a detailed requirements analysis and take a step back to view your data process flow, and to see if you are making the most of your available analysis.

Some key considerations:

  • What are your key business drivers?
    • What will you gain from this exercise?
    • How will it save/make the company money?
  • What’s your data structure, or lack thereof? Structured, unstructured or semi structured?
    • This will tell you what storage and analysis tools you need. Are they flat files, databases, logs?
  • Where is it coming from?
    • If your company has multiple data sources, it can make sense to organize a data warehouse into dimension and fact tables, maybe using something like Microsoft Analysis Services
    • What do you need to do to ETL?
  • What are your “real time” requirements?
    • Consider storing what you have in aggregate form for easy access
    • Consider caching the big queries – Redis, Memcached
    • Consider search tools like Elasticsearch for unstructured real time data
  • Do you need aggregations on the fly or do your aggregations require knowledge of long ago?
    • Consider big data streaming options like Kafka, Amazon Kinesis, Flume etc. if you need real-time aggregations and alerts in response to high-velocity data
    • Consider MapReduce or Redshfit if you need to query historical data
  • What skills do your team currently have?
    • MapReduce tools like Hue, Hive, Impala & Spark make querying huge data sets easy for those with basic SQL knowledge
  • How will you make use of your data analytics?
    • Is your data for exploration analysis or set charting for end-users?
    • How often will you access it?
    • What will you visualize? Do you need Tableau, Pentaho, Kibana etc.
  • So, what tools will you use and how will they fit together?
    • These might be influenced by your current technologies & the services available to you from your cloud provider

Tips for Big Data in the cloud

  • Keep it flexible
    • Design your systems to be loosely coupled and reduce single points of failure
  • Store wisely
    • Keep in mind your accessibility requirements before you archive stuff, but also consider arching it if you aren’t using it. Readily available query infrastructure is costly, while object and archive stores in the cloud are cheap.
    • Sometimes, your data comes from the cloud. In these instances, it makes sense to keep it there
  • Think through your data pipeline thoroughly, define your data models carefully
    • The data model is the heart of your big data operations. A well-designed data model, defining your structured, semi structured and unstructured stores is critical to success.
    • Things can be difficult to change later when you have high throughput systems – schema less or not
    • Really think about how you will want to be able to query the data!
  • Leverage the work & tools of others
    • Don’t re-invent the wheel: Use libraries and PaaS services
  • Automate everything possible
    • Yeah, this will save you a lot of time in the long run
  • Check your agility
    • Using the cloud today is all about agility. Run lots of tests, see if they work, and make sure you are responsive to change. If your not, always ask why.

Many of these topics could have been whole blog posts themselves – I hope these tips give you at least a good starting point on your Big Data and Analytics adventure! Which would you like to hear more about? Tweet and tell me @ChrisJRiddell

Find out more about Parrot Analytics here.

Good luck with your Big Data!

-By Chris Riddell 

Share this post: