Big Data on the Cloud

I’m excited to share a successful implementation of the MapReduce framework on Amazon’s AWS infrastructure.

Our end-client was a Fortune 500 beverage company, which had input data come in from couple of different market research companies like Nielsen. The first step was to move away from the proprietary client Nitro and get more control on the monthly data analysis by storing it on the MySQL database. By doing so, we brought down the data analysis period from 5 weeks to 2 weeks.

Now my team has implemented a MapReduce architecture of distributing data processing by adding parallel worker nodes for ETL (happening on Windows) and data analysis (on Linux) and which in turn would be aggregated on a master node which releases data to the dashboard. To publish the data, one leading online charting system, iCharts was used which supports various input data formats giving great flexibility to the users.

The amazing fact is that the initial 12 days total processing time has come down to just merely 3 days! This is using the same infrastuctre cost! This could happen just because of the Cloud where these nodes are started on the fly using AWS scripts as soon as the monthly data becomes available and after 3 days of processing, are parked back into the AWS account. Here are few stats just to get a feel of the size of data I am talking about:

  • Total raw data size : 130GB
  • Total data points crunched : 34 trillion
  • Data points after analysis : 240 billion
  • Total reports : 121 (on an average 6 table charts per report)
  • Total EC2 instances : 11 Windows, 12 Linux (for 3 days)

Though in today’s date this does not perfectly qualify as “Big Data”, I’m sure that this solution is easily scalable to handle petabytes of data. The Cloud has tremendous power when it comes to elasticity and scalability. This big data project was a perfect example of this.

Manoj Patil is the Chief Architect of Big Data Analytics at Compassites. He comes in with 13+ years of well-rounded experience in architecting and implementing solutions for various enterprises across verticals such as Data Analytics, Supply Chain Management, BI, Payments, Life Sciences etc. He loves working on solutions for Big Data customers using different cloud services. Prior to joining Compassites, Manoj had worked in various roles including programmer, project manager, delivery manager with the organizations like Talentbea Inc, Persistent Systems and Vmoksha Technologies.


Leave a comment