AWS-based Big Data project

Challenge

After initial research, FortySeven engineers outlined the following key requirements for the required solution: Extensive reporting on application usage data collected from users' devices. The amount of reported data is expected to increase. Ability to calculate existing/new stat types based on all collected data. In other words, data storage must be provided for life. Ability to filter calculated statistics and create charts based on specific criteria. The ability to automatically perform statistical calculations based on specific periods.

FortySeven developers implemented a multi-stage data line:

  • First stage: a collection of primary data. To lower the cost of the reporting API and protect it from unauthorized access, the web development team used the Amazon Cognito service, which provides accurate permission management for both authorized users and guests. The reports are collected from the Android apps on the devices and sent directly to the S3 group with Cognito-controlled write-only access. This is an entirely secure transaction because it is impossible to steal information from the Amazon S3 bucket that is accessed only by writing. The data received as a result of the above operations contains a large number of small report files in a special repository S3
  • The second stage: data compression. Before storing data in the main bucket, we convert it to larger, compressed, vertically formatted files that are optimized for partial reading of data. We enable the conversion with the help of a Spark icon that runs on the Amazon Web Services EMR Spot Instance Hadoop block that automatically turns on and off. Here we see the benefit of the AWS dashboard it brings to the project because the EMR service provides Hadoop group management automatically without engineering efforts
  • The third stage: statistics. FortySeven engineers manually launched a similar EMR-based suite and ran the distributed Spark software to calculate the required statistics. The statistical output is saved in an extensive SQL database hosted by Amazon. The analytical result contains several SQL tables optimized to search in a date range
  • Fourth stage: visualization of statistics. The Apache Tomcat web interface uses the D3 Javascript library to draw graphs based on the data results stored in the SQL database defined according to specific criteria.
  • An app’s user can pick out the object by scanning a barcode or QR code
Industry
Consumer Goods and Services
Expertise
Mobile Application Development, Front-End Development
Technologies
Java, Apache Spark Over Hadoop, Apache Parquet, MySQL Server, Apache Tomcat, Linux, AWS

Approach

To perform all statistical manipulations, you need a lot of disk space and CPU power. This is why the Amazon Web Services platform was chosen because it provides a complete set of services and resources required for this solution at a reasonable cost. AWS technology provides a relatively inexpensive storage space for an almost unlimited amount of data. It supports the automated deployment of Hadoop groups, which was necessary to fulfill the customer's mission. FortySeven developers implemented a multi-stage data line.

Result

The Amazon Web Services platform enables the implementation of a complex significant data pipeline that avoids the main restricted access of any big data processing:

  • Disk space
  • CPU power
  • Distributed computing

Amazon S3 is one of the cheapest value-added flagship store solutions in its class, and the EMR cluster provides an easy and highly automated way to instantly deploy and build a large Hadoop cluster, significantly reducing the cost of development efforts.