Big Data

Big data streaming analytics case with Apache Kafka, Spark (Flink) and BI systems

Pinterest LinkedIn Tumblr

Today we will consider an example of building a big data streaming analytics system based on Apache Kafka , Spark , Flink , NoSQL DBMS, Tableau BI system or visualization in Kibana. 

Read on to find out who and why should investigate Twitter posts in real time, how to implement it technically, visualize it in visual BI dashboards for making data-driven decisions, and what does the Kappa architecture have to do with it.

Once Again about Big Data Analytics for Business: Marketing Problem Setting

Advertising and marketing are still the largest consumers of Big Data and data science technologies.… Moreover, modern business not only seeks to satisfy the emerging need of the client, but also to form it by stimulating demand or anticipating the desires of the consumer. 

For example, visitors to recreation parks, summer festivals and outdoor sports events are interested in fast delivery of picnic groceries or ready-to-eat meals. You can identify a potential client using an online analysis of his activity on social networks. 

Visit here: Top Big Data Companies

For example, hashtags # rest, # parkgorky, # weekend, etc. under the photos on Instagram or Twitter, along with geolocation data, they indicate that right now a person is walking in a specific area and, possibly, depending on the weather, will be happy to drink hot coffee or cool green tea, having a hearty burger or healthy lifestyle lunch. 

Sure, if the user is not at the cafe at the same time, i.e. the message lacks hashtags # cafe, # lunch, # summertime, etc. By analyzing such posts and tweets in real time, a food tech company can significantly increase its profits due to such ad hoc sales.

Thus, the key capabilities of the big data streaming analytics system for this case will be the following:

  • scalability, accuracy and the highest data processing speed (in real time or near real-time);
  • intellectual analysis of the collected information and automated decision-making, for example, the generation of personal special offers, taking into account the historical interests of the client and his current characteristics, such as geolocation, time of day, weather and other factors;
  • visualization of analysis results on an interactive dashboard.

How to implement this in practice, we will consider further.

Big Data Streaming Analytics Ml System Architecture

A typical Big Data system for the above-described need for Big Data has a classic Kappa-architecture , which allows relatively inexpensive processing of unique events in real time without in-depth historical analysis. Technically, this can be implemented as follows [1] :

  • read data from social networks in real-time mode;
  • aggregate them by extracting the hashtags of interest and defining the relationship between them;
  • make calculations, forming personal recommendations using machine learning models ( Machine Learning );
  • visualize the results of data analysis in the BI-system dashboard.

In particular, the Twitter API allows you to receive data in real time, process it and transfer it further along the processing pipeline, which will look like this:

  • data is collected in JSON format using the Twitter API and written to Apache Kafka topics for online analytics, as well as in Hadoop HDFS for history formation;
  • Spark applications are responsible for batch and stream computing, as well as ML;
  • As an analytical data warehouse, a NoSQL DBMS is suitable, which best meets predetermined storage requirements and data read / write speed, for example, Apache HBase , Hive , Greenplum , Cassandra , Elasticsearch , etc.
  • to generate reports and visualize data analysis results, you can use ready-made BI solutions, for example, Tableau, integrated with an analytical DBMS using special connectors.

However, it is possible to implement such a system of online big data analytics not only with the help of the Big Data technologies noted in the figure. Read on to see what alternatives are possible for each of the components described.

Apache Kafka And Other Implementation Technologies

The complexity of connecting the system components to each other and the availability of ready-made integration connectors can become a criterion for choosing a particular framework. 

For example, in October 2020, the Greenplum-Spark Connector 2.0 was released, which we talked about here . 

And you can connect the same Greenplum MPP DBMS with Apache Kafka using the Greenplum Stream Server (GPSS) or the PXF (Platform eXtension Framework) Java framework, which we discussed in this article . 

And about the features of creating your own Apache Spark connector to the Tableau BI system, read this article .

In addition, the necessary functional and non-functional requirements for this system component can be used as criteria for choosing an analytical DBMS. For example, Elasticsearch has almost instant indexing of new data in JSON and other semi-structured formats with fuzzy search support and ML modules, which we mentioned here . 

And built-in integration with Kibana will allow you to visualize the results of data analytics, as was done in the ad conversion analysis case study. 

The advantage of this solution is that there are no costs for the commercial license of the Tableau BI system – instead, the Apache Kafka bundle with the ELK stack components is used.  (Elasticsearch, Logstash, Kibana). And the implementation of machine learning algorithms is responsible for the PySpark code in the Spark framework [2] .

However, Apache Flink provides similar capabilities, which can be used instead of Spark if you need fast data processing in real time. 

Similar to Spark, Flink also provides SQL modules and Machine Learning libraries, incl. a set of Alink algorithms. Like Spark, Flink allows you to write code in Java, Scala and Python with improved performance thanks to the updates in the latest 1.13.0 release, released in May 2021 [3] . For answers to the question “Apache Spark vs Flink” (what are the similarities and differences between these distributed frameworks), see our separate article .

You will learn technical details of the implementation of the considered case and other similar examples of streaming big data analytics based on Apache Kafka, Spark and Greenplum in specialized courses in our licensed training and professional development center for developers, managers, architects, engineers, administrators, Data Scientists. and Big Data analysts in Moscow:

Write A Comment