Scalable and fast real-time data with Apache Kafka

Thursday, April 9, 2020

When there is a need to analyze teams, the decision makers usually want to see some graphs. Visualizing data is absolutely crucial when trying to understand big sets of data. The more data the graphs are visualizing and the more up-to-date the data is, the better.
Nonetheless from first collecting the data, all the way of getting it into the visualization stage, there is a long route to travel and there can be some challenges on the way.

In the following illustrations, I have built dashboads with Kibana, a data visualization tool that is part of the Elastic product family known as ELK-stack. In this post I want to talk about how Kafka, a stream processing software made by LinkedIn could help in this process of providing real-time visualizations of any data source. Kafka is a stream processor that is built on the basis of traditional messaging pub-sub system. Kafka itself is the one that moves the data around and processes it along the way but the systems pushing the data in are called publishers and the ones subscribing to the information streams of kafka are subscribers.

Screenshot-2020-04-08-at-18-29-33

What really sets ELK-stack and especially Kibana apart as an visualization tool is that it handles the data as logs. Kafka doesnt give the data to kibana as the structured data tables as many other data visualization tools do. Kafka sends the state of the producer (in this case postgres database) periodically as log messages which is the trick behind making ELK-stack so scalable while keeping the vast amounts of data flowing in real-time.

Screenshot-2020-04-09-at-18-20-36

Instead of a traditional model where you download the data and start visualizing it, the user subscribes into the kafka “topic” which holds the data that the user is interested of and pulls the logs of the data throughout the history regarding on that data and subscribes to receive all the changes and addition that happen on that topic. This way the subscribers can pull the data into search engines, visualization tool or where ever they want to use the data in.

1024px-Apache-kafka-icon-svg

The flexibility of Kafka really comes into the light when we look at whats happening inside the kafka streams.
In kafka stream you can process the data in any way you desire. Kafka uses very SQL-like query language that every data engineer should be able to adapt quickly. The most common use cases is to process transactional data into a weekly or monthly averages before pushing it into the visualization tool.

CREATE STREAM AS SELECT * FROM example_topic WINDOW TUMBLING (SIZE 30 DAYS);

With this technology, as illustrated in the above images, we are able to visualize data from Jira issues (or any data source) and include the data into our visualization and reports without delay.
To conclude:
Stop downloading the data. Subscribe to data streams instead.

PostgresqlDatabasesKafkaReal-time analytics