Creating a streaming data pipeline with Kafka Streams

Creating a rule-based streaming data topology

Published in

ITNEXT

4 min readAug 14, 2020

What is a streaming topology?

A topology is a directed acyclic graph (DAG) of stream processors (nodes) connected by streams (edges). A few key features of a DAG is that it is finite and does not contain any cycles. Creating a streaming topology allows data processors to be small, focused microservices that can be easily distributed and scaled and can execute their work in parallel.

Why use Kafka Streams?

Kafka Streams is a API developed by Confluent for building streaming applications that consume Kafka topics, analyzing, transforming, or enriching input data and then sending results to another Kafka topic. It lets you do this with concise code in a way that is distributed and fault-tolerant. Kafka Streams defines a processor topology as a logical abstraction for your stream processing code.

Key concepts of Kafka Streams

A stream is an unbounded, continuously updating data set, consisting of an ordered, replayable, and fault-tolerant sequence of key-value pairs.
A stream processor is a node in the topology that receives one input record at a time from its upstream processors in the topology, applying its operation to it, and can optionally produce one or more output records to its downstream processors.
A source processor is a processor that does not have any upstream processors.
A sink processor is a processor that does not have any down-stream processors.

Getting Started

For this tutorial, I will be using the Java APIs for Kafka and Kafka Streams. I’m going to assume a basic understanding of using Maven to build a Java project and a rudimentary familiarity with Kafka and that a Kafka instance has already been setup. Lenses.io provides a quick and easy containerized solution to setting up a Kafka instance here.

To get started, we need to add kafka-clients and kafka-streams as dependencies to the project pom.xml:

Building a Streaming Topology

One or more input, intermediate, and output topics are needed for the streaming topology. Information for creating new Kafka topics can be found here. Once we have created the requisite topics, we can create a streaming topology. Here is an example of creating a topology for an input topic, where the value is serialized as JSON (serialized/deserialized by GSON).

Simple example of streaming topology

The above example is a very simple streaming topology, but at this point it doesn’t really do anything. It is important to note, that the topology is executed and persisted by the application executing the previous code snippet, the topology does not run inside the Kafka brokers. All topology processing overhead is paid for by the creating application.

A running topology can be stopped by executing:

streams.close();

To make this topology more useful, we need to define rule-based branches (or edges). In the next example, we create a basic topology with 3 branches, based on the values of a specific field in the JSON message payload.

Streaming topology with multiple branches

The topology we just created would look like the following graph:

Downstream consumers for the branches in the previous example can consume the branch topics exactly the same way as any other Kafka topic. The downstream processors may produce their own output topics. Therefore, it may be useful to combine the results from downstream processors with the original input topic. We can also use the Kafka Streams API to define rules for joining the resulting output topics into a single stream.

Crossing the Streams

Kafka Streams models its stream joining functionality off SQL joins. There are three kinds of joins:

inner join: emits an output when both input topics have records with the same key.
left join: emits an output for each record in the left or primary input topic. If the other topic does not have a value for a given key, it is set to null.
outer join: emits an output for each record in either input topic. If only one source contains a key, the other is null.

For our example, we are joining together a stream of input records and the results from downstream processors. In this case, it makes the most sense to perform a left join with the input topic being considered the primary topic. This will ensure the joined stream always outputs the original input records, even if there are no processor results available.

The final overall topology will look like the following graph:

It is programmatically possible to have the same service create and execute both streaming topologies, but I avoided doing this in the example to keep the graph acyclical.