Beginning with Pega® Platform 7.3, you can connect to Apache Kafka servers through a dedicated data set rule. Apache Kafka is a fault-tolerant and scalable platform that you can use as a data source for real-time analysis of customer records (such as messages, calls, and so on) as they occur. The most efficient way of using Kafka data sets in your application is through Data Flow rules that include event strategies.
Pega Platform includes Kafka client version 0.10.0.1, which is compatible with Kafka server version 0.10.
To establish a connection between Pega Platform and an Apache Kafka server, create the following components in your application:
- Kafka configuration instance
- Kafka data set rule type
Kafka configuration instances
Kafka configuration is a data instance that is created in the Data-Admin-Kafka class of your application. The purpose of these rules is to create a client connection between Pega Platform and an external Apache Kafka server or a cluster of servers.
Creating a Kafka configuration instance
For more information, see Creating a Kafka configuration instance.
Kafka data sets
Each Kafka server or server cluster that you connect to stores streams of records in categories that are called topics. For each topic that you want to access from Pega Platform, you must create a Kafka data set rule. When configuring a Kafka data set, you can select an existing topic in the target Kafka configuration instance, or you can create a topic if the Kafka cluster is configured for the autocreation of topics. Optionally, you also can specify the partition keys that you want to apply to the data while running distributed data flow runs, or whether you want to read historical Kafka records, that is, the records from before the real-time data flow run that references this Kafka data set was started.
Creating a Kafka data set
For more information, see Creating a Kafka data set.
Kafka data sets in data flows
You can use Kafka data sets either as a source or a destination in a Data Flow rule. You can run data flows that reference Kafka data sets only in real-time mode. Because Kafka servers support partitioning, you can distribute data flow runs to process data across all the nodes that are configured as part of the Data Flow service, increasing the throughput and resiliency of data flow processing.