What is Kafka ?

Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale. Originally created to handle real-time data feeds at LinkedIn in 2011, Kafka quickly evolved from messaging queue to a full-fledged event streaming platform capable of handling over 1 million messages per second, or trillions of messages per day.

Apache Kafka functions as a distributed event store and stream-processing platform, developed as an open-source system by the Apache Software Foundation. The developers authored it in Java and Scala with the primary goal of creating a cohesive, high-throughput, and low-latency platform for managing real-time data feeds.

Why Kafka?

Kafka offers numerous benefits and has become a cornerstone in the technology landscape. Currently adopted by over 80% of Fortune 100 companies spanning diverse industries, Kafka serves a myriad of use cases, both extensive and modest. It stands out as the go-to technology for developers and architects working on the next wave of scalable, real-time data streaming applications. While various technologies are available in the market to accomplish similar goals, the widespread popularity of Kafka can be attributed to several key factors.

Everything you need to know about Kafka in 10 minutes:

Event streaming serves as the digital counterpart to the central nervous system of the human body. It forms the technological bedrock for the ‘always-on’ world, where businesses are progressively becoming software-defined and automated. In this environment, the user of software transforms into an entity intertwined with software itself.


In technical terms, event streaming involves the real-time capture of data from various sources such as databases, sensors, mobile devices, cloud services, and software applications. This data is recorded as streams of events, stored persistently for future retrieval. Event streaming encompasses the ongoing processes of manipulating, processing, and responding to these event streams in both real-time and retrospective manners. Additionally, it includes the dynamic routing of event streams to different destination technologies based on specific requirements. The ultimate goal of event streaming is to maintain a seamless and continuous flow of data, ensuring that pertinent information is available at the right place and time.


For what purposes can event streaming be employed?

  • Processing payments and financial transactions in real-time is crucial for industries like stock exchanges, banks, and insurance.
  • Real-time tracking and monitoring of vehicles, including cars, trucks, fleets, and shipments, is essential in sectors such as logistics and the automotive industry.
  • Continuously capturing and analyzing sensor data from IoT devices or other equipment is imperative, especially in environments like factories and wind parks.
  • Swiftly collecting and responding to customer interactions and orders is vital in various sectors, including retail, the hotel and travel industry, and mobile applications.
  • Monitoring patients in hospital care and predicting changes in their condition to ensure timely treatment during emergencies is a critical healthcare application.
  • Connecting, storing, and making data produced by different divisions of a company readily available is fundamental for organizational coherence.
  • Serving as the foundation for data platforms, event-driven architectures, and micro services is integral to fostering technological innovation and efficiency.

Kafka combines three key capabilities so you can implement your use cased for event streaming end-to-end with a single battle-tested solution:

  1. To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
  2. To store streams of events durably and reliably for as long as you want.
  3. To process streams of events as they occur or retrospectively.

How does Kafka work in a nutshell?

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. Deploy it on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.

Servers: Kafka operates as a cluster comprising one or more servers, capable of spanning multiple datacenters or cloud regions. Within this cluster, certain servers constitute the storage layer known as brokers. Additionally, dedicated servers run Kafka Connect, facilitating the continuous import and export of data as event streams This integration allows Kafka to seamlessly connect with your existing systems, including relational databases and other Kafka clusters. Designed to support mission-critical scenarios, a Kafka cluster exhibits high scalability and fault tolerance. In the event of a server failure, the remaining servers seamlessly assume the responsibilities to guarantee uninterrupted operations, preventing any data loss.

Customers/Clients enable you to develop distributed applications and microservices capable of concurrently reading, writing, and processing streams of events at scale. This is achieved in a fault-tolerant manner, even in scenarios involving network issues or machine failures. Kafka comes with a set of built-in clients, complemented by numerous clients contributed by the Kafka community. These clients cover a range of programming languages, including Java and Scala (which includes the higher-level Kafka Streams library), Go, Python, C/C++, and more. Additionally, REST APIs are available for integration.

Main Concepts and Terminology

An event signifies an occurrence, capturing the happening of something either in the world or within your business.
Alternatively, documentation refers to it as a record or message. When interacting with Kafka, one performs the action of reading or writing data in the format of events. In essence, an event comprises a key, value, timestamp, and possibly includes metadata headers. Illustrated below is an instance of an event.

Producers & Consumers

Producers are those client applications that publish (write) events to Kafka, and consumers are those that subscribe to (read and process) these events. Kafka achieves its high scalability, a key design element, by fully decoupling producers and consumers, making them agnostic of each other. For example, producers never need to wait for consumers. Kafka provides various guarantees such as the ability to process events exactly-once.

Events are organized and durably stored in topics. Very simplified, a topic is similar to a folder in a filesystem, and the events are the files in that folder. An example topic name could be “payments”. Topics in Kafka are always multi-producer and multi-subscriber: a topic can have zero, one, or many producers that write events to it, as well as zero, one, or many consumers that subscribe to these events.
Readers can access events in a topic as frequently as necessary—unlike traditional messaging systems, which delete events after consumption. Instead, you define the duration for which Kafka should retain your events through a per-topic configuration setting, after which it will discard old events.. Kafka’s performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine.

To ensure the fault tolerance and high availability of your data, it’s essential to replicate every topic, even spanning geo-regions or datacenters. This way, multiple brokers constantly maintain copies of the data, providing a safeguard in case of issues such as maintenance on brokers. A typical configuration in production involves a replication factor of 3, meaning there will always be three data copies. This replication process occurs at the level of topic-partitions.