Notes on Kafka


What is Kafka

"Apache kafka is a distributed streaming platform. It lets you publish and subscribe to streams of records; it lets you store streams of records in a fault-tolerant way; it lests  you process streams of records as they occur."

 

Why Apache Kafka

  • Distributed, resilient architecture, fault tolerant
  • Horizontal scalability
  • High performance (latency of less than 10 ms), real time
  • Widely accepted, de facto for building streaming platform

Benifits of using Message Queue (such as Kafka)

  • Decouple producers from consumers
  • Prevent data loss when system fails
  • Alleviate the load when traffic spikes
  • Used as a temp storage and can be used for batch processing
  • Data may come from many different services which use different protocol to communicate, using message queue as central bus will simplify the complexity

Use cases

  • Messaging system
  • Website Activity tracking
  • Gather metrics from many different locations
  • Log Aggregation
  • Streaming processing (with the Kafka Streams API or Spark or Flink for example)
  • De-coupling of system dependencies
  • Integration with with Spark, Flink, Storm, Hadoop, and many other Big Data technologies

How Kafka works at high level

Kafka runs on a cluster of servers which are called brokers. The brokers are the agents for producers and consumers of data.

Kafka stores data in topics. A topic is like a category of data. Each topic may have many partitions. Each partition is an ordered, immutable sequence of records that is continually appended to -- a structured commit log. 

Data stored on Kafka has a retention period. Beyond the retention period, the data will be discarded to free up space no matter it has been consumed or not.

Each partition is replicated across a configurable number of servers for fault tolerance. Among them, one is called the leader and others are called in-sync replica (ISR). Zookepper is used to managed the election of the leaders of partions along with other things.

Consumers can have consumer groups so that each consumer in the group can read from one or more partitions of the topic. One partition can only consumed by one consumer in the consumer in the consumer group. Kafka dynamically assign partitions to consumers in a consumer group. If new consumer instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka only provides a total order over records within a prtition, not between different partitions in a topic. And it guarantees the following.

  • Messages sent by a producer to a particular topic partition will be appended in the order thy are sent. That is, if a record M1 is sent by the same producer as record M2, and M1 is sent first, the M1 will have a lower offset than M2 and appear earlier in the log.
  • A consumer instance sees records in the order they are stored in the log.
  • For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

Kafka has four core APIs:

  • The producer API : publish stream of records
  • The Consumer API: :subscribe to one or more topics of the streaming records
  • The Streams API: allow an application to act as a stream processor
  • The Connector API: connect to databases

 

Deploy using CLI

# Start server
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties

# create a topic
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
# list topics
> bin/kafka-topics.sh --list --zookeeper localhost:2181

# send messages
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a message
This is another message

# start a consumer
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

Deploy using Docker

https://github.com/wurstmeister/kafka-docker

Clients: 

Kafka is wrtten in Scala and Java and it has a bunch of client libraries including Golang, Python, NodeJs, C++ etc.https://cwiki.apache.org/confluence/display/KAFKA/Clients

Ecosystem:

Kafka can be integrated into many other systems. A list of them can be found at https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem.

 

 

 




  


Comments:

Write a comment
Anonymous

Captcha image

Reload

Type the number you see in the image above: