Partitioning a topic
Distributed systems are already there all over the place. A distributed system is a system whose pieces could be running in different in different run times, these different run times could be on the same machine, they could be on different machines in the same server room, they could be on different machines in the same city, they could be on different machines in the same state, they could be on different machines in the same zone, they could be on different machines in the same planet. But when you talk to it, it talks as if it is one thing, one system. That is a distributed system.
Kafka is a distributed system. You connected to Kafka and told it to create a topic, it created the first one in New Zealand, you asked it to create another topic, it created the second one on a machine in Algeria. When you ask for data from topic one, you get it. When you ask for data from topic two, you get it. That is a distributed system.
Ok — so what is topic partitioning.
Given you have this setup. And the topic is supposed to store all the thoughts that came to your mind while you are alive, the server in new zealand would not be able to hold it. It would run out of disk space. Some of your high definition thoughts would bring it down, and you would not be able to store the next thought. Kafka loses it’s omnipotence. It would become a human. Except it won’t as it allows topic partitioning.
Note the scale of data that the systems of our times are working with. A distributed system works over multiple machines. The machines are helping the distributed system function. But that is not the only thing that the machine does. The machine might be doing bitcoin mining, might be storing high definition movies. Given the scale of the data and the fact that the machine size available is insufficient, there is a need to be able to store the data across machines. One is the size.
Kafka cluster
Partitioning is a fundamental concept. Topic is broken into partitions. Kafka works with M topics, and each topic would have N partitions. Clearly, there are many machines involved. You can see that there is a swarm of machines, a cluster of nodes. Distributed system. The word cluster captures the kafka system better than just distributed system. Distributed system can be any kind of architecture. Hub Spoke architecture. Cluster captures the kafka architecture well. Cluster of nodes where topics run. The functioning of the Kafka system is NOT the heavy part. The data is what makes the system heavy. Complexity of Kafka is not what requires so many machines. It is the volume of data along with Kafka’s high standards of availability, durability etc.
Distributing events across partitions of a topic
Key Value pair is the fundamental data abstraction for an event. If key is not provided, then the value gets distributed across the machines in the cluster. Round robin approach would be used to store the value if key is not provided. If key is provided, a hash code is computed and a mod N is done where N is the number of partitions, and that node is where the value gets stored.
Key guarantees ordering. Key is not a unique identifier for the value. It is more like a context for the value. For instance, if all the events happening on an order is tracked on a topic, then the order instance would be the key. All orderOne related events would sit together, in the right order on the particular partition. orderTwo related events sit together and so on.
All events associated with the same customer would sit in the same partition as another example.
Partitioning is good.