Software Development

Apache Kafka – What Is It And Does It Compare To Amazon Kinesis?

What is Apache Kafka?

Apache Kafka is an open-source, distributed, scalable publish-subscribe messaging system.  The organization responsible for the software is the Apache Software Foundation.  The code is written in Scala and was initially developed by the LinkedIn Company.  It was open-sourced in 2011 and became a top-level Apache project.

The project has the intention of providing a unified low-latency platform capable of handling data feeds in real-time. It is becoming more and more valuable for different enterprise infrastructures requiring integration between systems.  Systems wishing to integrate may publish or subscribe to particular Kafka topics based on their needs.

Kafka was influenced by transaction logs and the idea behind Apache Kafka is to be a massively scalable queue for messages which is constructed like a transaction log.

What Does Apache Kafka do?

This is a platform which is designated to work as a real-time data stream.

Kafka allows the organization of data under particular topics. Data producers write to topics as “publishers”.  Consumers or “subscribers” are configured and programmed to read off topic queues.

Topic messages are persisted on disk and replicated within the cluster to prevent data loss. Kafka has a cluster-centric design which offers strong durability and fault-tolerance guarantees.

Apache Kafka vs. Amazon Kinesis

Amazon Kinesis software is modeled after the Apache Kafka.  It’s known to incredibly fast, reliable and easy to operate.

Amazon Kinesis has a built-in cross replication while Kafka requires configuration to be performed on your own. Cross-replication is the idea of syncing data across logical or physical data centers.  Cross-replication is not mandatory, and you should consider doing so only if you need it.

Engineers sold on the value proposition of Kafka, but wish it was offered similar to the Kinesis model should keep an eye on http://confluent.io.

When to use?

As previously mentioned, Kafka is often chosen as an integration system in enterprise environments similar to traditional message brokering systems such as ActiveMQ or RabbitMQ.   Integration between systems is assisted by Kafka clients in a variety of languages including Java, Scala, Ruby, Python, Go, Rust, Node.js, etc.

Other use cases for Kafka include website activity tracking for a range of use cases including real-time processing or loading into Hadoop or analytic data warehousing systems for offline processing and reporting.

But, the most interesting part of Kafka (and Kinesis) lately is the use in streaming processing.  More and more applications and enterprises are building architectures which include processing pipelines consisting of multiple stages.  For example, a multi-stage design might include raw input data consumed from Kafka topics in stage 1.  In stage 1, data is consumed and then aggregated, enriched, or otherwise transformed. Then, in stage 2, the data is published to new topics for further consumption or follow-up processing during a later stage.

Conclusion

Keep an eye on supergloo.com for more articles and tutorials on Kafka and data processing pipelines using streams.  And if there are any particular “topics” �� you would like to see, please mention in the comments below.

References

Todd McGrath

Todd is a consultant in data engineering and software development using Scala, Apache Spark, Scala, Groovy, Python, relational, columnar and noSQL databases. He is a 20-year software veteran and founder of supergloo, inc.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button