Enterprise Java

Important Production bugs and fixes for Storm and Kafka integration

I will describe here a few details for Storm and Kafka integration modules, a few important bugs that you should be aware and how to overcome some of them (especially for production installations).

I am heavily using Apache Storm in production installations with Kafka as my main input source (Spout).

Storm integration modules with Kafka and versions:

Recently, I upgrade to Storm 1.0.3 (from 0.9.6) and to Kafka 0.9.0.1 (from 0.8.2.2).
Unfortunately, Storm 1.0.3 has 2 major bugs that you have to resolve in order to use it in a production environment.

Major bugs (related to Kafka):

  1. “New Kafka spout crashes if partitions are reassigned while tuples are in-flight [JIRA-2104]
    This is fixed in 1.0.x branch (Pull-1980)
  2. “Storm-kafka-client: Failed tuples are not always replayed” [JIRA-2087]
    This is fixed in 1.x branch (Pull-1826)

I faced the above bugs when started the migration process from Storm 0.9.6 to 1.0.3. When stressed my topologies, various things started to not work or either saw stalled Workers that had stopped processing data.
After reading many logs and doing many tests, we finally understood the problem (KafkaSpout bugs). We paused the migration process and we were looking to fix these problems.
Luckily, Storm committers had already fixed these bugs, so solution was already provided.
A big thanks to Storm community!!!!

In order to resolve these issues, I ported these two fixes in a forked version of “storm-kafka-client” and release the new customized module with a new maven version (1.0.3-<custom>1.0) . Then I just reference the new custom version in my projects.
Afterwards, we started stress tests again and everything work as expected.
Be aware that bug “2087” is fixed only in 1.x branch, but it is very easy to port it to 1.0.3 version.

Fortunately, a few days ago Storm 1.1.0 was released. This release already fixes these bugs and many others. I have not tested yet, but I will try it soon.
There was no Storm 1.1.0 release when I ported back these fixes to 1.0.3 release line.

If you plan to stay with Storm 1.0.3 release, then you have to be aware with a few additional bugs of this release that you may want to fix them in your “custom” release:

  • “Kafka outage can lead to lockup of topology” [STORM-2440] [FIX]
  • “ReportErrorAndDie doesn’t always die” [STORM-2194] [FIX]
  • “Utils.sleep method doesn’t set interrupted flag after catching InterruptedException” [STORM-2396] [FIX]
  • “Event Logger bolt is instantiated even if topology.eventlogger.executors=0” [STORM-2389] [FIX]
  • “Fail-back Blob deletion also fails in BlobSynchronizer.syncBlobs” [STORM-2386] [FIX] (related to Nimbus HA)
  • “Storm-HDFS’s listFilesByModificationTime is broken” [STORM-2350] [FIX]
  • “Type mismatch in ReadClusterState’s ProfileAction processing Map” [STORM-2345] [FIX]

Most of the above bugs (except 2440 & 2194) are already resolved in Storm 1.1.0 release. New release contains new features that you might be interested (Streaming SQL, Druid and OpenTSB integration, more).

Best regards,
Adrianos Dadis.
Real Democracy requires Free Software

Adrianos Dadis

Adrianos is working as senior software engineer in telcos business domain. Particularly interested in enterprise integration, multi-tier architecture and middleware services. He mainly works with Weblogic, JBoss, Java EE, Spring, Drools, Oracle SOA Suite and various ESBs.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button