Important Production bugs and fixes for Storm and Kafka integration

Adrianos DadisApril 11th, 2017Last Updated: April 10th, 2017

0 29 2 minutes read

I will describe here a few details for Storm and Kafka integration modules, a few important bugs that you should be aware and how to overcome some of them (especially for production installations).

I am heavily using Apache Storm in production installations with Kafka as my main input source (Spout).

Storm integration modules with Kafka and versions:

Storm 0.x supports Kafka 0.8.x with existing module storm-kafka
Storm 1.0.x supports Kafka 0.9.x with new module storm-kafka-client
Storm 1.x supports Kafka 0.10.x with new module storm-kafka-client

Recently, I upgrade to Storm 1.0.3 (from 0.9.6) and to Kafka 0.9.0.1 (from 0.8.2.2).
Unfortunately, Storm 1.0.3 has 2 major bugs that you have to resolve in order to use it in a production environment.

Major bugs (related to Kafka):

“New Kafka spout crashes if partitions are reassigned while tuples are in-flight [JIRA-2104]
This is fixed in 1.0.x branch (Pull-1980)
“Storm-kafka-client: Failed tuples are not always replayed” [JIRA-2087]
This is fixed in 1.x branch (Pull-1826)

I faced the above bugs when started the migration process from Storm 0.9.6 to 1.0.3. When stressed my topologies, various things started to not work or either saw stalled Workers that had stopped processing data.
After reading many logs and doing many tests, we finally understood the problem (KafkaSpout bugs). We paused the migration process and we were looking to fix these problems.
Luckily, Storm committers had already fixed these bugs, so solution was already provided.
A big thanks to Storm community!!!!

In order to resolve these issues, I ported these two fixes in a forked version of “storm-kafka-client” and release the new customized module with a new maven version (1.0.3-<custom>1.0) . Then I just reference the new custom version in my projects.
Afterwards, we started stress tests again and everything work as expected.
Be aware that bug “2087” is fixed only in 1.x branch, but it is very easy to port it to 1.0.3 version.

Fortunately, a few days ago Storm 1.1.0 was released. This release already fixes these bugs and many others. I have not tested yet, but I will try it soon.
There was no Storm 1.1.0 release when I ported back these fixes to 1.0.3 release line.

If you plan to stay with Storm 1.0.3 release, then you have to be aware with a few additional bugs of this release that you may want to fix them in your “custom” release:

“Kafka outage can lead to lockup of topology” [STORM-2440] [FIX]
“ReportErrorAndDie doesn’t always die” [STORM-2194] [FIX]
“Utils.sleep method doesn’t set interrupted flag after catching InterruptedException” [STORM-2396] [FIX]
“Event Logger bolt is instantiated even if topology.eventlogger.executors=0” [STORM-2389] [FIX]
“Fail-back Blob deletion also fails in BlobSynchronizer.syncBlobs” [STORM-2386] [FIX] (related to Nimbus HA)
“Storm-HDFS’s listFilesByModificationTime is broken” [STORM-2350] [FIX]
“Type mismatch in ReadClusterState’s ProfileAction processing Map” [STORM-2345] [FIX]

Most of the above bugs (except 2440 & 2194) are already resolved in Storm 1.1.0 release. New release contains new features that you might be interested (Streaming SQL, Druid and OpenTSB integration, more).

Best regards,
Adrianos Dadis.
Real Democracy requires Free Software

Reference:

Important Production bugs and fixes for Storm and Kafka integration from our JCG partner Adrianos Dadis at the Java, Integration and the virtues of source blog.