This blog entry attempts to summarize the technologies in the CEP domain, touching over their prime feature(s) as well as their lackings. It seems sometimes that the term CEP is being overused (as did happen with ‘ESB’) and the write-up below reflects our perception and version of it.
ESPER (http://esper.codehaus.org/) is a popular open source component for complex event processing (CEP) available for java. It includes rich support for pattern matching and stream processing based on sliding time or length windows. Despite the fervent discussion over the term ‘CEP’ (http://www.dbms2.com/2011/08/25/renaming-cep-or-not/), ESPER seem to be a good fit for the term CEP, as it appears to be able to really identify “complex events” from a stream of simple events, thanks to ESPER’s EPL (Event Processing Language).
Recently, while searching for an open source solution for real time CEP, our group stumbled across Twitter’s Storm project (https://github.com/nathanmarz/storm
). It claims to be most comparable to Yahoo’s S4, while being in the same space as “Complex Event Processing” systems like Esper and Streambase. I am not sure about Streambase, but digging deeper into the Storm project made it look much different from CEP and from the ESPER solution. Ditto with S4 (http://incubator.apache.org/s4/
). While S4 and Storm seem to be good at real time stream processing in a distributed mode and they appeared (as they claim) to be the “Hadoop for Real Time”, they didn’t seem to have provisions to match patterns (and thus to indicate complex events).
Searching for a definition for CEP (that our study can relate to) led to the following bullets, (http://colinclark.sys-con.com/node/1985400
) which include the below four as prerequisite for a system/solution to be called a CEP component/project/solution:
- Domain Specific Language
- Continuous Query
- Time or Length Windows
- Temporal Pattern Matching
The lack of continuous query supporting time/length windows and temporal pattern matching seem to altogether missing in current versions of the S4 and Storm projects. Probably, it is due to their infancy and they will mature up to include such features in future. As of now, they only seem fit for pre-processing the event stream before passing it over to a CEP engine like ESPER. Their ability to do distributed processing (a la map-reduce mode) can help to speed up the pre-processing where events can be filtered off, or enriched via some lookup/computation etc. There have also been some attempts to integrate Storm with Esper (http://tomdzk.wordpress.com/2011/09/28/storm-esper/
While the processing systems like S4 and Storm lack important features of CEP, ESPER based systems have the disadvantage of being memory bound. Having too many events or having a large time windows can potentially cause ESPER to run out of memory. If ESPER is used to process real time streams, for e.g. from a social media, there will be lot of data accumulating in the ESPER memory. On a high level, the problem statement is to invent a CEP solution for big data. On a finer level, the problem statement include architecting a CEP solution for handling on-board (batched) as well as in-flight (Real-Time) data.
In DarkStar’s terminology (http://www.eventprocessing-communityofpractice.org/EPS-presentations/Clark_EP.pdf), the requirement is “having matched a registered pattern in real time, discover similar patterns in the database”. Since being memory bound is a limitation, it may prove useful if some mechanism to condense the in-memory events can be arrived at. The condensed data however should still be meaningful and retain the context of the original stream.
to be continued (as we research further) …
Reference: The complex (event) world from our JCG partner Abhishek Jain at the NS.Infra blog.