The idea of this article is to explain the general features of Apache big data stream processing frameworks. Also provide a crisp comparative analysis of Apache’s big data streaming frameworks against the generic features. So that it useful to select the right framework for the application development.
In the big data world, there are many tools and frameworks available to process the large volume of data in offline mode or batch mode. But the need for real time processing to analyze the data arriving at high velocity on the fly and provide analytics or enrichment services is also high. In the last couple of year this is an ever changing landscape, with many new entrants of streaming frameworks. So choosing the real time processing engine becomes a challenge.
The real time streaming engines interacts with stream or messaging frameworks such as Apache Kafka, RabbitMQ, Apache Flume to receive the data in real time.
It process the data inside the cluster computing engine which typically runs on top of a cluster manager such as Apache YARN, Apache Mesos or Apache Tez.
The processed data sent back to message queues ( Apache Kafka, RabbitMQ, Flume) or written into storage such as HDFS, NFS.
3. Characteristics of Real Time Stream Process Engines
3.1 Programming Models
They are two types of programming models present in real time streaming frameworks.
This approach provides basic components, using which the streaming application can be created. For example, In Apache Storm, the spout is used to connect to different sources and receive the data and bolts are used to process the received data.
This is more of a functional programming approach, where the framework allows us to define higher order functions. This declarative APIs provides more advanced operations like windowing or state management & it is considered more flexible.
3.2 Message Delivery Guarantee
There are three message delivery guarantee mechanisms. They are : at most once, at least
once & exactly once.
3.2.1 At most once
This is a best effort delivery mechanism. The message may be delivered one or more times. So the possibilities of getting duplicate events processed are very high.
3.2.2 At least once
This mechanism will ensure that the message is delivered at-least once.
But in the process of delivering at least once, the framework might deliver the
message more than once. So, duplicate message might be received and processed.
This might result in unnecessary complications, where the processing logic is not
3.2.3 Exactly once
The framework will ensure that the message is delivered and processed exactly once.
The message delivery is guaranteed and there won’t be any duplicate messages.
So, “Exactly Once” delivery guarantee is considered to be best of all.
3.3 State Management
Statement management defines the way events are accumulated in side the frameworks before it actually process the data. This is a critical factor while deciding the framework for real time analytics.
3.3.1 Stateless processing
The frameworks which process the incoming events independently without the knowledge
of any previous events are considered to be stateless. The data enrichment and data processing applications might need kind of processing power.
3.3.2 Stateful Processing
The stream processing frameworks can make use of the previous events to process the
incoming events, by storing them in cache or external databases. Real time analytics applications need stateful processing, so that it can collect the data for a specific interval and process them before it really recommends any suggestions to the user.
3.4 Processing Modes
Processing mode defines, how the incoming data is processed. There are three processing modes: Event, Micro batch & batch.
3.4.1 . Event Mode
Each and every incoming message is processed independently. It may or may not maintain the state information.
3.4.2 Micro Batch
The incoming events are accumulated for a specific time window and the collected events processed together as batch.
The incoming events are processed like a bounded stream of inputs.
This allows to process the large finite set of incoming events.
3.5 Cluster Manager
The real time processing frameworks runs in cluster computing environment might need a cluster manager. The support for cluster manager is critical to support the scalability and performance requirement of the application. The frameworks might run on standalone mode, their own cluster manager, Apache YARN, Apache Mesos or Apache Tez.
3.5.1 Standalone Mode
The support to run on standalone mode is useful during development phase, where the developers can run the code in their development environment, they do not need to deploy their code in the large cluster computing environment.
3.5.2 Proprietary Cluster Manager
Some of real time processing frameworks might support their own cluster managers, such Apache Spark has its own Standalone Cluster manager, which is bundled with the software. This reduces the overhead of installing , configuration and maintenance of other cluster managers such as Apache Yarn or Apache Mesos.
3.5.3 Support for Industry Standard Cluster Managers
If you already have a big data environment and want to leverage the cluster for real time processing, then support to existing cluster computing manager is very critical. The real time stream processing frameworks must support Apache YARN, Apache Mesos or Apache Tez.
3.6 Fault Tolerance
Most of the Big Data frameworks follows master slave architecture. Basically the master is responsible for running the job on the cluster and monitor the clients in the cluster. So, the framework must handle failures at the master node as well as failure in client nodes. Some frameworks might need some external tools like monit/supervisord to monitor the master node. For example, Apache Spark streaming has its own monitoring process for the master (driver) node. If the master node fails it will be automatically restarted. If the client node fails, master takes care of restarting them. But in Apache Storm, the master has to be monitored using monit.
3.7 External Connectors
The framework must support seamless connection to external data generation sources such Twitter feeds, Kafka , RabbitMQ, Flume, RSS Feeds, Hekad, etc. The frameworks must provide standard inbuilt connectors as well as provision to extend the connectors to connect various streaming data sources.
3.7.1 Social Media Connectors – Twitter / RSS Feeds
3.7.2 Message Queue Connectors -Kafka / RabbitMQ
3.7.3 Network Port Connectors – TCP/UDP Ports
3.7.4 Custom Connectors – Support to develop customized connectors to read from custom applications.
3.8 Programming Language Support
Most of these frameworks supports JVM languages, especially Java & Scala. Some also supports Python. The selection of the framework might depend on the language of the choice.
3.9 Reference Data Storage & Access
The real time processing engines, might need to refer some data bases to enhance or aggregate the given data. So, the framework must provide features to integrate and efficient access to the reference data. Some frameworks provide ways to internally cache the reference data in memory (Eg. Apache Spark Broadcast Variable). Apache Samza and Apache Flink supports storing the reference data internally in each cluster node, so that jobs can access them internally without connecting to the data base over the network.
Following are the various methods available in the big data streaming frameworks:
3.9.1 In-memory cache : Allows to store reference data inside cluster nodes, so that it improves the performance by reducing the delay in connecting to external data bases.
3.9.2 Per Client Data Base Storage: Allows to store data in 3rd party database systems like MySQL, SQLite,MongoDB etc inside the streaming cluster. Also provides API support to connect and retrieve data from those data bases & provides efficient data base connection methodologies.
3.9.3. Remote DBMS connection: These systems support connecting to the external databases outside the streaming clusters. This is considered to be less efficient due to higher latency introduced due to network connectivity and bottlenecks introduced due to network communication.
3.10 Latency and throughput
Though hardware configuration plays a major role in latency and throughput, some of the design factors of the frameworks affects the performance. The factors are : Network IO, efficient use memory, reduced disk access, in memory cache for reference data. For example, Apache Kafka Streaming API provides higher throughput and low latency due to reduced network I/O, hence the messaging framework and computing engines are in the same cluster. Similarly, Apache Spark uses the memory to cache the data, there by reduces the disk access results in low latency and higher throughput.
4. Feature Comparison Table
Following table provides a comparison Apache streaming frameworks against the above discussed features.
The above frameworks supports both statefull and stateless processing modes.
This article summarizes the various features of the streaming framework, which are critical selection criteria for new streaming application. Every application is unique and has its own specific functional and non-functional requirement, so the right framework is completely depends on the requirement.
6.1 Apache Spark Streaming – http://spark.apache.org/streaming/
6.2 Apache Storm – http://storm.apache.org/
6.3 Apache Flink – https://flink.apache.org/
6.4 Apache Samza – http://samza.apache.org/
6.5 Apache Kafka Streaming API – http://kafka.apache.org/documentation.html#streams