WHITE PAPER
KAFK A VS. KAFKA VS. KINESIS Apache Kafka and Amazon Kinesis Comparison And Migration Guide By: Parviz Deyhim
US: +1 877 773 3306
UK: +44 800 634 3414
HK: +852 3521 0215
CONTENTS
03
Introduction
18
Application Changes
04
Apache Kafka Overview
23
Migrating Platforms
05
Amazon Kinesis Overview
29
Conclusion
06
Core Concepts
30
Apendix A
08
Cost Comparison
32
Apendix B
14
Design and Architectural Decisions
CONTENTS
03
Introduction
18
Application Changes
04
Apache Kafka Overview
23
Migrating Platforms
05
Amazon Kinesis Overview
29
Conclusion
06
Core Concepts
30
Apendix A
08
Cost Comparison
32
Apendix B
14
Design and Architectural Decisions
INTRODUCTION Streaming data processing has become increasingly prevalent. As a result, different platforms and framework frameworkss have been introduced to reduce the complexity of the requirements such as durable and scalable high-throughputt data ingest. While the traditional high-throughpu pub-sub messaging frameworks such as RabbitMQ and ActiveMQ have been around to help with those challenges, one solution that has changed the landscape since it’s inception is Apache Kafka. Apache Kafka, an open-source framework developed at LinkedIn, has been a popular choice for a variety of use-cases such as stream processing and data transformation due to its well-engineered, scalable and durable design. However one of the shortcomings of Apache Kafka However is the lack of cloud-native design in high-availability and monitoring. As a result, we’ve found that running and operating Apache Kafka in a cloud environment requires a great deal of time and effort committed by the operation and the engineering teams. An alternative to Apache Kafka but with the similar features is Amazon Kinesis. Amazon Kinesis is a data ingest service hosted and managed by Amazon Web Services (AWS).
Similar to the other platform-as-ser vice offerings, Amazon Kinesis eliminates the need for developers to manage and operate their own infrastructure. Since the inception of Amazon Kinesis, our clients have been asking the following questions: • What are the architectural differences between two systems? • What are are the application and API differences? • What are the cost differences between the two platforms? In this document we will answer those questions by examining: 1. Introduction to the common and important concepts pertaining to Apache Kafka and Amazon Kinesis 2. The economic and technical considerations of using Apache Kafka and Amazon Kinesis 3. The application API differences For those readers who are interested in migrating from Apache Kafka to Amazon Kinesis, the last section in this document provides a sample code to help with the migration process.
03
APACHE KAFKA OVERVIEW Apache Kafka is an open-source distributed pub-sub messaging solution that was initially developed at LinkedIn. Apache Kafka users are in responsible for installing and managing their clusters, and also accounting for requirements such as high availability, durability, and recovery. Apache Kafka consists of multiple nodes referred to as Brokers. Brokers are responsible for accepting messages (leaders) and replicating the messages to the rest of the brokers in the cluster (followers). T he distributed nature of Apache Kafka allows the system to scale out and provides high availability (HA) in case of node failure. The membership (leaders and followers) of Brokers in a cluster is tracked and administered via Apache Zookeeper, yet another open-source distributed membership framework.
Producer Applications
Zookeeper Node
Zookeeper Node
Kafka Broker
Kafka Broker
A va il ab il it y Z on e / Da t a Ce nt re
A va il ab il it y Z on e / Da t a Ce nt re
Consumer Applications
For more details on how Apache Kafka works, please refer to the following guide: Apache Kafka
04
AMAZON KINESIS OVERVIEW Amazon Kinesis, also a pub-sub messaging solution, is hosted by Amazon Web Services (AWS) and provides a similar set of capabilities as Apache Kafka. This section provides high-level architectural differences between the two systems. Amazon Kinesis is a fully managed service hosted within a given AWS region (i.e. us-east-1) and spans over multiple Availability Zones (i.e. us-east-1a). Similar to Apache Kafka, Amazon Kinesis is responsible for accepting end-user’s messages and replicating the messages to multiple-availability zones for high availability and durability. The fully managed aspect of Amazon Kinesis eliminates the need for users to maintain infrastructures or be concerned about the details surrounding features like replication or the other system configurations.
Amazon Kinesis Producer Applications
Consumer Applications Availabili ty Zone / Data Centr e
Availa bilit y Zone/ Data Centre
05
CORE CONCEPTS Throughout this document we’ll be referring to platform specific terms and concepts. The following section provides a summary and mapping of the important concepts in Apache Kafka and the corresponding concepts in Amazon Kinesis. A more detailed comparison is provided in the “Application Changes” section. Kafka Concept
Kinesis Concept
Topic
Stream
Partition
Shard
Broker
N/A
Apache Kafka Produc er
Ama zon Ki nesi s Producer
Apache Kafka Consumer
Amazon Kinesis Consumer
Offset number
Sequence Number
Replication
Not Required
APACHE KAFKA TOPIC VS. AMAZON KINESIS STREAM Apache Kafka Topic and Amazon Kinesis Stream represent an ordered, immutable and partitioned list of messages. The new messages get appended to the end of this list and each message has a unique identifier. APACHE KAFKA PARTITION VS. AMAZON KINESIS SHARD In Apache Kafka, each topic consists of one or more partitions. Shard is the similar concept in Amazon Kinesis. The intent of distributing each topic and stream over multiple partitions or Shard is to increase write/read throughput by distributing the load between multiple nodes. APACHE KAFKA REPLICATION VS. N/A Replication provides higher durability and availability in cases where the resource hosting the topic/streams experiences failures. In Apache Kafka the users have the ability to define the topic’s replication factor. Amazon Kinesis automatically stores data across multiple Availability Zones synchronously and as a result, the users are not required to define a replication strategy.
06
APACHE KAFKA OFFSET NUMBER VS. AMAZON KINESIS SEQUENCE NUMBER Each record within an Apache Kafka topic or Amazon Kinesis stream partition is given a unique number. In Apache Kafka this number is referred to as “Offset” while in Amazon Kinesis this number is referred to as “Sequence Number.” Both platforms guarantee that the offsets or the sequence numbers in a given partition or a Shard are ordered and sequentially increasing. APACHE KAFKA PRODUCER VS. AMAZON KINESIS PRODUCER The producers are the application components that submit records to Apache Kafka or Amazon Kinesis. The producers handle the ability to send multiple records to the platform, and are able to partition data using partitions or Shards, as well as per form tasks like compression and failure handling.
APACHE KAFKA CONSUMER VS. AMAZON KINESIS CONSUMER The consumers are the application components that fetch records from Amazon Kinesis or Apache Kafka via the provided API’s. Similar to the producer applications, the consumers can deal with failures and reading records from multiple partitions or Shards. APACHE KAFKA BORKER VS. N/A The Brokers are Apache Kafka nodes that are hosting one or more Apache Kafka partitions. Amazon Kinesis is a hosted service and the nodes that are hosting the Shards are abstracted from the users.
07
COST COMPARISONS The following section provides an overview of the different cost factors involved in running Apache Kafka or using Amazon Kinesis.
APACHE KAFKA COST FACTORS The cost of running and maintaining an Apache Kafka cluster involves a number of factors that users have to be aware of. It is common to calculate the cost simply by calculating the cost of the underlying hardware but in order to accurately estimate the cost of hosting Apache Kafka, users have to include the cost of replication, effort required to maintain (patching and upgrading) the cluster, monitoring, maintaining other dependent systems such as Apache Zookeeper, and maintaining brokers distributed between multiple datacenters or availability-zones. The following section provides an overview of the factors that should be considered when comparing the cost of hosting Apache Kafka vs. Amazon Kinesis. Hosting Cost Factors The cost of Hosting Apache Kafka consists of the following factors: Infrastructure costs + Data Durability + Maintenance Costs
Before we dive into the costs of hosting Apache Kafka, it’s important to note that running and maintaining an Apache Kafka cluster involves running and hosting a highly available and reliable Apache Zookeeper cluster. Apache Kafka relies on the Apache Zookeeper for some of its important and vital functions, therefore any cost calculation will be widely inaccurately if it does not include the cost of hosting and maintaining an Apache Zookeeper cluster. Infrastructure costs The cost of hosting Apache Kafka includes the cost of running an infrastructure capable of supporting the velocity of the incoming data (in terms of records/ sec) and the cost of storing data according to the data retention requirements. Based on our experience, one cost outweighs the other: either t he velocity of the incoming data requires deploying a cluster where the amount of CPU cores, memory and networking outweighs the data retention storage requirements or the required storage footprint for data retention outweighs the required amount of CPU core, memory or networking bandwidth. In some of the larger deployments with high traffic requirements, both factors can be equally important.
08
Durability costs The other factors that should be considered are the costs involved in providing durability and high availability. The cost of durability is directly influenced by the replication factor of Apache Kafka cluster, which in turn influences the cost of the required storage footprint. For example, with 1TB per day of incoming data and a replication factor of 3, the total size of the stored data on local disk is 3TB. While Apache Kafka replication provides durability and protection in case of Apache Broker nodes failures, it does not protect against data-center or availability zone outages. In order to protect against data-center/ availability zone outages, multiple Apache Kafka Brokers nodes have to be deployed in multiple datacenters or availability-zones. Multi-datacenter/availability-zone deployment introduces additional costs such as datacenter or availability bandwidth costs.
Managing Apache Kafka Framework Apache Kafka Specific Tasks
Monitoring and alerting on Apache Kafka Brokers failures Monitoring and alerting on Apache Kafka Resource utilizations (Disk, CPU and Memory) Monitoring and alerting on Apache Kafka partition throughput Migrating Apache Kafka partitions to new nodes to increase throughput Tuning Apache Kafka JVM settings Scaling Brokers to increase CPU, Memory and Disk resources Upgrading Apache Kafka version Recovering/Replacing failed Brokers Failing over to a different cluster in a different data-center or availability-zone Multi-AZ deployment
Maintenance Cost Factors Apache Kafka is a well-engineered framework, and due to its complex engineering nature, there are various factors involved in maintaining a production grade cluster. The following tables provide the high-level tasks involved in maintaining an Apache Zookeeper and Apache Kafka clusters. Apache Kafka relies on Apache Zookeeper for some of its internal functions and it’s important to consider the efforts required to host both frameworks.
09
Managing Apache Zookeeper Framework Apache Kafka Specific Tasks
Monitoring and alerting on Apache Zookeeper node failures Monitoring and alerting on Apache Zookeeper Resource utilization (Disk, CPU and Memory) Apache Zookeeper JVM Tuning Scaling Zookeeper nodes to increase CPU, Memory and Disk resources Upgrading Zookeeper version Replace failed Zookeeper nodes Multi-AZ deployment
AMAZON KINESIS COST FACTORS Given that Amazon Kinesis is a hosted service, it involves fewer costs factors as compared to Apache Kafka. The following section focuses on the cost factors involved in using Amazon Kinesis. Hosting Cost Factors One of the main benefits of using Amazon Kinesis is the fact that users are not responsible for hosting and maintaining a distributed cluster. Infrastructure costs Since Amazon Kinesis is a hosted service, beyond the cost of using the service, there are no additional infrastructure costs involved with using the platform. In terms of storage, the users’ data is hosted for 24 hours without an additional cost. If a longer data retention period is required, users have to pay additional charges. Durability costs Amazon Kinesis automatically replicates users’ data to multiple availability zones for durability. Clients do not have to be concerned with the cost of replication or additional cost of storage due to the replication factor.
10
Maintenance Cost Factors As compared to Apache Kafka, the maintenance tasks are limited to a few areas as demonstrated in the table below.
COST COMPARISON EXAMPLE We are generally not in favor of providing pricing information since it’s practically impossible to provide an accurate number that satisfy different use cases. However, we can provide a pricing example of running a hypothetical workload.
Apache Kafka Specific Tasks
Amazon Kinesis specific Tasks
Monitoring and alerting on Apache Kafka Brokers failures
N/A handled by Kinesis Service
Monitoring and alerting on Apache Kafka Resource utilizations (Disk, CPU and Memory)
N/A handled by Kinesis Service
Monitoring and alerting on Apache Kafka partition throughput
Monitoring and alerting on Cloudwatch Shard metrics
Migrating Apache Kafka partitions to new nodes to increase throughput
Amazon Kinesis API to add & remove Shards
Tuning Apache Kafka JVM settings
N/A handled by Kinesis Service
Scaling Brokers to increase CPU, Memory and Disk resources
N/A handled by Kinesis Service
Table 1.1
Upgrading Apache Kafka version
N/A handled by Kinesis Service
Recovering/Replacing failed Brokers
N/A handled by Kinesis Service
* Total storage required after considering the daily total incoming traffic, retention days, the storage headroom and the replication factor.
Failing over to a different cluster N/A handled by in a different data-center or Kinesis Service availability-zone
We’ll use the following requirements to calculate the cost of Apache Kafka and borrow the traffic estimate to calculate the cost of using Amazon Kinesis. Requirement
Value
Estimated Daily traffic
1 TB
Apache Kafka Data Retention Days
7
Data Payload size
1KB
Apache Kafka Replication Factor
3
* Apache Kafka Monthly Required Storage with 30% headroom
31.5 TB
** Daily Records/Sec
11574
** Records/Sec = (Daily traffic in KB/86400)/Payload size.
11
Apache Kafka Costs Given the requirements above and what we’ve discussed in the previous sections, we can estimate the Apache Kafka costs as following: Cost Factors
Value
Price
* Broker d2.xlarge EC2 Instances (Annual)
6
$10,390.00
** Zookeeper c3.xlarge EC2 Nodes (Annual)
3
$651
*** Between availability-zone bandwidth cost (Annual)
744TB
$7618.56
**** Maintenance Cost (Annual)
0.3 FTE
$30,000.00
3-Yr total cost
-
$145978.68
*** The cost sending traffic between multiple brokers hosted in 3 availability zone with the replication factor of 3. 1TB per day traffic and replication to 2 other availability zone = 2TB per day of inter-az traffic. 2TB x 31Days x 12months. Refer to Amazon monthly price calculator for more details. **** Based on our experience, we beli eve 30% of a DevOps engineer time is required to support activities mentioned in the section Apache Kafka “Maintenance factors.” We’ve assumed $100K/year salary to calculate the cost.
* We used Amazon D3.xlarger 3-yr upfront-reserved instances to calculate this cost. We believe this pricing model is a close estimation of hardware server/storage prices. $5195 3-yr upfront payment / 12 = $1731.67 Annual cost * 6 instan ces = $10,390.00 ** As discussed above, it is critical to include the cost of maintaining Apache Kafka and Zookeeper in our calculations. In this example, we’re using C3.xlarge instances to host Apach e Zookeeper nodes.
12
Amazon Kinesis Costs Using the traffic numbers defined in Table 1.1, we can estimate the cost of using Amazon Kinesis: Cost Factors
Value
Price
* Number of Shards Required (Annual)
12
$7356
** Maintenance Cost (Annual)
0.1 FTE
$10000
3-yr total cost
-
$92068
APACHE KAFKA COST BENEFITS As demonstrated by the example above, we believe for the majority of the workloads, using Amazon Kinesis is financially beneficial. However, there are cases where Apache Kafka can be more cost effective. One example of such scenario is when the incoming traffic consists of a small payload with a high number of records-per-sec and a short amount of data retention period. Due to the fact that the majority of the cost of hosting Apache Kafka is influenced by the amount of storage required to host the retained data, in the scenario where payload is small and the retention time is limited, the storage requirements is minimum. In such cases, Apache Kafka may prove to be more cost effective.
* The number of Shards required to support the incoming traffic of 11574 records/sec with 1KB payload. Refer to Amazon Kinesis pricing for the Shard price calculation. ** Similar calculation method to t he Apache Kafka maintenance costs. We’re assuming 10% of DevOps engineering time should be dedicated to support Apache Kafka maintenance activities. *** The costs associated with the producer/consumer application changes. **** The cost of training engineers
13
DESIGN AND ARCHITECTURAL DECISIONS This section walks through performance, scalability, durability and delivery semantics of both platforms. We also provide an example of creating an Amazon Kinesis Stream that provides similar characteristics as our existing Apache Kafka cluster.
In addition to increasing the capacity, given that Apache Kafka holds historical data, users may be required to increase the disk footprint capacity of the cluster. The process of adding more disk capacity is achieved by adding more nodes to the cluster.
SCALABILITY Both Apache Kafka and Amazon Kinesis rely on the concept of replicated partitions to provide linear scalability. More specifically both frameworks provide the ability for the users to partition the data to multiple distinct groups and the system handles the replication of the data to multiple nodes. In the case of Apache Kafka, the scalability of each partition depends on the number of CPU cores, the amount of memory and the performance of the local disks of the node hosting that partition.
In the case of Amazon Kinesis, given the hosted nature of the service, the throughput of each Shard is preadvertised by the Amazon Kinesis team. Currently each Amazon Kinesis Shard provides 1000 PUT records-persec or 1MBps of write and 2Mbps of or 5 transactionper-sec of read traffic. An example of calculating the number of Amazon Kinesis Shard is provided in the following section.
In Apache Kafka, in order to increase the throughput of the system, the users have to add more hardware capacity to the cluster and migrate the existing partitions to the newly added resources. This process assumes that there are more partitions than the number of cluster nodes. Otherwise, to increase the throughput of the cluster, one has to add more resources to the existing resources, also known as scaling up.
Increasing the scalability of Amazon Kinesis is easier than Apache Kafka. In order to increase the throughput of a given Amazon Kinesis stream, more Shards can be added to by splitting the existing Shards. The benefit of the Amazon Kinesis throughput model is that the users have a prior knowledge of the exact performance numbers to expect for every provisioned Shard. In contrast, the Apache Kafka throughput numbers depend on the t ype of the resource hosting Apache Kafka nodes. In most cases the users have to perform a load test to find out the throughput numbers each node
14
can sustain. This process can be error prone and at times can cause overprovisioning of the Apache Kafka clusters. The disadvantage of the Amazon Kinesis throughput model is that the current read limit of 5 transaction-persecond limits how many applications can read from Amazon Kinesis at any given time. In other words, if you have multiple applications that require pulling data once a sec from all Amazon Kinesis Shards, the ma ximum number of applications that can be supported by a single Amazon Kinesis Stream is 5. In order to increase the number of concurrent applications consuming all Amazon Kinesis Shards, one has to limit how often each application is reading from Amazon Kinesis to stay below the 5 read-per-sec per Shard limit. Alternatively, users can increase the number of Shards to increase the overall number of fetches/sec allowed by the Amazon Kinesis stream. An example of calculating the required number of Shards to enable parallel reads is provided in the “Architecting An Amazon Kinesis Stream “section.
LATENCY Latency is defined as the time consumed from the moment a given record is accepted by the platform until the time the consumer has been able to read the same record.
Amazon Kinesis is a best fit for use-cases with the higher throughput and the larger payloads vs. smaller payload but higher record-per-sec (See the cost factor section).
DURABILITY As mentioned in the previous sections, Apache Kafka provides durability by replicating data to multiple broker nodes. Amazon Kinesis provides the same durability guarantees by replicating the data to multiple availability zones. The major difference between t wo systems is the need for users to configure and control Apache Kafka replication strategy while Amazon Kinesis replication is handled by Amazon. DELIVERY SEMANTICS Both Apache Kafka and Amazon Kinesis provide at-least-once delivery semantic. More accurately, both systems may, at times, provide duplicate records to the consumers. In most cases the reason for duplicated records is re-try at the producer or the consumer level. Enforcing idempotency within the consumer application can produce exactly-once semantics. The details of writing idempotent applications are beyond the scope of this document. What users should be aware of is that there is a potential for duplicate messages and the consumers have to tolerate such scenario.
Apache Kafka can be configured to perform < 1second latency depending on the cluster and the producer/ consumer configurations. Amazon Kinesis latency has shown to be in the range of 1-5 second. The applications that require < 1second latency are not an ideal use-case for Amazon Kinesis.
15
Architecting An Amazon Kinesis Stream The following example demonstrates how to design an Amazon Kinesis Stream given the cluster configuration provided below:
In order to meet the performance of our existing Apache Kafka cluster, we need to ensure that our Amazon Kinesis stream can sustain the rate of the incoming and outgoing traffic by calculating the required number of Amazon Kinesis Shards to provide the same level of performance/throughput.
Requirements
Values
1
Daily Ingested Data Size (TB)
0.5
2
Retention days
7
3
Replication Factor
2
4
Payload Size(KB)
2KB
5
Number of Apache Kafka Partitions
5
6
Number of consumer applications
3
7
Fetches per sec per consumer application
1
8
Incoming traffic rate (KBps)
5787
9
Incoming traffic rate (Rec/sec)
2894
Number of ShardShards to support the incoming traffic = Max (~6, 3)
10
Per Apache Kafka partition incoming traffic rate (KBps)
1157
Number of ShardShards to support the incoming traffic = 6
11
Per Apache Kafka partition incoming traffic rate (Rec/sec)
579
12
Fetch per consumer application (KBps)
5787
13
Number of Amazon Kinesis Shards required (Writes)
6
14
Number of Amazon Kinesis Shards required (Read)
9
15
Number of Amazon Kinesis Shards required
9
Incoming traffic (Write)
A simple calculation can tell us how many Shards is required to sustain our incoming traffic: Number of ShardShards to support the incoming traffic = MAX (KBps/1000, Records/sec/1000) Number of ShardShards to support the incoming traffic = Max (5787 /1000, 2894 /1000)
Using six Shards, our Amazon Kinesis can sustain up to 6000 KBps or 6000 records/sec, which matches our existing Apache Kafka Cluster incoming traffic.
16
Outgoing traffic (Read)
Our current Apache Kafka cluster is supporting three consumer applications with each fetching at the rate of once per second or 5787 KBps. Given the numbers, we can calculate the number of Shards required to support our Amazon Kinesis consumers: Number of ShardShards to support the incoming traffic = MAX (KBps/2000, Consumer Fetches/sec/5) Number of ShardShards to support the outgoing traffic = MAX (5787*3/2000, 1*3/5) Number of ShardShards to support the outgoing traffic = MAX (17361/2000, 1*3/5) Number of ShardShards to support the outgoing traffic = MAX (~9, ~1) Number of ShardShards to support the outgoing traffic = 9
Using 9 Shards, our Amazon Kinesis can sustain up to 18000 KBps or 45 fetch/sec, which matches our existing Apache Kafka Cluster outgoing traffic. Total Shards required
Now that we’ve calculated how many Shards are required to support both our incoming and outgoing traffic, we can calculate the total number of Shards by: Total ShardShards = MAX (# ShardShards to support incoming traffic, # ShardShards to support outgoing traffic) Total ShardShards = MAX (6,9) Total ShardShards = 9
17
APPLICATION CHANGES Lets quickly walk through the differences between Producer and Consumer applications in Apache Kafka and Amazon Kinesis.
PRODUCER CHANGES Both Apache Kafka and Amazon Kinesis producers perform a similar set of high-level tasks such as: 1. 2. 3. 4.
Note: Users can use Amazon Kinesis API or KPL library (Kinesis Producer Library) to interact with Amazon Kinesis producer APIs. The comparison below assumes using KPL instead of direct API.
Accept records from higher level applications Perform batching and/or compression Partition the records between partitions/Shards Submit record(s) to Apache Kafka or Amazon Kinesis
However, despite the similarities, there are some important differences that the users should be aware of. This section provides a comparison of features and behaviors of Apache Kafka and Amazon Kinesis. In order to better organize the producer behaviors and logics, we’ve divided the actions into the following areas: • • • •
Submitting Records Compression and batching Failure handling Backpressure handling [TBD]
18
API Class
General Behavior
Submitting Records
Record Completeness
Apache Kafka
Amazon Kinesis
KafkaProducer
KinesisProducer
KafkaProducer.send() has async behavior where it returns immediately. It will also provide a callback method to act on completed records
KinesisProducer.addUserRecord has async and sync behaviors where it returns Future objects as the result of addUserRecord call. Users can use the Future objects to block on the status of the submission or use Future async capabilities to check for the status of the submission as events are provided to the user code.
KafkaProducer sends records to each Shard's broker leader
KinesisProducer sends records to a single API regardless of the number of Shards
"Ack" Configuration setting controls if a record is successfully submitted if one or all brokers have received the submitted record.
Amazon Kinesis API synchronously replicates data across three facilities in an AWS Region and returns 200 OK to the users.
Users can provide a partition number on per Partition keys are Unicode strings and can associate record basis to specify the exact partition records data records to Shards using the hash key ranges should be submitted to of the Shards. Record Partitioning
General Behavior
Batching
Default partitioning logic: hash(key)%numPartitions
An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to Shards
A custom partitioner can be provided to client using "partitioner.class"
Users can override hashing the partition key to determine the Shard by explicitly specifying a hash value using the ExplicitHashKey parameter.
KafkaProducer.send() adds records to memory buffer.
KinesisProducer.addUserRecord buffers records in memory until it's ready to submit
How many records to keep before submitting records controlled by "batch.size"
How may records to buffer can be controlled by "RecordMaxBufferedTime"
linger.ms config parameter to control how often to send batches
RecordMaxBufferedTime config paratmer can control how often to submit batched records
Batching Configuration
19
Compression
Failure Handling
General Behavior
Client can compress data before sending to Kafka is "compression.type " is set
Kinesis client library (KPL) does not provide compression but users can compress records in their application logic
Client will retry failed submissions if configured via "retries" config parameter
Client will retry failed submissions.
KafkaProducer.send() provides callback method to act on record failures
KinesisProducer.addUserRecord returns Future objects that users can use to check the status of the record submissions
CONSUMER CHANGES This section provides an overview of the Consumer application differences in Apache Kafka and Amazon Kinesis. Generally speaking, consumers perform the following tasks: 1. Consuming records from Apache Kafka or Amazon Kinesis. 2. Perform an action such as writing to database or transforming the record to some other format. 3. Performing load-balancing and failure handling. While at the high-level both Apache Kafka and Amazon Kinesis perform similar tasks, there are differences that users should be aware of. The rest of this section focuses on the differences both conceptually and also at the API level.
In order to better organize the consumer behaviors and logics, we’ve divided the consumer actions into the following high-level areas: • Load balancing: How Kafka and Kinesis consumers distribute reading between multiple consumers • Offset Control: The logic that both Kafka and Kinesis follow to handle which records have been processed through out the life of the consumer • Failure handling: How Kafka and Kinesis handle various different failure scenarios s Note: Users can use Amazon Kinesis API or KCL library (Kinesis Client Library) to interact with Amazon Kinesis consumer APIs. The comparison below assumes using KCL instead of direct API.
20
Apache Kafka
Amazon Kinesis
API Class
KafkaConsumer
Implementing KCL IRecordProcessor
Consuming Records
KafkaConsumer.poll() consumes records from Kafka
Users implemented I RecordProcessor.processRecords() is provided with list of records to process
load-balances records between subscribers in a given “Consumer Group”
Each instance of KCL application uses a KCL worker, which in turn creates a KCL RecordProcessor per Kinesis Shard
Each consumer can have one or more partitions assigned to it
A single Kinesis application instance creates multiple RecordProcessors to handle reading from multiple Shards. If multiple instances of KCL applications are deployed on multiple instances, the work of reading from Shards is divided between each other automatically.
Assigning partition to consumer is automatic and happens at Kafka Brokers level
Assigning partition to KCL workers happens at the client level and it’s coordinated by help of DynamoDB table that holds each worker’s state
Kafka allows consumers to persist their offset position outside of Kinesis (i.e. Zookeeper)
KCL library leverages DynamoDB to persist the sequence number of each KCL worker. Using external storages other than DynamoDB is not supported via KCL. Users have to use Kinesis’s APIs to develop their own logic.
Kafka Consumers can automatically checkpoint their position in the stream to Kafka. In this case consumers can’t handle failures at the record level
KCL library does not provide automatic checkpoint capability. Users can develop a similar logic using KCL manual checkpoint feature
Load-Balancing
State Control
Offset Control
Kafka Consumers can manually checkpoint their positions to Kafka using KafkaConsumer.commitSync Users can use KCL’s IRecordProcessorCheckpointer. or KafkaConsumer.commitAsync. checkpoint feature to persist their position in the stream. This method allows for failure handling KCL provides the ability to start from beginning or the end of the stream. If the ability to seek to a specific sequence Consumer number is required, KCL library cannot be used. Users have Position Kafka allows consumers to manually control its offset to use the following Kinesis direct APIs: Control position at any given time using: KafkaConsumer. seek, KafkaConsumer.seekToBeginning, In other GetShardIteratorRequest.setShardIteratorType words Kafka consumers can move forward or (“AT_SEQUENCE_NUMBER” OR “AFTER_SEQUENCE_ backwards in the stream. NUMBER”) GetShardIteratorRequest. setStartingSequenceNumber(specialSequenceNumber)
21
Broker Failure Failure Handling
Kafka Client transparently handles Kafka broker failures and adapts as partitions migrate within the cluster
Kafka Brokers perform health-check on consumers Consumer by tracking consumer heartbeats and re-balances Failure the partitions between different consumers if there are failures.
Handled by Kinesis service APIs. Consumers don’t have to handle this failure. In the case KCL worker failures, the existing and healthy KCL workers will spawn new record processors to take over the failed processors. The coordination and health-check happens using DynamoDB and the concept of lease [TBD]. If Consumers are not leveraging KCL library, the fai lure handling mentioned here has to be implemented by the users.
22
MIGRATING PLATFORMS So far we’ve covered the differences between Apache Kafka and Amazon Kinesis both in technical and economical terms. While we think both systems have their own relevant use-cases, we believe for the majority of workloads, there’s a justifiable case in terms of cost reduction and eliminating maintenance efforts to migrate to Amazon Kinesis. To migrate from Apache Kafka to Amazon Kinesis, first evaluate the previous sections to get a better understanding of the required Producer and Consumer application changes. The rest of this section provides high-level guidance on how to migrate existing data from Apache Kafka to Amazon Kinesis. We also provide a sample code to demonstrate the data migration processes covered in this section. Important concepts The following concepts should be considered before architecting a solution to help with copying the data from Apache Kafka to Amazon Kinesis. The copy process should reach equilibrium: The copy process should be architected such that at some point the system reaches equilibrium, meaning the rate of copying data to Amazon Kinesis should reach the same
or greater level at which the data is being ingested by Apache Kafka. Otherwise a copy process that does not create equilibrium will not conclude. In order to ensure equilibrium is reached, it is important to know at what rate the data is being ingested to Apache Kafka and ensure the copy process can meet or exceed at the same rate. A convenient way to ensure equilibrium is to stop the Apache Kafka producers and let the copy process conclude. However, this may not be acceptable as stopping producers creates delay in the data processing pipeline and also creates a situation where incoming data has to stop or spooled at the source, which can potentially result in data-loss. The copy process should provide at a minimum at-leastonce delivery semantic: The process of copying data to Amazon Kinesis, including reading from Apache Kafka and writing to Amazon Kinesis, should support at-leastonce delivery semantic. This ensures that in case of failure, both on reading or writing data, the copy process should not lose data. The consumer applications need to be idempotent: Due to the complexity of the copy process where multiple systems are involved and the incoming and the outgoing data are read and written over the network, it is possible for the copy process to introduce duplicate records.
23
In the majority of the cases, duplicate records are due to failure recovery and retry strategies that the copy process has in place. Because of the potential of duplicate records, the consumer applications should be idempotent (at least once semantic). The copy process should handle backpressure: It’s important for the copy process to be able to handle backpressure and adjust to Amazon Kinesis API throttling. The backpressure is the side effect of network slow down or the slow down of the producer application that writes to Amazon Kinesis. During either of these scenarios, the rate of the incoming data exceeds the rate of outgoing data and can potentially cause system instability. The copy process should be able to recover from failures: It is important for the copy process to be able to recover from failures. There are three common failures that should be handled by the copy process: 1. The copy process itself dies 2. The resource(s) hosting the copy process dies 3. The network is unreachable During any of the above scenarios, the copy process SHOULD NOT: 1. Experience data loss (discussed above) 2. Lose it’s position in the stream and forced to start from the beginning of the stream * * This creates duplicate records and while the consumers are idempotent and can handle duplicate records,
starting from the tip of the stream can cause data process lag and other processing issues.
MIGRATING DATA WITH APACHE SPARK STREAMING To support the process of copying data bet ween Apache Kafka and Amazon Kinesis, we’ve decided to use Apache Spark Streaming. Apache Spark supports the important factors mentioned above. Lets quickly review how Apache Spark handles the cases mentioned in previous sections. The copy process should reach equilibrium: Apache Spark can guarantee equilibrium by providing the ability to distribute the copy process between multiple distinct nodes. The distributed nature of Apache Spark provides the performance and throughput required to create equilibrium between both systems. The copy process should provide at the minimum atleast-once delivery semantic: Spark Streaming provides at-least-once delivery semantic during reading from Apache Kafka, holding data in memory during copy and writing to Amazon Kinesis. The copy process should handle backpressure: Spark streaming has the ability to rate limit reading from Apache Kafka in cases where writing to Amazon Kinesis has slowed down. The copy process should be able to recover from failures: Spark streaming can recovery from node failures without data-loss and can resume from the last record that was successfully copied to Amazon Kinesis.
24
PREPARATION In the preparation to migrate the data stored by Apache Kafka to Amazon Kinesis, we’ll perform the following tasks: 1. Calculate the number of Amazon Kinesis Shards 2. Create Amazon Kinesis Stream and Shards 3. Create an Amazon EMR Spark cluster Calculate the number of Amazon Kinesis Shards to create Refer to the previous section (“Architecting An Amazon Kinesis Stream”) for an example of calculating the number of Kinesis Shards. Based on the example provided, our Amazon Kinesis Stream requires 9 Shards.
Spark Streaming Data Migration Code The following sampled code demonstrates how to take advantage of Apache Spark to copy data from Apache Kafka to Amazon Kinesis. Note: it is important to reiterate that this is a sample code and not meant to be used in production without further modifications and implementing some of the throttling logics. We first configure the core Apache Spark component (SparkContext) and set the checkpoint HDFS directory. The checkpoint directory is where Spark Streaming persist important metadata information to avoid data- loss in case of any cluster interruptions: The copy process should be able to recover from failures
Create Amazon Kinesis Stream and Shards Refer to the following guide on how to create Amazon Kinesis streams: http://docs.aws.amazon.com/streams/ latest/dev/learning-kinesis-module-one-create-stream. html Create EMR Spark cluster To execute the Apache Spark Streaming code (below) we’ll use Amazon EMR to create a Spark cluster. For demonstration proposes, one or two node EMR cluster suffices. Refer to the following guide to create an Amazon EMR cluster: http://docs.aws.amazon.com/ ElasticMapReduce/latest/ReleaseGuide/emr-sparklaunch.html
25
/* Apache Spark Settings */ val conf = new SparkConf().setMaster(“yarn”).setAppName( “SparkDataCopy”) val sc = new SparkContext(conf) val checkPointDir = “/spark/checkpoint/sparkkafkacopy/” Next we setup Amazon Kinesis configuration settings. In this example we’re using “KafkaKinesisMigartion” stream created in “us-west-2” AWS region: /* Amazon Kinesis Settings */ val region = “us-west-2” val streamName = “KafkaKinesisMigration” Similarly we’ll create Apache Kafka specific configurations. Remember to replace the brokerList string with your Apache Kafka Brokers hostnames /* Apache Kafka Settings */ val topics = “KafkaKinesisMigration2” val brokerList:String = “hostname1:9092,hostname2:9092” val topicsSet = topics.split(“,”).toSet val kafkaParams = Map[String, String]( “metadata.broker.list” -> brokerList, “key.deserializer”->”org.apache.kafka.common.serialization.ByteArrayDeserializer”, “value.deserializer”->”org.apache.kafka.common.serialization.ByteArrayDeserializer”)
26
As mentioned in the previous section, our data copy logic has to be able to handle backpressure (The copy process should handle backpressure). One of the scenarios where we can potentially experience backpressure is when Apache Spark is copying data to Amazon Kinesis at a higher rate allowed by Amazon Kinesis APIs. As mentioned in the previous sections, the total Amazon Kinesis Stream throughput depends on the total number of Shards created. The following code keeps track of the number of messages and bytes that have been successfully copied to Amazon Kinesis. Later on we’ll use these metrics to slow down our copy process if we’re close to the Amazon Kinesis Shard capacity: /* Tracking how may bytes and messages have been sent */ val bytesSent = sc.accumulator[Long](0, “BytesSent”) val messegesSent = sc.accumulator[Long](0, “MessagesSent”) val startTimeMs = sc.accumulator[Long](System.currentTimeMillis, “StartTimeMs”) The important data extract and copy logic happens in the following two sections. We first read messages from Apache Kafka: val messages = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( ssc, kafkaParams, topicsSet) And later copy the messages to Amazon Kinesis: val producer = KinesisConnection.getConnection() partitionOfRecords.foreach(msg=>{ val (key,value) = Helper.getKV(msg._1,msg._2) msgSize = value.size val futureRes = producer.addUserRecord(streamName,key.toString(),ByteBuffer.wrap(value.getBytes())) Futures.addCallback(futureRes,myCallback)
27
The main logic that handles backpressure is implemented here: val (msgPerSec,bytePerSec) = Helper.performanceMetrics(startTimeMs.value,messegesSent.value,bytesSent. value) logger.debug(“MsgPerSec: “+msgPerSec) logger.debug(“BytesPerSec: “+bytePerSec) while (msgPerSec>msgPerSecThrottle || bytePerSec>bytesPerSecThrottle) { Helper.throttle(500) } while (producer.getOutstandingRecordsCount>=10000) { logger.info(“BACKPRESSURE INVOKED: “+ producer.getOutstandingRecordsCount) Helper.throttle(1000) } The full sample code is provided in the Appendix B.
28
CONCLUSION Both Apache Kafka and Amazon Kinesis are well-engineered solutions meant to help with stream processing requirements. Apache Kafka is an open-source solution where users have the flexibility to configure different aspects of the platform. However, users are also tasked with maintaining an Apache Kafka infrastructure. Amazon Kinesis on the other hand provides a similar set of capabilities but since it’s a managed offering, it provides less flexibility with the advantage of eliminating the need for the users to maintain an infrastructure. When it comes to deciding on the right solution, users have to balance between flexibility, cost and API features. In this document we argued that Amazon Kinesis, given its hosted nature, provides a lower cost of maintenance. In comparison, Apache Kafka provides a richer API interface with a higher flexibility but also higher hosting and maintenance costs. We hope that by reading this document one can evaluate the cost and the flexibility factors provided by each solution and decide the best path forward for their specific workloads.
29
APPENDIX A Apache Kafka and Amazon Kinesis Producer and Consumer Configuration Comparison Producer Configuration Comparison Apache Kafka
Amazon Kinesis
receive.buffer.bytes
N/A
bootstrap.servers
Kinesis endpoint URL
request.timeout.ms
RequestTimeout
key.serializer
N/A
sasl.kerberos.ser vi ce.n ame
N/A
value.serializer
N/A
timeout.ms
AWS SDK Configuration
Acks
N/A
block.on.buffer.full
N/A
buffer.memory
5MB
compression.type
N/A
max.in.flight.requests.per. connection
N/A
metadata.fetch.timeout.ms
N/A
CollectionMaxCount
metadata.max.age.ms
N/A
client.id
N/A
metric.reporters
connections.max.idle.ms
AWS SDK Configuration
metrics.num.samples
linger.ms
RecordMaxBufferedTime
metrics.sample.window.ms
max.block.ms
N/A
reconnect.backoff.ms
AWS SDK Configuration
max.request.size
5MB
retry.backoff.ms
RateLimit
partitioner.class
N/A
retries batch.size
CloudWatch Metrics
30
Consumer Configuration Comparison Apache Kafka
Amazon Kinesis
bootstrap.servers
Kinesis API Endpoint
key.deserializer
N/A
value.deserializer
N/A
fetch.min.bytes
N/A
group.id
ApplicationName
heartbeat.interval.ms
N/A
max.partition.fetch.bytes
N/A
session.timeout.ms
failoverTimeMillis
auto.offset.reset
N/A
connections.max.idle.ms
ClientConfiguration. connectionMaxIdleMillis
enable.auto.commit
request.timeout.ms
ClientConfiguration. connectionTimeout
send.buffer.bytes
ClientConfiguration. socketSendBufferSizeHint
auto.commit.interval.ms
N/A
check.crcs
N/A
client.id
workerIdentifier
fetch.max.wait.ms
N/A
metadata.max.age.ms
N/A
metric.reporters
N/A
metrics.num.samples
metricsMaxQueueSize
metrics.sample.window.ms
metricsBufferTimeMillis
N/A
reconnect.b ackoff.m s
Cli en tConfig urat ion,Ret ryPoli cy
partition.assignment.strategy
N/A
retry.backoff.ms
ClientConfiguration.RetryPolicy
receive.buffer.bytes
ClientConfiguration. socketReceiveBufferSizeHint
reconnect.backoff.ms
AWS SDK Configuration
retry.backoff.ms
RateLimit
31
APPENDIX B Apache Spark Streaming sample code import java.nio.ByteBuffer import com.amazonaws.services.kinesis.producer._ import com.google.common.util.concurrent.{Futures, FutureCallback} import kafka.serializer.DefaultDecoder import org.apache.spark.streaming.{Seconds, S treamingContext} import org.apache.spark.{SparkConf, SparkContext} import org.slf4j.LoggerFactory import scala.collection.JavaConverters._ object KafkaCopy extends App { import org.apache.spark.streaming.kafka._ val logger = LoggerFactory.getLogger(“SparkCopy”) /* Spark Settings */ val conf = new SparkConf().setMaster(“yarn”).setAppName(“SparkDataCopy”) val sc = new SparkContext(conf) val checkPointDir = “/spark/checkpoint/sparkkafkacopy “ /* Amazon Kinesis Settings */ val region = “us-west-2” val streamName = “KafkaKinesisMigration” /* Apache Kafka Settings */ val topics = “KafkaKinesisMigration2” val brokerList:String = “hostname1:9092 hostname2:9092” val topicsSet = topics.split(“,”).toSet val kafkaParams = Map[String, String]( “metadata.broker.list” -> brokerList, “key.deserializer”->”org.apache.kafka.common.serialization.ByteArrayDeserializer”,
32
“value.deserializer”->”org.apache.kafka.common.serialization.ByteArrayDeserializer”) /* Tracking how may bytes and messages have been sent */ val bytesSent = sc.accumulator[Long](0, “BytesSent”) val messegesSent = sc.accumulator[Long](0, “MessagesSent”) val startTimeMs = sc.accumulator[Long](System.currentTimeMillis, “StartTimeMs”) /* Spark Streaming logic starts here */ def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(sc,Seconds(1)) val topicMap = topics.split(“,”).map((_, 2.toInt)).toMap val messages = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( ssc, kafkaParams, topicsSet) val numShards = Helper.getKinesisNumberOfShards(streamName,region) logger.info(“Number of Shards: “+numShards) val (msgPerSecThrottle,bytesPerSecThrottle) = Helper.getWriteThrottleThresholds(numShards) logger.info(“Throttle Msg/Sec Thresholds: “+msgPerSecThrottle) logger.info(“Throttle Bytes/Sec Thresholds: “+bytesPerSecThrottle)
messages.foreachRDD(msgRDD=>{ msgRDD.foreachPartition(partitionOfRecords=>{ var msgSize = 0 val myCallback = new FutureCallback[UserRecordResult] { override def onFailure(throwable: Throwable): Unit = { throwable match { case e: UserRecordFailedException => val result = e.getResult result.getAttempts.asScala.foreach(a => println(“Error Details: “ + a.getDelay + “ “ + a.getDuration + “ “ + a.getErrorCode + “ “ + a.getErrorMessage)) case _ => } } override def onSuccess(v: UserRecordResult): Unit = { logger.debu g(“PutReco rds SUCCESSFUL”) messegesSent+=1 bytesSent+=msgSize } }
val producer = KinesisConnection.getConnection() partitionOfRecords.foreach(msg=>{
33
val (key,value) = Helper.getKV(msg._1,msg._2) msgSize = value.size val futureRes = producer.addUserRecord(streamName,key.toString(),ByteBuffer.wrap(value.getBytes())) Futures.addCallback(futureRes,myCallback) val (msgPerSec,bytePerSec) = Helper.performanceMetrics(startTimeMs.value,messegesSent.value,bytesSent.value) logger.debug(“MsgPerSec: “+msgPerSec) logger.debug(“BytesPerSec: “+bytePerSec)
while (msgPerSec>msgPerSecThrottle || bytePerSec>bytesPerSecThrottle) { Helper.throttle(500) }
while (producer.getOutstandingRecordsCount>=10000) { logger.info(“BACKPRESSURE INVOKED: “+ producer.getOutstandingRecordsCount) Helper.throttle(1000) } }) }) }) ssc.checkpoint(checkPointDir) ssc } val ssc = StreamingContext.getOrCreate(checkPointDir, functionToCreateContext _) ssc.start() ssc.awaitTermination() }
34