Database service for the #realtimeweb

Scaling ElasticSearch Writes

ElasticSearch is unarguably one of the most popular distributed systems out there. At Appbase, we use ElasticSearch as our distributed layer for streaming results.

In this post, we talk about a less frequently covered topic on ElasticSearch - scaling writes in real world. Unlike benchmarks that are designed to test the limits, we kept a constraint to design systems and pick parameters that make economic sense.

Our key hypotheses with scaling have been threefold:

  • Sustain over 100,000 writes per second on a small workload of machines.
  • Show that writes scale linearly with the number of nodes.
  • Deployment setup that's easy to scale up and down in an automated fashion.

This chart best sums up our observations. We were infact able to get the writes to scale linearly as a function of the nodes using commodity compute instances, even slightly beyond our original goal of 100,000.

Assuming this relation holds true even as we scale further (and that's a big assumption!), we could in fact scale to 1,000,000 writes per second with 150 nodes; a number previously achieved with Cassandra - a write-optimized database running on 330 nodes. That said, it would be more on the outlier spectrum of realworld workloads.

Our Deployment Setup: A Deepdive

First, we need an easy way to deploy our setup if we are ever going to successfully index huge workloads.

We dockerize our deployment setup for reliability and make it agnostic to the underlying IaaS provider. Behind the scenes, each of our docker container maps to a single C4.2xlarge EC2 instance (8 vCPUs) on AWS; this choice largely affects the number of nodes required to hit the benchmark. For instance, if we went with C4.8xlarge (36 vCPUs) instances, we would have hit 100k writes with far less nodes. However, beefy instances are cost prohibitive and make it hard to scale up and down easily.

Deployment Architecture Image: A sketch of our deployment setup, yay!

While our IaaS provider here is AWS, one of the advantages of using a completely dockerized (err containerized) setup is that we don't have to care about that fact.

We get this setup in three steps:

  1. First create a docker file with ElasticSearch and Marvel (our monitoring plugin for ES).
  2. Next, we use for an automated deployment of the docker containers onto AWS nodes.
  3. Use Tutum's GUI to scale up and down our ElasticSearch cluster.
ElasticSearch Configuration

We keep a 1:1 mapping of shards and nodes, and keep one replica of the dataset.

Workload generation

An individual document payload for this test is roughly 100 bytes. We use a grammar generator to create phrases from a small set of words.

Some sample phrases look like:

{ "message": "most professors ate most sandwiches" }

{"message": "some dogs ate a curious caterpillar with a professor under the dog on the bathtub with every dog inside some professors in most professors under every professor" }

We then use up to five C3.8xlarge each running four processes to generate the test loads ranging from thousands to hundreds of thousands of requests every second.

For each test, documents are inserted into one index and one type.

Key Results

In doing these tests, we indexed over 1 billion documents under just two hours. The peak merge rate we saw was 129 MB/s, yes our ElasticSearch cluster on a 24 node setup is gobbling data at this speed.

Most mind boggling of all: At our peak sustained ingestion rate, we would have added over 13 billion documents in a single day and this would have cost us a mere USD 244.

Below are the raw snaps from Marvel, a plugin to monitor ElasticSearch performance.

Going Ahead

We found in these experiments that compute was a clear bottleneck. Working with dockerized deployment has been an absolute joy, which additionally gives us the freedom to be hosted on any infrastructure.

The linearity of writes has been a pleasant surprise. It would be an interesting experiment to see how this relationship holds as complexity of the cluster increases.

Lastly, this post has been solely focused on scaling indexing and doesn't talk about the performance on read and queries. We would like to explore this aspect more in future posts.

Author image
Founder at Appbase, read my musings on the db world.