I’m currently working on a large logging project that was initially implemented using AWS Elasticsearch. Having worked with large-scale mainline Elasticsearch clusters for several years, I’m absolutely stunned at how poor Amazon’s implementation is and I can’t fathom why they’re unable to fix or at least improve it.
A Quick Rundown
Elasticsearch stores data in various indexes you explicitly create or that can be automatically created as soon as you ship it some data. The records in each index are split across a definable number of shards, which are then balanced across the nodes in your cluster (as evenly as possible if your shard count is not divisible by the number of nodes.) There are two main types of shards in Elasticsearch; primary shards and replica shards. Replica shards provide resiliency in case of a failed node, and users can specify a different number of replica shards for each index as well.
Mainline Elasticsearch Operation
Elasticsearch is, well, elastic. It can be quite finicky sometimes, however, generally speaking, you can add nodes to a cluster or remove them, and as long as there are an appropriate number of replicas in the case of removing a node, Elasticsearch will move shards around and even the load across the nodes in a cluster. Things generally just work.
Running expensive queries can sometimes cause nodes to fall over and things of that nature, but there are a lot of tuneables available to help keep the rails on. With an appropriate number of replica shards, if a node does fall over it shouldn’t have significant impact.
Mainline Elasticsearch also has a number of add-ons available including X-Pack, including features like auditing, fine-grained ACLs and monitoring and alerting. Much of X-Pack recently became free to use, likely in response to Splunk’s new licensing options.
Amazon Elasticsearch Operation
As has happened before, Amazon took the open-source side of Elasticsearch, did a hard fork, and has been selling it as a hosted service, slowly implementing their own versions of features that have been available in one form or fashion in mainline Elasticsearch for years.
Amazon’s implementation is missing a lot of things like RBAC and auditing, which is particularly problematic in our environment because we are ingesting log data from different teams and would prefer to be able to segment them from one another; currently, anyone with access to Elasticsearch has full privileges to everything and can accidentally delete other people’s data, change how it’s replicated across nodes, and cause data ingestion to stop completely by adding a bad index template.
This is frustrating, but it’s not the big issue with their service. Shard rebalancing, a central concept to Elasticsearch working as well as it does, does not work on AWS’s implementation, and that negates basically everything good about Elasticsearch.
In a normal scenario, as data is added to nodes, sometimes one can become more full than others. This is understandable because there are no guarantees that records ingested will be the same size, or that shard counts will always evenly divide out across all of the nodes in a cluster. It’s not a big deal though, because Elasticsearch can rebalance shards across the nodes, and if one node does happen to fill up, the other nodes will happily start taking in data in its place.
This is not supported on Amazon. Some nodes can fill up (much) more quickly than others.
And it gets worse. On Amazon, if a single node in your Elasticsearch cluster runs out of space, the entire cluster stops ingesting data, full stop. Amazon’s solution to this is to have users go through a nightmare process of periodically changing the shard counts in their index templates and then reindexing their existing data into new indices, deleting the previous indices, and then reindexing the data again to the previous index name if necessary. This should be wholly unnecessary, is computationally expensive, and requires that a raw copy of the ingested data be stored along with the parsed record because the raw copy will need to be parsed again to be reindexed. Of course, this also doubles the storage required for "normal" operation on AWS.
Oops! I didn’t reindex the entire cluster as often as I should have and a node filled up! What do I do?
You have two options here. The first, delete enough data that the cluster comes back to life and then start reindexing things and hope nothing crashes. Hope you had a backup of what you needed to dump.
The second option is to add more nodes to the cluster or resize the existing ones to larger instance types.
But wait, how do you add nodes or shuffle things around if you can’t rebalance shards?
Amazon’s solution to this is a blue-green deployment. They spin up an entire new cluster, copy the entire contents of the previous cluster into the new one, and then switch over and destroy the old cluster.
These resizing jobs can take days with large clusters – as you can imagine, duplicating trillions of records can take some time. It also puts an insane load on the previous cluster (which is likely already over capacity in some ways if you’ve ended up here) and can actually cause the cluster to fall over. I’ve gone through several of these operations on 30+ node clusters in AWS, and only one time have I seen it complete successfully on its own.
So, you tried to resize your cluster and the job is broken. What now?
Interacting With Amazon
So your cluster resize job broke (on a service you probably chose so you wouldn’t have to deal with this stuff in the first place), so you open a top severity ticket with AWS support. Invariably, they’ll complain about your shard count or sizing and will helpfully add a link to the same shard sizing guidelines you’ve read 500 times by now. And then you wait for them to fix it. And wait. And wait. The last time I tried to resize a cluster and it locked up, causing a major production outage, it took SEVEN DAYS for them to get everything back online. They got the cluster itself online within a couple of days, but when things broke, apparently the nodes running Kibana lost connectivity to the main cluster. AWS support then spent the next four days trying things and then asking me if Kibana was working. They couldn’t even tell if they’d fixed the problem and had to have me verify whether they had restored connectivity between their own systems. I’ve since given up on doing anything but deleting data if a node manages to fill up.
Our spend with AWS as an organization is massive. This allows us the opportunity to periodically meet with their SMEs in various areas and discuss implementation strategies and get into the weeds about a lot of technical stuff. We set up a meeting with one of their Elasticsearch SMEs, during which time I spent the bulk of the meeting explaining Elasticsearch fundamentals and describing the… quirks… of their product. The gentleman was wholly blindsided that the whole thing falls over if a node fills up. If the SME they sent doesn’t know the fundamentals of how their product works, no wonder it takes their support team seven days to bring back a production cluster.
The logging project I dove into has its share of architectural faults and poor design decisions that we’re currently working through. And I certainly expected AWS Elasticsearch to be different from mainline. However, so many fundamental features are either disabled or missing in AWS Elasticsearch that it’s exacerbated almost every other issue we face.
For light use and small clusters, I could certainly see AWS Elasticsearch working reasonably well for users, but for clusters approaching petabyte-scale, it’s been an unending nightmare.
I’m extremely curious why Amazon’s implementation of Elasticsearch cannot rebalance shards; it’s pretty fundamental Elasticsearch functionality. Even with Amazon’s feature set being lacking compared to mainline Elasticsearch, it would certainly be an acceptable product for large clusters if it simply worked properly. I cannot fathom how Amazon decided to ship something so broken, and how they haven’t been able to improve the situation after over two years.
As others have posited, it seems to make sense that these may be symptoms of their implementation having been designed as a giant, multi-tenant cluster, trying to provide isolation and make it look like a stand-alone cluster to end-users. Even with options like encrypted data at rest and encrypted transport, it seems plausible. Or perhaps their tooling and configurations are simply remnants of an earlier architecture.
At the end of the day, as a friend of mine pointed out, it’s rather funny that they still call it "Elastic" when you can’t add or remove nodes from your clusters without spinning up a brand new cluster and migrating all of your data.