Elasticsearch is an open-source search and analytics engine based on Apache Lucene. When building applications on change data capture (CDC) data using Elasticsearch, you’ll want to architect the system to handle frequent updates or modifications to the existing documents in an index.
In this blog, we’ll walk through the different options available for updates including full updates, partial updates and scripted updates. We’ll also discuss what happens under the hood in Elasticsearch when modifying a document and how frequent updates impact CPU utilization in the system.
Example application with frequent updates
To better understand use cases that have frequent updates, let’s look at a search application for a video streaming service like Netflix. When a user searches for a show, ie “political thriller”, they are returned a set of relevant results based on keywords and other metadata.
Let’s look at an example document in Elasticsearch of the show “House of Cards”:
Embedded content: https://gist.github.com/julie-mills/1b1b0f87dcca601a6f819d3086db4c27
The search can be configured in Elasticsearch to use name
and description
as full-text search fields. The views
field, which stores the number of views per title, can be used to boost content, ranking more popular shows higher. The views
field is incremented every time a user watches an episode of a show or a movie.
When using this search configuration in an application the scale of Netflix, the number of updates performed can easily cross millions per minute as determined by the Netflix Engagement Report. From the Netflix Engagement Report, users watched ~100 billion hours of content on Netflix between January to July. Assuming an average watch time of 15 minutes per episode or a movie, the number of views per minute reaches 1.3 million on average. With the search configuration specified above, each view would require an update in the millions scale.
Many search and analytics applications can experience frequent updates, especially when built on CDC data.
Performing updates in Elasticsearch
Let’s delve into a general example of how to perform an update in Elasticsearch with the code below:
Embedded content: https://gist.github.com/julie-mills/c2bc1b4d32198fbc9df0975cd44546c0
Full updates versus partial updates in Elasticsearch
When performing an update in Elasticsearch, you can use the index API to replace an existing document or the update API to make a partial update to a document.
The index API retrieves the entire document, makes changes to the document and then reindexes the document. With the update API, you simply send the fields you wish to modify, instead of the entire document. This still results in the document being reindexed but minimizes the amount of data sent over the network. The update API is especially useful in cases where the document size is large and sending the entire document over the network will be time consuming.
Let’s see how both the index API and the update API work using Python code.
Full updates using the index API in Elasticsearch
Embedded content: https://gist.github.com/julie-mills/d64019542768baad2825e2f9c6bf94e6
As you can see in the code above, the index API requires two separate calls to Elasticsearch which can result in slower performance and higher load on your cluster.
Partial updates using the update API in Elasticsearch
Partial updates internally use the reindex API, but have been configured to only require a single network call for better performance.
Embedded content: https://gist.github.com/julie-mills/49125b47699cd0b6c2b2a0c824e8e2c0
You can use the update API in Elasticsearch to update the view count but, by itself, the update API cannot be used to increment the view count based on the previous value. That is because we need the older view count to set the new view count value.
Let’s see how we can fix this using a powerful scripting language, Painless.
Partial updates using Painless scripts in Elasticsearch
Painless is a scripting language designed for Elasticsearch and can be used for query and aggregation calculations, complex conditionals, data transformations and more. Painless also enables the use of scripts in update queries to modify documents based on complex logic.
In the example below, we use a Painless script to perform an update in a single API call and increment the new view count based on the value of the old view count.
Embedded content: https://gist.github.com/julie-mills/50da3261ae1866bd95734544c98b58af
The Painless script is pretty intuitive to understand, it is simply incrementing the view count by 1 for every document.
Updating a nested object in Elasticsearch
Nested objects in Elasticsearch are a data structure that allows for the indexing of arrays of objects as separate documents within a single parent document. Nested objects are useful when dealing with complex data that naturally forms a nested structure, like objects within objects. In a typical Elasticsearch document, arrays of objects are flattened, but using the nested data type allows each object in the array to be indexed and queried independently.
Painless scripts can also be used to update nested objects in Elasticsearch.
Adding a new field in Elasticsearch
Adding a new field to a document in Elasticsearch can be accomplished through an index operation.
You can partially update an existing document with the new field using the Update API. When dynamic mapping on the index is enabled, introducing a new field is straightforward. Simply index a document containing that field and Elasticsearch will automatically figure out the suitable mapping and add the new field to the mapping.
With dynamic mapping on the index disabled, you will need to use the update mapping API. You can see an example below of how to update the index mapping by adding a “category” field to the movies index.
Embedded content: https://gist.github.com/julie-mills/b83e89341f4db23e021df4ca6b5ed644
Updates in Elasticsearch under the hood
While the code is simple, Elasticsearch internally is doing a lot of heavy lifting to perform these updates because data is stored in immutable segments. As a result, Elasticsearch cannot simply make an in-place update to a document. The only way to perform an update is to reindex the entire document, regardless of which API is used.
Elasticsearch uses Apache Lucene under the hood. A Lucene index is composed of one or more segments. A segment is a self-contained, immutable index structure that represents a subset of the overall index. When documents are added or updated, new Lucene segments are created and older documents are marked for soft deletion. Over time, as new documents are added or existing ones are updated, multiple segments may accumulate. To optimize the index structure, Lucene periodically merges smaller segments into larger ones.
Updates are essentially inserts in Elasticsearch
Since each update operation is a reindex operation, all updates are essentially inserts with soft deletes.
There are cost implications for treating an update as an insert operation. On one hand, the soft deletion of data means that old data is still being retained for some period of time, bloating the storage and memory of the index. Performing soft deletes, reindexing and garbage collection operations also take a heavy toll on CPU, a toll that is exacerbated by repeating these operations on all replicas.
Updates can get more tricky as your product grows and your data changes over time. To keep Elasticsearch performant, you will need to update the shards, analyzers and tokenizers in your cluster, requiring a reindexing of the entire cluster. For production applications, this will require setting up a new cluster and migrating all of the data over. Migrating clusters is both time intensive and error prone so it’s not an operation to take lightly.
Updates in Elasticsearch
The simplicity of the update operations in Elasticsearch can mask the heavy operational tasks happening under the hood of the system. Elasticsearch treats each update as an insert, requiring the full document to be recreated and reindexed. For applications with frequent updates, this can quickly become expensive as we saw in the Netflix example where millions of updates happen every minute. We recommend either batching updates using the Bulk API, which adds latency to your workload, or looking at alternative solutions when faced with frequent updates in Elasticsearch.
Rockset, a search and analytics database built in the cloud, is a mutable alternative to Elasticsearch. Being built on RocksDB, a key-value store popularized for its mutability, Rockset can make in-place updates to documents. This results in only the value of individual fields being updated and reindexed rather than the entire document. If you’d like to compare the performance of Elasticsearch and Rockset for update-heavy workloads, you can start a free trial of Rockset with $300 in credits.