Journal Entry | Log4j Vulnerability

When the Log4Shell vulnerability was announced, the Openverse team immediately looked at their services to determine if any were affected. While most of the stack developed is written in Python, Openverse relies on a number of open sourceOpen Source Open Source denotes software for which the original source code is made freely available and may be redistributed and modified. Open Source **must be** delivered via a licensing model, see GPL. applications across several languages. Specifically, the search engine relies on an ElasticSearch cluster, which was determined to be the only vulnerable service. The team waited for an in-process “data refresh,” to complete before deployingDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. the necessary security patch to the ElasticSearch cluster.

The patching process required careful planning as ElasticSearch makes Openverse’s 600+ million openly licensed images searchable. Redeploying the cluster would require reindexing the data for several hours, and the team wanted to avoid users experiencing any downtime during the patching process. With six production ElasticSearch nodes (three master [sic] and three data), the team restarted the nodes one by one with the patch for Log4Shell. Additional information on the patch can be found here.

To perform this per-node restart with no downtime, it was important that ElasticSearch did not consider the restarting node as “unavailable” and attempt to redistribute the index across the other five remaining nodes. This was done by setting the “delayed allocation timeout” to 20 minutes. The script written performed the following:

  • Extend the timeout for the entire cluster
  • For each node:
    • Restart the ElasticSearch container with the additional Log4Shell patch JVM argument
    • Wait until the node was reintegrated with the cluster and the entire cluster was considered “healthy”
  • Reduce the timeout to its original value

This script was tested several times on a staging ElasticSearch cluster prior to being implemented on our production cluster. The entire restart process took approximately six minutes and did not interfere with the active use of Openverse. No user action was needed or needs to be taken.

Props to @aetherunbound, @zackkrida, and @ronnybadilla for their work patching the Log4Shell vulnerability and composing this journal entry.