infrastructure – Make Openverse

This was a passing thought I had that I wanted to note somewhere. Currently the ingestion server is a small Falcon app that runs most aspects of the data refresh, but then also (in staging/prod) interacts with a fleet of “indexer worker” EC2 instances when performing the Postgres -> Elasticsearch indexing.

We have plans for moving the data refresh steps from the ingestion server into Airflow. Most of these steps are operations on the various databases, so they’re not very processor-intensive on the server end. However, the indexing steps are intensive, which is why they’re spread across 6 machines in production (and even then it can take a number of hours to complete).

We could replicate this process in Airflow by setting up Celery-based workers so that the tasks run on a separate instance from the webserver/scheduler. Ultimately I’d like to go this route (or use something like the ECS Executor rather than Celery), but that’s a non-trivial effort to complete.

One other way we could accomplish this would be to use ECS tasks! We could have a container defined specifically for the indexing step, which expects to receive the range on which to index and all necessary connection information. We could then kick off n of those jobs using the EcsRunTaskOperator, and wait for completion using the EcsTaskStateSensor to determine when they complete. This could be done in our current setup without any new Airflow infrastructure. It’d also allow us to remove the indexer workers, which currently sit idle (albeit in the stopped state) in EC2 until they are used.

#airflow, #data-refresh, #ecs, #infrastructure, #openverse

Yesterday at 20:20 UTC, we released version 2.5.5 of our API! Along with a few dependency upgrades and DevEx improvements/fixes, this release also brings an important change regarding anonymous API requests. After v2.5.5, any media searches that are made without an APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. key cannot request more than 20 results per page.

This change was made in order to mitigate behavior we were seeing on the API which was adversely affecting performance for other users, our capacity to update the data that backs OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org., and our ability to deployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. new changes to the API.

Our API Terms of Service state:

– A user must adhere to all rate limits, registration requirements, and comply with all requirements in the Openverse API documentation;
– A user must not scrape the content in the Openverse Catalog;
– A user must not use multiple machines to circumvent rate limits or otherwise take measures to bypass our technical or security measures;
– A user must not operate in a way that negatively affects other users of the API or impedes the WordPress Foundation’s ability to provide its services;

Background

Beginning around May 18th, we saw a significant increase in traffic.

Total requests made to api.openverse.engineering over the last 30 days

While the digital demographics (browser, user agent, OS, device type, etc.) were quite varied, one feature stuck out – these requests were all being made with the page_size=500 parameter.

Total requests made to api.openverse.engineering over the last 30 days **using the `page_size=500` parameter**

Over the course of the last 30 days, these requests constituted almost 80% of our total traffic! While our application is designed to handle this many requests, it is not designed to handle each request querying for 500 results per page (the default page size is 20). As such, this had created significant strain on our Elasticsearch cluster and eventually caused disruptions in the API’s ability to serve results. The image below combines a few of our monitoring tools to show a general correlation between the page_size=500 requests and our Elasticsearch resource utilization.

Request count compared to Elasticsearch resource utilization

Even before this release, our application was set up to throttle individual, anonymous users to 1 request/second. These page_size=500 requests were coming from a myriad of different hosts; the initiator was able to circumvent the individual throttles by employing a large number of machines (also known as a botnet). These machines were also predominantly tied to a single data center and a single ASN, which led us to believe this was orchestrated by a single user.

This behavior was clearly in violation of our Terms of Service, since it was:

Not using a registered API key for high-volume use
Scraping data from Openverse
Using multiple machines to circumvent the application throttles
Consuming significant enough resources that it impacted other users of Openverse

Mitigation

As mentioned above, we deployedDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. a change which would now return a 401 Unauthorized for any anonymous requests to the API that included a page_size greater than the default of 20. Almost immediately after deployment, we saw this mitigation take effect when observing request behavior:

Screenshot of a Cloudflare analytics page. The graph in the center shows total requests with page_size=500, separated by status code over 6 hours. A consistent number of requests (split between 301 and 200) can be seen starting at 9:00 PST. At 13:00 PST, the number of 401 requests begins to overtake the number of 200 requests. After 13:15, the number of 200 requests drops to zero and all requests returned are 401s. — Total number of `page_size=500` requests made over the course of 6 hours, separated by return status code

In the above graph, you can see where we deployed v2.5.5 (~13:00 PST) – the number of 200 OK responses decreased, and the number of 401 Unauthorized responses increased significantly! Eventually all of the page_size=500 requests were being rejected as unauthorized.

With this change, we were able to successfully mitigate the botnet and return our resource consumption to typical levels. This can be seen easily with a few Elasticsearch metrics:

Elasticsearch metrics over the last 12 hours

While the intention behind Openverse is to make openly licensed media easy to access, we don’t currently have the capacity to enable users to access the entire dataset at once. We do plan on exploring options for this in the future.

We’re pleased that this mitigation was successful, and we will continue to be vigilant in ensuring uninterrupted access to Openverse for our users!

#openverse, #infrastructure, #api

Welcome!

Tag Archives: infrastructure

Applying ECS to the ingestion server/data refresh

Mitigating out of terms API usage

Background

Mitigation