Openverse is a search engine for openly-licensed media.
The OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. team builds the Openverse Catalog, APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.
Yesterday at 20:20 UTC, we released version 2.5.5 of our API! Along with a few dependency upgrades and DevEx improvements/fixes, this release also brings an important change regarding anonymous API requests. After v2.5.5, any media searches that are made without an APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. key cannot request more than 20 results per page.
This change was made in order to mitigate behavior we were seeing on the API which was adversely affecting performance for other users, our capacity to update the data that backs OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org., and our ability to deployDeployLaunching code from a local development environment to the production web server, so that it's available to visitors. new changes to the API.
– A user must adhere to all rate limits, registration requirements, and comply with all requirements in the Openverse API documentation;
– A user must not scrape the content in the Openverse Catalog;
– A user must not use multiple machines to circumvent rate limits or otherwise take measures to bypass our technical or security measures;
– A user must not operate in a way that negatively affects other users of the API or impedes the WordPress Foundation’s ability to provide its services;
Beginning around May 18th, we saw a significant increase in traffic.
While the digital demographics (browser, user agent, OS, device type, etc.) were quite varied, one feature stuck out – these requests were all being made with the page_size=500 parameter.
Over the course of the last 30 days, these requests constituted almost 80% of our total traffic! While our application is designed to handle this many requests, it is not designed to handle each request querying for 500 results per page (the default page size is 20). As such, this had created significant strain on our Elasticsearch cluster and eventually caused disruptions in the API’s ability to serve results. The image below combines a few of our monitoring tools to show a general correlation between the page_size=500 requests and our Elasticsearch resource utilization.
Even before this release, our application was set up to throttle individual, anonymous users to 1 request/second. These page_size=500 requests were coming from a myriad of different hosts; the initiator was able to circumvent the individual throttles by employing a large number of machines (also known as a botnet). These machines were also predominantly tied to a single data center and a single ASN, which led us to believe this was orchestrated by a single user.
This behavior was clearly in violation of our Terms of Service, since it was:
Not using a registered API key for high-volume use
Scraping data from Openverse
Using multiple machines to circumvent the application throttles
Consuming significant enough resources that it impacted other users of Openverse
As mentioned above, we deployedDeployLaunching code from a local development environment to the production web server, so that it's available to visitors. a change which would now return a 401 Unauthorized for any anonymous requests to the API that included a page_size greater than the default of 20. Almost immediately after deployment, we saw this mitigation take effect when observing request behavior:
In the above graph, you can see where we deployed v2.5.5 (~13:00 PST) – the number of 200 OK responses decreased, and the number of 401 Unauthorized responses increased significantly! Eventually all of the page_size=500 requests were being rejected as unauthorized.
With this change, we were able to successfully mitigate the botnet and return our resource consumption to typical levels. This can be seen easily with a few Elasticsearch metrics:
While the intention behind Openverse is to make openly licensed media easy to access, we don’t currently have the capacity to enable users to access the entire dataset at once. We do plan on exploring options for this in the future.
We’re pleased that this mitigation was successful, and we will continue to be vigilant in ensuring uninterrupted access to Openverse for our users!