Openverse is a search engine for openly-licensed media.
The OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org. team builds the Openverse Catalog, APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.
You can also come chat with us in #Openverse on SlackSlackSlack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/.. We have a weekly developer chat at 15:00 UTC on Tuesdays.
#1104: Pass actor for staging deploys with the `-f` flag
#1103: Add `GITHUB_TOKEN` to GitHubGitHubGitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/CLICLICommand Line Interface. Terminal (Bash) in Mac, Command Prompt in Windows, or WP-CLI for WordPress. step
#1098: Update other references of media count to 700 million
#1041: Bump filelock from 3.9.0 to 3.10.7 in /ingestion_server
#1040: Bump pytest-order from 1.0.1 to 1.1.0 in /ingestion_server
#1039: Bump aws-actions/configure-aws-credentials from 1 to 2
#1038: Use `ACCESS_TOKEN` for the Project automation
#1034: Dispatch workflows instead of regular reuse to show deployment runs
#1031: Use the `issue.node_id` for GraphQL APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
#1061: httpHTTPHTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. in api response
#1033: Deployment workflow runs do not show in workflow run history
#1074: Create DAG to fix PhyloPic's `foreign_identifier` column
#1072: Offset iNaturalist DAG from monthly by one day
#1071: 🔄 synced file(s) with WordPress/openverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org.
#1064: Bump apacheApacheApache is the most widely used web server software. Developed and maintained by Apache Software Foundation. Apache is an Open Source software available for free.-airflow[amazon,http,postgres] from 2.5.1 to 2.5.2
#435: Update env vars to fix URLURLA specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org scheme for related endpoints
#1014: Pass ISSUE_ID and PROJECT_ID to the new_issue workflow
#1011: Add release-drafter APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. configuration to enable testing in #987
#1006: Add CNAME in other use of `actions-gh-pages`
#981: Switch to internal headerHeaderThe header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. on single results
#944: Absorb `build-nginxNGINXNGINX is open source software for web serving, reverse proxying, caching, load balancing, media streaming, and more. It started out as a web server designed for maximum performance and stability. In addition to its HTTP server capabilities, NGINX can also function as a proxy server for email (IMAP, POP3, and SMTP) and a reverse proxy and load balancer for HTTP, TCP, and UDP servers. https://www.nginx.com/.` job into `build-images` job
#983: The workflow for new project automation needs a GitHubGitHubGitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/ token
#976: Cutting a release does not successfully run CI/CD workflow
#972: Dump Django URLURLA specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org resolver configuration and confirm all routes are expected API routes
#971: Staging API does not automatically deployDeployLaunching code from a local development environment to the production web server, so that it's available to visitors. after merge to main
#970: Reporting HTMLHTMLHTML is an acronym for Hyper Text Markup Language. It is a markup language that is used in the development of web pages and websites. view `SELECT`s all media records
#968: [Improvement] Diagrams with transparent background are not great in dark mode
#966: General setup guide requires Homebrew, but has no info on installation
#960: timeout is required to successfully create the elastic search indexes using the just file
#958: Tags incorrectly escaped utf-8 characters to `uxxxx`
#953: Make `searchTerm` non-required for Audio track and Image cell
#859: Consider JSON5 for `package.jsonJSONJSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML.` files
#388: API ECS MigrationMigrationMoving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies.
#463: Single result page should use header with navigation links
#478: Optimize CI pipeline avoiding running jobs for unrelated changes
#482: FilterFilterFilters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output. counter in button and tab
#675: Use `thumbnail_url` for thumbnail generation when present
#755: Build UIUIUI is an acronym for User Interface - the layout of the page the user interacts with. Think ‘how are they doing that’ and less about what they are doing. for API consumers to get their key and check their usage (original #335)
#1057: 🔄 synced file(s) with WordPress/openverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org.
#1052: Update README.md with documentation reference
#1049: Handle the upper case licenses in the add_license_dag
#922: Add `.githubGitHubGitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/` to CODEOWNERS
#916: Update VueVueVue (pronounced /vjuː/, like view) is a progressive framework for building user interfaces. https://vuejs.org/. from 2.7.10 to 2.7.14
#910: Add user validation, concurrency, manual runs to deployment workflow
#909: Add get-image-tag as dependency for nginxNGINXNGINX is open source software for web serving, reverse proxying, caching, load balancing, media streaming, and more. It started out as a web server designed for maximum performance and stability. In addition to its HTTP server capabilities, NGINX can also function as a proxy server for email (IMAP, POP3, and SMTP) and a reverse proxy and load balancer for HTTP, TCP, and UDP servers. https://www.nginx.com/. build step
#828: Move peerDependencyRules to root package.jsonJSONJSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML.
#879: Yellow background when reporting an image from GutenbergGutenbergThe Gutenberg project is the new Editor Interface for WordPress. The editor improves the process and experience of creating new content, making writing rich content much simpler. It uses ‘blocks’ to add richness rather than shortcodes, custom HTML etc. https://wordpress.org/gutenberg/
#878: Update reverse proxy to allow for path prefix rewriting on the APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
#877: Refactor deployment workflow into separate workflows per app and environment
#1051: Adjust schedule for long running queries termination
#1050: Add DAG for terminating long-running queries
#1045: Use Python to group items by license to speed up the query
#1003: Remove alternate image extraction from SMK, fix foreign landing URLURLA specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org
Closed issues
#1044: `add_license_url` DAG is inefficient and fails due to timeout
#840: Add production APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. deployment action
#838: Bump elasticsearch-dsl from 7.4.0 to 7.4.1 in /api
#836: Bump python-decouple from 3.7 to 3.8 in /api
#835: Bump python-decouple from 3.7 to 3.8 in /ingestion_server
#832: Bump elasticsearch-dsl from 7.4.0 to 7.4.1 in /ingestion_server
#831: Bump pytest from 7.2.1 to 7.2.2 in /ingestion_server
#830: Bump renovatebot/githubGitHubGitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/-action from 34.152.5 to 34.154.4
#806: Fix crash when more than one `q` parameter is provided in URLURLA specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org
#781: `setSearchTerm` fails when `query.q` is an array
#760: Update “Week in OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org.” script to support monorepo
#381: Set frontend memory and cpu to match staging
Closed issues
#407: Update deployment action to generate a blockBlockBlock is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. per service
#391: Set up and deployDeployLaunching code from a local development environment to the production web server, so that it's available to visitors. production API on ECS
#401: Restore the functionality of the weekly Make post
#398: Bump ipython from 8.3.0 to 8.10.0 in /automations/python
#396: Fix project automation logic around closed PRs
#366: Proposal: OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org. project process
Closed issues
#414: Remove the old headerHeaderThe header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. code
#413: Add option to sort search results by created_on
#276: Omit issues with closed PRs of moving to the “In Progress” column in the project board
#768: Load_data steps for `image` skipped during Wikimedia reingestion
#766: Update to new version of Phylopic APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
#689: Investigate converting iNaturalist to an incremental DAG
#684: inaturalist data quality: issue warning with missing photo ids
#2188: Download translations in bulk to prevent GlotPress throttling
#2187: Use aria-label for WordPress affiliation link
#2185: Move the sidebarSidebarA sidebar in WordPress is referred to a widget-ready area used by WordPress themes to display information that is not a part of the main content. It is not always a vertical column on the side. It can be a horizontal rectangle below or above the content area, footer, header, or any where in the theme. in the DOM order
#2184: Reduce GlotPress limit further to ensure all languages
#2180: Add “skip to content” links to the homepage and the 404 page; fix footer role
#2178: Add “skip to content link” to the Single result pages
#2177: Use h1 for the main heading on the homepage
#2172: Make translations more reliably present in all environments
#2169: Update common’s size estimate to 2.5 billion
#2146: Remove the searchBy creator filterFilterFilters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output. from the filters list
#2189: Add translator note for Dutch translation of `search-guide.example.prefix.content`
#2183: Improve accessibilityAccessibilityAccessibility (commonly shortened to a11y) refers to the design of products, devices, services, or environments for people with disabilities. The concept of accessible design ensures both “direct access” (i.e. unassisted) and “indirect access” meaning compatibility with a person’s assistive technology (for example, computer screen readers). (https://en.wikipedia.org/wiki/Accessibility) of the WordPress link in the footer
#2182: Expanding the filters sidebar does not focus a screen reader on the filters section
#2179: Default layout should not nest `footer` inside `main`
#2176: The first heading on the home page should be `h1`
#373: Update @dhruvkb‘s SSHSSHSecure SHell - a protocol for securely connecting to a remote system in addition to or in place of a password. key in `globally_authorized_keys`
#376: Match frontend linting dependency versions to frontend/pull/2121
#369: chore(deps): update alex-page/githubGitHubGitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/-project-automation-plus action to v0.8.3
#386: iFrameiframeiFrame is an acronym for an inline frame. An iFrame is used inside a webpage to load another HTML document and render it. This HTML document may also contain JavaScript and/or CSS which is loaded at the time when iframe tag is parsed by the user’s browser. removal
#993: 🔄 synced file(s) with WordPress/openverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org.
#768: Load_data steps for `image` skipped during Wikimedia reingestion
#766: Update to new version of Phylopic APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
#109: Update Europeana endpoint and accomodate v2 API changes
#1139: Bump ipython from 8.9.0 to 8.10.0 in /ingestion_server
#1135: Bump cryptography from 39.0.0 to 39.0.1 in /api
#1082: Add zero-downtime deployments & data transformations guide
Closed issues
#1030: Add documentation describing the data migrationMigrationMoving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. process we should follow
#2155: Fix the headerHeaderThe header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. scrolling
#2154: Enable new header feature flag in production
#2149: Remove ring offset to reduce ring thickness
#2147: Tighten the condition for audio playback to continue across pages
#2144: Increase horizontal padding in mobile search grid
#2140: Update content switcher tabs fonts and icons to match the mockups
#2121: Update eslint- and prettier- related dependencies
#2104: Add Debounce to filterFilterFilters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output. selection
#2101: Update search term when navigating with browser`s back and forward buttons
#2093: Serve Prometheus metrics on a separate port; fix metrics development workflow
Closed issues
#2176: The first heading on the home page should be `h1`
#2151: Browser zoom-in the search input on mobile when typing
#2150: The header is not fixed when scrolling to bottom
#2133: Increase padding of search results grid on mobile size
#2129: Focus outline thickness is incorrect in the ‘Clear filters’ button
#2118: Audio keeps playing on a single image result
#2082: Add the search term as a query parameter to the single result page
#2010: Minimize the use of JS for layout in the VAudioTrack component
#2009: Menu and breakpoint improvement in new header
#1295: Homepage search type buttons for small width viewports run off screen in languages with longer labels
#810: Requests to invalid / non-existent resources should return a 404 HTTPHTTPHTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. status
#811: Photos > Images should use a server-side, not client-side redirect
#373: Update @dhruvkb‘s SSHSSHSecure SHell - a protocol for securely connecting to a remote system in addition to or in place of a password. key in `globally_authorized_keys`
#371: Include deployment secrets in `WordPress/openverse`
#370: Add note about nuxt memory usage after deployDeployLaunching code from a local development environment to the production web server, so that it's available to visitors.
This was a passing thought I had that I wanted to note somewhere. Currently the ingestion server is a small Falcon app that runs most aspects of the data refresh, but then also (in staging/prod) interacts with a fleet of “indexer worker” EC2 instances when performing the Postgres -> Elasticsearch indexing.
We have plans for moving the data refresh steps from the ingestion server into Airflow. Most of these steps are operations on the various databases, so they’re not very processor-intensive on the server end. However, the indexing steps are intensive, which is why they’re spread across 6 machines in production (and even then it can take a number of hours to complete).
We could replicate this process in Airflow by setting up Celery-based workers so that the tasks run on a separate instance from the webserver/scheduler. Ultimately I’d like to go this route (or use something like the ECS Executor rather than Celery), but that’s a non-trivial effort to complete.
One other way we could accomplish this would be to use ECS tasks! We could have a container defined specifically for the indexing step, which expects to receive the range on which to index and all necessary connection information. We could then kick off n of those jobs using the EcsRunTaskOperator, and wait for completion using the EcsTaskStateSensor to determine when they complete. This could be done in our current setup without any new Airflow infrastructure. It’d also allow us to remove the indexer workers, which currently sit idle (albeit in the stopped state) in EC2 until they are used.
All OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org. contributors are invited to attend a new meeting to review our current projects and roadmap for the rest of the year. The first of these sessions will be held on August 10th 2022 at 1500 UTC. Visit the #openverse channel in the Make WP SlackSlackSlack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/. to see a video chat link prior to the meeting start.
Here are some helpful documents for use during the meeting:
Yesterday at 20:20 UTC, we released version 2.5.5 of our API! Along with a few dependency upgrades and DevEx improvements/fixes, this release also brings an important change regarding anonymous API requests. After v2.5.5, any media searches that are made without an APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. key cannot request more than 20 results per page.
This change was made in order to mitigate behavior we were seeing on the API which was adversely affecting performance for other users, our capacity to update the data that backs OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org., and our ability to deployDeployLaunching code from a local development environment to the production web server, so that it's available to visitors. new changes to the API.
– A user must adhere to all rate limits, registration requirements, and comply with all requirements in the Openverse API documentation;
– A user must not scrape the content in the Openverse Catalog;
– A user must not use multiple machines to circumvent rate limits or otherwise take measures to bypass our technical or security measures;
– A user must not operate in a way that negatively affects other users of the API or impedes the WordPress Foundation’s ability to provide its services;
Background
Beginning around May 18th, we saw a significant increase in traffic.
Total requests made to api.openverse.engineering over the last 30 days
While the digital demographics (browser, user agent, OS, device type, etc.) were quite varied, one feature stuck out – these requests were all being made with the page_size=500 parameter.
Total requests made to api.openverse.engineering over the last 30 days using the page_size=500 parameter
Over the course of the last 30 days, these requests constituted almost 80% of our total traffic! While our application is designed to handle this many requests, it is not designed to handle each request querying for 500 results per page (the default page size is 20). As such, this had created significant strain on our Elasticsearch cluster and eventually caused disruptions in the API’s ability to serve results. The image below combines a few of our monitoring tools to show a general correlation between the page_size=500 requests and our Elasticsearch resource utilization.
Request count compared to Elasticsearch resource utilization
Even before this release, our application was set up to throttle individual, anonymous users to 1 request/second. These page_size=500 requests were coming from a myriad of different hosts; the initiator was able to circumvent the individual throttles by employing a large number of machines (also known as a botnet). These machines were also predominantly tied to a single data center and a single ASN, which led us to believe this was orchestrated by a single user.
This behavior was clearly in violation of our Terms of Service, since it was:
Not using a registered API key for high-volume use
Scraping data from Openverse
Using multiple machines to circumvent the application throttles
Consuming significant enough resources that it impacted other users of Openverse
Mitigation
As mentioned above, we deployedDeployLaunching code from a local development environment to the production web server, so that it's available to visitors. a change which would now return a 401 Unauthorized for any anonymous requests to the API that included a page_size greater than the default of 20. Almost immediately after deployment, we saw this mitigation take effect when observing request behavior:
Total number of page_size=500 requests made over the course of 6 hours, separated by return status code
In the above graph, you can see where we deployed v2.5.5 (~13:00 PST) – the number of 200 OK responses decreased, and the number of 401 Unauthorized responses increased significantly! Eventually all of the page_size=500 requests were being rejected as unauthorized.
With this change, we were able to successfully mitigate the botnet and return our resource consumption to typical levels. This can be seen easily with a few Elasticsearch metrics:
Elasticsearch metrics over the last 12 hours
While the intention behind Openverse is to make openly licensed media easy to access, we don’t currently have the capacity to enable users to access the entire dataset at once. We do plan on exploring options for this in the future.
We’re pleased that this mitigation was successful, and we will continue to be vigilant in ensuring uninterrupted access to Openverse for our users!