A week in Openverse: 2023-03-13 – 2023-03-20

openverse

Merged PRs

  • #939: Add console_prod handler to query logging to allow in production
  • #936: Always build both api & ingestion server images for either service
  • #935: Deregister media model admins and dependents
  • #934: Add Django DB logging option
  • #933: Add application name to DB
  • #931: Remove Docker image loading from docs steps
  • #930: Fix links on the main Storybook page
  • #927: Fix global audio player's close button
  • #925: Build `api` when ingestion server changes
  • #922: Add `.githubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/` to CODEOWNERS
  • #918: Fix global audio player layout
  • #917: Update pinia and pinia/testing
  • #916: Update VueVue Vue (pronounced /vjuː/, like view) is a progressive framework for building user interfaces. https://vuejs.org/. from 2.7.10 to 2.7.14
  • #915: Fix background color on report pages
  • #910: Add user validation, concurrency, manual runs to deployment workflow
  • #909: Add get-image-tag as dependency for nginxNGINX NGINX is open source software for web serving, reverse proxying, caching, load balancing, media streaming, and more. It started out as a web server designed for maximum performance and stability. In addition to its HTTP server capabilities, NGINX can also function as a proxy server for email (IMAP, POP3, and SMTP) and a reverse proxy and load balancer for HTTP, TCP, and UDP servers. https://www.nginx.com/. build step
  • #895: Skip more jobs based on changed files
  • #894: Simplify and fix bundle size workflow
  • #893: Only generate POT file if `en.json5` has changed
  • #891: Add ability to boost search results by authority
  • #889: Prepare Docker setup for monorepo
  • #888: Adding brand assets
  • #886: Split deployment workflow into 4 separate workflows
  • #882: Only run stack label addition step on pull requests
  • #873: Project Proposal: Detecting, filtering, and blurring results that include sensitive terms
  • #844: Implement analytics in Nuxt
  • #828: Move peerDependencyRules to root package.jsonJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML.
  • #397: Provider tally extraction script

Closed issues

  • #929: The links in Storybook have not been updated to monorepo
  • #928: Frontend PRs fail CI
  • #926: Global audio player cannot be closed when the audio is playing
  • #921: Action Required: Fix Renovate Configuration
  • #920: Django check in CI is flakey because of plausible check
  • #913: Global audio player is broken
  • #908: `SEMANTIC_VERSION` is not supplied to nginx image
  • #906: Port conflict with SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/.
  • #879: Yellow background when reporting an image from GutenbergGutenberg The Gutenberg project is the new Editor Interface for WordPress. The editor improves the process and experience of creating new content, making writing rich content much simpler. It uses ‘blocks’ to add richness rather than shortcodes, custom HTML etc. https://wordpress.org/gutenberg/
  • #878: Update reverse proxy to allow for path prefix rewriting on the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
  • #877: Refactor deployment workflow into separate workflows per app and environment
  • #871: Jamendo thumbnails are failing
  • #865: Move Docker-only directories from root to `docker/`
  • #849: Skip frontend docker image build and its tests on non-frontend code changes
  • #827: Move pnpm peerDependencyRules.allowedVersions to the root package.json
  • #825: Set up wrangling for events
  • #380: Initial analysis of Redis provider tallies pre & post iNaturalist ingestion
  • #689: Add additional logging around search_controller's ES query building

openverse-catalog

Merged PRs

  • #1051: Adjust schedule for long running queries termination
  • #1050: Add DAG for terminating long-running queries
  • #1045: Use Python to group items by license to speed up the query
  • #1003: Remove alternate image extraction from SMK, fix foreign landing URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org

Closed issues

  • #1044: `add_license_url` DAG is inefficient and fails due to timeout
  • #1043: The Noun Project
  • #1039: Allow Flickr backfill to complete, turn notifications back on
  • #875: Duplicates identified in SMK data
  • #826: Provider: The Noun Project

openverse-infrastructure

Merged PRs

  • #420: secure staging api admin
  • #418: Add db logging and debug log level to production api
  • #417: Add api-production subdomain to access
  • #415: Add user validation, concurrency, manual runs to deployment workflow
  • #414: Add existing API aliases to ECS deployment
  • #413: Restore frontend capacity
  • #412: Add separate deployment workflows per environment/service
  • #411: Add photon auth key to ECS deployment
  • #401: Make desired count configurable, set to 5 in production

Closed issues

  • #399: Increase API ECS service count to match current EC2 production
  • #392: Point `api.openverse.engineering` to `api-production.openverse.engineering`
  • #366: Move staging ECS API to staging.openverse.org/api path route instead of openverse.engineering subdomain.

#openverse, #week-in-openverse

A week in Openverse: 2023-03-06 – 2023-03-13

openverse

Merged PRs

  • #890: Add a stemming override for the word “universe”
  • #885: Add stack to label sync, allow emoji to be defined for whole group
  • #872: Make deployment action “uses” explicit
  • #870: Update sentry; fix config
  • #863: Fix weekly update workflow
  • #862: Add feature flag for fake marking results as sensitive
  • #858: Remove `prepare` script to prevent i18n overwrites inside Docker
  • #851: Make codeowners more specific
  • #848: Identify and fix cause of cURL error 23 when setting up pre-commit
  • #846: Bump boto3 from 1.26.81 to 1.26.84 in /ingestion_server
  • #843: Add preferences for analytics
  • #842: Update homepage copy to “700 million”
  • #841: Bump boto3 from 1.26.81 to 1.26.84 in /api
  • #840: Add production APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. deployment action
  • #838: Bump elasticsearch-dsl from 7.4.0 to 7.4.1 in /api
  • #836: Bump python-decouple from 3.7 to 3.8 in /api
  • #835: Bump python-decouple from 3.7 to 3.8 in /ingestion_server
  • #832: Bump elasticsearch-dsl from 7.4.0 to 7.4.1 in /ingestion_server
  • #831: Bump pytest from 7.2.1 to 7.2.2 in /ingestion_server
  • #830: Bump renovatebot/githubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/-action from 34.152.5 to 34.154.4
  • #806: Fix crash when more than one `q` parameter is provided in URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org
  • #804: RFC + POC: Add Plausible for analytics
  • #798: Handle incorrect types in cookie value
  • #788: Update home link screen reader text
  • #786: Add stack label if available, make get-changes composite action
  • #785: Add actions to search forms

Closed issues

  • #871: Jamendo thumbnails are failing
  • #857: Locales missing in Docker images
  • #854: Add non-production feature flag for marking half of results as sensitive
  • #852: `TypeError` term.trim is not a function
  • #850: PR review requests are not following the CODEOWNERS assignements
  • #845: Handle `precommit` recipe exiting with code 23
  • #839: Update homepage copy to “700 million”
  • #821: Add feature flag for analytics
  • #782: Invalid cookie value causes an error
  • #781: `setSearchTerm` fails when `query.q` is an array
  • #760: Update “Week in OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org.” script to support monorepo
  • #521: `DataCloneError` raised on search in Safari
  • #522: Switch to `ENTRYPOINT` instead of `CMD` in our Dockerfile
  • #545: Dependency Dashboard

openverse-catalog

Merged PRs

  • #1042: Update `LICENSE` to match main repo
  • #1041: Tweak Flickr time division settings, add logs
  • #1038: Add trailing slash to Jamendo thumbnail URLs
  • #1037: 🔄 synced file(s) with WordPress/openverse
  • #1036: Temporarily turn off scheduled image data refreshes, increase matview refresh timeout
  • #1035: Add logging to iNaturalist date check
  • #1034: Add flickr sub provider auditing dag
  • #1031: Adjust Flickr max records to account for incorrect reporting
  • #1028: Improve license URL validation
  • #1005: Add a DAG for backfilling license_url when meta_data is null
  • #976: Add Airflow variable used to configure overrides for task timeouts

Closed issues

  • #1027: `_get_valid_cc_url` makes a network request even for known valid license urls
  • #1024: Improve iNaturalist date check logging
  • #724: Allow execution timeouts to be overridden by Variables
  • #676: Identify new Flickr sub-providers
  • #511: Ensure that all media have `license_url` in `meta_data` field

openverse-infrastructure

Merged PRs

  • #410: 🔄 synced file(s) with WordPress/openverse
  • #408: Increase frontend memory and CPU back up
  • #406: Bump API to v4.0.0, point bump script to monorepo
  • #402: Add API service to ECS Cloudwatch dashboards
  • #398: Stand up production API on ECS
  • #397: Construct API URLs dynamically, change staging domain
  • #386: Add stack check as required for monorepo
  • #381: Set frontend memory and cpu to match staging

Closed issues

  • #407: Update deployment action to generate a blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. per service
  • #400: Set up API in Cloudwatch ECS dashboard
  • #391: Set up and deployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. production API on ECS

#openverse, #week-in-openverse

A week in Openverse: 2023-02-13 – 2023-02-20

openverse

Merged PRs

  • #418: Update sync to account changes to PR template path
  • #407: Change ‘Reverted’ to ‘Rollback’ in project docs
  • #406: Remove front matter from project proposal template
  • #405: Fix authenticated logins for the label PR action
  • #402: Update actions/checkout action to v3
  • #401: Restore the functionality of the weekly Make post
  • #398: Bump ipython from 8.3.0 to 8.10.0 in /automations/python
  • #396: Fix project automation logic around closed PRs
  • #366: Proposal: OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org. project process

Closed issues

  • #414: Remove the old headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. code
  • #413: Add option to sort search results by created_on
  • #276: Omit issues with closed PRs of moving to the “In Progress” column in the project board

openverse-catalog

Merged PRs

  • #994: Add an “Airflow Alert” issue template
  • #969: Add dayshift to tsv filenames for reingestion workflows

Closed issues

  • #1000: Jobe Alert
  • #997:
  • #768: Load_data steps for `image` skipped during Wikimedia reingestion
  • #766: Update to new version of Phylopic APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
  • #689: Investigate converting iNaturalist to an incremental DAG
  • #684: inaturalist data quality: issue warning with missing photo ids

openverse-api

Merged PRs

  • #1144: Add screen to API Docker image
  • #1143: Bump django from 4.1.6 to 4.1.7 in /api
  • #1142: Add API rollback workflow
  • #1141: 🔄 synced file(s) with WordPress/openverse
  • #1140: Bump ipython from 8.9.0 to 8.10.0 in /api
  • #1139: Bump ipython from 8.9.0 to 8.10.0 in /ingestion_server
  • #916: Add option to sort search results by `created_on`

openverse-frontend

Merged PRs

  • #2192: Simplify `get-translations.js` and add error handling and fallbacks
  • #2191: Add a directive for translators to not translate Netherlands
  • #2190: 🔄 synced file(s) with WordPress/openverse
  • #2188: Download translations in bulk to prevent GlotPress throttling
  • #2187: Use aria-label for WordPress affiliation link
  • #2185: Move the sidebarSidebar A sidebar in WordPress is referred to a widget-ready area used by WordPress themes to display information that is not a part of the main content. It is not always a vertical column on the side. It can be a horizontal rectangle below or above the content area, footer, header, or any where in the theme. in the DOM order
  • #2184: Reduce GlotPress limit further to ensure all languages
  • #2180: Add “skip to content” links to the homepage and the 404 page; fix footer role
  • #2178: Add “skip to content link” to the Single result pages
  • #2177: Use h1 for the main heading on the homepage
  • #2172: Make translations more reliably present in all environments
  • #2169: Update common’s size estimate to 2.5 billion
  • #2162: Remove the unused old header code
  • #2146: Remove the searchBy creator filterFilter Filters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output. from the filters list
  • #2134: Fix the Search Help (Syntax Guide) links

Closed issues

  • #2189: Add translator note for Dutch translation of `search-guide.example.prefix.content`
  • #2183: Improve accessibilityAccessibility Accessibility (commonly shortened to a11y) refers to the design of products, devices, services, or environments for people with disabilities. The concept of accessible design ensures both “direct access” (i.e. unassisted) and “indirect access” meaning compatibility with a person’s assistive technology (for example, computer screen readers). (https://en.wikipedia.org/wiki/Accessibility) of the WordPress link in the footer
  • #2182: Expanding the filters sidebar does not focus a screen reader on the filters section
  • #2179: Default layout should not nest `footer` inside `main`
  • #2176: The first heading on the home page should be `h1`
  • #2170: Not all locales show up in picker
  • #2163: Remove the `old_header` code
  • #2125: Keyboard navigation to the footer on the search page is impossible
  • #2116: Syntax Guide page links don’t work
  • #1344: Creator filter is unclear

openverse-infrastructure

Merged PRs

  • #379: DeployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. API 2.7.6
  • #376: Include Make-related secrets in Terraform
  • #375: Update API email address to openverse.org
  • #373: Update @dhruvkb‘s SSHSSH Secure SHell - a protocol for securely connecting to a remote system in addition to or in place of a password. key in `globally_authorized_keys`
  • #372: Sync config with actual infra
  • #371: Include deployment secrets in `WordPress/openverse`
  • #370: Add note about nuxt memory usage after deploy
  • #357: Add CloudWatch agent to API boxes

Closed issues

  • #338: Update API email address to use openverse.org

#openverse, #week-in-openverse

A week in Openverse: 2023-02-08 – 2023-02-15

openverse

Merged PRs

  • #398: Bump ipython from 8.3.0 to 8.10.0 in /automations/python
  • #395: Change authentication token
  • #376: Match frontend linting dependency versions to frontend/pull/2121
  • #369: chore(deps): update alex-page/githubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/-project-automation-plus action to v0.8.3

Closed issues

  • #391: PR Project Automation is failing
  • #389:
  • #387: Baseline SEO improvements
  • #386: iFrameiframe iFrame is an acronym for an inline frame. An iFrame is used inside a webpage to load another HTML document and render it. This HTML document may also contain JavaScript and/or CSS which is loaded at the time when iframe tag is parsed by the user’s browser. removal

openverse-catalog

Merged PRs

  • #993: 🔄 synced file(s) with WordPress/openverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org.
  • #974: Update Europeana endpoint
  • #969: Add dayshift to tsv filenames for reingestion workflows

Closed issues

  • #1000: Jobe Alert
  • #997:
  • #768: Load_data steps for `image` skipped during Wikimedia reingestion
  • #766: Update to new version of Phylopic APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
  • #109: Update Europeana endpoint and accomodate v2 API changes

openverse-api

Merged PRs

  • #1140: Bump ipython from 8.9.0 to 8.10.0 in /api
  • #1139: Bump ipython from 8.9.0 to 8.10.0 in /ingestion_server
  • #1135: Bump cryptography from 39.0.0 to 39.0.1 in /api
  • #1082: Add zero-downtime deployments & data transformations guide

Closed issues

  • #1030: Add documentation describing the data migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. process we should follow

openverse-frontend

Merged PRs

  • #2184: Reduce GlotPress limit further to ensure all languages
  • #2177: Use h1 for the main heading on the homepage
  • #2172: Make translations more reliably present in all environments
  • #2169: Update common’s size estimate to 2.5 billion
  • #2166: Update URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org in opensearch.xml
  • #2165: Group localized URLs of the same page in the sitemap
  • #2161: Ensure the lineage is traced correctly
  • #2160: Fix homepage zooming on iPhone
  • #2158: Minify homepage images
  • #2157: Baseline SEO improvements
  • #2156: 🔄 synced file(s) with WordPress/openverse
  • #2155: Fix the headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. scrolling
  • #2154: Enable new header feature flag in production
  • #2149: Remove ring offset to reduce ring thickness
  • #2147: Tighten the condition for audio playback to continue across pages
  • #2144: Increase horizontal padding in mobile search grid
  • #2140: Update content switcher tabs fonts and icons to match the mockups
  • #2121: Update eslint- and prettier- related dependencies
  • #2120: Update babel-related dependencies
  • #2109: Extract Prometheus module
  • #2104: Add Debounce to filterFilter Filters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output. selection
  • #2101: Update search term when navigating with browser`s back and forward buttons
  • #2093: Serve Prometheus metrics on a separate port; fix metrics development workflow

Closed issues

  • #2176: The first heading on the home page should be `h1`
  • #2170: Not all locales show up in picker
  • #2164: Sitemap is not correct for i18n routes
  • #2159: Missing Translation Strings
  • #2151: Browser zoom-in the search input on mobile when typing
  • #2150: The header is not fixed when scrolling to bottom
  • #2133: Increase padding of search results grid on mobile size
  • #2129: Focus outline thickness is incorrect in the ‘Clear filters’ button
  • #2118: Audio keeps playing on a single image result
  • #2082: Add the search term as a query parameter to the single result page
  • #2010: Minimize the use of JS for layout in the VAudioTrack component
  • #2009: Menu and breakpoint improvement in new header
  • #1295: Homepage search type buttons for small width viewports run off screen in languages with longer labels
  • #810: Requests to invalid / non-existent resources should return a 404 HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. status
  • #811: Photos > Images should use a server-side, not client-side redirect
  • #812: All pages should output a canonical URL tag
  • #813: Add hreflang directives
  • #505: Debounce filter selection

openverse-infrastructure

Merged PRs

  • #373: Update @dhruvkb‘s SSHSSH Secure SHell - a protocol for securely connecting to a remote system in addition to or in place of a password. key in `globally_authorized_keys`
  • #372: Sync config with actual infra
  • #371: Include deployment secrets in `WordPress/openverse`
  • #370: Add note about nuxt memory usage after deployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors.
  • #369: Bump API to v2.7.5
  • #368: Bump catalog-airflow from v1.5.0 to v1.5.1
  • #367: 🔄 synced file(s) with WordPress/openverse
  • #365: Split next root modules by service

Closed issues

  • #347: Split `next` root modules’ `main.tf` into separate files for each service

#openverse, #week-in-openverse

Applying ECS to the ingestion server/data refresh

This was a passing thought I had that I wanted to note somewhere. Currently the ingestion server is a small Falcon app that runs most aspects of the data refresh, but then also (in staging/prod) interacts with a fleet of “indexer worker” EC2 instances when performing the Postgres -> Elasticsearch indexing.

We have plans for moving the data refresh steps from the ingestion server into Airflow. Most of these steps are operations on the various databases, so they’re not very processor-intensive on the server end. However, the indexing steps are intensive, which is why they’re spread across 6 machines in production (and even then it can take a number of hours to complete).

We could replicate this process in Airflow by setting up Celery-based workers so that the tasks run on a separate instance from the webserver/scheduler. Ultimately I’d like to go this route (or use something like the ECS Executor rather than Celery), but that’s a non-trivial effort to complete.

One other way we could accomplish this would be to use ECS tasks! We could have a container defined specifically for the indexing step, which expects to receive the range on which to index and all necessary connection information. We could then kick off n of those jobs using the EcsRunTaskOperator, and wait for completion using the EcsTaskStateSensor to determine when they complete. This could be done in our current setup without any new Airflow infrastructure. It’d also allow us to remove the indexer workers, which currently sit idle (albeit in the stopped state) in EC2 until they are used.

#airflow, #data-refresh, #ecs, #infrastructure, #openverse

Community Meeting Recap (09 August 2022)

Meeting start

🎉 Done!

👀 Needs review

🚧 In progress

An issue is in the todo column and unassigned.

💬 Agenda discussion

One of our agenda items was already tackled in the previous week, so no discussion on it was necessary. We discussed what else is needed before we can close out & deploy the catalog v1.3.0 milestone.

@krysal also brought to folks’ attention that we still need to run a data refresh in order to confirm some issues we completed in the catalog are addressed downstream.

Meeting end

#openverse, #openverse-weekly-community-meeting

Openverse Prioritization Meeting 2022-08-10

All OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org. contributors are invited to attend a new meeting to review our current projects and roadmap for the rest of the year. The first of these sessions will be held on August 10th 2022 at 1500 UTC. Visit the #openverse channel in the Make WP SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/. to see a video chat link prior to the meeting start.

Here are some helpful documents for use during the meeting:

And some background documentation that may help facilitate conversation:

#planning, #prioritization, #roadmap

Mitigating out of terms API usage

Yesterday at 20:20 UTC, we released version 2.5.5 of our API! Along with a few dependency upgrades and DevEx improvements/fixes, this release also brings an important change regarding anonymous API requests. After v2.5.5, any media searches that are made without an APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. key cannot request more than 20 results per page.

This change was made in order to mitigate behavior we were seeing on the API which was adversely affecting performance for other users, our capacity to update the data that backs OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org., and our ability to deployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. new changes to the API.

Our API Terms of Service state:

– A user must adhere to all rate limits, registration requirements, and comply with all requirements in the Openverse API documentation;

– A user must not scrape the content in the Openverse Catalog;

– A user must not use multiple machines to circumvent rate limits or otherwise take measures to bypass our technical or security measures;

– A user must not operate in a way that negatively affects other users of the API or impedes the WordPress Foundation’s ability to provide its services;

Background

Beginning around May 18th, we saw a significant increase in traffic.

Total requests made to api.openverse.engineering over the last 30 days

While the digital demographics (browser, user agent, OS, device type, etc.) were quite varied, one feature stuck out – these requests were all being made with the page_size=500 parameter.

Total requests made to api.openverse.engineering over the last 30 days using the page_size=500 parameter

Over the course of the last 30 days, these requests constituted almost 80% of our total traffic! While our application is designed to handle this many requests, it is not designed to handle each request querying for 500 results per page (the default page size is 20). As such, this had created significant strain on our Elasticsearch cluster and eventually caused disruptions in the API’s ability to serve results. The image below combines a few of our monitoring tools to show a general correlation between the page_size=500 requests and our Elasticsearch resource utilization.

Request count compared to Elasticsearch resource utilization

Even before this release, our application was set up to throttle individual, anonymous users to 1 request/second. These page_size=500 requests were coming from a myriad of different hosts; the initiator was able to circumvent the individual throttles by employing a large number of machines (also known as a botnet). These machines were also predominantly tied to a single data center and a single ASN, which led us to believe this was orchestrated by a single user.

This behavior was clearly in violation of our Terms of Service, since it was:

  1. Not using a registered API key for high-volume use
  2. Scraping data from Openverse
  3. Using multiple machines to circumvent the application throttles
  4. Consuming significant enough resources that it impacted other users of Openverse

Mitigation

As mentioned above, we deployedDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. a change which would now return a 401 Unauthorized for any anonymous requests to the API that included a page_size greater than the default of 20. Almost immediately after deployment, we saw this mitigation take effect when observing request behavior:

Screenshot of a Cloudflare analytics page. The graph in the center shows total requests with page_size=500, separated by status code over 6 hours. A consistent number of requests (split between 301 and 200) can be seen starting at 9:00 PST. At 13:00 PST, the number of 401 requests begins to overtake the number of 200 requests. After 13:15, the number of 200 requests drops to zero and all requests returned are 401s.
Total number of page_size=500 requests made over the course of 6 hours, separated by return status code

In the above graph, you can see where we deployed v2.5.5 (~13:00 PST) – the number of 200 OK responses decreased, and the number of 401 Unauthorized responses increased significantly! Eventually all of the page_size=500 requests were being rejected as unauthorized.

With this change, we were able to successfully mitigate the botnet and return our resource consumption to typical levels. This can be seen easily with a few Elasticsearch metrics:

Elasticsearch metrics over the last 12 hours

While the intention behind Openverse is to make openly licensed media easy to access, we don’t currently have the capacity to enable users to access the entire dataset at once. We do plan on exploring options for this in the future.

We’re pleased that this mitigation was successful, and we will continue to be vigilant in ensuring uninterrupted access to Openverse for our users!

#openverse, #infrastructure, #api

A week in Openverse: 2022-03-21 – 2022-03-28

openverse

Merged PRs

  • #202: Update feature_request.md label template to remove priority and aspect
  • #198: Update bug_report.md to remove default priority label
  • #197: Update bug_report.md to remove `Expectation` section
  • #194: Add infrastructure repo to synced repo list

Closed issues

  • #157: Create OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org. GitHubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/ activity overview dashboard
  • #140: Remove “Expectation” headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. from bug report template
  • #73: [Feature] Configure ESLint and Prettier for JS scripts
  • #31: [MetaMeta Meta is a term that refers to the inside workings of a group. For us, this is the team that works on internal WordPress sites like WordCamp Central and Make WordPress.] 3D Models
  • #30: Remove requests for reviews from closed PRs

openverse-catalog

Merged PRs

  • #441: 🔄 Synced file(s) with WordPress/openverse
  • #440: 🔄 Synced file(s) with WordPress/openverse
  • #424: Add LRU cache to `is_valid_license_info`
  • #423: Change PhyloPic date range & schedule interval
  • #422: Round duration for provider ingestion completion message
  • #421: Enable XCom pickling in Airflow
  • #397: Add data refresh to Airflow

Closed issues

  • #419: Add an `lru_cache` to `is_valid_license_info`
  • #410: Change Phylopic to @weekly
  • #377: Enable XCom pickling
  • #373: Format “Airflow DAG Load Data Complete” duration
  • #353: Data refresh orchestration DAG

openverse-api

Merged PRs

  • #591: 🔄 Synced file(s) with WordPress/openverse
  • #590: 🔄 Synced file(s) with WordPress/openverse
  • #586: 🔄 Synced file(s) with WordPress/openverse
  • #584: Replace plural `categories` as field name with singular `categoryCategory The 'category' taxonomy lets you group posts / content together that share a common bond. Categories are pre-defined and broad ranging.`
  • #583: Replace plural `categories` as field name with singular `category`
  • #580: Add CI check for uncommitted migrations
  • #577: Remove `query_serializer` for reporting endpoints
  • #576: Use `httpsHTTPS HTTPS is an acronym for Hyper Text Transfer Protocol Secure. HTTPS is the secure version of HTTP, the protocol over which data is sent between your browser and the website that you are connected to. The 'S' at the end of HTTPS stands for 'Secure'. It means all communications between your browser and the website are encrypted. This is especially helpful for protecting sensitive data like banking information.` for hyperlinked APIs by replacing the URLs

Closed issues

  • #573: Return secure URLs for the fields thumbnail, detail_url and related_url.
  • #571: Run `makemigrations` in CI to prevent merging PRs with missing migrations.

openverse-frontend

Merged PRs

  • #1187: 🔄 Synced file(s) with WordPress/openverse
  • #1186: Mock services using jest.mock
  • #1183: 🔄 Synced file(s) with WordPress/openverse
  • #1182: Fix missing nuxt types
  • #1178: Add useFetchState composable
  • #1175: Add the 3D model SVG
  • #1173: Remove redundant type and simplify media service
  • #1172: Content page component design fixes
  • #1168: Update audio categories
  • #1166: Remove source links from sources page
  • #1163: Add support for TypeScript in VueVue Vue (pronounced /vjuː/, like view) is a progressive framework for building user interfaces. https://vuejs.org/. SFCs.
  • #1153: Add local visual regression infrastructure
  • #1150: Typescriptify `api-service`
  • #1148: Hotfix for negative values in peaks
  • #1147: Strictly filterFilter Filters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output. sentry errors
  • #1144: Enable HTTPS in local development
  • #1142: fix focus outline placement button
  • #1140: Fix hero search button layout error
  • #1139: Add import extension linting rule
  • #1134: Convert license utils and constants to TS
  • #1131: Use links instead of buttons for header search type switcher
  • #1040: Convert the search store to Pinia

Closed issues

  • #1169: Audio category filter not working correctly
  • #1149: Add types to `data/api-service`
  • #1145: Add sentry ignore filters
  • #1143: Enable https in local development
  • #1138: Enable `import/extensions` rule for ESLint
  • #1136: Layout error in the hero search button in some locales
  • #1130: Search type switcher items in the header should use a link instead of a button
  • #1128: Europeana and SoundCloud don’t support search filters
  • #1122: Add 3D model icon svg to the project
  • #1121: Reduce to a single source of truth for search filters
  • #1110: Fix play/pause button focus outline placement
  • #1090: Create `VContentPage` component
  • #1037: Convert `search` store from Vuex to Pinia
  • #1019: Configure CI to run visual regression tests
  • #1017: Configure local visual regression testing
  • #1008: Providers links from Source page not working properly
  • #931: Include `utils/license.js` in `tsconfig.jsonJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML.`

#openverse, #week-in-openverse

A week in Openverse: 2022-03-14 – 2022-03-21

openverse

Merged PRs

  • #171: RFC: 3D Model Support

openverse-catalog

Merged PRs

  • #418: Fix invalid license urls from Finnish Museum APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.
  • #417: Use published Docker image in primary docker-compose.yml
  • #416: Fix schedule intervals on Cleveland Museum & Wikimedia Commons
  • #415: Reduce noise in NYPL ingestion
  • #414: Update API requests for Museum Victoria DAG
  • #413: Add ConnectionError to acceptable flaky exceptions for Freesound
  • #412: Add OFEO-SG subprovider
  • #409: Group test runs by module or class
  • #404: 🔄 Synced file(s) with WordPress/openverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org.
  • #402: Make ‘sound’ categoryCategory The 'category' taxonomy lets you group posts / content together that share a common bond. Categories are pre-defined and broad ranging. more specific
  • #395: Handle duplicate keys in load_data task

Closed issues

  • #408: Group tests by test class in pytest to prevent test collisions
  • #406: Smithsonian workflow is missing configuration for sub-providers
  • #401: NYPL provider script is noisy regarding missing primary creators
  • #392: Finnish Museum `pull_data` freezes and times out
  • #391: PhyloPic DAG detects no content even when data exists
  • #390: Museum Victoria DAG fails to pull data
  • #389: Freesound pull_data task fails when getting audio file size
  • #388: Handle duplicate keys in the TSV load_data task
  • #379: Change Wikimedia Commons schedule interval to @daily
  • #378: Use published Docker image in primary docker-compose.yml
  • #368: Rename the “ingestion server” to “data refresh”

openverse-api

Merged PRs

  • #570: Add missing migrations
  • #568: Add throttle exemptions
  • #566: 🔄 Synced file(s) with WordPress/openverse
  • #556: Add pronunciation as valid sound category
  • #554: Add parameter to exclude certain sources

Closed issues

  • #565: Create an unrestricted rate limit model
  • #553: Query param to exclude a source
  • #526: Sound category mismatch
  • #391: Monitoring all the things

openverse-frontend

Merged PRs

  • #1137: Remove lodash.findindex from dependencies
  • #1129: Fix audio track null duration and add defaultRef
  • #1120: Update tailwindcss-rtl, talkback and typescript
  • #1115: 🔄 Synced file(s) with WordPress/openverse
  • #1112: Tweaks to the Image Details page
  • #1098: Fix mature content report submission
  • #1072: Refactor media store results getters
  • #1058: Convert more utils to TypeScript
  • #1057: Run e2e tests inside a docker container

Closed issues

  • #1111: Wrong font size on image details page and has horizontal scrolling on mobile
  • #1106: Replace `lodash.isempty` with domain-specific implementation
  • #1105: Replace `lodash.findindex` with `Array.prototype.findIndex`
  • #1079: Mature content report submission is broken
  • #1076: Audio track current time sometimes being set to non-real number
  • #1056: Faulty logic for audio count on the all results view
  • #1030: Audit tree-shaking and dead-code removal when using environment flags from `node_env.ts`
  • #929: Add types to `utils/get-parameter-by-name.js`
  • #920: Add types to `utils/attribution-html.js`
  • #895: Homepage search button text doesn’t fit in some locales
  • #756: Switch to Pinia

openverse-browser-extension

Merged PRs

  • #32: 🔄 Synced file(s) with WordPress/openverse

#openverse, #week-in-openverse