Openverse Now Includes Over 1 Million Audio Records

OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. has reached a crucial milestone of including over 1,000,000 Creative Commons licensed audio files in our collection of GPLGPL GPL is an acronym for GNU Public License. It is the standard license WordPress uses for Open Source licensing https://wordpress.org/about/license/. The GPL is a ‘copyleft’ license https://www.gnu.org/licenses/copyleft.en.html. This means that derivative work can only be distributed under the same license terms. This is in distinction to permissive free software licenses, of which the BSD license and the MIT License are widely used examples.-compatible media. When we launched audio earlier this year we included roughly ten thousand files.

The growth of the collection of the year is a testament to our contributors, partners, and audio sources. In particular, work to refactor our Provider DAGs, the scripts which collect Openverse media from 3rd party APIs, led by sponsored contributor @stacimc, helped us reach this exciting milestone. Thanks to Staci and all the Openverse contributors who made this possible!

Openverse indexes audio from Jamendo, Freesound, and Wikimedia Commons. Do you have another source of openly-licensed audio you’re excited about? Please send Openverse contributors your suggestions using our GitHub issue form.

Community Meeting Recap (22 November 2022)

[Slack: Meeting start]

🎉 Done!

Notably, this week included many submissions from community contributors!

👀 Needs review

🚧 In progress/To Do

A special highlight was made for issues that are available for pick up from:

📒 Agenda

We briefly discussed the following points:

Comments are welcome in each issue/post.

[Slack: Meeting end]

#openverse-weekly-community-meeting

Preparing the next migration of the Catalog

As we have fixed and reactivated data ingestion via more provider scripts, some shortcomings in the data model have been uncovered. Today I went through the backlog of Catalog’s issues and noticed we had accumulated a small pile of problems that seems need to be addressed primarily at the database level.

For some, the discussion has already begun in other posts, but I wanted to gather everything I have noticed in one place, so we can discuss if we do all the changes together, whether is better to do it in more manageable chunks or find alternative solutions. These are disruptive and breaking changes given they alter the structure of the TSV files (a key part of the ingestion process) and need to be implemented carefully, so are usually postponed until the last resort and when is confirmed that they are necessary.

Columns to modify or add

  • filesize : change type to bigint
  • duration : change type to bigint
  • mature : add a boolean column
  • description : add a text or varchar(n) column

Issues

The need for these changes comes from this list of issues that could be concentrated into a milestone:

Also, consider the following:

Previous related work

We performed the addition of columns in the past, you can see the PR for including the category in the image table, and we plan to add more media types next year, so this could serve as an opportunity to refine this process and make it simpler.

Finally, the call is to start thinking about this. There is no hurry currently, but the need is becoming more and more evident. What do you think? Do you see it as viable? When could we start this endeavor?

#data-normalization, #postgres

Thinking towards 2023

The OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. community has been discussing plans for upcoming projects and our vision for the coming year in a few different places. I’d like to suggest we focus all remaining discussion here.

I am personally curious around the following guiding questions:

1. What could we approve on from the approach taken in our 2022 roadmap?
2. What guiding, big-picture goals should inform our work in 2023?
3. What projects might we take on, related to those goals?

I’d like to propose we take about 3 weeks to discuss these topics in depth, supplemented with synchronous sessions. We can move our community “priorities” meeting one week earlier to accommodate this. I’ll post about that shortly.

Record number of contributors for the Catalog

This week we will be deployingDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. a new version of the catalog, v1.3.6. Our continuous deployment setup automatically ships changes to DAGs on each commit to main, but changes to our Docker image (e.g. configuration or dependency version changes) require a new release. Thus we occasionally need to cut a release and deployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. a new version of the catalog. This release also marks the completion of the Provider DAG refactors milestone, a huge effort to modernize and update our provider ingestion scripts.

What’s particularly exciting about this release is that it has the largest number of unique contributors to the catalog since the start of the OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. project! We had contributions over 16 different maintainers & community members (and even a few bots), many of whom contributed to the catalog for the first time.

Thank you to all the following who made this possible (and those who contributed in ways other than committing):

This project would not be possible without you all!

Community Meeting Recap (15 November 2022)

[Slack: Meeting start]

🎉 Done!

👀 Needs review

🚧 In progress/To Do

📒 Agenda

We briefly discussed the following points:

Comments are welcome in each issue/post.

[Slack: Meeting end]

#openverse-weekly-community-meeting

Community Meeting Recap (25 October 2022)

[Slack: Meeting start]

🎉 Done!

👀 Needs review

🚧 In progress/To Do

📒 Agenda

We briefly discussed the following points:

Comments are welcome in each issue/post.

[Slack: Meeting end]

#openverse-weekly-community-meeting

Handling very large (>2GB) files

As we gear up to tackle the Provider DAG Stability milestone, @stacimc and I were looking over the existing issues and came across this one: Filesize exceeds postgres integer column maximum size. The specific details can be seen in the linked issue, but the summary is that we were trying to ingest a Wikimedia record which referenced an audio clip that was over 8 hours in length. The filesize for this record exceeded the maximum value size allowed for a Postgres integer column and broke the ingestion.

After discussing this, we came up with a few options:

  1. Reject any records that exceed this 2GB size limit at the MediaStore level. It seems unlikely that users would want audio records this large, especially considering that we don’t make any distinction on length beyond “> 10 minutes” in the search filters.
  2. Set values greater than this column maximum to NULL. Records that exceed that size will not have filesize information stored, but all other information will be available.
  3. Alter the column to use a Postgres bigint type. This will require migrations on both the catalog and the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and could be extremely cumbersome.

Of the 3 options above, Staci and I thought that the first would be most appropriate and easiest to execute. It also wouldn’t preclude us from accepting larger file sizes in the future, should we wish to take a different approach to including them in the catalog.

What do folks think, does that seem like a suitable next step? Are there other alternatives we haven’t considered that might be worth pursuing?

#data-cleaning, #postgres

Storing provider supplied thumbnails

Note: This is adapted from a conversation that happened on the Make WP Slack (see the link for more context).

Our current situation regarding thumbnails

Thumbnails were collected by providers and put in the thumbnail column in the catalog. We have data there now, and the column has not been dropped. Since this column also exists in the API, the data from it (for both audio and image) gets copied into the thumbnail column there as well. However, only the Audio model uses the thumbnail column, the Image model ignores it and tries to create a scaled down thumbnail using the primary image URL. The MediaStore class in the provider has been modified so that no new thumbnail URLs can be added to the thumbnail column, but they can be added in the metadata under meta_data["thumbnail_url"].

We have run into scenarios recently where our own attempts to create thumbnails from the full-size image URLs have caused consistent timeouts and prevent the records from being shown in search results. SMK is a recent example of this: (1) (2). It may be advantageous to use the provider’s pre-computed thumbnails in these cases. Crucially, not all providers supply or are expected to supply thumbnails. As such, it seems like we may want to reduce the width of our tables by removing this frequently-empty column.

Proposal

For image: we first go through and copy all of the existing thumbnail values in the catalog into meta_data["thumbnail_url"]. Then we perform a new-table-without-column-swap and remove the thumbnail column from from the table. This will help conserve space and reduce the width of the table. Then we update the image_view materialized view to populate a thumbnail column for the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. using the meta_data dictionary. An example of this is in the Audio view’s audio_set_foreign_identifier . This will allow us to keep the thumbnail column in the API but reduce space on the catalog, since a migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. of that size on the API would likely be untenable at the moment. We’ll then make two changes to the logic on both ends:

  1. Add thumbnail_url handling logic in the ImageStore which will put that value in the meta_data dictionary automatically
  2. Change the Image model’s logic on the API to first try and use image.thumbnail (which has been copied from meta_data["thumbnail_url"] during the data refresh), then try image.url when creating the scaled down image

We could potentially do the same for Audio, it looks like about half of the records do not have thumbnails:

deploy@localhost:openledger> select count(*) from audio where thumbnail is null;
+--------+
| count  |
|--------|
| 427124 |
+--------+
SELECT 1
Time: 4.354s (4 seconds), executed in: 4.328s (4 seconds)
deploy@localhost:openledger> select count(*) from audio;
+--------+
| count  |
|--------|
| 949384 |
+--------+
SELECT 1
Time: 3.899s (3 seconds), executed in: 3.887s (3 seconds)

What do folks think about this approach? I think any effort to reduce the width of our DB tables is an important one, as it will make migrations & data management easier down the line.

#data-normalization, #postgres, #thumbnails

Community Meeting Recap (18 October 2022)

[Slack: Meeting start]

🎉 Done!

  • Retirement of legacy TSV loading workflow [PR] [Slack]
  • Community PR to fix the button height [Issue] [Slack]
  • Community improvements to the CONTRIBUTING.md file [Issue] [Slack]
  • Improvements to the Recent Searches feature [Issue#1] [Issue#2] [Slack]
  • UIUI UI is an acronym for User Interface - the layout of the page the user interacts with. Think ‘how are they doing that’ and less about what they are doing./UXUX UX is an acronym for User Experience - the way the user uses the UI. Think ‘what they are doing’ and less about how they do it. Improvements to the headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. [PR#1] [PR#2] [Slack]

👀 Needs review

wordpress.orgWordPress.org The community site where WordPress code is created and shared by the users. This is where you can download the source code for WordPress core, plugins and themes as well as the central location for community conversations and organization. https://wordpress.org/ (OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. theme)

Frontend

Catalog

🚧 In progress/To Do

📒 Agenda

  • Thumbnails in the catalog and APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. [Slack: Start] [Slack: End]
  • WordPress.org theme redirect (covered in 👀 Needs review above)

[Slack: Meeting end]

#openverse-weekly-community-meeting