Preparing the next migration of the Catalog – Make Openverse

Welcome!

Openverse is a search engine for openly-licensed media.

The OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. team builds the Openverse Catalog, APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.

You can also come chat with us in #openverse on the Make WP Chat. We have a weekly developer chat at 15:00 UTC on Mondays.

If you’re a new contributor, welcome! Have a look at our good first issues or our guide for new contributors.

As we have fixed and reactivated data ingestion via more provider scripts, some shortcomings in the data model have been uncovered. Today I went through the backlog of Catalog’s issues and noticed we had accumulated a small pile of problems that seems need to be addressed primarily at the database level.

For some, the discussion has already begun in other posts, but I wanted to gather everything I have noticed in one place, so we can discuss if we do all the changes together, whether is better to do it in more manageable chunks or find alternative solutions. These are disruptive and breaking changes given they alter the structure of the TSV files (a key part of the ingestion process) and need to be implemented carefully, so are usually postponed until the last resort and when is confirmed that they are necessary.

Columns to modify or add

filesize : change type to bigint
duration : change type to bigint
mature : add a boolean column
description : add a text or varchar(n) column

Issues

The need for these changes comes from this list of issues that could be concentrated into a milestone:

Also, consider the following:

#235 Investigate the use of alembic for openledger migrations

Previous related work

We performed the addition of columns in the past, you can see the PR for including the category in the image table, and we plan to add more media types next year, so this could serve as an opportunity to refine this process and make it simpler.

Finally, the call is to start thinking about this. There is no hurry currently, but the need is becoming more and more evident. What do you think? Do you see it as viable? When could we start this endeavor?

#data-normalization, #postgres

beccawidom 6:26 pm on November 21, 2022

Are we thinking blue-sky here? Would it ever be possible to move tags jsons to a separate tags table? I have no way of knowing how tags frequently tags are populated in image for example, but I do know that searching within a jsonJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. field in a big postgres table is sloooooow, and to the extent that we’re trying to standardize information about tags, putting them in a traditionally normalized table could really help with that.

I love the emphasis on downstream usage of the data here.
- Krystle Salazar 8:03 pm on November 23, 2022
  
  That is an interesting question @beccawidom, and thanks for replying! In the first instance, I was thinking only about the most needed changes currently, but we can discuss other fields as well so please add all your points.
  
  Regarding modification for analytics (in this case, getting stats about tags), I don’t think the catalog database is well adapted for this purpose even separating them in another table. Given it has any indices, it’s not optimized for that. I don’t have the answer to this problem yet, so I would like to hear options from folks. @aetherunbound suggested the great_expectations tool which I have yet to take a look at.
  - beccawidom 6:22 pm on November 27, 2022
    I’m not 100% sure I’m understanding your comment here. When you’re talking about indexes, do you mean jsonb indexes on the tags column vs regular indexing to support joins across the two tables (e.g. using provider + tag, assuming both fields would be in both tables)? I guess my impression is that postgres struggles with tables that are both wide and long the way that the image table is, for example. So I’ve been thinking about ways to narrow the table so that a single database page could contain a record. (More on row size here.) Some other possibilities, simpler than tags, would be:
    
    a separate small license version table so we could replace the separate license and version fields and urls in the metadata with a single int foreign key
    
    consider converting foreign_identifier to uuid rather than varchar(3000), though this would be tough to decide without analyzing the field directly
    
    if we’re adding a longer text description field, consider shortening the title field which is currently varchar(5000)
Madison 11:10 pm on November 22, 2022

Thank you for aggregating this information @krysal! I think this would be an effort worth focusing on, especially as a means of laying out a standardized plan for database migrations in the future. One thing I want to mention is that we will want to perform type changes to the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. database first, whereas we don’t necessarily need to create new columns on the API database before they’re added to the catalog. If a new column exists on the catalog before it is available on the API, that column will not be copied over. However if a column exists on both databases with different types, we’ll almost certainly run into a type overflow issue when copying the data.

I think that will need to do these migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. steps one at a time rather than all in one go.