Preparing the next migration of the Catalog

As we have fixed and reactivated data ingestion via more provider scripts, some shortcomings in the data model have been uncovered. Today I went through the backlog of Catalog’s issues and noticed we had accumulated a small pile of problems that seems need to be addressed primarily at the database level.

For some, the discussion has already begun in other posts, but I wanted to gather everything I have noticed in one place, so we can discuss if we do all the changes together, whether is better to do it in more manageable chunks or find alternative solutions. These are disruptive and breaking changes given they alter the structure of the TSV files (a key part of the ingestion process) and need to be implemented carefully, so are usually postponed until the last resort and when is confirmed that they are necessary.

Columns to modify or add

  • filesize : change type to bigint
  • duration : change type to bigint
  • mature : add a boolean column
  • description : add a text or varchar(n) column

Issues

The need for these changes comes from this list of issues that could be concentrated into a milestone:

Also, consider the following:

Previous related work

We performed the addition of columns in the past, you can see the PR for including the category in the image table, and we plan to add more media types next year, so this could serve as an opportunity to refine this process and make it simpler.

Finally, the call is to start thinking about this. There is no hurry currently, but the need is becoming more and more evident. What do you think? Do you see it as viable? When could we start this endeavor?

#data-normalization, #postgres