Catalog Postgres Exploration Effort

On May 22nd at 1500 UTC several OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. members will be gathering for a meeting to discuss and engage in some exploratory operations on a copy of the catalog database to get a better sense of how Postgres performs given our current scale & purpose. This was spurred on by this discussion on the decoupling popularity calculations IP; I personally had a feeling regarding how certain Postgres changes would affect our query time, performance, etc., but based on the level of unpredictable behavior we’ve seen with the matview refresh, few of us feel confident about being able to predict outcomes with Postgres. We thought it might be useful to spend some time performing some exploratory operations & optimizations with Postgres!

Some things I think might be useful to try:

  • Reducing the width of the coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. media table
  • Performing some normalization steps (e.g. pulling provider out into a separate table)
  • Adding an index on the provider column

Some ways that we could assess the performance changes of the above:

  • Have a sample set of data we use for upsert tests, one small (~1k rows) and another large (~100k rows)
  • Develop some sample updates we might perform (e.g. the thumbnail column update we tried to perform for Flickr which was projected to take an extremely long time)
  • A matview refresh or recreation

Our plan will be to create a new catalog DB instance based on a snapshot, and perform all the above against that instance. @olgabulat and @stacimc have confirmed to join the meeting, and Staci and I will continue to work on things together throughout the day while we’re co-located.

If folks have any suggestions for other operations or queries for us to test, we’d love to hear them! Hopefully we can come away from this with a better sense of “what will happen” when we change the catalog schema in certain ways 😁

#catalog #postgres