Catalog Performance Testing Results

On May 22, several Openverse maintainers met to investigate performance improvements to the Catalog database. We created three test instances of the Catalog DB from restored snapshots of production data:

  • baseline: A restored production snapshot with no changes applied
  • provider-index: A snapshot with a b-tree index added on the provider column for both the audio and image tables
  • provider-partition: A snapshot with the image and audio tables partitioned on the provider column

We then ran a series of SQL queries against all three environments in order to compare their relative performance. The queries selected were representative of operations that are routinely performed on the catalog. The exact queries used to modify the provider-index and provider-partition environments, as well as the queries used in testing, are documented below.

Setting up the test environments

provider-index

To create the provider-index test DB, we restored a production snapshot and ran the following queries to add a b-tree index on provider for both media tables:

CREATE INDEX audio_provider ON public.audio USING btree (provider);
CREATE INDEX image_provider ON image USING btree (provider);

provider-partition

To create the provider-partition test DB, we restored a production snapshot and ran the following queries to create new audio_part and image_part tables, which are partitioned on the provider column. Queries given are only for the audio table as an example:

CREATE TABLE audio_part (
    identifier uuid DEFAULT public.uuid_generate_v4(),
    created_on timestamp with time zone NOT NULL,
    updated_on timestamp with time zone NOT NULL,
    ingestion_type character varying(80),
    provider character varying(80),
    source character varying(80),
    foreign_identifier character varying(3000),
    foreign_landing_url character varying(1000),
    url character varying(3000) NOT NULL,
    thumbnail character varying(3000),
    filetype character varying(5),
    duration integer,
    bit_rate integer,
    sample_rate integer,
    category character varying(80),
    genres character varying(80)[],
    audio_set jsonb,
    set_position integer,
    alt_files jsonb,
    filesize integer,
    license character varying(50) NOT NULL,
    license_version character varying(25),
    creator character varying(2000),
    creator_url character varying(2000),
    title character varying(5000),
    meta_data jsonb,
    tags jsonb,
    watermarked boolean,
    last_synced_with_source timestamp with time zone,
    removed_from_source boolean NOT NULL,
    standardized_popularity double precision,
    PRIMARY KEY (identifier, provider)
) PARTITION BY LIST (provider);

-- Add indices
CREATE UNIQUE INDEX audio_part_provider_fid_idx
    ON audio_part
        USING btree (provider, md5(foreign_identifier));
CREATE UNIQUE INDEX audio_part_url_key
    ON audio_part
    USING btree (url, provider);

-- Create partitions
CREATE TABLE jamendo PARTITION OF audio_part for VALUES in ('jamendo');
CREATE TABLE freesound PARTITION OF audio_part for VALUES in ('freesound');
CREATE TABLE wikimedia_audio PARTITION OF audio_part for VALUES in ('wikimedia_audio');

-- Insert all records from the relevant media table
INSERT INTO audio_part SELECT * FROM audio;

CREATE INDEX audio_part_partition ON audio_part (provider);
SET constraint_exclusion=on;

The partitioned table matches the schema of the existing media tables as closely as possible. There are a few necessary changes worth noting:

  • In order to partition by provider, provider was added to the primary key (previously just identifier)
  • The unique indices all need to include the partition key, which meant that we had to add a unique constraint on (url, provider) rather than just (url)as we have in the current Catalog media tables and in the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. This means that with this setup, it is theoretically possible to have a record which passes the constraints of the Catalog but not the API.

Tests in the provider-partition environment were run against the image_part and audio_part tables.

Test queries

Test queries were selected for the following common operations:

  • SELECT and UPDATE by provider
  • Large upserts (as performed during ingestion by provider DAGs)
  • Batch updates (for data migrations and popularity refreshes. See https://github.com/WordPress/openverse/pull/2331)

SELECT Met

Select all records for Metropolitan Museum (~370k rows)

SELECT COUNT(*) FROM image WHERE PROVIDER=‘met’; 

SELECT Wikimedia Audio

Select all records for Wikimedia Audio (~1 million rows)

SELECT COUNT(*) FROM audio WHERE PROVIDER=‘wikimedia_audio’; 

UPDATE thumbnails

Set thumbnail to null for all Wikimedia Audio records.

UPDATE audio
SET thumbnail = null
WHERE provider = ‘wikimedia_audio’;

UPDATE duration

Set duration to 0 for all Wikimedia Audio records.

UPDATE audio
SET duration = 0
WHERE provider = ‘wikimedia_audio’;

UPDATE standardized popularity

Calculate standardized popularity for all Wikimedia Audio records.

UPDATE audio
SET standardized_popularity = standardized_audio_popularity(audio.provider, audio.meta_data)
WHERE provider = ‘wikimedia_audio’;

UPSERT Wikimedia Audio

For upserts, we manually ran all of the queries generated during the load_data steps for some production DagRuns. This includes creating a loading table, loading the data from a production tsv, cleaning the data, and finally upserting it into the media table. Only the performance of the actual upsert step is compared in this study, but all of the queries can be found by looking at the logs of the relevant DagRun. If you do not have access to production Airflow, you can run a DAG locally to see all queries.

For Wikimedia Audio we used queries from the 2023-05-04 DagRun, loading 804 records.

UPSERT Freesound

Queries copied from the 2023-03-10 DagRun, loading 152,500 records.

UPSERT Europeana

Queries copied from the 2023-05-22 DagRun, loading 23,724 records.

CREATE Wikimedia Audio temp table

Create a table for all Wikimedia Audio records, as would be used in the first step of a batched update. (~370k rows)

CREATE TABLE rows_to_update_wiki_audio AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM audio
    WHERE provider=‘wikimedia_audio’ AND updated_on < NOW();

CREATE Met temp table

Create a table for all Metropolitan records, as would be used in the first step of a batched update. (~1 million rows)

CREATE TABLE rows_to_update_met AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘met’ AND updated_on < NOW();

CREATE Flickr temp table

Create a table for all Flickr records, as would be used in the first step of a batched update. (~500 million rows)

CREATE TABLE rows_to_update_flickr AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘flickr’ AND updated_on < NOW();

Flickr Large Batch Update

Update a batch of 10,000 Flickr records.

UPDATE image
    SET standardized_popularity = standardized_image_popularity(provider, meta_data)
    WHERE identifier in (
        SELECT identifier FROM rows_to_update_flickr
        WHERE row_id >= 300000000 AND row_id < 300010000
        FOR UPDATE SKIP LOCKED
    );

Flickr Small Batch Update

Update a batch of 1,000 Flickr records.

UPDATE image
SET standardized_popularity = standardized_image_popularity(provider, meta_data)
     WHERE identifier in (
         SELECT identifier FROM rows_to_update_flickr
         WHERE row_id >= 1000000 AND row_id < 1001000
         FOR UPDATE SKIP LOCKED
     );

Results

baselineprovider-indexprovider-partition
SELECT Met11 min 15 sec5.598 sec6.778 sec
SELECT Wikimedia Audio2 min 32 sec0.510 sec0.206 sec
UPDATE thumbnails4 min 54 sec1 min 14 sec1 min 7 sec
UPDATE duration1 min 12 sec1 min 9 sec1 min 16 sec
UPDATE popularity1 min 18 sec1 min 21 sec1 min 18 sec
UPSERT Wikimedia Audio33.751 sec1.532 sec1 min 2 sec
UPSERT Freesound52 sec35 sec28 sec
UPSERT Europeana44 min 22 sec30 min 13 sec6 min 51 sec
CREATE Wikimedia temp table2 sec1 sec40 sec
CREATE Met temp table6 min 40 sec7 sec24 sec
CREATE Flickr temp table18 hr40 min 13 sec9 hr 35 min 20 sec
Flickr large batch UPDATE1 hr 5 min 9 sec50 min 10 sec2 hr 42 min 30 sec
Flickr small batch UPDATE5 min 1 sec2 min 12 sec31 min 7 sec

Inconsistencies Noted During Testing

When testing ingestion UPSERTs, we initially compared the performance of the query on provider-index and provider-partition to the runtime of the upsert_data task in the production DagRun from which the test was sourced. In each case, this production upsert took less than a minute to complete, while the same query took close to an hour in our test environments. Alarmingly, this seemed to indicate that the index and partition both massively degraded the performance of UPSERTs relative to baseline.

However, when we ran the queries on the baseline testing environment — which was theoretically identical to the production environment in which the DAGs ran — we found that the runtime was considerably longer (in fact, even longer than the runtime on provider-index and provider-partition). The timings given in the chart above are for these tests run on baseline. I also tested running the queries manually on the actual production database, and found they ran in less than a minute, just as they had during the DagRun!

This leaves us with a major inconsistency: the performance of queries on baseline, a restored snapshot of production with no additional changes, is clearly much worse than the performance on production, at least in the case of ingestion upserts. We concluded that the effect of routine operations performed on production (daily ingestion, for example) creates too many confounding variables. One theory, considered only after the tests had already been performed, was that the optimizer in the test environments may not have been able to plan queries optimally due to the lack of statistics in newly restored databases; per the docs:

Once restored, it is wise to run ANALYZE on each database so the optimizer has useful statistics. An easy way to do this is to run vacuumdb -a -z to VACUUM ANALYZE all databases; this is equivalent to running VACUUM ANALYZE manually.

Consequently, in our analysis we only compare the relative performance between our three test environments.

This is important to consider when looking at the data for batched updates in particular. A batched update on provider-index ran about 26% faster than the same update on baseline; however, it still took over 50 minutes to update just 10,000 records. Extrapolated out to the roughly 50,000 batches needed, a full update of the Flickr provider would take just over five years. This is obviously untenable.

However, upserts on baseline took about 60 times as long as the identical upsert on production. If the same ratio holds true for batch updates, and they take about 50 seconds when run on production, the full Flickr update could be run in about 28 days, which is comfortably within our desired timeframe for a popularity refresh.

As a final note, the order queries were performed in dramatically affected performance. For example, a SELECT on wikimedia_audio would run much more quickly after an UPDATE was performed on the same rows. We tried to test queries in as close to the same order as possible among the three environments, but as the test plan was evolving while we were working there may be some inconsistencies (for example, the very quick performance of baseline in the Wikimedia temp table test).

Analysis

It is useful to first examine our priorities in the Catalog, with respect to the types of queries that were analyzed in this work. By far our most commonly performed operations are (or will be) ingestion upserts and batched updates.

Upserts during ingestion are not currently a bottleneck. The pull_data step to pull data from external provider APIs is by far the greater part of ingestion runtime; even in reingestion DAGs, we do not have a pressing need to improve the performance of this step. It is not critical to improve the performance of this step, as long as we do not worsen it.

Batched updates are a greater concern, as large updates have historically taken considerable time (timing out after weeks in some cases). A batched update, as currently implemented, has two steps: the creation of the temp table which is done once, and the actual UPDATE which is run potentially many times, for each batch. Therefore improving the performance of the UPDATE has a significantly greater impact on the performance of a batched update overall.

With that context, we can look at the two proposals under test:

provider-partition

The provider partition had surprisingly mixed results. In the Wikimedia Audio upsert test, it had the worst performance of all three environments by a considerable margin, but was easily the most performant for the Freesound and Europeana upserts. It’s possible the order of query performance was a factor.

While the partition environment was consistently much faster at creating temp tables for the batched update, the actual updates took considerably longer than baseline.

provider-index

The provider index offered improved performance across the board (with one exception: the update_standardized_popularity test, in which queries in all three environments completed within 3 seconds of each other). Note this does require comparing the index to baseline and not to production, as noted in the Inconsistencies section.

In particular, provider-index appeared to have consistently better performance during a batched update UPDATE step.

Conclusions

Because the batched UPDATE step is the operation most critically in need of performance improvement, I think we should not pursue partitioning the media tables by provider, and instead consider adding the provider index. As noted throughout this post, the many inconsistencies found during testing and the lack of a large testing sample size make this compelling, but far from conclusive.

I propose the following next steps:

  • First, try running a batched update of wikimedia_audio, our largest audio provider, in production. This will allow us to verify the theory that batched updates, much like upserts, will be faster in production than in our tests.
  • Add the provider index to the audio table in production. Because audio is considerably smaller than image, this is a quick operation.
  • Re-run the batched update and compare the performance. If the results are promising, consider repeating the process with the image table. If not, the index can be safely dropped.

Note that this work only investigated two of the proposals put forth at the recent Catalog meet-up. There are many others we would like to test in the future, including pulling license information into its own table, and reducing the size of large text columns.

If you have read this far, thank you very much! I’d love to hear your thoughts.

#catalog, #postgres

Catalog Postgres Exploration Effort

On May 22nd at 1500 UTC several OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. members will be gathering for a meeting to discuss and engage in some exploratory operations on a copy of the catalog database to get a better sense of how Postgres performs given our current scale & purpose. This was spurred on by this discussion on the decoupling popularity calculations IP; I personally had a feeling regarding how certain Postgres changes would affect our query time, performance, etc., but based on the level of unpredictable behavior we’ve seen with the matview refresh, few of us feel confident about being able to predict outcomes with Postgres. We thought it might be useful to spend some time performing some exploratory operations & optimizations with Postgres!

Some things I think might be useful to try:

  • Reducing the width of the coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. media table
  • Performing some normalization steps (e.g. pulling provider out into a separate table)
  • Adding an index on the provider column

Some ways that we could assess the performance changes of the above:

  • Have a sample set of data we use for upsert tests, one small (~1k rows) and another large (~100k rows)
  • Develop some sample updates we might perform (e.g. the thumbnail column update we tried to perform for Flickr which was projected to take an extremely long time)
  • A matview refresh or recreation

Our plan will be to create a new catalog DB instance based on a snapshot, and perform all the above against that instance. @olgabulat and @stacimc have confirmed to join the meeting, and Staci and I will continue to work on things together throughout the day while we’re co-located.

If folks have any suggestions for other operations or queries for us to test, we’d love to hear them! Hopefully we can come away from this with a better sense of “what will happen” when we change the catalog schema in certain ways 😁

#catalog #postgres

Preparing the next migration of the Catalog

As we have fixed and reactivated data ingestion via more provider scripts, some shortcomings in the data model have been uncovered. Today I went through the backlog of Catalog’s issues and noticed we had accumulated a small pile of problems that seems need to be addressed primarily at the database level.

For some, the discussion has already begun in other posts, but I wanted to gather everything I have noticed in one place, so we can discuss if we do all the changes together, whether is better to do it in more manageable chunks or find alternative solutions. These are disruptive and breaking changes given they alter the structure of the TSV files (a key part of the ingestion process) and need to be implemented carefully, so are usually postponed until the last resort and when is confirmed that they are necessary.

Columns to modify or add

  • filesize : change type to bigint
  • duration : change type to bigint
  • mature : add a boolean column
  • description : add a text or varchar(n) column

Issues

The need for these changes comes from this list of issues that could be concentrated into a milestone:

Also, consider the following:

Previous related work

We performed the addition of columns in the past, you can see the PR for including the category in the image table, and we plan to add more media types next year, so this could serve as an opportunity to refine this process and make it simpler.

Finally, the call is to start thinking about this. There is no hurry currently, but the need is becoming more and more evident. What do you think? Do you see it as viable? When could we start this endeavor?

#data-normalization, #postgres

Handling very large (>2GB) files

As we gear up to tackle the Provider DAG Stability milestone, @stacimc and I were looking over the existing issues and came across this one: Filesize exceeds postgres integer column maximum size. The specific details can be seen in the linked issue, but the summary is that we were trying to ingest a Wikimedia record which referenced an audio clip that was over 8 hours in length. The filesize for this record exceeded the maximum value size allowed for a Postgres integer column and broke the ingestion.

After discussing this, we came up with a few options:

  1. Reject any records that exceed this 2GB size limit at the MediaStore level. It seems unlikely that users would want audio records this large, especially considering that we don’t make any distinction on length beyond “> 10 minutes” in the search filters.
  2. Set values greater than this column maximum to NULL. Records that exceed that size will not have filesize information stored, but all other information will be available.
  3. Alter the column to use a Postgres bigint type. This will require migrations on both the catalog and the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and could be extremely cumbersome.

Of the 3 options above, Staci and I thought that the first would be most appropriate and easiest to execute. It also wouldn’t preclude us from accepting larger file sizes in the future, should we wish to take a different approach to including them in the catalog.

What do folks think, does that seem like a suitable next step? Are there other alternatives we haven’t considered that might be worth pursuing?

#data-cleaning, #postgres

Storing provider supplied thumbnails

Note: This is adapted from a conversation that happened on the Make WP Slack (see the link for more context).

Our current situation regarding thumbnails

Thumbnails were collected by providers and put in the thumbnail column in the catalog. We have data there now, and the column has not been dropped. Since this column also exists in the API, the data from it (for both audio and image) gets copied into the thumbnail column there as well. However, only the Audio model uses the thumbnail column, the Image model ignores it and tries to create a scaled down thumbnail using the primary image URL. The MediaStore class in the provider has been modified so that no new thumbnail URLs can be added to the thumbnail column, but they can be added in the metadata under meta_data["thumbnail_url"].

We have run into scenarios recently where our own attempts to create thumbnails from the full-size image URLs have caused consistent timeouts and prevent the records from being shown in search results. SMK is a recent example of this: (1) (2). It may be advantageous to use the provider’s pre-computed thumbnails in these cases. Crucially, not all providers supply or are expected to supply thumbnails. As such, it seems like we may want to reduce the width of our tables by removing this frequently-empty column.

Proposal

For image: we first go through and copy all of the existing thumbnail values in the catalog into meta_data["thumbnail_url"]. Then we perform a new-table-without-column-swap and remove the thumbnail column from from the table. This will help conserve space and reduce the width of the table. Then we update the image_view materialized view to populate a thumbnail column for the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. using the meta_data dictionary. An example of this is in the Audio view’s audio_set_foreign_identifier . This will allow us to keep the thumbnail column in the API but reduce space on the catalog, since a migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. of that size on the API would likely be untenable at the moment. We’ll then make two changes to the logic on both ends:

  1. Add thumbnail_url handling logic in the ImageStore which will put that value in the meta_data dictionary automatically
  2. Change the Image model’s logic on the API to first try and use image.thumbnail (which has been copied from meta_data["thumbnail_url"] during the data refresh), then try image.url when creating the scaled down image

We could potentially do the same for Audio, it looks like about half of the records do not have thumbnails:

deploy@localhost:openledger> select count(*) from audio where thumbnail is null;
+--------+
| count  |
|--------|
| 427124 |
+--------+
SELECT 1
Time: 4.354s (4 seconds), executed in: 4.328s (4 seconds)
deploy@localhost:openledger> select count(*) from audio;
+--------+
| count  |
|--------|
| 949384 |
+--------+
SELECT 1
Time: 3.899s (3 seconds), executed in: 3.887s (3 seconds)

What do folks think about this approach? I think any effort to reduce the width of our DB tables is an important one, as it will make migrations & data management easier down the line.

#data-normalization, #postgres, #thumbnails