Catalog Performance Testing Results

On May 22, several Openverse maintainers met to investigate performance improvements to the Catalog database. We created three test instances of the Catalog DB from restored snapshots of production data:

  • baseline: A restored production snapshot with no changes applied
  • provider-index: A snapshot with a b-tree index added on the provider column for both the audio and image tables
  • provider-partition: A snapshot with the image and audio tables partitioned on the provider column

We then ran a series of SQL queries against all three environments in order to compare their relative performance. The queries selected were representative of operations that are routinely performed on the catalog. The exact queries used to modify the provider-index and provider-partition environments, as well as the queries used in testing, are documented below.

Setting up the test environments

provider-index

To create the provider-index test DB, we restored a production snapshot and ran the following queries to add a b-tree index on provider for both media tables:

CREATE INDEX audio_provider ON public.audio USING btree (provider);
CREATE INDEX image_provider ON image USING btree (provider);

provider-partition

To create the provider-partition test DB, we restored a production snapshot and ran the following queries to create new audio_part and image_part tables, which are partitioned on the provider column. Queries given are only for the audio table as an example:

CREATE TABLE audio_part (
    identifier uuid DEFAULT public.uuid_generate_v4(),
    created_on timestamp with time zone NOT NULL,
    updated_on timestamp with time zone NOT NULL,
    ingestion_type character varying(80),
    provider character varying(80),
    source character varying(80),
    foreign_identifier character varying(3000),
    foreign_landing_url character varying(1000),
    url character varying(3000) NOT NULL,
    thumbnail character varying(3000),
    filetype character varying(5),
    duration integer,
    bit_rate integer,
    sample_rate integer,
    category character varying(80),
    genres character varying(80)[],
    audio_set jsonb,
    set_position integer,
    alt_files jsonb,
    filesize integer,
    license character varying(50) NOT NULL,
    license_version character varying(25),
    creator character varying(2000),
    creator_url character varying(2000),
    title character varying(5000),
    meta_data jsonb,
    tags jsonb,
    watermarked boolean,
    last_synced_with_source timestamp with time zone,
    removed_from_source boolean NOT NULL,
    standardized_popularity double precision,
    PRIMARY KEY (identifier, provider)
) PARTITION BY LIST (provider);

-- Add indices
CREATE UNIQUE INDEX audio_part_provider_fid_idx
    ON audio_part
        USING btree (provider, md5(foreign_identifier));
CREATE UNIQUE INDEX audio_part_url_key
    ON audio_part
    USING btree (url, provider);

-- Create partitions
CREATE TABLE jamendo PARTITION OF audio_part for VALUES in ('jamendo');
CREATE TABLE freesound PARTITION OF audio_part for VALUES in ('freesound');
CREATE TABLE wikimedia_audio PARTITION OF audio_part for VALUES in ('wikimedia_audio');

-- Insert all records from the relevant media table
INSERT INTO audio_part SELECT * FROM audio;

CREATE INDEX audio_part_partition ON audio_part (provider);
SET constraint_exclusion=on;

The partitioned table matches the schema of the existing media tables as closely as possible. There are a few necessary changes worth noting:

  • In order to partition by provider, provider was added to the primary key (previously just identifier)
  • The unique indices all need to include the partition key, which meant that we had to add a unique constraint on (url, provider) rather than just (url)as we have in the current Catalog media tables and in the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. This means that with this setup, it is theoretically possible to have a record which passes the constraints of the Catalog but not the API.

Tests in the provider-partition environment were run against the image_part and audio_part tables.

Test queries

Test queries were selected for the following common operations:

  • SELECT and UPDATE by provider
  • Large upserts (as performed during ingestion by provider DAGs)
  • Batch updates (for data migrations and popularity refreshes. See https://github.com/WordPress/openverse/pull/2331)

SELECT Met

Select all records for Metropolitan Museum (~370k rows)

SELECT COUNT(*) FROM image WHERE PROVIDER=‘met’; 

SELECT Wikimedia Audio

Select all records for Wikimedia Audio (~1 million rows)

SELECT COUNT(*) FROM audio WHERE PROVIDER=‘wikimedia_audio’; 

UPDATE thumbnails

Set thumbnail to null for all Wikimedia Audio records.

UPDATE audio
SET thumbnail = null
WHERE provider = ‘wikimedia_audio’;

UPDATE duration

Set duration to 0 for all Wikimedia Audio records.

UPDATE audio
SET duration = 0
WHERE provider = ‘wikimedia_audio’;

UPDATE standardized popularity

Calculate standardized popularity for all Wikimedia Audio records.

UPDATE audio
SET standardized_popularity = standardized_audio_popularity(audio.provider, audio.meta_data)
WHERE provider = ‘wikimedia_audio’;

UPSERT Wikimedia Audio

For upserts, we manually ran all of the queries generated during the load_data steps for some production DagRuns. This includes creating a loading table, loading the data from a production tsv, cleaning the data, and finally upserting it into the media table. Only the performance of the actual upsert step is compared in this study, but all of the queries can be found by looking at the logs of the relevant DagRun. If you do not have access to production Airflow, you can run a DAG locally to see all queries.

For Wikimedia Audio we used queries from the 2023-05-04 DagRun, loading 804 records.

UPSERT Freesound

Queries copied from the 2023-03-10 DagRun, loading 152,500 records.

UPSERT Europeana

Queries copied from the 2023-05-22 DagRun, loading 23,724 records.

CREATE Wikimedia Audio temp table

Create a table for all Wikimedia Audio records, as would be used in the first step of a batched update. (~370k rows)

CREATE TABLE rows_to_update_wiki_audio AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM audio
    WHERE provider=‘wikimedia_audio’ AND updated_on < NOW();

CREATE Met temp table

Create a table for all Metropolitan records, as would be used in the first step of a batched update. (~1 million rows)

CREATE TABLE rows_to_update_met AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘met’ AND updated_on < NOW();

CREATE Flickr temp table

Create a table for all Flickr records, as would be used in the first step of a batched update. (~500 million rows)

CREATE TABLE rows_to_update_flickr AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘flickr’ AND updated_on < NOW();

Flickr Large Batch Update

Update a batch of 10,000 Flickr records.

UPDATE image
    SET standardized_popularity = standardized_image_popularity(provider, meta_data)
    WHERE identifier in (
        SELECT identifier FROM rows_to_update_flickr
        WHERE row_id >= 300000000 AND row_id < 300010000
        FOR UPDATE SKIP LOCKED
    );

Flickr Small Batch Update

Update a batch of 1,000 Flickr records.

UPDATE image
SET standardized_popularity = standardized_image_popularity(provider, meta_data)
     WHERE identifier in (
         SELECT identifier FROM rows_to_update_flickr
         WHERE row_id >= 1000000 AND row_id < 1001000
         FOR UPDATE SKIP LOCKED
     );

Results

baselineprovider-indexprovider-partition
SELECT Met11 min 15 sec5.598 sec6.778 sec
SELECT Wikimedia Audio2 min 32 sec0.510 sec0.206 sec
UPDATE thumbnails4 min 54 sec1 min 14 sec1 min 7 sec
UPDATE duration1 min 12 sec1 min 9 sec1 min 16 sec
UPDATE popularity1 min 18 sec1 min 21 sec1 min 18 sec
UPSERT Wikimedia Audio33.751 sec1.532 sec1 min 2 sec
UPSERT Freesound52 sec35 sec28 sec
UPSERT Europeana44 min 22 sec30 min 13 sec6 min 51 sec
CREATE Wikimedia temp table2 sec1 sec40 sec
CREATE Met temp table6 min 40 sec7 sec24 sec
CREATE Flickr temp table18 hr40 min 13 sec9 hr 35 min 20 sec
Flickr large batch UPDATE1 hr 5 min 9 sec50 min 10 sec2 hr 42 min 30 sec
Flickr small batch UPDATE5 min 1 sec2 min 12 sec31 min 7 sec

Inconsistencies Noted During Testing

When testing ingestion UPSERTs, we initially compared the performance of the query on provider-index and provider-partition to the runtime of the upsert_data task in the production DagRun from which the test was sourced. In each case, this production upsert took less than a minute to complete, while the same query took close to an hour in our test environments. Alarmingly, this seemed to indicate that the index and partition both massively degraded the performance of UPSERTs relative to baseline.

However, when we ran the queries on the baseline testing environment — which was theoretically identical to the production environment in which the DAGs ran — we found that the runtime was considerably longer (in fact, even longer than the runtime on provider-index and provider-partition). The timings given in the chart above are for these tests run on baseline. I also tested running the queries manually on the actual production database, and found they ran in less than a minute, just as they had during the DagRun!

This leaves us with a major inconsistency: the performance of queries on baseline, a restored snapshot of production with no additional changes, is clearly much worse than the performance on production, at least in the case of ingestion upserts. We concluded that the effect of routine operations performed on production (daily ingestion, for example) creates too many confounding variables. One theory, considered only after the tests had already been performed, was that the optimizer in the test environments may not have been able to plan queries optimally due to the lack of statistics in newly restored databases; per the docs:

Once restored, it is wise to run ANALYZE on each database so the optimizer has useful statistics. An easy way to do this is to run vacuumdb -a -z to VACUUM ANALYZE all databases; this is equivalent to running VACUUM ANALYZE manually.

Consequently, in our analysis we only compare the relative performance between our three test environments.

This is important to consider when looking at the data for batched updates in particular. A batched update on provider-index ran about 26% faster than the same update on baseline; however, it still took over 50 minutes to update just 10,000 records. Extrapolated out to the roughly 50,000 batches needed, a full update of the Flickr provider would take just over five years. This is obviously untenable.

However, upserts on baseline took about 60 times as long as the identical upsert on production. If the same ratio holds true for batch updates, and they take about 50 seconds when run on production, the full Flickr update could be run in about 28 days, which is comfortably within our desired timeframe for a popularity refresh.

As a final note, the order queries were performed in dramatically affected performance. For example, a SELECT on wikimedia_audio would run much more quickly after an UPDATE was performed on the same rows. We tried to test queries in as close to the same order as possible among the three environments, but as the test plan was evolving while we were working there may be some inconsistencies (for example, the very quick performance of baseline in the Wikimedia temp table test).

Analysis

It is useful to first examine our priorities in the Catalog, with respect to the types of queries that were analyzed in this work. By far our most commonly performed operations are (or will be) ingestion upserts and batched updates.

Upserts during ingestion are not currently a bottleneck. The pull_data step to pull data from external provider APIs is by far the greater part of ingestion runtime; even in reingestion DAGs, we do not have a pressing need to improve the performance of this step. It is not critical to improve the performance of this step, as long as we do not worsen it.

Batched updates are a greater concern, as large updates have historically taken considerable time (timing out after weeks in some cases). A batched update, as currently implemented, has two steps: the creation of the temp table which is done once, and the actual UPDATE which is run potentially many times, for each batch. Therefore improving the performance of the UPDATE has a significantly greater impact on the performance of a batched update overall.

With that context, we can look at the two proposals under test:

provider-partition

The provider partition had surprisingly mixed results. In the Wikimedia Audio upsert test, it had the worst performance of all three environments by a considerable margin, but was easily the most performant for the Freesound and Europeana upserts. It’s possible the order of query performance was a factor.

While the partition environment was consistently much faster at creating temp tables for the batched update, the actual updates took considerably longer than baseline.

provider-index

The provider index offered improved performance across the board (with one exception: the update_standardized_popularity test, in which queries in all three environments completed within 3 seconds of each other). Note this does require comparing the index to baseline and not to production, as noted in the Inconsistencies section.

In particular, provider-index appeared to have consistently better performance during a batched update UPDATE step.

Conclusions

Because the batched UPDATE step is the operation most critically in need of performance improvement, I think we should not pursue partitioning the media tables by provider, and instead consider adding the provider index. As noted throughout this post, the many inconsistencies found during testing and the lack of a large testing sample size make this compelling, but far from conclusive.

I propose the following next steps:

  • First, try running a batched update of wikimedia_audio, our largest audio provider, in production. This will allow us to verify the theory that batched updates, much like upserts, will be faster in production than in our tests.
  • Add the provider index to the audio table in production. Because audio is considerably smaller than image, this is a quick operation.
  • Re-run the batched update and compare the performance. If the results are promising, consider repeating the process with the image table. If not, the index can be safely dropped.

Note that this work only investigated two of the proposals put forth at the recent Catalog meet-up. There are many others we would like to test in the future, including pulling license information into its own table, and reducing the size of large text columns.

If you have read this far, thank you very much! I’d love to hear your thoughts.

#catalog, #postgres

Catalog Postgres Exploration Effort

On May 22nd at 1500 UTC several OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. members will be gathering for a meeting to discuss and engage in some exploratory operations on a copy of the catalog database to get a better sense of how Postgres performs given our current scale & purpose. This was spurred on by this discussion on the decoupling popularity calculations IP; I personally had a feeling regarding how certain Postgres changes would affect our query time, performance, etc., but based on the level of unpredictable behavior we’ve seen with the matview refresh, few of us feel confident about being able to predict outcomes with Postgres. We thought it might be useful to spend some time performing some exploratory operations & optimizations with Postgres!

Some things I think might be useful to try:

  • Reducing the width of the coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. media table
  • Performing some normalization steps (e.g. pulling provider out into a separate table)
  • Adding an index on the provider column

Some ways that we could assess the performance changes of the above:

  • Have a sample set of data we use for upsert tests, one small (~1k rows) and another large (~100k rows)
  • Develop some sample updates we might perform (e.g. the thumbnail column update we tried to perform for Flickr which was projected to take an extremely long time)
  • A matview refresh or recreation

Our plan will be to create a new catalog DB instance based on a snapshot, and perform all the above against that instance. @olgabulat and @stacimc have confirmed to join the meeting, and Staci and I will continue to work on things together throughout the day while we’re co-located.

If folks have any suggestions for other operations or queries for us to test, we’d love to hear them! Hopefully we can come away from this with a better sense of “what will happen” when we change the catalog schema in certain ways 😁

#catalog #postgres

Concurrency/resource issues on monthly runs

Today (or yesterday, depending on your timezone) was the start of March 1st, UTC 00:00. For our Airflow instance, this meant that the scheduler kicked off all @hourly, @daily, @weekly, and @monthly DAG runs simultaneously.

While this has not historically been a problem, the iNaturalist workflow is set to run @monthly. iNaturalist is a particularly resource intensive DAG, and the massive amount of data it processes has required some other adjustments to our existing DAGs. The iNaturalist DAG does run a check for new data before proceeding for any run, but when it does identify that there is new data to process, it must reprocess the entire dataset (since there is no way to detect which records have changed from the last run).

For our Airflow cluster, this meant that iNaturalist was running alongside almost all of our other scheduled DAGs, and this caused some interruptions with other DAGs. We saw a myriad of seemingly-inexplicable issues on the cluster, ranging from log files missing to TSVs failing to exist when they should. This seemed to point to a disk space issue, but when I checked the instance itself it had plenty of disk space. I suspect I wasn’t able to catch it, but the iNaturalist DAG initially loads the Catalog of Life data as its first step which could have pushed it over the edge. Alongside other ingestion processes, it’s totally plausible that the disk ran out of space with everything going on.

@stacimc and I monitored the instance throughout the afternoon, pausing the iNaturalist DAG and waiting for the other DAGs to finish processing. Everything after that point ran successfully, and we re-enabled the iNaturalist DAG (and the data refresh DAG). Everything seems to have returned to normal at this point, though we plan on restarting the Docker stack on the instance once iNaturalist is complete and prior to the data refresh run this weekend.

Moving forward, we’ve identified a number of ways to improve our workflows and infrastructure:

#airflow #catalog #database

Preparing for iNaturalist

Today we were able to merge some massive and significant changes contributed by @beccawidom to the iNaturalist DAG! This PR includes a number of changes, namely:

  • The transformation steps have changed from “CSV -> Postgres -> TSV -> Postgres” now to “CSV -> Postgres -> Postgres”. This significantly reduces disk space, time, and processing overhead, and was a necessary change in order to process all of the iNaturalist data in a reasonable timeframe. It also serves as a proof-of-concept for future bulk data imports, since the transformation & data cleaning steps are happening entirely in SQL (an OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. first!).
  • Images are now connected with the Catalog of Life, which provides English vernacular names. This should help improve search relevancy over the current scientific names.

I want to take a moment to celebrate this huge accomplishment, and the tremendous effort @beccawidom poured into this effort. Thank you!


Now that this DAG is ready to be run once again, we’re faced with the impressive and daunting notion that we could, in a matter of days, increase the size of the image catalog by ~137 million (a roughly 23.3% increase in size). With that information, it’s important to consider the implications of including this data.

We have a weekly image data refresh process which transfers images from the catalog into our APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. for public use. Presently, this data refresh takes around 47 hours without the popularity recalculation and 60 hours with the popularity recalculation. If we are to assume these times are linear, we can expect those times to become 58 hours and 74 hours respectively. Since these are run weekly, this still gives us about 100 hours left in the week before we start having data refreshes queued while previous ones are running.

Here are some steps we can take to monitor the process:

  1. Take a manual database snapshot of the catalog prior to enabling the iNaturalist DAG.
  2. Enable the DAG shortly after the weekly data refresh has completed. This will allow iNaturalist to run without other significant database operations occurring.
  3. Disable the DAG after the run while we verify the following steps.
  4. Monitor the next scheduled image data refresh closely for significant aberrations in step duration.
  5. Make a number of searches after the data refresh is complete to see how results are affected. We can make a number of searches which we would expect to return iNaturalist data (e.g. cat, mushroom, alligator) and some we expect should not (e.g. computer, transistor, book).
  6. Re-enable the iNaturalist DAG.

One of our big-picture goals for 2023 is search relevancy, and a key piece required for making improvements in that area is understanding how our existing document scoring works. I’m not sure that we can predict how adding this much data will affect our result relevancy. In the case where we notice result relevancy is negatively impacted (e.g. unrelated queries are flooded with iNaturalist results), there are a few actions we can take to mitigate this:

  • Alter the weight of the provider in the API (@sarayourfriend had mentioned this as an option).
  • Set the authority boost of the provider in the ingestion server and reindex the images.
  • Disable the iNaturalist provider in the API.

We would like to do all we can to avoid the last option. I don’t presume that the iNaturalist data will require taking the above actions, but I wanted to outline them and open up space in case other folks have mitigation ideas.

We’re incredibly excited for the addition of this data!

#catalog #database