Image data reingestion strategy

Goal

In order to keep the data in OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. Catalog up to date over time, it is necessary to ‘recheck’ the information for each image periodically over time. This makes tracking metrics that change with time, e.g., ‘views’, more sensible. It also gives us the possibility of removing images from the index that were removed from the source for some reason.

Top ↑

Ingestion Strategy

We would prefer to reingest the information for newer images more frequently, and the information for older images less frequently. This is because we assume the information for newer images will be updated at the source in more interesting ways when the image is newer. For example, assume a picture is viewed 100 times per month.

monthtotal views% increase
1100infinite
2200100%
330050%
440033%
550025%
660020%
770017%
880014%
990013%
10100011%
11110010%
1212009%

As we see, given consistent monthly views, the ‘percent increase’ of the total views metric drops off as the picture ages (In reality, it appears that in most cases, pictures are mostly viewed when they are new).

Thus, it makes sense to focus more on keeping the information up-to-date for the most recently uploaded images.

Top ↑

Main metric: days ingested per day of processing

The basic thing to consider when trying to figure out a proper strategy for keeping image data up-to-date via reingestion is: For what percentage of the Provider’s collection can we reingest the image data per day? This tells us how sparsely we need to spread out the reingestion of image data. For example, if we can only reingest 1% per day, then we’d expect the meantime between reingesting the data for a given image to be about 100 days. Since we ingest from most providers based on the date an image was uploaded, a convenient approximation of this is: How many days worth of uploaded data can we ingest per day?

Top ↑

Example: Wikimedia Commons

For Wikimedia Commons (WMC), we can ingest about 25 days of uploaded image data per day. This means it takes around 15 days to ingest the data of all images uploaded to Wikimedia Commons in a year (the year 2019 was used to test these numbers; other years will differ based on the number of images uploaded in those years. It is not clear whether to expect more or fewer images per year for years further back in time).

Top ↑

Algorithm to choose which days to (re)ingest

Basically, we assume we’ll ingest some number n of days each day. We set some maximum number of days D we’re willing to wait between reingestion of the data for a given image, subject to the constraint that we need to have nD > x, where x is the total number of days of data we want to ingest eventually. If we have ‘wiggle room’ (e.g., if nD > x + y for some y >= 1), we use it to create a ‘smooth increase’ in the number of days between ingestion, with perhaps the first reingestion being when the image is only 30 days old, rather than D days old.

Top ↑

Example: Wikimedia Commons calculations/options

If we want to reingest the data for an image at least yearly, the best we can do (optimizing for the total number of days ingested eventually) is to reingest the data once per year. This means we’d be able to reingest data that is up to 24 years old (assuming that we can only ingest 25 days per day, and noting that we need to also ingest the current date’s data). In months, that means we’d ingest the data for an image when it is

0, 12, 24, 36, ..., or 288

months old. Now, noticing that Wikimedia Commons has only existed since July of 2003, we could modify our strategy to reingest their data when it is

0, 1, 3, 5, 7, 12, 24, 36, ..., or 252

months old. We’ve cut off a bit of the end of our ingestion, but we’d still get all data that is newer than 21 years old ingested once per year.

Top ↑

Concurrent instances of the Algo for WMC

In some cases (e.g., Wikimedia Commons), it seems to be possible to run multiple instances of the algorithm concurrently. In the case of WMC, running even just two instances in parallel gives us much more room to work with. In this case, we’d be able to ingest about 50 days per day of WMC data. One implementation would be to reingest data when it is

0, 1, 2, 3, 4, 5, 6, 9, 12, 15, 18, 21, 24, 30, 36, 42, 48, ..., or 246

months old. This sequence means that data for images at most 6 months old is ingested monthly, data for images 6-24 months old is ingested every 3 months, and data 24-246 months old is ingested every 6 months. This still covers more than 20 years back in time, so we’d be able to use these numbers for about 3 years from now. For more longevity, we could instead reingest data when it is

0, 1, 2, 3, 4, 5, 6, 9, 12, 15, 18, 21, 24, 30, 36, 42, ..., 120, 132, 144, ..., or 384

months old. This is the same as the previous example, but we slow down to only reingesting data that is at least 10 years old once per year. This covers 32 years back in time, and so for WMC, this algorithm would suffice for a reingestion strategy for the next 15 years (the current date is 2020).

Clearly, days of data ingested per day is crucial to being able to reingest providers’ catalogs on reasonable time scales.

Top ↑

Addendum — Minimum Provider API Capabilities

It seems like the Wikimedia Commons Provider APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. is about the bottom in terms of performance/volume ratio for which we’ll be able to successfully reingest the entire catalog on a reasonable timescale, and for some years in the future. Our assumption here is that reingesting data for an image at a rate slower than once per year isn’t sufficient. Roughly speaking, if a Provider API is slower than WMC’s, then it must have lower volume for us to succeed, and if it is higher volume than WMC’s, then it must be faster.

Last updated: