Openverse is a search engine for openly-licensed media.
The OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. team builds the Openverse Catalog, APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.
In order to keep the data in OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. Catalog up to date over time, it is necessary to ‘recheck’ the information for each image periodically over time. This makes tracking metrics that change with time, e.g., ‘views’, more sensible. It also gives us the possibility of removing images from the index that were removed from the source for some reason.
We would prefer to reingest the information for newer images more frequently, and the information for older images less frequently. This is because we assume the information for newer images will be updated at the source in more interesting ways when the image is newer. For example, assume a picture is viewed 100 times per month.
As we see, given consistent monthly views, the ‘percent increase’ of the total views metric drops off as the picture ages (In reality, it appears that in most cases, pictures are mostly viewed when they are new).
Thus, it makes sense to focus more on keeping the information up-to-date for the most recently uploaded images.
The basic thing to consider when trying to figure out a proper strategy for keeping image data up-to-date via reingestion is: For what percentage of the Provider’s collection can we reingest the image data per day? This tells us how sparsely we need to spread out the reingestion of image data. For example, if we can only reingest 1% per day, then we’d expect the meantime between reingesting the data for a given image to be about 100 days. Since we ingest from most providers based on the date an image was uploaded, a convenient approximation of this is: How many days worth of uploaded data can we ingest per day?
For Wikimedia Commons (WMC), we can ingest about 25 days of uploaded image data per day. This means it takes around 15 days to ingest the data of all images uploaded to Wikimedia Commons in a year (the year 2019 was used to test these numbers; other years will differ based on the number of images uploaded in those years. It is not clear whether to expect more or fewer images per year for years further back in time).
Basically, we assume we’ll ingest some number n of days each day. We set some maximum number of days D we’re willing to wait between reingestion of the data for a given image, subject to the constraint that we need to have nD > x, where x is the total number of days of data we want to ingest eventually. If we have ‘wiggle room’ (e.g., if nD > x + y for some y >= 1), we use it to create a ‘smooth increase’ in the number of days between ingestion, with perhaps the first reingestion being when the image is only 30 days old, rather than D days old.
If we want to reingest the data for an image at least yearly, the best we can do (optimizing for the total number of days ingested eventually) is to reingest the data once per year. This means we’d be able to reingest data that is up to 24 years old (assuming that we can only ingest 25 days per day, and noting that we need to also ingest the current date’s data). In months, that means we’d ingest the data for an image when it is
0, 12, 24, 36, ..., or 288
months old. Now, noticing that Wikimedia Commons has only existed since July of 2003, we could modify our strategy to reingest their data when it is
0, 1, 3, 5, 7, 12, 24, 36, ..., or 252
months old. We’ve cut off a bit of the end of our ingestion, but we’d still get all data that is newer than 21 years old ingested once per year.
In some cases (e.g., Wikimedia Commons), it seems to be possible to run multiple instances of the algorithm concurrently. In the case of WMC, running even just two instances in parallel gives us much more room to work with. In this case, we’d be able to ingest about 50 days per day of WMC data. One implementation would be to reingest data when it is
months old. This sequence means that data for images at most 6 months old is ingested monthly, data for images 6-24 months old is ingested every 3 months, and data 24-246 months old is ingested every 6 months. This still covers more than 20 years back in time, so we’d be able to use these numbers for about 3 years from now. For more longevity, we could instead reingest data when it is
months old. This is the same as the previous example, but we slow down to only reingesting data that is at least 10 years old once per year. This covers 32 years back in time, and so for WMC, this algorithm would suffice for a reingestion strategy for the next 15 years (the current date is 2020).
Clearly, days of data ingested per day is crucial to being able to reingest providers’ catalogs on reasonable time scales.
It seems like the Wikimedia Commons Provider APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. is about the bottom in terms of performance/volume ratio for which we’ll be able to successfully reingest the entire catalog on a reasonable timescale, and for some years in the future. Our assumption here is that reingesting data for an image at a rate slower than once per year isn’t sufficient. Roughly speaking, if a Provider API is slower than WMC’s, then it must have lower volume for us to succeed, and if it is higher volume than WMC’s, then it must be faster.