Today we were able to merge some massive and significant changes contributed by @beccawidom to the iNaturalist DAG! This PR includes a number of changes, namely:
- The transformation steps have changed from “CSV -> Postgres -> TSV -> Postgres” now to “CSV -> Postgres -> Postgres”. This significantly reduces disk space, time, and processing overhead, and was a necessary change in order to process all of the iNaturalist data in a reasonable timeframe. It also serves as a proof-of-concept for future bulk data imports, since the transformation & data cleaning steps are happening entirely in SQL (an Openverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. find Openverse at https://openverse.org. first!).
- Images are now connected with the Catalog of Life, which provides English vernacular names. This should help improve search relevancy over the current scientific names.
I want to take a moment to celebrate this huge accomplishment, and the tremendous effort @beccawidom poured into this effort. Thank you!
Now that this DAG is ready to be run once again, we’re faced with the impressive and daunting notion that we could, in a matter of days, increase the size of the image catalog by ~137 million (a roughly 23.3% increase in size). With that information, it’s important to consider the implications of including this data.
We have a weekly image data refresh process which transfers images from the catalog into our API An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. for public use. Presently, this data refresh takes around 47 hours without the popularity recalculation and 60 hours with the popularity recalculation. If we are to assume these times are linear, we can expect those times to become 58 hours and 74 hours respectively. Since these are run weekly, this still gives us about 100 hours left in the week before we start having data refreshes queued while previous ones are running.
Here are some steps we can take to monitor the process:
- Take a manual database snapshot of the catalog prior to enabling the iNaturalist DAG.
- Enable the DAG shortly after the weekly data refresh has completed. This will allow iNaturalist to run without other significant database operations occurring.
- Disable the DAG after the run while we verify the following steps.
- Monitor the next scheduled image data refresh closely for significant aberrations in step duration.
- Make a number of searches after the data refresh is complete to see how results are affected. We can make a number of searches which we would expect to return iNaturalist data (e.g. cat, mushroom, alligator) and some we expect should not (e.g. computer, transistor, book).
- Re-enable the iNaturalist DAG.
One of our big-picture goals for 2023 is search relevancy, and a key piece required for making improvements in that area is understanding how our existing document scoring works. I’m not sure that we can predict how adding this much data will affect our result relevancy. In the case where we notice result relevancy is negatively impacted (e.g. unrelated queries are flooded with iNaturalist results), there are a few actions we can take to mitigate this:
- Alter the weight of the provider in the API (@sarayourfriend had mentioned this as an option).
- Set the authority boost of the provider in the ingestion server and reindex the images.
- Disable the iNaturalist provider in the API.
We would like to do all we can to avoid the last option. I don’t presume that the iNaturalist data will require taking the above actions, but I wanted to outline them and open up space in case other folks have mitigation ideas.
We’re incredibly excited for the addition of this data!