Today (or yesterday, depending on your timezone) was the start of March 1st, UTC 00:00. For our Airflow instance, this meant that the scheduler kicked off all
@monthly DAG runs simultaneously.
While this has not historically been a problem, the iNaturalist workflow is set to run
@monthly. iNaturalist is a particularly resource intensive DAG, and the massive amount of data it processes has required some other adjustments to our existing DAGs. The iNaturalist DAG does run a check for new data before proceeding for any run, but when it does identify that there is new data to process, it must reprocess the entire dataset (since there is no way to detect which records have changed from the last run).
For our Airflow cluster, this meant that iNaturalist was running alongside almost all of our other scheduled DAGs, and this caused some interruptions with other DAGs. We saw a myriad of seemingly-inexplicable issues on the cluster, ranging from log files missing to TSVs failing to exist when they should. This seemed to point to a disk space issue, but when I checked the instance itself it had plenty of disk space. I suspect I wasn’t able to catch it, but the iNaturalist DAG initially loads the Catalog of Life data as its first step which could have pushed it over the edge. Alongside other ingestion processes, it’s totally plausible that the disk ran out of space with everything going on.
@stacimc and I monitored the instance throughout the afternoon, pausing the iNaturalist DAG and waiting for the other DAGs to finish processing. Everything after that point ran successfully, and we re-enabled the iNaturalist DAG (and the data refresh DAG). Everything seems to have returned to normal at this point, though we plan on restarting the Docker stack on the instance once iNaturalist is complete and prior to the data refresh run this weekend.
Moving forward, we’ve identified a number of ways to improve our workflows and infrastructure:
#airflow #catalog #database
Today we were able to merge some massive and significant changes contributed by @beccawidom to the iNaturalist DAG! This PR includes a number of changes, namely:
- The transformation steps have changed from “CSV -> Postgres -> TSV -> Postgres” now to “CSV -> Postgres -> Postgres”. This significantly reduces disk space, time, and processing overhead, and was a necessary change in order to process all of the iNaturalist data in a reasonable timeframe. It also serves as a proof-of-concept for future bulk data imports, since the transformation & data cleaning steps are happening entirely in SQL (an Openverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. first!).
- Images are now connected with the Catalog of Life, which provides English vernacular names. This should help improve search relevancy over the current scientific names.
I want to take a moment to celebrate this huge accomplishment, and the tremendous effort @beccawidom poured into this effort. Thank you!
Now that this DAG is ready to be run once again, we’re faced with the impressive and daunting notion that we could, in a matter of days, increase the size of the image catalog by ~137 million (a roughly 23.3% increase in size). With that information, it’s important to consider the implications of including this data.
We have a weekly image data refresh process which transfers images from the catalog into our API An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. for public use. Presently, this data refresh takes around 47 hours without the popularity recalculation and 60 hours with the popularity recalculation. If we are to assume these times are linear, we can expect those times to become 58 hours and 74 hours respectively. Since these are run weekly, this still gives us about 100 hours left in the week before we start having data refreshes queued while previous ones are running.
Here are some steps we can take to monitor the process:
- Take a manual database snapshot of the catalog prior to enabling the iNaturalist DAG.
- Enable the DAG shortly after the weekly data refresh has completed. This will allow iNaturalist to run without other significant database operations occurring.
- Disable the DAG after the run while we verify the following steps.
- Monitor the next scheduled image data refresh closely for significant aberrations in step duration.
- Make a number of searches after the data refresh is complete to see how results are affected. We can make a number of searches which we would expect to return iNaturalist data (e.g. cat, mushroom, alligator) and some we expect should not (e.g. computer, transistor, book).
- Re-enable the iNaturalist DAG.
One of our big-picture goals for 2023 is search relevancy, and a key piece required for making improvements in that area is understanding how our existing document scoring works. I’m not sure that we can predict how adding this much data will affect our result relevancy. In the case where we notice result relevancy is negatively impacted (e.g. unrelated queries are flooded with iNaturalist results), there are a few actions we can take to mitigate this:
- Alter the weight of the provider in the API (@sarayourfriend had mentioned this as an option).
- Set the authority boost of the provider in the ingestion server and reindex the images.
- Disable the iNaturalist provider in the API.
We would like to do all we can to avoid the last option. I don’t presume that the iNaturalist data will require taking the above actions, but I wanted to outline them and open up space in case other folks have mitigation ideas.
We’re incredibly excited for the addition of this data!