Concurrency/resource issues on monthly runs – Make Openverse

Welcome!

Openverse is a search engine for openly-licensed media.

The OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. team builds the Openverse Catalog, APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.

You can also come chat with us in #openverse on the Make WP Chat. We have a weekly developer chat at 15:00 UTC on Mondays.

If you’re a new contributor, welcome! Have a look at our good first issues or our guide for new contributors.

Today (or yesterday, depending on your timezone) was the start of March 1st, UTC 00:00. For our Airflow instance, this meant that the scheduler kicked off all @hourly, @daily, @weekly, and @monthly DAG runs simultaneously.

While this has not historically been a problem, the iNaturalist workflow is set to run @monthly. iNaturalist is a particularly resource intensive DAG, and the massive amount of data it processes has required some other adjustments to our existing DAGs. The iNaturalist DAG does run a check for new data before proceeding for any run, but when it does identify that there is new data to process, it must reprocess the entire dataset (since there is no way to detect which records have changed from the last run).

For our Airflow cluster, this meant that iNaturalist was running alongside almost all of our other scheduled DAGs, and this caused some interruptions with other DAGs. We saw a myriad of seemingly-inexplicable issues on the cluster, ranging from log files missing to TSVs failing to exist when they should. This seemed to point to a disk space issue, but when I checked the instance itself it had plenty of disk space. I suspect I wasn’t able to catch it, but the iNaturalist DAG initially loads the Catalog of Life data as its first step which could have pushed it over the edge. Alongside other ingestion processes, it’s totally plausible that the disk ran out of space with everything going on.

@stacimc and I monitored the instance throughout the afternoon, pausing the iNaturalist DAG and waiting for the other DAGs to finish processing. Everything after that point ran successfully, and we re-enabled the iNaturalist DAG (and the data refresh DAG). Everything seems to have returned to normal at this point, though we plan on restarting the Docker stack on the instance once iNaturalist is complete and prior to the data refresh run this weekend.

Moving forward, we’ve identified a number of ways to improve our workflows and infrastructure:

#airflow #catalog #database