Today (or yesterday, depending on your timezone) was the start of March 1st, UTC 00:00. For our Airflow instance, this meant that the scheduler kicked off all
@monthly DAG runs simultaneously.
While this has not historically been a problem, the iNaturalist workflow is set to run
@monthly. iNaturalist is a particularly resource intensive DAG, and the massive amount of data it processes has required some other adjustments to our existing DAGs. The iNaturalist DAG does run a check for new data before proceeding for any run, but when it does identify that there is new data to process, it must reprocess the entire dataset (since there is no way to detect which records have changed from the last run).
For our Airflow cluster, this meant that iNaturalist was running alongside almost all of our other scheduled DAGs, and this caused some interruptions with other DAGs. We saw a myriad of seemingly-inexplicable issues on the cluster, ranging from log files missing to TSVs failing to exist when they should. This seemed to point to a disk space issue, but when I checked the instance itself it had plenty of disk space. I suspect I wasn’t able to catch it, but the iNaturalist DAG initially loads the Catalog of Life data as its first step which could have pushed it over the edge. Alongside other ingestion processes, it’s totally plausible that the disk ran out of space with everything going on.
@stacimc and I monitored the instance throughout the afternoon, pausing the iNaturalist DAG and waiting for the other DAGs to finish processing. Everything after that point ran successfully, and we re-enabled the iNaturalist DAG (and the data refresh DAG). Everything seems to have returned to normal at this point, though we plan on restarting the Docker stack on the instance once iNaturalist is complete and prior to the data refresh run this weekend.
Moving forward, we’ve identified a number of ways to improve our workflows and infrastructure:
#airflow #catalog #database