Today (or yesterday, depending on your timezone) was the start of March 1st, UTC 00:00. For our Airflow instance, this meant that the scheduler kicked off all
@monthly DAG runs simultaneously.
While this has not historically been a problem, the iNaturalist workflow is set to run
@monthly. iNaturalist is a particularly resource intensive DAG, and the massive amount of data it processes has required some other adjustments to our existing DAGs. The iNaturalist DAG does run a check for new data before proceeding for any run, but when it does identify that there is new data to process, it must reprocess the entire dataset (since there is no way to detect which records have changed from the last run).
For our Airflow cluster, this meant that iNaturalist was running alongside almost all of our other scheduled DAGs, and this caused some interruptions with other DAGs. We saw a myriad of seemingly-inexplicable issues on the cluster, ranging from log files missing to TSVs failing to exist when they should. This seemed to point to a disk space issue, but when I checked the instance itself it had plenty of disk space. I suspect I wasn’t able to catch it, but the iNaturalist DAG initially loads the Catalog of Life data as its first step which could have pushed it over the edge. Alongside other ingestion processes, it’s totally plausible that the disk ran out of space with everything going on.
@stacimc and I monitored the instance throughout the afternoon, pausing the iNaturalist DAG and waiting for the other DAGs to finish processing. Everything after that point ran successfully, and we re-enabled the iNaturalist DAG (and the data refresh DAG). Everything seems to have returned to normal at this point, though we plan on restarting the Docker stack on the instance once iNaturalist is complete and prior to the data refresh run this weekend.
Moving forward, we’ve identified a number of ways to improve our workflows and infrastructure:
#airflow #catalog #database
We initiated our first data refresh after adding the entirety of the iNaturalist data this last week, ballooning the catalog from 590m records to ~712m. Unfortunately, the
image_view matview refresh step (which normally takes 16 hours) went far longer than expected. Here’s a link to the graph view for the run for those with access.
The matview refresh task hit the timeout of 1 day, but didn’t immediately stop due to a known issue that Rebecca is working on addressing. Typically when this happens, the query completes successfully (albeit well after the intended timeout) but Airflow, having attempted to issue a task failure for the entire duration the task went over the timeout, marks the task as failed. This initiated a retry on the task. Once I noticed the retry attempt, I proceeded to pause the DAG and mark the task as failed. For the same reason as just mentioned, this would not actually halt the query, so I went and executed a
pg_terminate_backend on the matview refresh. This took quite some time to actually execute as well.
I wanted to verify whether or not the query ran successfully the first time even if it went beyond its alloyed timeout. I initially tried the rapid count calculation on the view, but it returned ~589m records. I wanted to verify that this was not an issue with tuples failing to be updated on the view for the rapid count, so I ran a full
SELECT COUNT(*) FROM image_view. Sadly, this also returned that amount:
deploy@localhost:openledger> select count(*) from image_view;
| count |
| 589538189 |
Time: 1623.020s (27 minutes 3 seconds), executed in: 1623.017s (27 minutes 3 seconds)
This meant that the matview refresh did not complete successfully even after 52 hours running, which is quite disheartening.
I think our best bet is to remove retries on this task and increase the timeout to account for the larger data. We cannot do the latter though until we have a sense of how long the query will actually take now. In order to assess that, I’ve opened up a
screen session on the catalog EC2 instance and began the
REFRESH MATERIALIZED VIEW CONCURRENTLY image_view command on it (the session is named
matview-refresh). I’ve also set a reminder to check on this in about 55 hours. Once that’s able to complete successfully, we’ll have a better understanding of how to proceed.
Given that this is merely the first step of the data refresh, we can probably expect further complications in later steps as well 😅
This was a passing thought I had that I wanted to note somewhere. Currently the ingestion server is a small Falcon app that runs most aspects of the data refresh, but then also (in staging/prod) interacts with a fleet of “indexer worker” EC2 instances when performing the Postgres -> Elasticsearch indexing.
We have plans for moving the data refresh steps from the ingestion server into Airflow. Most of these steps are operations on the various databases, so they’re not very processor-intensive on the server end. However, the indexing steps are intensive, which is why they’re spread across 6 machines in production (and even then it can take a number of hours to complete).
We could replicate this process in Airflow by setting up Celery-based workers so that the tasks run on a separate instance from the webserver/scheduler. Ultimately I’d like to go this route (or use something like the ECS Executor rather than Celery), but that’s a non-trivial effort to complete.
One other way we could accomplish this would be to use ECS tasks! We could have a container defined specifically for the indexing step, which expects to receive the range on which to index and all necessary connection information. We could then kick off
n of those jobs using the
EcsRunTaskOperator, and wait for completion using the
EcsTaskStateSensor to determine when they complete. This could be done in our current setup without any new Airflow infrastructure. It’d also allow us to remove the indexer workers, which currently sit idle (albeit in the
stopped state) in EC2 until they are used.
#airflow, #data-refresh, #ecs, #infrastructure, #openverse