Post-iNaturalist Data Refresh Status

We initiated our first data refresh after adding the entirety of the iNaturalist data this last week, ballooning the catalog from 590m records to ~712m. Unfortunately, the image_view matview refresh step (which normally takes 16 hours) went far longer than expected. Here’s a link to the graph view for the run for those with access.

The matview refresh task hit the timeout of 1 day, but didn’t immediately stop due to a known issue that Rebecca is working on addressing. Typically when this happens, the query completes successfully (albeit well after the intended timeout) but Airflow, having attempted to issue a task failure for the entire duration the task went over the timeout, marks the task as failed. This initiated a retry on the task. Once I noticed the retry attempt, I proceeded to pause the DAG and mark the task as failed. For the same reason as just mentioned, this would not actually halt the query, so I went and executed a pg_terminate_backend on the matview refresh. This took quite some time to actually execute as well.

I wanted to verify whether or not the query ran successfully the first time even if it went beyond its alloyed timeout. I initially tried the rapid count calculation on the view, but it returned ~589m records. I wanted to verify that this was not an issue with tuples failing to be updated on the view for the rapid count, so I ran a full SELECT COUNT(*) FROM image_view. Sadly, this also returned that amount:

deploy@localhost:openledger> select count(*) from image_view;
+-----------+
| count     |
|-----------|
| 589538189 |
+-----------+
SELECT 1
Time: 1623.020s (27 minutes 3 seconds), executed in: 1623.017s (27 minutes 3 seconds)

This meant that the matview refresh did not complete successfully even after 52 hours running, which is quite disheartening.

I think our best bet is to remove retries on this task and increase the timeout to account for the larger data. We cannot do the latter though until we have a sense of how long the query will actually take now. In order to assess that, I’ve opened up a screen session on the catalog EC2 instance and began the REFRESH MATERIALIZED VIEW CONCURRENTLY image_view command on it (the session is named matview-refresh). I’ve also set a reminder to check on this in about 55 hours. Once that’s able to complete successfully, we’ll have a better understanding of how to proceed.

Given that this is merely the first step of the data refresh, we can probably expect further complications in later steps as well 😅

#airflow #data-refresh

Applying ECS to the ingestion server/data refresh

This was a passing thought I had that I wanted to note somewhere. Currently the ingestion server is a small Falcon app that runs most aspects of the data refresh, but then also (in staging/prod) interacts with a fleet of “indexer worker” EC2 instances when performing the Postgres -> Elasticsearch indexing.

We have plans for moving the data refresh steps from the ingestion server into Airflow. Most of these steps are operations on the various databases, so they’re not very processor-intensive on the server end. However, the indexing steps are intensive, which is why they’re spread across 6 machines in production (and even then it can take a number of hours to complete).

We could replicate this process in Airflow by setting up Celery-based workers so that the tasks run on a separate instance from the webserver/scheduler. Ultimately I’d like to go this route (or use something like the ECS Executor rather than Celery), but that’s a non-trivial effort to complete.

One other way we could accomplish this would be to use ECS tasks! We could have a container defined specifically for the indexing step, which expects to receive the range on which to index and all necessary connection information. We could then kick off n of those jobs using the EcsRunTaskOperator, and wait for completion using the EcsTaskStateSensor to determine when they complete. This could be done in our current setup without any new Airflow infrastructure. It’d also allow us to remove the indexer workers, which currently sit idle (albeit in the stopped state) in EC2 until they are used.

#airflow, #data-refresh, #ecs, #infrastructure, #openverse