Post-iNaturalist Data Refresh Status

We initiated our first data refresh after adding the entirety of the iNaturalist data this last week, ballooning the catalog from 590m records to ~712m. Unfortunately, the image_view matview refresh step (which normally takes 16 hours) went far longer than expected. Here’s a link to the graph view for the run for those with access.

The matview refresh task hit the timeout of 1 day, but didn’t immediately stop due to a known issue that Rebecca is working on addressing. Typically when this happens, the query completes successfully (albeit well after the intended timeout) but Airflow, having attempted to issue a task failure for the entire duration the task went over the timeout, marks the task as failed. This initiated a retry on the task. Once I noticed the retry attempt, I proceeded to pause the DAG and mark the task as failed. For the same reason as just mentioned, this would not actually halt the query, so I went and executed a pg_terminate_backend on the matview refresh. This took quite some time to actually execute as well.

I wanted to verify whether or not the query ran successfully the first time even if it went beyond its alloyed timeout. I initially tried the rapid count calculation on the view, but it returned ~589m records. I wanted to verify that this was not an issue with tuples failing to be updated on the view for the rapid count, so I ran a full SELECT COUNT(*) FROM image_view. Sadly, this also returned that amount:

deploy@localhost:openledger> select count(*) from image_view;
+-----------+
| count     |
|-----------|
| 589538189 |
+-----------+
SELECT 1
Time: 1623.020s (27 minutes 3 seconds), executed in: 1623.017s (27 minutes 3 seconds)

This meant that the matview refresh did not complete successfully even after 52 hours running, which is quite disheartening.

I think our best bet is to remove retries on this task and increase the timeout to account for the larger data. We cannot do the latter though until we have a sense of how long the query will actually take now. In order to assess that, I’ve opened up a screen session on the catalog EC2 instance and began the REFRESH MATERIALIZED VIEW CONCURRENTLY image_view command on it (the session is named matview-refresh). I’ve also set a reminder to check on this in about 55 hours. Once that’s able to complete successfully, we’ll have a better understanding of how to proceed.

Given that this is merely the first step of the data refresh, we can probably expect further complications in later steps as well 😅

#airflow #data-refresh