We initiated our first data refresh after adding the entirety of the iNaturalist data this last week, ballooning the catalog from 590m records to ~712m. Unfortunately, the image_view
matview refresh step (which normally takes 16 hours) went far longer than expected. Here’s a link to the graph view for the run for those with access.
The matview refresh task hit the timeout of 1 day, but didn’t immediately stop due to a known issue that Rebecca is working on addressing. Typically when this happens, the query completes successfully (albeit well after the intended timeout) but Airflow, having attempted to issue a task failure for the entire duration the task went over the timeout, marks the task as failed. This initiated a retry on the task. Once I noticed the retry attempt, I proceeded to pause the DAG and mark the task as failed. For the same reason as just mentioned, this would not actually halt the query, so I went and executed a pg_terminate_backend
on the matview refresh. This took quite some time to actually execute as well.
I wanted to verify whether or not the query ran successfully the first time even if it went beyond its alloyed timeout. I initially tried the rapid count calculation on the view, but it returned ~589m records. I wanted to verify that this was not an issue with tuples failing to be updated on the view for the rapid count, so I ran a full SELECT COUNT(*) FROM image_view
. Sadly, this also returned that amount:
deploy@localhost:openledger> select count(*) from image_view;
+-----------+
| count |
|-----------|
| 589538189 |
+-----------+
SELECT 1
Time: 1623.020s (27 minutes 3 seconds), executed in: 1623.017s (27 minutes 3 seconds)
This meant that the matview refresh did not complete successfully even after 52 hours running, which is quite disheartening.
I think our best bet is to remove retries on this task and increase the timeout to account for the larger data. We cannot do the latter though until we have a sense of how long the query will actually take now. In order to assess that, I’ve opened up a screen
session on the catalog EC2 instance and began the REFRESH MATERIALIZED VIEW CONCURRENTLY image_view
command on it (the session is named matview-refresh
). I’ve also set a reminder to check on this in about 55 hours. Once that’s able to complete successfully, we’ll have a better understanding of how to proceed.
Given that this is merely the first step of the data refresh, we can probably expect further complications in later steps as well 😅
#airflow #data-refresh