Applying ECS to the ingestion server/data refresh

This was a passing thought I had that I wanted to note somewhere. Currently the ingestion server is a small Falcon app that runs most aspects of the data refresh, but then also (in staging/prod) interacts with a fleet of “indexer worker” EC2 instances when performing the Postgres -> Elasticsearch indexing.

We have plans for moving the data refresh steps from the ingestion server into Airflow. Most of these steps are operations on the various databases, so they’re not very processor-intensive on the server end. However, the indexing steps are intensive, which is why they’re spread across 6 machines in production (and even then it can take a number of hours to complete).

We could replicate this process in Airflow by setting up Celery-based workers so that the tasks run on a separate instance from the webserver/scheduler. Ultimately I’d like to go this route (or use something like the ECS Executor rather than Celery), but that’s a non-trivial effort to complete.

One other way we could accomplish this would be to use ECS tasks! We could have a container defined specifically for the indexing step, which expects to receive the range on which to index and all necessary connection information. We could then kick off n of those jobs using the EcsRunTaskOperator, and wait for completion using the EcsTaskStateSensor to determine when they complete. This could be done in our current setup without any new Airflow infrastructure. It’d also allow us to remove the indexer workers, which currently sit idle (albeit in the stopped state) in EC2 until they are used.

#airflow, #data-refresh, #ecs, #infrastructure, #openverse