ecs – Make Openverse – WordPress.org

Welcome!

Openverse is a search engine for openly-licensed media.

The OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. team builds the Openverse Catalog, APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.

You can also come chat with us in #openverse on the Make WP Chat. We have a weekly developer chat at 15:00 UTC on Mondays.

If you’re a new contributor, welcome! Have a look at our good first issues or our guide for new contributors.

This was a passing thought I had that I wanted to note somewhere. Currently the ingestion server is a small Falcon app that runs most aspects of the data refresh, but then also (in staging/prod) interacts with a fleet of “indexer worker” EC2 instances when performing the Postgres -> Elasticsearch indexing.

We have plans for moving the data refresh steps from the ingestion server into Airflow. Most of these steps are operations on the various databases, so they’re not very processor-intensive on the server end. However, the indexing steps are intensive, which is why they’re spread across 6 machines in production (and even then it can take a number of hours to complete).

We could replicate this process in Airflow by setting up Celery-based workers so that the tasks run on a separate instance from the webserver/scheduler. Ultimately I’d like to go this route (or use something like the ECS Executor rather than Celery), but that’s a non-trivial effort to complete.

One other way we could accomplish this would be to use ECS tasks! We could have a container defined specifically for the indexing step, which expects to receive the range on which to index and all necessary connection information. We could then kick off n of those jobs using the EcsRunTaskOperator, and wait for completion using the EcsTaskStateSensor to determine when they complete. This could be done in our current setup without any new Airflow infrastructure. It’d also allow us to remove the indexer workers, which currently sit idle (albeit in the stopped state) in EC2 until they are used.

#airflow, #data-refresh, #ecs, #infrastructure, #openverse

Welcome!

Tag Archives: ecs

Applying ECS to the ingestion server/data refresh