DAG Status Information

This document serves as a living record of which DAGs are disabled or unstable and why. This can be helpful for tracking the various issues a DAG might have and knowing which DAGs we can turn back on when.

A “Disabled” DAG is turned off in production. An “Unstable” DAG is turned on (often in order to consume partial data), but raising expected/known errors.

Note: The DAG column links to Airflow directly, which is not currently publicly accessible. We are working on improving our Role-Based Access setup in order to allow community members to view Airflow without an account. We’ll make an announcement on the Make WP blog once this is ready!

DAGStatusReasonLast Updated
Common crawlDisabledCommon crawl infrastructure and processing is not yet set up
FlickrDisabledStrange bug in the provider APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. which causes us to receive many hundreds of thousands of duplicate records, tracked in our repo here2022-10-25
Finnish MuseumsDisabledThe pull_data task is consistently timing out, and was most recently manually failed after running >15 days. We need to fix the issue with the timeout and potentially convert this to a dated dag.
Image expirationDisabledNot yet investigated
OauthDisabledOauth not set up or needed for providers yet
iNaturalistDisabledPaused while we improve load performance2022-10-27
WordPress Photo DirectoryDisabledFilesize retrieval and page iteration are both broken2022-11-01
Audio data refreshDisabledBoth data refreshes were paused due to the image popularity refresh taking 4 days longer than expected to run and failing to finish. The image data refresh is being attempted again, but we are holding off on restarting the audio refresh until image completes.2022-11-29
Wikimedia Commons WorkflowUnstableWikimedia results are often timing out due to very large batches. The DAG is enabled because partial data is still ingested2022-11-29
Wikimedia ReingestionUnstableThe large batch issue causes timeouts on multiple individual ingestion days; each one results in a SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/. notification. This will be resolved either when the large batch issue is fixed, or by implementing an option to aggregate reingestion errors into a single Slack message.2022-11-29

Last updated: