Provider ingestion script refactor guide

This document describes code patterns/changes to consider while upgrading a provider ingestion script.

As part of Catalog milestone v1.3.2, we are going through our current set of provider APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. ingestion scripts and refactoring them to use the new ProviderDataIngester class. These refactors are improving our ability to make a single change across all provider scripts and has given us more “levers” for controlling ingestion flow, reporting, etc.

While we iterate through these refactors, it can be useful to take this time to improve general code health. We’ve had some lovely improvements to data normalization & cleaning on the script end via the MediaStore class, and Python itself has come a long way since the start of the catalog. At the same time, refactors should aim to change the behavior of the code as little as possible, so try to avoid changing or adding data.

When refactoring a module, please consider the following:

  • Remove bespoke URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org cleaning/normalization methods in favor of the MediaStore‘s inherent URL cleaning. [Example]
  • When adding items to the meta_data JSONB column, avoid storing any keys that have a value of None – this provides little value for Elasticsearch and bloats the space required to store the record. [Example]
  • When extracting information from a returned API record, catch missing required data as early as possible. This will prevent unnecessary processing when we know that the record will be invalid/uningestable anyway. [Example]
  • Remove log lines that may be excessive or noisy (e.g. logging the license_url of every record). [Example]
  • Ensure that required fields (e.g. foreign_identifier) don’t default to an empty string but rather skip further processing of the record entirely. [Example]
  • Use staticmethods where possible for scope reduction & ease of testing. [Example]
  • Remove bespoke filetype parsing methods in favor of MediaStore‘s filetype extraction. [Example]
  • If overriding the __init__ function of a provider data ingester class, be sure to have the override accept **kwargs and pass those kwargs into the super()__init__ call. Also avoid naming the super class explicitly as Python 3 will handle the method resolution order on its own. [Example]
  • Use self.get_response_json for any additional requests that need to be made, so the ingester’s delay and timeout settings are respected and so changes to the entrypoint down the line can affect all requests. [Example]
  • Provider APIs may have changed since the first scripts were written (e.g. phylopic) so run the dag locally for testing if possible.
  • Take a look at the image and audio stores for info on the data we’re looking for from providers. Open new issues as appropriate to ensure that we’re using all of the available information and that information is being directed into the right fields.

Last updated: