Openverse Catalog

About

OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. Catalog is our database of metadata about CC-licensed works from across the internet. This data is found and parsed via Common Crawl data and open APIs, and the code is located in the openverse-catalog repository.

The Openverse Catalog powers the Openverse APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., which powers the Openverse frontend and the Openverse Browser Extension. For more details, see the main Openverse Frontend repository.

Top ↑

Data Tables

The main data of the Openverse Catalog is in the PostgreSQL image table and the (soon to be added) audio table, in the openledger database. They share the following scheme (annotated with source), in addition to their specific fields.

ColumnTypeComes From
identifierUUIDDB-generated
created_ontimestamp with time zonescript-generated (now)
updated_ontimestamp with time zonescript-generated (now)
ingestion_typecharacter varying(80)script-generated
providercharacter varying(80)script-generated
sourcecharacter varying(80)script-generated
foreign_identifiercharacter varying(3000)metadata
foreign_landing_urlcharacter varying(1000)metadata
urlcharacter varying(3000)metadata
thumbnailcharacter varying(3000)metadata
filesizeintegermetadata
licensecharacter varying(50)metadata
license_versioncharacter varying(25)metadata
creatorcharacter varying(2000)metadata
creator_urlcharacter varying(2000)metadata
titlecharacter varying(5000)metadata
meta_datajsonbmetadata
tagsjsonbmetadata
watermarkedbooleanmetadata
last_synced_with_sourcetimestamp with time zonescript-generated (now)
removed_from_sourcebooleanscript-generated (f)

Additional fields for image:

ColumnTypeComes From
widthintegermetadata
heightintegermetadata

Additional fields for audio:

ColumnTypeComes From
durationintegermetadata (in milliseconds)
bit_rateintegermetadata
sample_rateintegermetadata
categoryCategory The 'category' taxonomy lets you group posts / content together that share a common bond. Categories are pre-defined and broad ranging.character varying(3000)metadata
genrescharacter varying(80)[]metadata
audio_setjsonbmetadata
alt_filesjsonbmetadata

Top ↑

Providers

Providers are sites that host CC-licensed works. We have direct partnerships with some of our providers and ingest their content through their public APIs. However, we identify new providers mainly through processing Common Crawl data.

The current providers are as follows:

Providers to review are tracked via GithubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/ issues.

Top ↑

Data Process Flow & Management

The purpose of Openverse Catalog is to facilitate the discovery of 1.4 billion CC licensed content by leveraging open data from Common Crawl and open APIs.

Top ↑

Airflow – Workflow Management

ApacheApache Apache is the most widely used web server software. Developed and maintained by Apache Software Foundation. Apache is an Open Source software available for free. Airflow is used to manage the workflow for the various API ETL jobs and will sync the extracted Common Crawl image data to the upstream database.

The workflows can be been managed from the Admin Panel.

Top ↑

How to Deploy New Airflow DAGs or Provider API Scripts to Production

  1. Log into EC2 instance with Airflow. Ask to Zack (zackkrida@automattic.com) to add your keys if necessary.
    ssh ec2-user@ec2-54-85-128-241.compute-1.amazonaws.com
  2. Change directory to cc_catalog_airflow
    cd cccatalog/src/cc_catalog_airflow
  3. If necessary, copy environment data to .env in this directory
  4. Fetch all remotes
    git fetch --all
  5. Checkout git tag
    git checkout tags/
  6. Run script ./deploy_in_prod.sh
  7. (optional) Navigate to the Admin Panel to see things working (or failing to work, as the case may be)

Note: If you do this while a particular job is running, that job will fail. But, the context will be saved, so the job should be retried once the new container comes up.

Top ↑

Weekly airflow log rotation

At the moment, the airflow scheduler logs absolutely must be rotated weekly in order to avoid filling up the disk (thereby breaking everything that runs on the host ec2 instance, even potentially the docker daemon). For instructions see this page.

Top ↑

AWS Managed Services

  • Common Crawl Data Pipelines The Common Crawl corpus containes petabytes of web crawl data. Common Crawl publishes a new dataset at the end of each month. An EMR cluster of 100 c4.8xlarge is configured to automatically parse the data to identify all domains that link to creativecommons.org. Spot pricing is used to keep the cost of this job under $100 and the number of instances keeps the execution time under 1 hour (±10 minutes). If the execution time increases it may be time to scale up (or out). Benchmarking has shown that the R3/R4 instance family are also suitable for this job, however the spot pricing fluctuates.
    • Name: Common Crawl ETL – Extract all domains that link to creativecommons.org and save the output in Parquet files. Queries can be performed using the data to identify potential providers for inclusion in Openverse Search. A query has been developed to help with this process and it is saved in Athena
    • query name: Top_CC_Licensed_Domains. This query requires the user to specify the common crawl partition that should be analyzed.
    All providers that pass the Provider Review Process have been integrated in another data pipeline, which extracts the image URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org if it links to a Creative Commons license.
    • Name: Common Crawl Provider Images ETL – Extract image URL and associat ed metaMeta Meta is a term that refers to the inside workings of a group. For us, this is the team that works on internal WordPress sites like WordCamp Central and Make WordPress. data, if it links to a CC license.
  • Glue Job
    • Name: ProcessCommonCrawlData
    • Description: Aggregate the most recent common crawl data and move it from the script’s working directory to the interim Data Lake location. A partition is also created, so that the data can be analyzed by partitions (which helps to minimize costs).
  • Glue ETL Trigger
    • Name: trg_ProcessCommonCrawlData
    • Description: Trigger the execution of the “ProcessCommonCrawlData” job. This runs on the 7th day of each month at 11PM
  • Glue Crawler
    • Name: LoadCommonCrawlData
    • Description: This refreshes the commonsmapper database in Athena and will also load recently added data/partitions. This runs on the 7th day of each month at 11:30PM (after the new partition has been moved to the interim Data Lake.
  • Athena
    • This is the UIUI UI is an acronym for User Interface - the layout of the page the user interacts with. Think ‘how are they doing that’ and less about what they are doing. to query the partitioned common crawl data.
  • S3 buckets
    • commonsmapper
      • licenses_by_domain: defunct: this was used by the former Director of Product Engineering
      • common_crawl_image_data: the output directory for cc licensed images that were extracted from each provider. The directory structure is common_crawl_index/provider_name/filename.csv
    • commonsmapper-v2
      • scripts: contains the files that are used by the EMR cluster.
      • bootstrap: config files for the EMR cluster.
      • output: temporary output location for the Common Crawl Data pipeline => “Common Crawl ETL”. I manually cleanup this directory after confirming that all operations were performed successfully by the AWS EMR Cluster & Glue Jobs.
      • log: this will be your best friend to debug any issues.
        • pipeline/df-01792251K0EHZMVMJQ82: log files for the “Common Crawl ETL” data pipeline.
        • pipeline/df-09203082Y9C4462IMYIB: log files for the “Common Crawl Provider Images ETL” data pipeline.
      • data: The interim Data Lake. The data is partitioned using its respective Common Crawl index.
  • AWS Key
    • Key pair name is cc-catalog. Saved in the Shared-CC Catalog folder in LastPass, filename => cc-catalog key.

Top ↑

Archive

  • RSync.net is used to store archived common crawl data.

Top ↑

Research

Top ↑

Internal AI policy

Following internal review and discussion, we have concluded that as an organization and team, we do not wish to engage contractors or services in ways that would facilitate the development of military applications of technology. This includes, but is not limited to, the use of data that is supplied by CC through any means in connection with the services that contributes to the training of artificial intelligence and image recognition tools by the service provider.

Openverse staff will make reasonable efforts during procurement and partnership processes to evaluate service providers to determine how our data or contributions will be used including, without limitation, whether it is used in service of such initiatives. This will include evaluating contractual terms of engagement and requesting modifications that preclude such uses when feasible. If after entering an agreement, it is determined that this is the case, Openverse management will determine the best course of action, which may involve canceling the service or partnership.

Top ↑

Regular reingestion of current image information

Currently, we generally only get information about a given image when it is first uploaded to one of our larger providers. For popularity data, this won’t suffice; It’s not very meaningful to know how many views an image has just after it was uploaded. Thus, we have come up with a strategy that will allow us to update info about images uploaded in the past regularly, with a preference towards ‘freshening’ the information about recently uploaded images. This will also (in the future) allow us to remove images that were taken down at the source from Openverse Search. The strategy is outlined here.

Top ↑

Long term Image ranking ideas

In the long run, and in order to have ‘fair’ metrics that are comparable across different providers, it may prove fruitful to consider trying to determine, e.g., how many times a given image is direct-linked on the internet. These sorts of metrics would probably be easiest to glean from Common Crawl.

Top ↑

Community Involvement

We would like to increase the level of community contribution to Openverse Catalog. Details can be found on the Openverse Catalog Community Involvement page.

Last updated: