Openverse is a search engine for openly-licensed media. It is both a website where you can search, discover, and learn how to use and attribute media, as well as an openly accessible REST API.
The OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. team builds the Openverse Catalog, API, and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.
You can also come chat with us in #openverse on the Make WP Chat. We have a weekly developer chat at 15:00 UTC on Mondays.
OpenverseOpenverseOpenverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. Catalog is our database of metadata about CC-licensed works from across the internet. This data is found and parsed via Common Crawl data and open APIs, and the code is located in the openverse-catalog repository.
The Openverse Catalog powers the Openverse APIAPIAn API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., which powers the Openverse frontend and the Openverse Browser Extension. For more details, see the main Openverse Frontend repository.
The main data of the Openverse Catalog is in the PostgreSQL image table and the (soon to be added) audio table, in the openledger database. They share the following scheme (annotated with source), in addition to their specific fields.
Column
Type
Comes From
identifier
UUID
DB-generated
created_on
timestamp with time zone
script-generated (now)
updated_on
timestamp with time zone
script-generated (now)
ingestion_type
character varying(80)
script-generated
provider
character varying(80)
script-generated
source
character varying(80)
script-generated
foreign_identifier
character varying(3000)
metadata
foreign_landing_url
character varying(1000)
metadata
url
character varying(3000)
metadata
thumbnail
character varying(3000)
metadata
filesize
integer
metadata
license
character varying(50)
metadata
license_version
character varying(25)
metadata
creator
character varying(2000)
metadata
creator_url
character varying(2000)
metadata
title
character varying(5000)
metadata
meta_data
jsonb
metadata
tags
jsonb
metadata
watermarked
boolean
metadata
last_synced_with_source
timestamp with time zone
script-generated (now)
removed_from_source
boolean
script-generated (f)
Additional fields for image:
Column
Type
Comes From
width
integer
metadata
height
integer
metadata
Additional fields for audio:
Column
Type
Comes From
duration
integer
metadata (in milliseconds)
bit_rate
integer
metadata
sample_rate
integer
metadata
categoryCategoryThe 'category' taxonomy lets you group posts / content together that share a common bond. Categories are pre-defined and broad ranging.
Providers are sites that host CC-licensed works. We have direct partnerships with some of our providers and ingest their content through their public APIs. However, we identify new providers mainly through processing Common Crawl data.
Providers to review are tracked via GithubGitHubGitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/issues.
The purpose of Openverse Catalog is to facilitate the discovery of 1.4 billion CC licensed content by leveraging open data from Common Crawl and open APIs.
ApacheApacheApache is the most widely used web server software. Developed and maintained by Apache Software Foundation. Apache is an Open Source software available for free. Airflow is used to manage the workflow for the various API ETL jobs and will sync the extracted Common Crawl image data to the upstream database.
The workflows can be been managed from the Admin Panel.
Log into EC2 instance with Airflow. Ask to Zack (zackkrida@automattic.com) to add your keys if necessary. ssh ec2-user@ec2-54-85-128-241.compute-1.amazonaws.com
Change directory to cc_catalog_airflow cd cccatalog/src/cc_catalog_airflow
If necessary, copy environment data to .env in this directory
Fetch all remotes git fetch --all
Checkout git tag git checkout tags/
Run script ./deploy_in_prod.sh
(optional) Navigate to the Admin Panel to see things working (or failing to work, as the case may be)
Note: If you do this while a particular job is running, that job will fail. But, the context will be saved, so the job should be retried once the new container comes up.
At the moment, the airflow scheduler logs absolutely must be rotated weekly in order to avoid filling up the disk (thereby breaking everything that runs on the host ec2 instance, even potentially the docker daemon). For instructions see this page.
Common Crawl Data Pipelines The Common Crawl corpus containes petabytes of web crawl data. Common Crawl publishes a new dataset at the end of each month. An EMR cluster of 100 c4.8xlarge is configured to automatically parse the data to identify all domains that link to creativecommons.org. Spot pricing is used to keep the cost of this job under $100 and the number of instances keeps the execution time under 1 hour (±10 minutes). If the execution time increases it may be time to scale up (or out). Benchmarking has shown that the R3/R4 instance family are also suitable for this job, however the spot pricing fluctuates.
Name: Common Crawl ETL – Extract all domains that link to creativecommons.org and save the output in Parquet files. Queries can be performed using the data to identify potential providers for inclusion in Openverse Search. A query has been developed to help with this process and it is saved in Athena
query name: Top_CC_Licensed_Domains. This query requires the user to specify the common crawl partition that should be analyzed.
All providers that pass the Provider Review Process have been integrated in another data pipeline, which extracts the image URLURLA specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org if it links to a Creative Commons license.
Name: Common Crawl Provider Images ETL – Extract image URL and associat
ed metaMetaMeta is a term that refers to the inside workings of a group. For us, this is the team that works on internal WordPress sites like WordCamp Central and Make WordPress. data, if it links to a CC license.
Description: Aggregate the most recent common crawl data and move it from the script’s working directory to the interim Data Lake location. A partition is also created, so that the data can be analyzed by partitions (which helps to minimize costs).
Description: This refreshes the commonsmapper database in Athena and will also load recently added data/partitions. This runs on the 7th day of each month at 11:30PM (after the new partition has been moved to the interim Data Lake.
This is the UIUIUI is an acronym for User Interface - the layout of the page the user interacts with. Think ‘how are they doing that’ and less about what they are doing. to query the partitioned common crawl data.
licenses_by_domain: defunct: this was used by the former Director of Product Engineering
common_crawl_image_data: the output directory for cc licensed images that were extracted from each provider. The directory structure is common_crawl_index/provider_name/filename.csv
scripts: contains the files that are used by the EMR cluster.
bootstrap: config files for the EMR cluster.
output: temporary output location for the Common Crawl Data pipeline => “Common Crawl ETL”. I manually cleanup this directory after confirming that all operations were performed successfully by the AWS EMR Cluster & Glue Jobs.
log: this will be your best friend to debug any issues.
pipeline/df-01792251K0EHZMVMJQ82: log files for the “Common Crawl ETL” data pipeline.
pipeline/df-09203082Y9C4462IMYIB: log files for the “Common Crawl Provider Images ETL” data pipeline.
data: The interim Data Lake. The data is partitioned using its respective Common Crawl index.
AWS Key
Key pair name is cc-catalog. Saved in the Shared-CC Catalog folder in LastPass, filename => cc-catalog key.
Following internal review and discussion, we have concluded that as an organization and team, we do not wish to engage contractors or services in ways that would facilitate the development of military applications of technology. This includes, but is not limited to, the use of data that is supplied by CC through any means in connection with the services that contributes to the training of artificial intelligence and image recognition tools by the service provider.
Openverse staff will make reasonable efforts during procurement and partnership processes to evaluate service providers to determine how our data or contributions will be used including, without limitation, whether it is used in service of such initiatives. This will include evaluating contractual terms of engagement and requesting modifications that preclude such uses when feasible. If after entering an agreement, it is determined that this is the case, Openverse management will determine the best course of action, which may involve canceling the service or partnership.
Currently, we generally only get information about a given image when it is first uploaded to one of our larger providers. For popularity data, this won’t suffice; It’s not very meaningful to know how many views an image has just after it was uploaded. Thus, we have come up with a strategy that will allow us to update info about images uploaded in the past regularly, with a preference towards ‘freshening’ the information about recently uploaded images. This will also (in the future) allow us to remove images that were taken down at the source from Openverse Search. The strategy is outlined here.
In the long run, and in order to have ‘fair’ metrics that are comparable across different providers, it may prove fruitful to consider trying to determine, e.g., how many times a given image is direct-linked on the internet. These sorts of metrics would probably be easiest to glean from Common Crawl.