Title: data-normalization – Make Openverse – WordPress.org

---

#  Tag Archives: data-normalization

 [  ](https://profiles.wordpress.org/krysal/) [Krystle Salazar](https://profiles.wordpress.org/krysal/)
11:02 pm _on_ November 18, 2022     
Tags: data-normalization, [postgres ( 5 )](https://make.wordpress.org/openverse/tag/postgres/)

# 󠀁[Preparing the next migration of the Catalog](https://make.wordpress.org/openverse/2022/11/18/preparing-the-next-migration-of-the-catalog/)󠁿

As we have fixed and reactivated data ingestion via more provider scripts, some 
shortcomings in the data model have been uncovered. Today I went through the backlog
of Catalog’s issues and noticed we had accumulated a small pile of problems that
seems need to be addressed primarily at the database level.

For some, the discussion has already begun in [other posts](https://make.wordpress.org/openverse/2022/10/25/handling-very-large-2gb-files/),
but I wanted to gather everything I have noticed in one place, so we can discuss
if we do all the changes together, whether is better to do it in more manageable
chunks or find alternative solutions. These are disruptive and breaking changes 
given they alter the structure of the TSV files (a key part of the ingestion process)
and need to be implemented carefully, so are usually postponed until the last resort
and when is confirmed that they are necessary.

## Columns to modify or add

 * `filesize` : change type to `bigint`
 * `duration` : change type to `bigint`
 * `mature` : add a `boolean` column
 * `description` : add a `text` or `varchar(n)` column

### Issues

The need for these changes comes from this list of issues that could be concentrated
into a milestone:

 * [#730 Filesize exceeds postgres integer column maximum size](https://github.com/WordPress/openverse-catalog/issues/730)
 * [#873 Duration exceeds Postgres integer column maximum size](https://github.com/WordPress/openverse-catalog/issues/873)
 * [#364 Not sure if we have descriptions for each media item](https://github.com/WordPress/openverse-catalog/issues/364)
 * [#3 Collect mature content flag data from providers](https://github.com/WordPress/openverse-catalog/issues/3)

Also, consider the following:

 * [#235 Investigate the use of alembic for openledger migrations](https://github.com/WordPress/openverse-catalog/issues/235)

## Previous related work

We performed the addition of columns in the past, you can see the [PR for including the `category` in the image table](https://github.com/WordPress/openverse-catalog/pull/196),
and we plan to add more media types next year, so this could serve as an opportunity
to refine this process and make it simpler.

Finally, the call is to start thinking about this. There is no hurry currently, 
but the need is becoming more and more evident. What do you think? Do you see it
as viable? When could we start this endeavor?

[#data-normalization](https://make.wordpress.org/openverse/tag/data-normalization/),
[#postgres](https://make.wordpress.org/openverse/tag/postgres/)

 * [Login to Reply](https://login.wordpress.org/?redirect_to=https%3A%2F%2Fmake.wordpress.org%2Fopenverse%2F2022%2F11%2F18%2Fpreparing-the-next-migration-of-the-catalog%2F%23respond&locale=en_US)

 [  ](https://profiles.wordpress.org/aetherunbound/) [Madison Swain-Bowden](https://profiles.wordpress.org/aetherunbound/)
1:17 am _on_ October 25, 2022     
Tags: data-normalization, [postgres ( 5 )](https://make.wordpress.org/openverse/tag/postgres/),
thumbnails   

# 󠀁[Storing provider supplied thumbnails](https://make.wordpress.org/openverse/2022/10/25/storing-provider-supplied-thumbnails/)󠁿

**Note**: This is adapted from a [conversation that happened on the Make WP Slack](https://wordpress.slack.com/archives/C02012JB00N/p1666108900092929)(
see the link for more context).

## Our current situation regarding thumbnails

Thumbnails were collected by providers and put in the `thumbnail` column in the 
catalog. We have data there now, and [the column has not been dropped](https://github.com/WordPress/openverse-catalog/blob/ab90e8ee610c3bce2e04746787e2cb006bcd6960/docker/local_postgres/0003_openledger_image_schema.sql#L26).
Since [this column also exists in the API](https://github.com/WordPress/openverse-api/blob/c38ddfee022fce3f72c4894a804a5577f526abeb/api/catalog/api/models/mixins.py#L89-L97),
the data from it (for both audio and image) gets copied into the `thumbnail` column
there as well. However, [only the Audio model uses the `thumbnail` column](https://github.com/WordPress/openverse-api/blob/1666f084b81514d649f7801618836b3023690e44/api/catalog/api/views/audio_views.py#L62),
the [Image model ignores it and tries to create a scaled down thumbnail using the primary image URL](https://github.com/WordPress/openverse-api/blob/1666f084b81514d649f7801618836b3023690e44/api/catalog/api/views/image_views.py#L107-L121).
The MediaStore class in the provider [has been modified so that no new thumbnail URLs can be added to the `thumbnail` column](https://github.com/WordPress/openverse-catalog/pull/526/files),
but they can be added in the metadata under `meta_data["thumbnail_url"]`.

We have run into scenarios recently where our own attempts to create thumbnails 
from the full-size image URLs have caused consistent timeouts and prevent the records
from being shown in search results. SMK is a recent example of this: [(1)](https://github.com/WordPress/openverse-catalog/issues/494)
[(2)](https://github.com/WordPress/openverse-api/issues/895). It may be advantageous
to use the provider’s pre-computed thumbnails in these cases. Crucially, **not all
providers supply or are expected to supply thumbnails**. As such, it seems like 
we may want to reduce the width of our tables by removing this frequently-empty 
column.

## Proposal

For `image`: we first go through and copy all of the existing `thumbnail` values
in the catalog into `meta_data["thumbnail_url"]`. Then we perform a [new-table-without-column-swap](https://wordpress.slack.com/archives/C02012JB00N/p1666107428023019)
and remove the `thumbnail` column from from the table. This will help conserve space
and reduce the width of the table. Then we update the `image_view` materialized 
view to populate a `thumbnail` column for the APIAPI An API or Application Programming
Interface is a software intermediary that allows programs to interact with each 
other and share data in limited, clearly defined ways. using the `meta_data` dictionary.
An example of this is in the [Audio view’s ](https://github.com/WordPress/openverse-catalog/blob/2273271f5de608eb76ee976a026fbf5e32512791/docker/local_postgres/0007_openledger_audio_view.sql#L82)`
audio_set_foreign_identifier` . This will allow us to keep the `thumbnail` column
in the API but reduce space on the catalog, since a migrationMigration Moving the
code, database and media files for a website site from one server to another. Most
typically done when changing hosting companies. of that size on the API would likely
be untenable at the moment. We’ll then make two changes to the logic on both ends:

 1. Add `thumbnail_url` handling logic in the `ImageStore` which will put that value
    in the `meta_data` dictionary automatically
 2. Change the Image model’s logic on the API to first try and use `image.thumbnail`(
    which has been copied from `meta_data["thumbnail_url"]` during the data refresh),
    then try `image.url` when creating the scaled down image

We could potentially do the same for Audio, it looks like about half of the records
do not have thumbnails:

    ```wp-block-code
    deploy@localhost:openledger> select count(*) from audio where thumbnail is null;
    +--------+
    | count  |
    |--------|
    | 427124 |
    +--------+
    SELECT 1
    Time: 4.354s (4 seconds), executed in: 4.328s (4 seconds)
    deploy@localhost:openledger> select count(*) from audio;
    +--------+
    | count  |
    |--------|
    | 949384 |
    +--------+
    SELECT 1
    Time: 3.899s (3 seconds), executed in: 3.887s (3 seconds)
    ```

What do folks think about this approach? I think any effort to reduce the width 
of our DB tables is an important one, as it will make migrations & data management
easier down the line.

[#data-normalization](https://make.wordpress.org/openverse/tag/data-normalization/),
[#postgres](https://make.wordpress.org/openverse/tag/postgres/), [#thumbnails](https://make.wordpress.org/openverse/tag/thumbnails/)

 * [Login to Reply](https://login.wordpress.org/?redirect_to=https%3A%2F%2Fmake.wordpress.org%2Fopenverse%2F2022%2F10%2F25%2Fstoring-provider-supplied-thumbnails%2F%23respond&locale=en_US)

 [  ](https://profiles.wordpress.org/aetherunbound/) [Madison Swain-Bowden](https://profiles.wordpress.org/aetherunbound/)
9:02 pm _on_ October 10, 2022     
Tags: 3d-models, data-normalization, [provider ( 2 )](https://make.wordpress.org/openverse/tag/provider/)

# 󠀁[Next steps for Thingiverse data](https://make.wordpress.org/openverse/2022/10/10/next-steps-for-thingiverse-data/)󠁿

As we close out the [provider DAG refactor milestone](https://github.com/WordPress/openverse-catalog/milestone/8),
one of the providers we intended to refactor was [Thingiverse](https://github.com/WordPress/openverse-catalog/issues/596).
Thingiverse currently doesn’t have a [legacy provider DAG configuration](https://github.com/WordPress/openverse-catalog/blob/85a1db073aac9dcf6d03d0c0937693765c563ac1/openverse_catalog/dags/providers/provider_workflows.py#L22),
but we do have [data in the API under the image media type](https://api.openverse.engineering/v1/images/?source=thingiverse)
for it.

What complicates Thingiverse is that it is fundamentally a source of CC licensed
3D models, and we already have it [slated as a 3D model provider](https://github.com/WordPress/openverse-catalog/issues/473)
once that project goes underway, yet we currently have existing data in the `images`
table for it.

That to say, I’m struggling to determine what our next steps should be for this 
provider. It seems like it would not be ideal to continue ingesting this provider
under the “image” media type, given that we’ll want to distinguish it as a 3D model
provider later down the line. However, I also have some hesitancy around deleting
the 32,659 records we already have in the `image` table as folks may be referencing
those results. On the other hand it is only 32k records though and the results themselves
are probably not as useful _as images _because they’re all 3D renders ([example](https://search.openverse.engineering/search/image?q=maul&source=thingiverse));
perhaps it’s reasonable to remove them in favor of including them in the 3D models
index down the line.

I’m open to other thoughts, but I’d like to propose that we:

 1. Retire the existing provider script (3D models will look similar but will likely
    grab a lot of other fields).
 2. Delete the `thingiverse` data from the `image` table in the catalog.
 3. Remove the “Thingiverse” provider from the APIAPI An API or Application Programming
    Interface is a software intermediary that allows programs to interact with each
    other and share data in limited, clearly defined ways. search results via the Django
    Admin UIUI UI is an acronym for User Interface - the layout of the page the user
    interacts with. Think ‘how are they doing that’ and less about what they are doing..

Do folks have concern with these steps? Any alternative proposals?

[#data-normalization](https://make.wordpress.org/openverse/tag/data-normalization/)
[#3d-models](https://make.wordpress.org/openverse/tag/3d-models/) [#provider](https://make.wordpress.org/openverse/tag/provider/)

 * [Login to Reply](https://login.wordpress.org/?redirect_to=https%3A%2F%2Fmake.wordpress.org%2Fopenverse%2F2022%2F10%2F10%2Fnext-steps-for-thingiverse-data%2F%23respond&locale=en_US)

 [  ](https://profiles.wordpress.org/aetherunbound/) [Madison Swain-Bowden](https://profiles.wordpress.org/aetherunbound/)
11:10 pm _on_ September 30, 2022     
Tags: data-normalization, [provider ( 2 )](https://make.wordpress.org/openverse/tag/provider/)

# 󠀁[Next steps for Walters Art Museum data](https://make.wordpress.org/openverse/2022/09/30/next-steps-for-walters-art-museum-data/)󠁿

Today I attempted to refactor the Walters Art Museum provider APIAPI An API or Application
Programming Interface is a software intermediary that allows programs to interact
with each other and share data in limited, clearly defined ways. script (see [this GitHub issue](https://github.com/WordPress/openverse-catalog/issues/597)).
While working on this refactor, I noticed that I could neither use the [testing sandbox provided by the API](http://api.thewalters.org/help/index)
nor [create a user account to receive an API key](https://github.com/WaltersArtMuseum/walters-api/issues/75).
We have tried [reaching out a number of times](https://github.com/WordPress/openverse-catalog/issues/386)
over the past year to ask for the CC Search API key to no avail.

As it stands, we have no way of confirming that the API could be accessible once
this DAG is turned on. We only have 16,948 records in the catalog/API (confirmed
in both places). The last update to the API codebase was made on [August 7th, 2015](https://github.com/WaltersArtMuseum/walters-api/commits/master),
and the last update to any of our data was December 1st, 2020. The media that our
data references still exists AFAICT.

Given all this context, I propose that we:

 1. Create a one-off script to populate `height`, `width`, `filesize`, and `filetype`(
    see the [filesize/filtype](https://github.com/WordPress/openverse-catalog/issues/536)
    and [height/width](https://github.com/WordPress/openverse-catalog/issues/647) backfill
    GitHubGitHub GitHub is a website that offers online implementation of git repositories
    that can easily be shared, copied and modified by other developers. Public repositories
    are free to host, private repositories require a paid subscription. GitHub introduced
    the concept of the ‘pull request’ where code changes done in branches by contributors
    can be reviewed and discussed before being merged by the repository owner. [https://github.com/](https://github.com/)
    issues). This can likely be done without an API key using the direct image URLs
    we have in our database.
 2. Move the Walters provider script into the [Retired DAGs directory](https://github.com/WordPress/openverse-catalog/tree/main/openverse_catalog/dags/retired)
    and decommission the DAG.

It does not seem likely that API will become accessible to us again in the near 
future. The backfills described above would at least allow us to have the minimum
data we’d like to have now as part of our ongoing [data normalization effort](https://github.com/WordPress/openverse-catalog/milestone/4)
and allow us to continue to serve the data we have in the API.

What do y’all think?

[#data-normalization](https://make.wordpress.org/openverse/tag/data-normalization/),
[#provider](https://make.wordpress.org/openverse/tag/provider/)

 * [Login to Reply](https://login.wordpress.org/?redirect_to=https%3A%2F%2Fmake.wordpress.org%2Fopenverse%2F2022%2F09%2F30%2Fnext-steps-for-walters-art-museum-data%2F%23respond&locale=en_US)