Storing provider supplied thumbnails

Note: This is adapted from a conversation that happened on the Make WP Slack (see the link for more context).

Our current situation regarding thumbnails

Thumbnails were collected by providers and put in the thumbnail column in the catalog. We have data there now, and the column has not been dropped. Since this column also exists in the API, the data from it (for both audio and image) gets copied into the thumbnail column there as well. However, only the Audio model uses the thumbnail column, the Image model ignores it and tries to create a scaled down thumbnail using the primary image URL. The MediaStore class in the provider has been modified so that no new thumbnail URLs can be added to the thumbnail column, but they can be added in the metadata under meta_data["thumbnail_url"].

We have run into scenarios recently where our own attempts to create thumbnails from the full-size image URLs have caused consistent timeouts and prevent the records from being shown in search results. SMK is a recent example of this: (1) (2). It may be advantageous to use the provider’s pre-computed thumbnails in these cases. Crucially, not all providers supply or are expected to supply thumbnails. As such, it seems like we may want to reduce the width of our tables by removing this frequently-empty column.

Proposal

For image: we first go through and copy all of the existing thumbnail values in the catalog into meta_data["thumbnail_url"]. Then we perform a new-table-without-column-swap and remove the thumbnail column from from the table. This will help conserve space and reduce the width of the table. Then we update the image_view materialized view to populate a thumbnail column for the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. using the meta_data dictionary. An example of this is in the Audio view’s audio_set_foreign_identifier . This will allow us to keep the thumbnail column in the API but reduce space on the catalog, since a migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. of that size on the API would likely be untenable at the moment. We’ll then make two changes to the logic on both ends:

  1. Add thumbnail_url handling logic in the ImageStore which will put that value in the meta_data dictionary automatically
  2. Change the Image model’s logic on the API to first try and use image.thumbnail (which has been copied from meta_data["thumbnail_url"] during the data refresh), then try image.url when creating the scaled down image

We could potentially do the same for Audio, it looks like about half of the records do not have thumbnails:

deploy@localhost:openledger> select count(*) from audio where thumbnail is null;
+--------+
| count  |
|--------|
| 427124 |
+--------+
SELECT 1
Time: 4.354s (4 seconds), executed in: 4.328s (4 seconds)
deploy@localhost:openledger> select count(*) from audio;
+--------+
| count  |
|--------|
| 949384 |
+--------+
SELECT 1
Time: 3.899s (3 seconds), executed in: 3.887s (3 seconds)

What do folks think about this approach? I think any effort to reduce the width of our DB tables is an important one, as it will make migrations & data management easier down the line.

#data-normalization, #postgres, #thumbnails

Community Meeting Recap (18 October 2022)

[Slack: Meeting start]

🎉 Done!

  • Retirement of legacy TSV loading workflow [PR] [Slack]
  • Community PR to fix the button height [Issue] [Slack]
  • Community improvements to the CONTRIBUTING.md file [Issue] [Slack]
  • Improvements to the Recent Searches feature [Issue#1] [Issue#2] [Slack]
  • UIUI UI is an acronym for User Interface - the layout of the page the user interacts with. Think ‘how are they doing that’ and less about what they are doing./UXUX UX is an acronym for User Experience - the way the user uses the UI. Think ‘what they are doing’ and less about how they do it. Improvements to the headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. [PR#1] [PR#2] [Slack]

👀 Needs review

wordpress.orgWordPress.org The community site where WordPress code is created and shared by the users. This is where you can download the source code for WordPress core, plugins and themes as well as the central location for community conversations and organization. https://wordpress.org/ (OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. theme)

Frontend

Catalog

🚧 In progress/To Do

📒 Agenda

  • Thumbnails in the catalog and APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. [Slack: Start] [Slack: End]
  • WordPress.org theme redirect (covered in 👀 Needs review above)

[Slack: Meeting end]

#openverse-weekly-community-meeting

Community Meeting Recap (11 October 2022)

Meeting start

🎉 Done!

👀 Needs review

🚧 In progress/To Do

📒 Agenda

#openverse-weekly-community-meeting

Next steps for Thingiverse data

As we close out the provider DAG refactor milestone, one of the providers we intended to refactor was Thingiverse. Thingiverse currently doesn’t have a legacy provider DAG configuration, but we do have data in the API under the image media type for it.

What complicates Thingiverse is that it is fundamentally a source of CC licensed 3D models, and we already have it slated as a 3D model provider once that project goes underway, yet we currently have existing data in the images table for it.

That to say, I’m struggling to determine what our next steps should be for this provider. It seems like it would not be ideal to continue ingesting this provider under the “image” media type, given that we’ll want to distinguish it as a 3D model provider later down the line. However, I also have some hesitancy around deleting the 32,659 records we already have in the image table as folks may be referencing those results. On the other hand it is only 32k records though and the results themselves are probably not as useful as images because they’re all 3D renders (example); perhaps it’s reasonable to remove them in favor of including them in the 3D models index down the line.

I’m open to other thoughts, but I’d like to propose that we:

  1. Retire the existing provider script (3D models will look similar but will likely grab a lot of other fields).
  2. Delete the thingiverse data from the image table in the catalog.
  3. Remove the “Thingiverse” provider from the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. search results via the Django Admin UIUI UI is an acronym for User Interface - the layout of the page the user interacts with. Think ‘how are they doing that’ and less about what they are doing..

Do folks have concern with these steps? Any alternative proposals?

#data-normalization #3d-models #provider

Recap: Priorities Meeting 2022-10-05

Members of the OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. community met over video and audio today to discuss Openverse priorities for the month of October. We discussed three major things (you can read the raw notes here):

  1. Status updates of current projects
  2. What we’ll be working on this month
  3. New projects we’re excited to start

The need for better project documentation

One of the first things we mentioned was the need for deeper project documentation in Make. There isn’t one easy place to view current projects at a glance. @beccawidom mentioned how useful the Dag Status page is and how it could be a good model for a projects page.

iFrameiframe iFrame is an acronym for an inline frame. An iFrame is used inside a webpage to load another HTML document and render it. This HTML document may also contain JavaScript and/or CSS which is loaded at the time when iframe tag is parsed by the user’s browser. Removal

The project is ongoing and progressing quickly. Development is well underway and can be viewed on the staging site with the ?ff_new_header=on flag enabled. Here is an example link.

@olgabulat completed the mobile version of the ‘recent searches’ feature today. There are a few open PRs in need of review:

and more needing to be started:

Provider DAG Refactors

This project is also ongoing and advancing nicely. The tracking milestone has the latest updates and assignments. The project continues to reveal issues with existing DAGs, from faulty logic, to unhandled errors, to data integrity concerns.

Infrastructure Improvements

There’s ongoing work to switch to a single load balancer for all applications.

Browse and Insert Openverse Media in WP CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress.

This project began serendipitously from a community PR to add Openverse browsing to GutenbergGutenberg The Gutenberg project is the new Editor Interface for WordPress. The editor improves the process and experience of creating new content, making writing rich content much simpler. It uses ‘blocks’ to add richness rather than shortcodes, custom HTML etc. https://wordpress.org/gutenberg/. Since then there’s been more discussion and some designs created.

Data normalization

We talked about this project a lot and decided we should devote time to planning this month. @krysal plans to share a draft of a plan to help clarify the scope and goals of the work.

Upcoming projects

We discussed how the iFrame removal project won’t have much active development work in the coming weeks, so it would be a good time to start planning for other projects. We identified two:

  • The session state cookie milestone would be great to prevent UIUI UI is an acronym for User Interface - the layout of the page the user interacts with. Think ‘how are they doing that’ and less about what they are doing. pop-in and for future storage of user state.
  • @zackkrida also wants to start Analytics and will kick off the project with an RFC for frontend event collection.

We also discussed the Content Safety project at length. It’s very important work but we don’t have a clear lead for that project yet. As @aetherunbound mentioned, it would be worthwhile and possible this month to start scoping out this project. Examples Madison gave are to gather and create issues, and write an RFC for the project.

Action Items

Finally, we decided on some next steps to keep work moving.

  • @krysal: Share a draft of the Data Normalization plan
  • @aetherunbound: Create a new DAG Stability and Bugfixes project and kickoff post
  • @zackkrida: Create a new Analytics project 
  • @zackkrida: Write a frontend event collection (analytics) RFC
  • @zackkrida: Create and RFC request for handling removed images/dead links
  • All: Identify a lead for the Content Safety project

#priorities #openverse-priorities

iNaturalist Retrospective Recap

Time: Wednesday, September 28th at 20:00 UTC
Attendees: @beccawidom, @krysal, @sarayourfriend, @aetherunbound

We held a retrospective for discussing & processing the recent iNaturalist provider ingestion script contribution from @beccawidom. We spent several minutes writing sticky notes of our thoughts, then spent a few more minutes grouping them and voting on them. We then had a discussion on the various groups and the points/action items which arose from each group.

We did not get to all groups and also wanted to allow for asynchronous discussion. Below are the notes from that meeting, but I will also create discussion comments for each major group so we can aggregate responses in that manner.

Stickies

Screenshot of the stickies which were generated

Notes

Big Picture Questions

  • What are the big questions?
    • Are we going to stay in Postgres?
      • If we’re switching the infrastructure to be flat-files, should we spend time optimizing it? -> Yes
      • Though we want to remove Postgres, we should plan for it to stick around for a few years
    • Is non-APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. consumption/bulk-import a priority going forward? -> Yes

Communication transparency

  • Moving a lot of our internal communication to the Make WP blog would be nice, but we need better tools for facilitating that.
  • DMs vs public is a hard distinction for new contributors
    • It can be difficult to determine what is a Python question vs an OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. question
    • Public discussions could generally be better since more people could contribute
    • Rebecca feels confident now to be able to throw some code into GitHubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/ with potential questions and using that as a vector for communication & discussion
    • What would help improve this process or make it easier for people to get more comfortable with the various communication channels?
  • Public “project thread” seemed desirable for Rebecca, maybe project threads in general could maybe go in the Make blog so the public could see what were working on
    • GitHub issues precludes nested comments because everything is flat, discussions could be used but we only have 1 level of nesting
    • What tool could we use to do project threads publicly?
      • Could we use P2P2 P2 or O2 is the term people use to refer to the Make WordPress blog. It can be found at https://make.wordpress.org/.? Could Make become P2? Could we have a public P2?
    • We do have GitHub issues or milestones that reflect the project thread, but still don’t have the content/conversationality of the project thread

Documentation

  • Lots of different things were new for Rebecca at the same time, not just Openverse-specific
    • Perhaps we could have a “contributor onboarding process”?
#community-contributors, #inaturalist, #retrospective

Next steps for Walters Art Museum data

Today I attempted to refactor the Walters Art Museum provider APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. script (see this GitHub issue). While working on this refactor, I noticed that I could neither use the testing sandbox provided by the API nor create a user account to receive an API key. We have tried reaching out a number of times over the past year to ask for the CC Search API key to no avail.

As it stands, we have no way of confirming that the API could be accessible once this DAG is turned on. We only have 16,948 records in the catalog/API (confirmed in both places). The last update to the API codebase was made on August 7th, 2015, and the last update to any of our data was December 1st, 2020. The media that our data references still exists AFAICT.

Given all this context, I propose that we:

  1. Create a one-off script to populate height, width, filesize, and filetype (see the filesize/filtype and height/width backfill GitHubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/ issues). This can likely be done without an API key using the direct image URLs we have in our database.
  2. Move the Walters provider script into the Retired DAGs directory and decommission the DAG.

It does not seem likely that API will become accessible to us again in the near future. The backfills described above would at least allow us to have the minimum data we’d like to have now as part of our ongoing data normalization effort and allow us to continue to serve the data we have in the API.

What do y’all think?

#data-normalization, #provider

Openverse Monthly Priorities Meeting 2022-10-05

OpenverseOpenverse Openverse is a search engine for openly-licensed media, including photos, audio, and video. Openverse is also the name for the collection of related code repositories that make up the project. contributors will host a community meeting to discuss priorities for the month of October at 1500 UTC on 2022-10-05.

A sync video chat link will be provided. We hope to see you there!

You can read the notes document for these meetings to catch up on past discussions.

Community Meeting Recap (27 September 2022)

Meeting start

🎉 Done!

👀 Needs review

🚧 In progress/To Do

Meeting end

#openverse-weekly-community-meeting

Community Meeting Recap (20 September 2022)

Meeting start

🎉 Done!

👀 Needs review

🚧 In progress/To Do

💬 Agenda discussion

Meeting end

#openverse-weekly-community-meeting