Community Meeting Recap (2023-07-04)

[Meeting Start]

📝 Agenda

In this meeting we discussed the OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at dataset, specifically in the context of sharing it with interested parties. We were joined by Aaron Gokaslan (MosiacML, Cornell Uni) and Apolinário (Hugging Face). The following discussion topics emerged.

  • Deduplication
    • Openverse doesn’t yet do any deduplication, we but have future plans to explore it.
    • We also discussed what metrics to consider when determining the canonical and best version of any media item and how aggregating disparate information for duplicates across platforms can improve relevance. [Slack]
  • Synthetic captions
    • MosaicML has an AI captioning pipeline to generate synthetic captions in addition to the ground truth.
    • Openverse could utilise these captions because metadata useful for ML applications, like generated labels, would also be useful to human audiences. [Slack]
  • Dataset updates
    • Openverse is constantly collecting new metadata and identifying new images.
    • In the HF Datasets format, new data could be just appended to the dataset, while updating would involve kind of re-doing it.
    • We discussed different approaches towards newly added records vs updates to existing records. [Slack].
  • Academic paper
    • There are no technical academic papers referencing Openverse, either published by us or anyone else. Madison did give a talk at PyData that was related to this.
    • We discussed that publishing a paper might be interesting and Madison’s slides could be a good jumping off point. [Slack]
  • Conclusions
    • We’re continuing the discussion on GitHub.
    • Contributions to the conversations are welcome.

🔔 Reminder

Openverse contributors will host a sync video meeting to discuss priorities for July at 1500 UTC on July 5th 2023, links for which will be posted in the #openverse channel of the Making WordPress Chat.

[Meeting End]