We call the following endpoints from Wikimedia Commons:
https://commons.wikimedia.org/w/api.php?action=query&generator=allimages&prop=imageinfo&gailimit=40&gaisort=timestamp&gaistart=YYYY-MM-DDT00:00:00Z&gaiend=YYYY-MM-DDT00:00:00Z&iiprop=url|user|dimensions|extmetadata&iiurlwidth=300&format=jsonhttps://commons.wikimedia.org/w/api.php?action=query&generator=allimages&prop=imageinfo&gailimit=40&gaisort=timestamp&gaistart=YYYY-MM-DDT00:00:00Z&gaiend=YYYY-MM-DDT00:00:00Z&iiprop=url|user|dimensions|extmetadata&iiurlwidth=300&format=json&gaicontinue=SOME_PICTURE.jpg
For both of these, we replace YYYY-MM-DD with dates (we only pull data for images uploaded/updated between these dates). We use the second query if there are more than 40 images between the specified dates, giving a continue location as the next picture (represented by SOME_PICTURE.jpg). These requests each return a JSONJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. of the following form:
{
"batchcomplete": "",
"continue": {
"gaicontinue": "20191101000510|James_Allen_da_Luz.png",
"continue": "gaicontinue||"
},
"query": {
"pages": {
"83547663": {
"pageid": 83547663,
"ns": 6,
"title": "File:'Buurpraatje', SK-A-2607.jpg",
"imagerepository": "local",
"imageinfo": [
{
"user": "Mr.Nostalgic",
"size": 1353326,
"width": 3746,
"height": 2400,
"thumburl": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/%27Buurpraatje%27%2C_SK-A-2607.jpg/300px-%27Buurpraatje%27%2C_SK-A-2607.jpg",
"thumbwidth": 300,
"thumbheight": 192,
"url": "https://upload.wikimedia.org/wikipedia/commons/a/a2/%27Buurpraatje%27%2C_SK-A-2607.jpg",
"descriptionurl": "https://commons.wikimedia.org/wiki/File:%27Buurpraatje%27,_SK-A-2607.jpg",
"descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=83547663",
"extmetadata": {
"DateTime": {
"value": "2019-11-01 00:04:22",
"source": "mediawiki-metadata",
"hidden": ""
},
"ObjectName": {
"value": "'Buurpraatje', SK-A-2607",
"source": "mediawiki-metadata",
"hidden": ""
},
"CommonsMetadataExtension": {
"value": 1.2,
"source": "extension",
"hidden": ""
},
"Categories": {
"value": "1897 paintings|19th-century paintings in the Rijksmuseum Amsterdam|CC-Zero|Paintings by Jozef Israëls in the Rijksmuseum Amsterdam",
"source": "commons-categories",
"hidden": ""
},
"Assessments": {
"value": "",
"source": "commons-categories",
"hidden": ""
},
"ImageDescription": {
"value": "<br><br><big><b>Identificatie</b></big><br><b>Titel(s): '</b>Buurpraatje'<br><b>Objecttype:</b> schilderij <br><b>Objectnummer:</b> SK-A-2607<br><b>Opschriften / Merken:</b> signatuur, linksonder: ‘Jozef Israels’<br><b>Omschrijving:</b> 'Buurpraatje'. Twee vrouwen maken een praatje voor een boerenwoning. De linker vrouw is geleund op het houten hek om het erf van de woning. Bij het hek staan enkele dunne bomen.<br><br><big><b>Vervaardiging</b></big><br><b>Vervaardiger:</b> schilder: Jozef Israëls<br><b>Datering:</b> 1897<br><b>Fysieke kenmerken:</b> olieverf op doek<br><b>Materiaal:</b> doek olieverf <br><b>Afmetingen:</b> drager: h 40,5 cm. × b 62,8 cm. × d 3,4 cm. (incl. achterkantbescherming)buitenmaat: d 12,5 cm. (drager incl. SK-L-3257)<br><br><big><b>Verwerving en rechten</b></big><br><b>Credit line:</b> Schenking van de heer en mevrouw Drucker-Fraser, Montreux<br><b>Verwerving:</b> schenking 4-mei-1912<br><b>Copyright:</b> Publiek domein",
"source": "commons-desc-page"
},
"DateTimeOriginal": {
"value": "1897",
"source": "commons-desc-page"
},
"Credit": {
"value": "<a rel=\"nofollow\" class=\"external free\" href=\"http://hdl.handle.net/10934/RM0001.COLLECT.7868\">http://hdl.handle.net/10934/RM0001.COLLECT.7868</a>",
"source": "commons-desc-page",
"hidden": ""
},
"Artist": {
"value": "Rijksmuseum",
"source": "commons-desc-page"
},
"LicenseShortName": {
"value": "CC0",
"source": "commons-desc-page",
"hidden": ""
},
"UsageTerms": {
"value": "Creative Commons Zero, Public Domain Dedication",
"source": "commons-desc-page",
"hidden": ""
},
"AttributionRequired": {
"value": "false",
"source": "commons-desc-page",
"hidden": ""
},
"LicenseUrl": {
"value": "http://creativecommons.org/publicdomain/zero/1.0/deed.en",
"source": "commons-desc-page",
"hidden": ""
},
"Copyrighted": {
"value": "True",
"source": "commons-desc-page",
"hidden": ""
},
"Restrictions": {
"value": "",
"source": "commons-desc-page",
"hidden": ""
},
"License": {
"value": "cc0",
"source": "commons-templates",
"hidden": ""
}
}
}
]
},
...
"83547651": {
"pageid": 83547651,
"ns": 6,
"title": "File:Wuppertal, Marienstr. 99.jpg",
"imagerepository": "local",
"imageinfo": [
{
"user": "Im Fokus",
"size": 4395166,
"width": 2736,
"height": 3648,
"thumburl": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Wuppertal%2C_Marienstr._99.jpg/300px-Wuppertal%2C_Marienstr._99.jpg",
"thumbwidth": 300,
"thumbheight": 400,
"url": "https://upload.wikimedia.org/wikipedia/commons/8/8d/Wuppertal%2C_Marienstr._99.jpg",
"descriptionurl": "https://commons.wikimedia.org/wiki/File:Wuppertal,_Marienstr._99.jpg",
"descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=83547651",
"extmetadata": {
"DateTime": {
"value": "2019-11-01 00:02:50",
"source": "mediawiki-metadata",
"hidden": ""
},
"ObjectName": {
"value": "Wuppertal, Marienstr. 99",
"source": "mediawiki-metadata",
"hidden": ""
},
"CommonsMetadataExtension": {
"value": 1.2,
"source": "extension",
"hidden": ""
},
"Categories": {
"value": "2019 photographs of Wuppertal|Marienstraße 99 (Wuppertal)|Self-published work",
"source": "commons-categories",
"hidden": ""
},
"Assessments": {
"value": "",
"source": "commons-categories",
"hidden": ""
},
"ImageDescription": {
"value": "Wuppertal, Wohnquartier Nordstadt, Marienstr. 99",
"source": "commons-desc-page"
},
"DateTimeOriginal": {
"value": "2019-10-24 14:03:52",
"source": "commons-desc-page"
},
"Credit": {
"value": "<span class=\"int-own-work\" lang=\"en\">Own work</span>",
"source": "commons-desc-page",
"hidden": ""
},
"Artist": {
"value": "<a href=\"//commons.wikimedia.org/wiki/User:Im_Fokus\" title=\"User:Im Fokus\">Im Fokus</a>",
"source": "commons-desc-page"
},
"LicenseShortName": {
"value": "CC BY-SA 4.0",
"source": "commons-desc-page",
"hidden": ""
},
"UsageTerms": {
"value": "Creative Commons Attribution-Share Alike 4.0",
"source": "commons-desc-page",
"hidden": ""
},
"AttributionRequired": {
"value": "true",
"source": "commons-desc-page",
"hidden": ""
},
"LicenseUrl": {
"value": "https://creativecommons.org/licenses/by-sa/4.0",
"source": "commons-desc-page",
"hidden": ""
},
"Copyrighted": {
"value": "True",
"source": "commons-desc-page",
"hidden": ""
},
"Restrictions": {
"value": "",
"source": "commons-desc-page",
"hidden": ""
},
"License": {
"value": "cc-by-sa-4.0",
"source": "commons-templates",
"hidden": ""
}
}
}
]
}
}
}
}
We have elided a number of entries for brevity. Note the continue.gaicontinue field in the json. This will be used in the second request listed above. Below is a table showing the mapping from the data contained in such a json to columns in the image table in PostgreSQL. Fields from the json are preceded by $ to mark them. We have omitted the prefix query.pages.XXXXXX (where XXXXXX is the key for a given page) since it is identical for each field.
Column | Comes From
-------------------------|-------------------------------
foreign_identifier | $pageid
foreign_landing_url | $imageinfo[0].descriptionshorturl
url | $imageinfo[0].url
thumbnail | $imageinfo[0].thumburl
width | $imageinfo[0].width
height | $imageinfo[0].height
license | derived from $imageinfo[0].extmetadata.LicenseUrl.value
license_version | derived from $imageinfo[0].extmetadata.LicenseUrl.value
creator | derived from $imageinfo[0].extmetadata.Artist.value
creator_url | derived from $imageinfo[0].extmetadata.Artist.value
title | $title
meta_data | See below
meta_data field
The meta_data field is a JSON of the following form:
{
"description": $imageinfo[0].extmetadata.ImageDescription.value (stripped of html tags)
}