Wikidata:Requests for permissions/Bot/MsynBot 12
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 20:56, 23 February 2023 (UTC)[reply]
MsynBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: MisterSynergy (talk • contribs • logs)
Task/s: import missing GND ID (P227) identifier based on linked VIAF cluster
Code: there is currently an offline script to determine cases (245.000 ready to go, 80.000-ish more to evaluate); the editing script is here at PAWS, but will be run on a local machine later.
Function details: This is based on a user request in Topic:Xc75x92e4ojr3yo7 on my user talk page and it has already been discussed there. The bot will import GND identifiers to Wikidata based on VIAF clusters that are already linked from Wikidata via VIAF ID (P214). Current limitations are: items about humans only, GND to be imported is nowhere found in all of Wikidata yet, item does not have any GND claim yet, year of birth identical in GND database and Wikidata. With this setup, there are roughly 245.000 GND identifiers to be imported. With involved users, I would later discuss whether follow-up GND imports should be done as well using a similar strategy; however, involved users tend to be rather conservative regarding automated matching, thus I do not want to promise too much. All imported identifiers will carry a reference that indicates that VIAF cluster matching was done, and it will contain a link to the VIAF dump it was taken from.
Interested users could be @Kolja21, Wurgl, Emu, Epìdosis, Humanisator and potentially others. —MisterSynergy (talk) 18:24, 14 February 2023 (UTC)[reply]
- Support I've checked test cases. Very good result. --Kolja21 (talk) 18:31, 14 February 2023 (UTC)[reply]
- Support, limitations seem good (and could be a good example for future imports of other IDs from VIAF, I think); I look forward to seeing the test edits :) --Epìdosis 18:43, 14 February 2023 (UTC) P.S. I think it is implicit, but I would precise that of course deprecated VIAF ID (P214) should be excluded[reply]
- Yes, deprecated VIAF identifiers are being ignored of course. —MisterSynergy (talk) 19:42, 14 February 2023 (UTC)[reply]
- neutral – I have checked a few random cases and the checked those, where the day/month of birth differs. For all those cases I would have manually added the GND too. So I see no problem. Neutral, because I have absolute no idea about the bot policy here. --Wurgl (talk) 18:50, 14 February 2023 (UTC)[reply]
- @Wurgl: Wir schaffen ~12.000 GNDs pro Jahr in deWP einzutragen. Der Botlauf entspricht daher ungefähr 20 Jahre manueller/intellektueller Arbeit. Durch die Einschränkung auf Personen mit Lebensdaten ist trotzdem ein ähnlich gutes Ergebnis zu erwarten. --Kolja21 (talk) 18:58, 14 February 2023 (UTC)[reply]
- Support looks like a careful way to massively improve the GND coverage in WD. Thank you! Humanisator (talk) 04:08, 15 February 2023 (UTC)[reply]
- Comment Could a first run only involve clusters that contain only GND and WD? How many items would that be? Just in case something goes wrong somewhere else, than one would have this part on the more safe side? Humanisator (talk) 04:33, 15 February 2023 (UTC)[reply]
- Imho it's sufficient if the import takes place in packages. The maintenance category de:Kategorie:Wikipedia:GND in Wikipedia fehlt, in Wikidata vorhanden will show us, if there are errors. --Kolja21 (talk) 07:14, 15 February 2023 (UTC)[reply]
- The maintenance category will show articles not errors and only if an article for the item in de-Wikipedia exists and a GND in Wikipedia is missing, i.e. if Wikipedia contains the same error or there is no article nothing will be shown there. The birthday check is useful, but the conflated cluster https://viaf.org/viaf/163533083/ Brieger, Adolf, 1832-1912 mixed with Bergsten, Axel Gabriel, 1832-1912 (detected at maintenance page [1]) is one where the year of birth is the same. In the sample among the first 10 clusters one had 17 sources, one 8 and the rest fewer. So a packaging of the VIAF clusters to be used could for example be:
- GND-WD only - safest
- GND-WD-ISNI - ISNI often based on a VIAF contributer, so here maybe GND
- GND-WD and x1 to x2 others - let users check 50 writes, if OK, procede with the next package
- GND-WD and x2+ others
- (There is also another class: VIAF linked from WD containing only GND https://viaf.org/viaf/472145858090523021993/#Orbach,_Raymond_Lee_1934- history: DNB|1089234848 add 2016-03-21T08:21:45.764171+00:00, WKP|Q7298945 delete 2022-08-28T12:26:10.687086+00:00, WKP added to https://viaf.org/viaf/632531/#Orbach,_Raymond_L.) Humanisator (talk) 10:59, 18 February 2023 (UTC)[reply]
- There are very poor clusters in VIAF, cluster A (https://viaf.org/viaf/295932749/#Alois_Niederalt) a conflation, containing Caudill, Mildred from LC. Caudill, Mildred is also in cluster B (https://viaf.org/viaf/5044167625770403660000/#Caudill,_Mildred) containing only GND, and in cluster C (https://viaf.org/viaf/43377917/#Caudill,_Mildred) containing only ISNI. But the ISNI record is based on only one source, namely LC and lists the title "Helmand-Arghandab Valley, 1969" the title and LC are in cluster A, so LC and ISNI have been put into different VIAF clusters. VIAF contains very poor splits and very poor conflations. Humanisator (talk) 15:55, 15 February 2023 (UTC)[reply]
- @Humanisator: Q95849#P214 marked as conflation since 2021, so no problem for the bot. "de:Kategorie:Wikipedia:GND in Wikipedia fehlt, in Wikidata vorhanden will show us, if there are errors" of course doesn't mean that these GNDs are wrong. These are the article we will check and then we will see, if there are errors. Kolja21 (talk) 20:17, 15 February 2023 (UTC)[reply]
- VIAF wrong since 2014 KrBot, then coppied? by APPERbot to dewiki ... see Talk:Q95849#VIAF. A lot can go wrong until something becomes marked as conflation in WD. Humanisator (talk) 09:37, 16 February 2023 (UTC)[reply]
- @Humanisator: Q95849#P214 marked as conflation since 2021, so no problem for the bot. "de:Kategorie:Wikipedia:GND in Wikipedia fehlt, in Wikidata vorhanden will show us, if there are errors" of course doesn't mean that these GNDs are wrong. These are the article we will check and then we will see, if there are errors. Kolja21 (talk) 20:17, 15 February 2023 (UTC)[reply]
- The maintenance category will show articles not errors and only if an article for the item in de-Wikipedia exists and a GND in Wikipedia is missing, i.e. if Wikipedia contains the same error or there is no article nothing will be shown there. The birthday check is useful, but the conflated cluster https://viaf.org/viaf/163533083/ Brieger, Adolf, 1832-1912 mixed with Bergsten, Axel Gabriel, 1832-1912 (detected at maintenance page [1]) is one where the year of birth is the same. In the sample among the first 10 clusters one had 17 sources, one 8 and the rest fewer. So a packaging of the VIAF clusters to be used could for example be:
- Imho it's sufficient if the import takes place in packages. The maintenance category de:Kategorie:Wikipedia:GND in Wikipedia fehlt, in Wikidata vorhanden will show us, if there are errors. --Kolja21 (talk) 07:14, 15 February 2023 (UTC)[reply]
- Currently I do not have VIAF cluster size or membership information available beyond GND identifiers. It needs more memory than available on my machine to compute this (or a completely new script with a different approach that makes detours to avoid memory usage).
- Only ~250 out of 245.000 affected items have a dewiki sitelink (~0.1%).
- —MisterSynergy (talk) 20:05, 15 February 2023 (UTC)[reply]
- Do you have a command line where you could run zcat, fgrep, sed, sort? If yes, I can try to write the commands to put the information into tabulated text files, or one file.
- Or could you process https://viaf.org/viaf/295932749/justlinks.json ? Humanisator (talk) 09:58, 16 February 2023 (UTC)[reply]
- Run the bot or another tool with the ~250, so there is already some progress and maybe errors can be detected? Would you have the permission to add the 250 now?
- Do you have a command line where you could run zcat, fgrep, sed, sort? If yes, I can try to write the commands to put the information into tabulated text files, or one file.
- Humanisator (talk) 09:50, 16 February 2023 (UTC)[reply]
- There are 38 million VIAF clusters (as much as I am aware), so I cannot download them individually. There is a monthly published cluster file available via https://viaf.org/viaf/data/ (currently 8.5 GB uncompressed) which I am using.
- I got around the memory limitations meanwhile by parsing the source file line by line, without loading it to memory. The extracted subset that I need for cluster evaluation is still very large, but it seems to fit now. It is computationally still a heavy operation and I do not have results to present yet. —MisterSynergy (talk) 10:34, 16 February 2023 (UTC)[reply]
- re "There are 38 million VIAF clusters (as much as I am aware), so I cannot download them individually." that wasn't proposed.
- re "There is a monthly published cluster file available via https://viaf.org/viaf/data/ (currently 8.5 GB uncompressed) which I am using." that is what I thought.
- re "I got around the memory limitations meanwhile by parsing the source file line by line, without loading it to memory." that was what "zcat, fgrep, sed, sort" targeted. These commands were the fastet I found to extract and manipulate lines from https://viaf.org/viaf/data/viaf-...-links.txt.gz before doing something else, e.g. loading into a postgresql DB.
- Humanisator (talk) 13:02, 16 February 2023 (UTC)[reply]
- Support --Emu (talk) 15:32, 16 February 2023 (UTC)[reply]
- Could we link a few test edits? I am not able to find them among the massive co0ntribution of the bot.--Ymblanter (talk) 19:51, 22 February 2023 (UTC)[reply]
- @Ymblanter: See Topic:Xc75x92e4ojr3yo7 with 50 test cases (no errors found). Later you can track the edits here. @MisterSynergy: Imho you can now start with a first small package then we'll check these edits. --Kolja21 (talk) 23:42, 22 February 2023 (UTC)[reply]
- I will run a fresh test batch of ~50 edits soon and link it from here. Got distracted by another job that appeared in front of my eyes in the meantime :-) —MisterSynergy (talk) 23:53, 22 February 2023 (UTC)[reply]
- @Ymblanter:: Here are 25 test edits (running from PAWS under my main account for convenience; the same script will be migrated to the MsynBot account once this is approved). —MisterSynergy (talk) 20:54, 23 February 2023 (UTC)[reply]
- @Ymblanter: See Topic:Xc75x92e4ojr3yo7 with 50 test cases (no errors found). Later you can track the edits here. @MisterSynergy: Imho you can now start with a first small package then we'll check these edits. --Kolja21 (talk) 23:42, 22 February 2023 (UTC)[reply]