After completing my Python script to extract German IPA from English Wiktionary dumps, I realized I could also run the script on a dump of German Wiktionary, and extract even more data.
So I headed to the Wikimedia dumps page, downloaded the latest
dewiktionary dump, ran my script, and received … nothing!
I opened both dump files in glogg, located the entries of the same word in both files, and compared the structure of XML elements and Mediawiki markup.
The structure for
en.wiktionary is sketched in my previous post, and the structure of
de.wiktionary relevant for data extraction looks like this:
|Starts a page|
|holds the page title, i.e. the word|
|begin of page content|
|begin of section for German [word]|
|the Pronunciation section|
|the IPA part of the Pronunciation section|
|one or more entries with standard and regional pronunciations|
I adapted my original script to handle the dewiktionary page structure.
extract_de_ipa.py and the extracted data file
de_dewikt.csv can be found in my
german-ipa-dict repository on GitHub.
One thought on “German IPA from de.wiktionary”