After completing my Python script to extract German IPA from English Wiktionary dumps, I realized I could also run the script on a dump of German Wiktionary, and extract even more data.
So I headed to the Wikimedia dumps page, downloaded the latest dewiktionary
dump, ran my script, and received … nothing!
I opened both dump files in glogg, located the entries of the same word in both files, and compared the structure of XML elements and Mediawiki markup.
The structure for en.wiktionary
is sketched in my previous post, and the structure of de.wiktionary
relevant for data extraction looks like this:
<page> | Starts a page |
<title> | holds the page title, i.e. the word |
<text> | begin of page content |
== [word] ({{Sprache|Deutsch}}) == | begin of section for German [word] |
{{Aussprache}} | the Pronunciation section |
:{{IPA}} | the IPA part of the Pronunciation section |
{{Lautschrift| | one or more entries with standard and regional pronunciations |
I adapted my original script to handle the dewiktionary page structure.
The script extract_de_ipa.py
and the extracted data file de_dewikt.csv
can be found in my german-ipa-dict
repository on GitHub.
Enjoy đŸ˜‰
One thought on “German IPA from de.wiktionary”