An answer on LSE brought me to the open-dict-data/ipa-dict
repository on GitHub which contains IPA data for several languages, including German.
I downloaded the .zip files from the Releases link and looked at the German data. Out of interest, I opened the de_homonyms.txt
file, and noticed strange entries, for example
/ʔapgəˈzakt/ abgesackt, abgesagt
Confused by the unexpected word pair (and by the stress indicator), I checked this entry in Wiktionary, and while the English version does not give a pronunciation for abgesagt, the German version has (expected) [ˈapɡəˌzaːkt], as opposed to abgesackt [ˈapɡəˌzakt].
So I checked the file more thoroughly, and found that many “homonyms” were simply based on casing (upper case for nouns, lower case for verbs and adjectives).
Then I noticed unusual IPA transcriptions, such as
- ao for “au” (wiktionary: aʊ̯)
- ɔø for “eu” (wiktionary: ɔɪ̯)
- ae for “ei” (wiktionary: aɪ̯)
- failure to distinguish long and short vowels
- failure to distinguish e and ɛ
which brought me to the conclusion that I could not use this data for my purposes.
Now I am not saying that people don’t talk like that, because maybe there are some German dialects with exactly this pronunications, but it is impossible to say as the original repository stated as the source, kdelaney/germanipa
, does not exist anymore.
So, why not extract the IPA pronunications from wiktionary? As it happens, I had already downloaded a relatively recent XML dump of en.wiktionary, and had a look at the file.
With a size of 8.4 GB, the file enwiktionary-20220120-pages-meta-current.xml
cannot be opened with the editors I typically use, so I installed glogg to view the file.
It turns out, that while the Wiktionary export is an XML file, we do not need specific XML tools to access the contents, as Wikipedia exports typically have 1 XML element per line.
To extract the IPA pronunciations, we only need to analyze these lines in hierarchical order:
<page> | Starts a page |
<title> | holds the page title, i.e. the word |
<text> | begin of page content |
==German== | begin of section for German word |
====Pronunciation==== | the Pronunciation section |
* {{IPA|de| or ** {{IPA|de| | German pronunciation in MediaWiki template syntax |
Using some Python knowledge from Google Instant University, I came up with the script extract_de_ipa_en.py
to extract German IPA pronunciations from a Wiktionary dump and save in .csv format.
The script and the generated data file can be found in my german-ipa-dict
repository on GitHub.
Enjoy 😉
3 thoughts on “German IPA from en.wiktionary”