Design a site like this with WordPress.com
Get started

German IPA from en.wiktionary

An answer on LSE brought me to the open-dict-data/ipa-dict repository on GitHub which contains IPA data for several languages, including German.

I downloaded the .zip files from the Releases link and looked at the German data. Out of interest, I opened the de_homonyms.txt file, and noticed strange entries, for example

/ʔapgəˈzakt/ abgesackt, abgesagt

Confused by the unexpected word pair (and by the stress indicator), I checked this entry in Wiktionary, and while the English version does not give a pronunciation for abgesagt, the German version has (expected) [ˈapɡəˌzaːkt], as opposed to abgesackt [ˈapɡəˌzakt].

So I checked the file more thoroughly, and found that many “homonyms” were simply based on casing (upper case for nouns, lower case for verbs and adjectives).

Then I noticed unusual IPA transcriptions, such as

  • ao for “au” (wiktionary: aʊ̯)
  • ɔø for “eu” (wiktionary: ɔɪ̯)
  • ae for “ei” (wiktionary: aɪ̯)
  • failure to distinguish long and short vowels
  • failure to distinguish e and ɛ

which brought me to the conclusion that I could not use this data for my purposes.

Now I am not saying that people don’t talk like that, because maybe there are some German dialects with exactly this pronunications, but it is impossible to say as the original repository stated as the source, kdelaney/germanipa, does not exist anymore.

So, why not extract the IPA pronunications from wiktionary? As it happens, I had already downloaded a relatively recent XML dump of en.wiktionary, and had a look at the file.

With a size of 8.4 GB, the file enwiktionary-20220120-pages-meta-current.xml cannot be opened with the editors I typically use, so I installed glogg to view the file.

It turns out, that while the Wiktionary export is an XML file, we do not need specific XML tools to access the contents, as Wikipedia exports typically have 1 XML element per line.

To extract the IPA pronunciations, we only need to analyze these lines in hierarchical order:

<page>Starts a page
<title>holds the page title, i.e. the word
<text>begin of page content
==German==begin of section for German word
====Pronunciation====the Pronunciation section
* {{IPA|de| or
** {{IPA|de|
German pronunciation in MediaWiki template syntax

Using some Python knowledge from Google Instant University, I came up with the script extract_de_ipa_en.py to extract German IPA pronunciations from a Wiktionary dump and save in .csv format.

The script and the generated data file can be found in my german-ipa-dict repository on GitHub.

Enjoy 😉

Advertisement

3 thoughts on “German IPA from en.wiktionary”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: