Design a site like this with WordPress.com
Get started

German IPA from de.wiktionary

After completing my Python script to extract German IPA from English Wiktionary dumps, I realized I could also run the script on a dump of German Wiktionary, and extract even more data.

So I headed to the Wikimedia dumps page, downloaded the latest dewiktionary dump, ran my script, and received … nothing!

I opened both dump files in glogg, located the entries of the same word in both files, and compared the structure of XML elements and Mediawiki markup.

The structure for en.wiktionary is sketched in my previous post, and the structure of de.wiktionary relevant for data extraction looks like this:

<page>Starts a page
<title>holds the page title, i.e. the word
<text>begin of page content
== [word] ({{Sprache|Deutsch}}) ==begin of section for German [word]
{{Aussprache}}the Pronunciation section
:{{IPA}}the IPA part of the Pronunciation section
{{Lautschrift|one or more entries with standard and regional pronunciations
German Wiktionary makes more use of MediaWiki templates

I adapted my original script to handle the dewiktionary page structure.

The script extract_de_ipa.py and the extracted data file de_dewikt.csv can be found in my german-ipa-dict repository on GitHub.

Enjoy đŸ˜‰

Advertisement

One thought on “German IPA from de.wiktionary”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: