G2P Part 6: Solve inconsistencies in your text with a g2p pre-processing mapping
This is the 6th blog post in a seven-part series about a software tool called
g2p. In this post we’ll discuss how to use
g2p to do the common natural language processing task of text normalization.
G2P Blog Series Index
- How to write a basic mapping in G2P Studio
- Writing mappings on your computer
- Advanced mappings
- ReadAlong Studio & Other Applications
- Preprocessing mappings
Adding a ‘pre-processing’ mapping
It’s often not sufficient to just write a mapping between the characters in a language’s orthography and the IPA, as illustrated in use case #2 and use case #3 below. Real-world text input is pretty messy, and if we want ReadAlongs or Convertextract - or any other tool that uses
g2p - to work properly, we need to account for as much of that messiness as possible. Generally speaking, solving this kind of messiness is usually called ‘Text Normalization’.1 This ‘normalization’ can either be about ensuring that the same Unicode characters are used consistently, or it can also be about converting symbols into their pronounced form, like & or 123.
For example, maybe your language uses underlines in its orthography. There are two commonly confusable Unicode characters here: U+0331 COMBINING MACRON BELOW and U+0332 COMBINING LOW LINE, and they look almost identical (cf. g̱ (U+0331) vs g̲ (U+0332)). So, let’s ‘normalize’ to consistently use U+0331.
Second, maybe we have a text that has a lot of puncutation like ‘&’ in it. We could write a mapping here for that as well (example in Danish):
A third example can be seen in the Gitksan mapping where the writing system uses a single quote ‘ to mark ejectives and glottal stops, but there are many apostrophe-like confusable characters, like ’ or ʼ. In this mapping we can see that they’re all mapped to the single quote ‘ (U+0027).
How do we link this up with the rest of our mappings? We recommend calling these mappings
<yourlang>-equiv, for “equivalencies” which is more neutral and sometimes preferred than the term “normalization”. When you run
g2p creates a directed graph between all possible mappings. Similar to when using g2p for ReadAlongs, consider we have a
g2p pipeline from ‘dan’ to ‘eng-arpabet’ that goes through the
g2p graph like so, ‘dan’ → ‘dan-ipa’ → ‘eng-ipa’ → ‘eng-arpabet’. We basically want to add one more conversion along this path that does this normalization step. So, we configure a mapping for a mapping from ‘dan’ → ‘dan-equiv’ containing our normalizations, then we rename the existing mapping to ‘dan-equiv’ → ‘dan-ipa’. Then, we
g2p update and the next time we run a mapping from ‘dan’ → ‘eng-arpabet’, it will pass through the normalization mapping too.
not to be confused with Unicode Normalization, which is different usage of the same term! ↩