G2P Part 1: Getting from 'a' to 'b' with g2p - why g2p exists and will let you do awesome things
This is the first blog post in a seven-part series about a software tool called
g2p. This post describes some of the background context for why
g2p was created, and subsequent posts will go into more detail about how to use
g2p is a tool for systematically converting certain characters1 into other ones. This sounds fairly simple, but it can actually be incredibly powerful and useful! For example, maybe you want to get the pronunciation from a word’s spelling,
g2p can help with that! Or maybe a language you’re learning or teaching has different writing systems and you want to convert between them. Or, maybe your language has an historic or legacy way of writing and you want to convert it to the new writing system. There are also other uses for
g2p which I’ll explain in following posts - keep reading to learn the basics of
G2P Blog Series Index
- How to write a basic mapping in G2P Studio
- Writing mappings on your computer
- Advanced mappings
- ReadAlong Studio & Other Applications
- Preprocessing mappings
Who is involved with this project?
What are the motivations behind G2P?
There are many reasons why you might want to systematically convert between different characters. Here are a few possible use cases:
Use Case #1: Getting the pronunciation from a word’s spelling
Sometimes you want to convert from a language’s writing system (also known as orthography) to its pronunciation. This is a very common task in natural language processing and is essential in the creation of text-to-speech and automatic speech recognition systems. In another post in this series, I will describe the usefulness of
g2p specifically with a project called “ReadAlongs”.
“Letters” in a writing system are usually referred to as “graphemes” and their corresponding meaningful sounds are referred to as “phonemes”; hence “g2p” or “grapheme-to-phoneme”. It gets a little more complicated than that though, because sometimes a grapheme is made of more than one character, as with the digraph “th” which can be pronounced unvoiced as in ‘thin’ or voiced as in ‘that’. The International Phonetic Alphabet (IPA) is not so ambiguous! In IPA, the ‘th’ in ‘thin’ is written as θ and the ‘th’ in ‘that’ is written as ð.
Use Case #2: A language with multiple writing systems
Some languages have two (or more!) different writing systems. Take Cree for example, where you can either write a word in Standard Roman Orthography like “ê-wêpâpîhkêwêpinamâhk” or in Syllabics like ᐁᐍᐹᐲᐦᑫᐍᐱᐊᒫᕽ. My colleague Eddie has a great blog post about a tool he created to convert between the two here.
g2p can help with this kind of transformation between writing systems.
Use Case #3: Converting from legacy writing systems
Some languages historically used “font hacks” to render the characters in their writing system before they were supported on computers. There’s a longer discussion to be had here, but the tldr version is that when computers were gaining widespread use among speakers of Indigenous languages, they weren’t typically able to render (i.e., display) characters outside of the 128 characters supported by the American Standard Code for Information Interchange (ASCII) or even any of the extensions to ASCII that provide a total of 256 character (e.g., Latin-1 for Western European languages). To get around this, language communities would come up with their own custom fonts (often referred to as “font hacks” or “font encodings”) where they would override the display of a characters like “©” which existed in Latin-1, as ‘ǧ’ instead (example taken from the Heiltsuk Doulos font). For more information on this topic, please check out ‘Seeing the Heiltsuk Orthography from Font Encoding through to Unicode’ or ‘Applications and innovations in typeface design for North American Indigenous languages’.
Using g2p studio
If you want to use
g2p to convert some text in one of the supported languages2, simply visit the G2P Studio, select a language from the dropdown, and type in your text, as shown below. That’s all there is to it! To learn how to add support for other languages and use
g2p for other cool things, go on to the next part of the series!
Because the word ‘letter’ usually refers to a character within a specific alphabet or writing system, instead of ‘letter’, I’m going to use the word ‘character’ throughout this post. Similarly, despite the name of this tool being ‘Grapheme-to-Phoneme’, in reality
g2pcan be used to convert any characters to any other characters, not just graphemes (contrastive units of a writing system) to phonemes (contrastive units of a sound system). ↩
At time of writing, this includes the following list along with their ISO-639-3 codes) alq - Anishinàbemiwin, atj - Atikamekw, crg - Michif, crj - Southern & Northern East Cree, crx - Plains Cree, crm - Moose Cree, csw - Swampy Cree, ctp - Western Highland Chatino, dan - Danish, fra - French, git - Gitksan, gla - Scottish Gaelic, gwi - Gwich’in, haa - Hän, ikt - Inuinnaqtun, iku - Inuktitut, Kaska, kwk - Kwak’wala, lml - Raga, mic - Mi’kmaq, moh - Kanien’kéha, oji - Anishinaabemowin, see - Seneca, srs - Tsuut’ina, tau - Upper Tanana, tce - Southern Tutchone, ttm - Northern Tutchone, tgx - Tagish, tli - Tlingit ↩