G2P Part 3: Make a basic mapping with g2p
This is the third blog post in a seven-part series describing how to make a basic mapping with g2p
on your computer. If you haven’t already, please read the introductory blog post for g2p and the post about basic mappings with G2P Studio
- Background
- How to write a basic mapping in G2P Studio
- Writing mappings on your computer
- Advanced mappings
- ReadAlong Studio & Other Applications
- Preprocessing mappings
- Contributing
NOTE!
As of September 2023, there is a new version of g2p
available: 2.0 - the instructions in this blog were originally written for version 1.x. If you already have g2p
installed, we recommend that you upgrade your installation before continuing on with this post.
What you need to know to understand this post
-
To be able to follow along, I suggest having some sort of text editor, like Visual Studio Code.
-
You will also need to have Python installed on your computer. If you need a bit of help, my colleague Eddie Antonio Santos wrote a very good blog article on installing Python on Mac.
What’s the gist of what we’re about to do?
As described in the previous post, in order to use g2p
we need to understand the building blocks; rules and mappings. Rules are patterns that describe how to turn some input text into some other output text. When we combine a series of ordered rules together for a specific purpose, we call this a mapping. This blog post will show you how to write rules and mappings on your computer to use with g2p
.
How do I follow along?
You can follow along by writing your rules and mappings on your computer using a text editor like Visual Studio Code.
You should also install g2p
by running pip3 install g2p
1 in your command line or an integrated terminal within Visual Studio Code. See this post for more information on installing Python packages with pip.
Use in Python and the command line
Installation
There are two ways to install g2p
:
- install the latest published version of
g2p
, for use as is; - install an editable version to create your own mappings or edit
g2p
yourself.
To install the latest published version of g2p
, we recommend installing it using pip:
pip install g2p
If you are going to be creating your own mappings or editing g2p
in any way, you must first fork g2p
by going to https://github.com/roedoejet/g2p and forking the project to your own GitHub space.
Once you’ve forked g2p
, clone your own fork:
git clone https://github.com/YourGitHubUsername/g2p.git
and then do an editable pip installation:
cd g2p && pip install -e .
Usage
Using g2p
within Python can be done programatically using the make_g2p
function:
>>> from g2p import make_g2p
>>> transducer = make_g2p('dan', 'eng-arpabet')
>>> transducer('hej').output_string
'HH EH Y'
Command Line Interface
You can also use g2p
from the command line. The basic command for conversions is:
g2p convert <input_text> <in_lang> <out_lang>
So in practice:
$ g2p convert hej dan eng-arpabet
HH EH YY
Basic Rule writing
Each rule must be defined to have a sequence of one or more input characters and a sequence of zero or more output characters. We can define these rules in g2p
either using a tabular spreadsheet format (CSV) or using a format called JavaScript Object Notation or JSON.
For example, rules written in the tabular comma-separated values (CSV) format:
in | out |
---|---|
a | æ |
or using JSON:
[
{
"in": "a",
"out": "æ"
}
]
Both of the above rules capture our first rule that turns an orthographic ‘a’ into a broad IPA /æ/. You can choose to write your rules in either format, although JSON will offer you slightly more flexibility when writing advanced rules.
If we want to write rules that depend on a particular context, we need a couple more columns (CSV) or keys (JSON) than just in
and out
. This is where we use context_before
and context_after
. So, our second rule from above was to turn /k/ to [kʰ] when the character after /k/ is /æ/. Here, we could write the rules like this:
in | out | context_before | context_after |
---|---|---|---|
k | kʰ | æ |
or like this using JSON:
[
{
"in": "k",
"out": "kʰ",
"context_after": "æ"
}
]
Mapping configuration
When you combine multiple rules in g2p
for a particular purpose, this is called a mapping. In addition to each file containing your rules, you need a configuration file that tells g2p
how to process your rules. We write mapping configurations in YAML files titled config-g2p.yaml
. “.yaml” is the file extension for YAML which stands for ‘Yet Another Markup Language’ - which might be how you’re feeling right now after having already learned about Comma Separated Value (CSV) files and JavaScript Object Notation (JSON)!!
Here is a basic configuration for your mapping:
mappings:
- language_name: My Test Language # this is a shared value for all the mappings in this configuration
display_name: My Test Language to IPA # this is a 'display name'. It is a user-friendly name for your mapping.
in_lang: test # This is the code for your language input. By convention in g2p this should contain your language's ISO 639-3 code
out_lang: test-ipa # This is the code for the output of your mapping. In g2p we suffix -ipa to the in_lang for mappings between an orthography and IPA
type: mapping
authors: # This is a way to keep track of who has contributed to the mapping
- Aidan Pine
rules_path: test_to_ipa.json # This is the path to a file containing your rules. It should be in the same folder as your config-g2p.yaml file
If you are familiar with yaml, you will see that you can have more than one mapping under the mappings
key. So to add another mapping to this file, it would look like this:
mappings:
- language_name: My Test Language
display_name: My Test Language to IPA
in_lang: test
out_lang: test-ipa
type: mapping
authors:
- Aidan Pine
rules_path: test_to_ipa.json
- language_name: My Test Language
display_name: My Test Language IPA to Arpabet
in_lang: test-ipa
out_lang: test-arpabet
type: mapping
authors:
- Aidan Pine
rules_path: test_ipa_to_arpabet.json
If you’re not familiar with YAML, and you’re not just copy pasting from here, I recommend having a look at one of the many tutorials on how to use YAML properly before attempting to write your own mapping configuration, or looking at some of the examples of configurations in g2p/mappings/langs/*
.
Bringing it all together
From following the previous two sections, you should have two files created: test_to_ipa.json
and config-g2p.yaml
. Your config-g2p.yaml
file should look like this:
mappings:
- language_name: My Test Language # this is a shared value for all the mappings in this configuration
display_name: My Test Language to IPA # this is a 'display name'. It is a user-friendly name for your mapping.
in_lang: test # This is the code for your language input. By convention in g2p this should contain your language's ISO 639-3 code
out_lang: test-ipa # This is the code for the output of your mapping. By convention in g2p we suffix -ipa to the in_lang for mappings between an orthography and IPA
type: mapping
authors: # This is a way to keep track of who has contributed to the mapping
- Aidan Pine
rules_path: test_to_ipa.json # This is the path to a file containing your rules. It should be in the same folder as your config-g2p.yaml file
and your test_to_ipa.json
file should look like this:
[
{
"in": "k",
"out": "kʰ",
"context_after": "æ"
}
]
Then, type the following command in your command line or integrated terminal: g2p convert "kæt" test test-ipa --config /path/to/config-g2p.yaml
, replace /path/to/config-g2p.yaml
with the path from your current working directory in the command line to your config-g2p.yaml file. You should see the output kʰæt
produced below. Congratulations! You did your first conversion with g2p
. Try changing your rules around or converting other text and experiment to see what happens.
Footnotes
-
See this link for more information on the difference between pip and pip3: https://www.pythonpool.com/pip-vs-pip3/ ↩