G2P Part 7: Contributing a new mapping to g2p for everyone to use
This is the last part of the seven-part series on g2p. In this part, we’ll discuss how to contribute your mappings to the main g2p library.
G2P Blog Series Index
- Background
- How to write a basic mapping in G2P Studio
- Writing mappings on your computer
- Advanced mappings
- ReadAlong Studio & Other Applications
- Preprocessing mappings
- Contributing
NOTE!
As of September 2023, there is a new version of g2p
available: 2.0 - the instructions in this blog were originally written for version 1.x. If you already have g2p
installed, we recommend that you upgrade your installation before continuing on with this post.
Advanced: contributing your rules to the main g2p
library
So, you’ve written some cool rules and you want to contribute, that’s awesome! There are lots of benefits to contributing your mapping to g2p
. First of all, once your mapping is accepted, you’ll have it available and live on G2P Studio. Second, once the next version of g2p
is released with your mapping, it will be automatically built in to the Convertextract library. Third, if your mapping is between a language’s writing system and the IPA, you can also get ReadAlongs support for your language.
So, you write your mapping once, and you get three things for free (G2P studio, convertextract and readalongs). Here’s how:
- Fork g2p, see https://docs.github.com/en/github/getting-started-with-github/fork-a-repo for more details
- Add a folder for your language using the appropriate ISO 639.3 code to
g2p/mappings/langs
, i.e., create the folderg2p/mapping/langs/<yourlangcode>/
- Add a
config-g2p.yaml
file as described here in that folder - Add your mapping in that same folder
- If your mapping is for an IPA mapping, you can optionally run
g2p update
to update your mapping intog2p
and then generate the mapping as described in the ReadAlongs post between your language and English IPA. - Run
g2p update
to add your mapping tog2p
- Add some test data to
g2p/tests/public/data
. - Submit your changes by creating a pull request
Finally, either myself, or somebody else will review the changes, and you will get credit for those mappings and be added to the list of contributors
Adding tests
Testing your work is a really important part of software engineering. It lets us make changes to code and be confident that new features don’t break the expected functionality of g2p
. In order to add tests for your mapping, you can add a CSV/TSV/PSV file with 4 columns to g2p/tests/public/data
. The name of the file should be just the input language code, for example fra.psv
for the French tests. The first column in the file is for the input language code, the second is for the output language code, the third is for the input text and the fourth is for the expected output of that mapping and input. Here is an example between French (fra) and French IPA (fra-ipa) asserting that ‘manger’ results in ‘mɑ̃ʒe’:
fra|fra-ipa|manger|mɑ̃ʒe
fra|fra-ipa|écoutons|ekutɔ̃
There is a script for running tests at the root of the g2p
project called run_tests.py
. You can run all of the tests here using the following:
python run_tests.py all
or just run the language assertions including your tests like shown above using:
python run_tests.py langs
Writing g2p mappings that handle all the special cases can be quite tricky, especially when there are potential interactions between rules. To be confident that your g2p mappings work as you think, you should add a bunch of different words covering most of the spelling phenomena of the language you’re working on, with their expected IPA mapping. Ideally, you should also add some test cases to eng-ipa and eng-arpabet, to make sure the generated mapping works correctly. If you run into difficulties, feel free to post comments on this blog post or on the g2p library GitHub issues page!