Fineen Computational linguist

How to use the new Convertextract application for 'quality control' of ELAN annotations

How to use the new Convertextract application for 'quality control' of ELAN annotations


Have you ever wanted to NOT spend hours tediously checking that k + ‘ is written as and not k’? If you said YES!, Convertextract is the app for you. With minimal technical knowledge, you can now systemically make your ELAN annotations consistent.

What you need to know to understand this post

I assume that you know some background about the g2p library. For the purposes of using these tools, a library is a collection of code and documentation, but if you would like to dig deeper you can check out this Wikipedia article The g2p library uses existing and custom mappings (i.e. arbitrary input->output conversions). For example, you might want k’ (input) to be converted to (output). The Mapping is the roadmap for converting. These conversions are arbitrary, so depending on your use case you may need to create new mappings. Most of the existing mappings convert Graphemes (a character in the writing system of a language) to Phonemes (their equivalent sound in the language), hence the name ‘g2p’. To see existing mappings click here To keep this post simple, I will not explain how to add new g2p mappings. The documentation for adding mappings is here

Who is involved with this project?

What is needed to replicate the content in the post?

  • g2p Mapping of the desired conversions
  • Language text to be converted
  • Convertextract app (read the post for installation!)

What are the motivations behind this technology?

As a Student Intern on the NRC’s Indigenous Language Technology (ILT) project, I was approached by the Kwak̓wala Corpus Collection group to help create a systemic way to streamline the quality control process for their ELAN annotation data. Having many different people with many different orthographic conventions (i.e. different ways of writing the same thing) all working on annotating Kwak̓wala language data had resulted in inconsistencies.

For example, there was four ways that people were writing t̓s:

  • t’s, t̕s, ts̓, ts’

So, I added mappings in the g2p library that took the alternative forms and streamlined them therefore producing only one form in the output.Then I added support for ELAN files in the Convertextract library, so that the process became automated. Aidan Pine then turned Convertextract into an app!

How to use the new Convertextract app for ‘quality control’ of ELAN annotations

Convertextract, created by Aidan Pine, is a python library which extracts text data and finds/replaces specific text based on arbitrary correspondences. Until now, only basic CLI (Command Line Interface) was supported. Using Convertextract in the CLI allowed the user to convert a file based on pre-existing Mappings in the g2p library or based on a custom Mapping (not described here). However, the downside is that some programming knowledge is needed to use the CLI. The latest update now includes a GUI (Graphical User Interface) in the form of an app (for Mac computers only). The app makes Convertextract more accessible for non-programmers.

1. G2P mapping

Convertextract will carry out the streamlining for you, but it has to know what to convert. The g2p Mapping is this roadmap. See the section What you need to know to understand this post for more information on how to see if your language is supported.

2. Language data

You language data must be in one of the supported file formats. The most recent addition is .eaf files, which allows ELAN annotations to be used! For a full list of supported file types click here.

3. Convertextract application


IMPORTANT The app works on Mac only!!!

To download the app:

In your downloads folder, find the .zip file and double click on it to unzip.

  • Downloads>convertextract

Right-click on the application in the dist folder and select Open.

  • Downloads>convertextract>dist>Convertextract

Note: If you try to double click to open the app, you will get a security message. Right-clicking to open will allow you to override the security message.

Using the app

This is what the app looks like when you open it.

convertextract GUI

All you have to do is add your language data, choose the encoding (usually ‘utf-8’ should suffice), and pick your g2p mapping! The output will be exported as a copy of the input file + _converted.ext in the filename.

Example case

When typing, there is more than one way to write in the Kwak̓wala language. Convertextract takes all of these possibilities and generates one output for the sake of consistency.

I used the following inputs for Convertextract:

  • Encoding: utf-8
  • Input_language: kwk-umista
  • Output_language: kwk-umista-con

Performing ‘quality control

  Input language Output language
Language code kwk-umista kwk-umista-con
Sample text kwak’wala kwak̓wala
  kwak]wala kwak̓wala

If you need help setting up the app or have any questions at all, please feel free to comment below or send me an email!

comments powered by Disqus