How to Train a Custom MT Engine
And how to become a trainer.
If a company that has been using human translation for its documents wishes to move to machine translation, what steps must it take to train its own translation machine?
This question is especially relevant as more companies move toward integrating MT into their localisation workflows and also involve post-editors.
If a company wants to train its own custom MT engine, rather than just using generic engines like Google Translate or DeepL, here are the main steps it must go through:
1. Assess the Need and Define Objectives
Why MT? Faster delivery, cost reduction, scalability, etc.
Where to apply it? E.g., internal documentation, support content, product manuals, etc..
2. Collect and Organise Translation Data
Bilingual corpora (source + human translations) are gold. The more, the better.
Ideally, this includes:
Translation Memories (TMX format if possible)
Parallel documents
Legacy translated material
Monolingual corpora (target language only) can improve fluency.
Quality matters more than quantity—bad data = bad engine.
IMPORTANT
When training machine translation engines with client documents, data protection and confidentiality are essential. Sensitive or personally identifiable information is frequently included in translation data, and its treatment might result in privacy violations or unauthorised exposure. Data must be anonymised or pseudonymized, securely stored and transmitted, and restricted to authorised individuals in order to preserve customer trust and adhere to data protection laws. Strong data protection practices not only mitigate risks but also help maintain credibility and legal compliance.
Licensing and copyright issues are also fundamental when using data for MT training. Not all datasets are readily accessible for use in a study or for commercial purposes, and it may be illegal to utilise copyrighted items without the required authorisation. Always confirm your data’s licensing terms before training an MT engine. Verify that you are authorized to use, copy, and, if necessary, distribute it. When uncertain, either request permission from the copyright holder or utilize datasets with explicit, permissive licenses. Ignoring these concerns can result in legal repercussions and can undermine the legitimacy of your MT workflows.
3. Clean and Prepare the Data
Remove:
Misaligned segments
Duplicates
Poor translations
Standardise formats and terminology (especially if multiple vendors were involved).
4. Choose the Right MT Technology
Two main options:
Customizable MT engines from providers like:
Amazon Translate
Google AutoML
DeepL for Teams
SYSTRAN, ModernMT, Intento, KantanMT, or Custom NMT via Open Source (e.g., OpenNMT, MarianNMT)
Open-source NMT (Neural Machine Translation) = full control, but needs in-house expertise and infrastructure.
5. Train the Engine
Use clean data to feed the MT engine..
Training involves computational resources. Cloud services often help (GPU support required).
May take hours to days, depending on size.
What about style? When training or fine-tuning an MT engine, it’s not currently possible to “teach” it to follow a complex style guide like the Chicago Manual of Style in its entirety, at least not in the same explicit, rule-based way that a human editor would follow a checklist. Still, if you use bilingual corpora where the target language texts are consistently written in the desired style, over time, the model will implicitly learn to imitate that style.
6. Evaluate and Fine-Tune
Use BLEU scores or TER for automated evaluation—but always combine with human evaluation.
Adjust for domain, terminology, and tone.
Iterate with new or corrected data.
7. Integrate into the Workflow
Connect MT to CAT tools (e.g., memoQ, SDL Trados, Phrase).
Train and brief post-editors.
Use glossaries and term bases to control terminology.
8. Monitor, Retrain, and Maintain
Continuously collect feedback and corrections.
Feed post-edited data back into the training set.
Periodically retrain to improve output and adapt to changes in content/style.
Feeding the MT Engine
Format the Data
Your bilingual data (source + target language) needs to be in a structured format:
Most commonly: TMX (Translation Memory eXchange) or parallel plain text files (e.g., aligned source.txt and target.txt).
Each line should correspond to one translation unit (usually a sentence or segment).
Clean and Preprocess
Remove:
– Empty segments
– Misalignments
– Special characters, formatting codes (like tags or XML)
Normalise punctuation, numbers, date formats, etc.
Tokenise and truecase (some MT engines handle this automatically). More about this later.
Upload/Import into the MT Training Tool
Depending on the tool, this could be:
A cloud-based interface (e.g., DeepL API, Google AutoML Translation, KantanMT)
A local script or command-line process if you’re using an open-source engine (like OpenNMT)
You’ll typically:
– Choose language pair(s)
– Upload your parallel corpus
– Add optional glossaries or custom tags
– Configure parameters like domain or model type
Train the Model
The tool uses the uploaded data to “teach” the model how to translate in that specific language pair and domain.
This process can take hours or even days, depending on:
– Volume of data
– Engine type
– Hardware capacity (GPUs are often required)
Test and Validate
After training, you usually test the engine with a separate evaluation set that was not part of the training data.
Measure quality with BLEU, TER, or manual post-editing tests.
More about pre-processing
Tokenise and truecase are common preprocessing steps in machine translation and natural language processing (NLP). Here’s what they mean:
– Tokenise
Definition:
Tokenisation is the process of breaking text into smaller units called tokens. These tokens are usually words, punctuation marks, or symbols.
Why it’s important:
Machine translation models process text more effectively when it’s divided into clear units. Tokenisation helps the model understand where words start and end.
Example:
Original sentence:
She’s going to school.
Tokenized:
She ’ s going to school .
Each word and punctuation mark becomes a separate token. Some tokenisers also split contractions, as shown above.
– Truecase
Definition:
Truecasing is the process of restoring the correct casing (capitalisation) of words, especially when you’re working with lowercased or inconsistent text.
Why it’s important:
In many corpora (like subtitles or web crawls), text may be inconsistently capitalised. Truecasing ensures that words appear with the most likely casing based on context.
Example:
Input:
the president visited washington.
Truecased:
The President visited Washington.
How it’s done:
A truecaser uses a trained model to predict whether each word should be capitalised based on the surrounding context.
These steps make the text cleaner and more uniform, helping the MT engine better learn patterns during training.
Other Tools
If you’re using a neural MT framework like OpenNMT, Marian, or Fairseq, they often include or expect:
Preprocessing with similar steps (sentencepiece is popular now for subword tokenisation).
Truecasing may not be necessary if using subword units and case features.
What is SentencePiece?
SentencePiece is a tokeniser and text processor widely used in modern neural machine translation (NMT) and natural language processing (NLP). It works differently from traditional tokenisers like Moses.
SentencePiece splits text into subword units, which are smaller than words but larger than characters. This is particularly useful for handling:
Rare or unseen words
Morphologically rich languages
Misspellings or informal language
Unlike traditional tokenisers, SentencePiece doesn’t require pre-tokenisation with spaces. It treats the input as a raw stream of characters and learns how to break it up.
Why is it used in NMT?
Modern MT systems (like Google Translate, OpenNMT, or Marian) often use subword models to improve translation quality and vocabulary coverage.
SentencePiece:
Reduces vocabulary size (more efficient training)
Handles out-of-vocabulary (OOV) words better
Is language-independent (works on Chinese, Japanese, etc., without special rules)
How does it work?
Training phase – SentencePiece learns a vocabulary of subword units from a large corpus.
Encoding phase – It applies those subwords to segment new sentences.
Example
Input sentence:
unbelievably good performance
SentencePiece output (subwords):
▁unbeliev ably ▁good ▁performance
The underscore ▁ represents a space (used to indicate word boundaries).
“unbeliev” and “ably” are split – this helps the model handle “unbelievably” even if it’s rare.
Common algorithms behind SentencePiece:
BPE (Byte-Pair Encoding)
Unigram Language Model (default in SentencePiece)
Used in:
Google’s T5 and BERT models
OpenNMT, MarianNMT
Any Transformer-based translation or language model that benefits from subword tokenisation
What is a Transformer? Transformers use the self-attention mechanism to process the whole input sequence at once, unlike standard recurrent neural networks (RNNs), which analyze text one piece at a time.
Who Provides MT Training Services?
Big Companies and Language Service Providers (LSPs), for example, Translated, Unbabel, SDL/RWS, Appen.
These companies offer custom MT training as part of larger translation workflows, often tailored to enterprise needs.
They handle everything from data preparation to deployment and integration.
Freelancers and Small Tech Teams
Skilled freelancers can train and fine-tune MT models, especially for niche domains or specific language pairs.
Typical freelance backgrounds:
– Computational linguists
– Tech-savvy translators with NLP skills
– AI/ML freelancers with a language focus
Where Do Freelancers Learn These Skills?
Online Courses and Certifications
Coursera: Neural Machine Translation
Udemy: Practical NMT with OpenNMT, MarianNMT, or Hugging Face
edX / FutureLearn: Intro to NLP or Applied AI courses
Specialised MT Tool Documentation
MarianNMT
OpenNMT
Fairseq
Hugging Face Transformers
GitHub & Open-Source Communities
Many learn by experimenting with open-source MT engines and engaging in developer forums or communities like Stack Overflow, Reddit (r/MachineLearning), or Hugging Face’s Discord.
University Programs or Research Labs
Some translators with academic ties take part in NLP research projects or Master’s programs in Computational Linguistics or AI.