☆ Best practices for preparing XLIFF files¶

Document history¶

Date	Author	Comment
2018-01-30	Manuel Souto Pico	Creation
2018-02-02	Manuel Souto Pico	Encoding entities
2018-05-25	Manuel Souto Pico	Segmentation
2021-01-27	Manuel Souto Pico	Review and updates
2022-03-21	Manuel Souto Pico	Updated OMT project (SVN to Github) and OmegaT version (4.2 to 5.7)
2022-03-23	Manuel Souto Pico	Updated credentials of dummy user

Table of contents¶

Document history
Introduction
- Notation
- Sample project
1. Requirements
2. Recommendations
3. Common issues
Annexes
- Guidelines for creating XLIFF 1.2 files for OmegaT
  - Check list for translation
  - Check list for bilingual review
References

Introduction¶

This report includes recommendations for preparing content for translation in the XLIFF format with a view to producing XLIFF files that optimize the different language tasks and language asset management both in the short and the long run. It also focuses on common issues that can be problematic for language experts and how those issues can be avoided.

Often, there are no ultimate prescribed solutions, but following or not the recommendations in this document can make all the difference between hindering or crippling language tasks (the work of translators, reconcilers, reviewers, verifiers, etc.) and making their work enjoyable and bound to good results.

Info

cApStAn can take care of file preparation regardless of the native format provided that the client provides the original source files, and often that yields better results than what third-party engineers often can do. If you are a third-party engineer or a client, beware that the contents of this guide apply to you only after the possibility of relying on cApStAn for file preparation has been considered and discareded (for a good reason).

Notation¶

In this document, segments as the linguist sees them in the translation editor are represented as follows (preceded by the segment number):

31 This is a segment

Also, the cross mark (❌) and check mark (✔️) emojis are used to indicate after each exale what is discouraged or problematic and expected or recommended, respectively. These icons are not part of the text!

Sample project¶

To illustrate the DO's and DON'T's, an OmegaT sample project is provided, containing:

the original files (XML, HTML, SVG, etc.), in folder original
the problematic XLIFF files, in folder source/01_haram
the optimal XLIFF files, in folders source/02_halal_xlf1 (for XLIFF version v.1.2)
the optimal XLIFF files, in folders source/02_halal_xlf2 (for XLIFF version v.2.0)

The original files have been prepared as XLIFF in two different ways (so-called here "halal" and "haram"), to show the ideal preparation and its antithesis, respectively. This is the file structure in the project:

xliff_bestpractices_omtprj (project)
.
├── dictionary
├── glossary
├── manifest.rkm
├── omegat
│   └── ...
├── omegat.project
├── original
│   ├── entities.html
│   ├── markup_custom.xml
│   ├── markup_inline.svg
│   ├── markup_input.html
│   ├── markup_span.html
│   └── segmen_para.html
├── source
│   ├── 01_haram
│   │   ├── entities.html.xlf
│   │   ├── markup_custom.xml.xlf
│   │   ├── markup_inline.svg.xlf
│   │   ├── markup_input.html.xlf
│   │   ├── markup_span.html.xlf
│   │   └── segmen_para.html.xlf
│   └── 02_halal_xlf#
│       ├── entities.html.xlf
│       ├── markup_custom.xml.xlf
│       ├── markup_inline.svg.xlf
│       ├── markup_input.html.xlf
│       ├── markup_span.html.xlf
│       └── segmen_para.html.xlf
├── target
└── tm

You can open the XLIFF files and look at them in a text editor, but probably the best way to see the impact of the different preparation is to open the project in OmegaT and compare each haram file with its corresponding halal equivalent(s). The project can be downloaded twice with different names and two instances of OmegaT can be run simultaneously for a handier comparison.

To open the project in OmegaT:

Install and customize OmegaT 5.7 as per our installation and customization guide [1]
In OmegaT, go to Project > Download team project and enter the following details:
- Repository URL: https://github.com/capstanlqc/xliff_bestpractices_omt
- New local project folder: your preferred path to your local copy of the project.

That will create a local version of the project for you and open it. To open files in the translation editor, you may press Ctrl+L and then either type part of the file name or just click on a file to select it and open it. Typing also helps to filter the list of files.

Info

The project is public, therefore OmegaT should not ask for credentials to download it. It may ask for credentials however if you try to save changes. Please remember that this project is just to show issues in file preparation, the memory is not meant to be writeable.

The sample XLIFF files in the project above or the project itself can be re-created using Okapi Rainbow.[2] The Rainbow project, including the settings files, filters, segmentation rules etc., can also be downloaded from this link okapi_rainbow_project.tar.gz, although this is not necessary unless you want to recreate or customize the extraction process. Instructions to use the Rainbow project can be found in the README.md file.

1. Requirements¶

When creating XLIFF files, the only strict technical requirement is to create well-formed and valid XLIFF files, according to the XML syntax and the XLIFF specification [3]. Created XLIFF files can be validated with the strict XML schema [4] or using the XLIFF Checker [5].

Well-formedness and validity are the bare minimum, but of course it is perfectly possible to produce XLIFF files that are valid and compliant with the XLIFF standard but that are not translation-friendly. The main purpose of this report is to promote some best practices, upon which the following recommendations are based.

2. Recommendations¶

Based on the localization industry's best practices for preparing files for translation and on cApStAn's experience in localization of international large-scale assessments and questionnaires, we can define a number of important recommendations or expectations.

First and foremost, one generic recommendation:

Use the existing technology. There's a rich array of tools and libraries in the localization industry, both commercial and open source, that embody the know-how accumulated over the last decades. If you try to reinvent the wheel, it'll take you much longer and you'll achieve worse results. Use what exists, and improve it if you can, and only then develop your own technology if you see that what

Some of the tools you could use to manage the XLIFF roundtrip (extraction and merge):

Maxprograms' OpenXLIFF filters / XLIFF Manager (open source)
Okapi Rainbow (open source)
memoQ (commercial)
SDL Trados Studio (commercial)

Let's now be a bit more specific, about segmentation and inline codes:

Text must be segmented by sentence. Each segment should contain one single sentence, not a full paragraph or several sentences.
All translatable content must be extracted and all untranslatable content must be excluded during the extraction.
Inline codes should not be interpreted as end of paragraph or as boundaries between text units/blocks, to avoid breaking down sentences in fragments.
Inline codes and markup should ideally be represented using the notation specified by the XLIFF standard so that the translation editor can recognize, lock and display them as placeable tags.
If markup is not represented as tags and is escaped instead, then each markup block should be as short as possible, and custom tags might need to be created to protect escaped markup.
The number of tags should be as low as possible.
Some clean-up of the source files or some back and forth between localization engineers and source content authors is sometimes necessary).

Failure to follow those recommendations hampers language tasks or/and could affect translation quality.

There are other, recommendations about characters and whitespace might have a lesser impact on their own, but recurrent and concurrent failure to follow them may also contribute to making the translation process more difficult:

Unicode characters should be used instead escaped HTML entities.
Excessive whitespace and inline line breaks should be avoided.
To avoid truncations and overflowing text, text should be wrapped dynamically in the publication medium instead of using line break codes in the source text.
Comments should not be extracted as translatable text.

2.1. Segmentation¶

Translating or reviewing long paragraphs containing several sentences is inconvenient for several reasons.

On the one hand, it requires a higher cognitive effort from the linguist, which reduces productivity and increases the chances of errors, and on the other hand, it reduces the likelihood of propagation and internal reuse of translations within the project, which leads to rework (reducing turnaround time and savings) and to a higher number of inconsistencies, that are difficult to catch or fix afterwards.

Sentence-based segmentation boosts the reuse of translations, through the propagation of the translation of a repeated segment to all its repetitions, or through the higher availability of matches to translate similar segments, thus increasing consistency, and reduces the need for the linguist to run concordance searches in search of the different parts that need to be assembled in the final translation. Segmentation contributes both to final quality and satisfaction of the team.

Segmentation should occur after each sentence, so that each sentence is included in one segment and each segment contains only one sentence. For example, a paragraph typically contains several sentences, like the following:

Lorem ipsum dolor sit amet, consectetur adipiscing elit… Aliquam ex nisi, mattis pulvinar nulla sed, commodo mattis ligula. Nulla sit amet leo lacinia, pellentesque mi non, aliquam augue? Pellentesque tempor dictum dui in imperdiet. Fusce ligula arcu, hendrerit eu dignissim eget, consequat quis sem! Maecenas eget ligula dapibus, dictum purus vitae, sodales neque.

If the text is segmented, this long paragraph can be handled as independent sentences and therefore becomes much more manageable for the linguist. The expected result is:

1 Lorem ipsum dolor sit amet, consectetur adipiscing elit…
2 Aliquam ex nisi, mattis pulvinar nulla sed, commodo mattis ligula.
3 Nulla sit amet leo lacinia, pellentesque mi non, aliquam augue?
4 Pellentesque tempor dictum dui in imperdiet.
5 Fusce ligula arcu, hendrerit eu dignissim eget, consequat quis sem!
6 Maecenas eget ligula dapibus, dictum purus vitae, sodales neque.

To implement segmentation, you must use segmentation rules. Different tools might have slightly different implementations, but they all use regular expressions to match the patterns that correspond to sentence boundaries. SRX [6] is an XML-based standard of the localization industry used to define segmentation rules, and it can be used by Okapi Framework [7]. Segmentation rulesets can be easily created and customized in Okapi Ratel [8].

Existing libraries or tools used to prepare files as XLIFF normally include a basic set of default rules which often cover most of the needs, which can be varied. As you can see in the example above, segments do not only end in full stop, but might end in other punctuation marking the end of a sentence (i.e. in English: interrogation, ellipsis, etc.).

Sometimes it is necessary or convenient to create more specific rules to meet the specific needs of the source content, which is relatively easy in SRX. For example, if it's necessary to split the text in some context that the default rules do not cover. Or there are cases where punctuation symbols do not stand for the end of a sentence, such as in the case of abbreviations:

1 NBC canceled Mr. Robinson, the freshman comedy series.

To avoid segmenting after abbreviations and in other similar cases, exceptions are necessary to prevent or mask the general rule. Without the appropriate exceptions for abbreviations, we would obtain incorrect segmentations such as:

1 NBC canceled Mr.
2 Robinson, the freshman comedy series.

Luckily, there again, default rulesets used by available tools already include the most frequent abbreviations.

☞ In the sample OmegaT project provided, file 01_haram/segment_para.html.xlf shows a text that has been prepared without segmentation, whereas file 02_halal/segment_para.html.xlf has been prepared with sentence-based segmentation.

2.2. Inline codes¶

The source content might include inline codes, e.g. any HTML markup tags used to apply a certain behavior or property to part of the text. Often source content is HTML or some sort of similar markup language, where markup is used to define layout, formatting and/or structure. The preparation of the source files as XLIFF entails dealing with those codes as appropriate.

Codes can be of two kinds:

Suprasentential or intersentential codes (i.e. codes that embed a sentence, or stand outside of a sentence, or between sentences, or operate at a higher level than the sentence) should not be included in segments.
Intrasentential codes (i.e. codes that are included inside a sentence, often as spanning codes or code pairs) must be represented as inline elements (also called "content markup") according to the guidelines of the XLIFF specification [9].

The CAT tool will then correctly display inline elements as placeable tags, which translators can easily insert in the appropriate position in the translation of each segment. In the contrary, it is problematic for translators and reviewers to deal with inline codes that have not been protected as placeable tags, which also poses a risk to the integrity of the codes and the document.

2.2.1. Suprasentential codes¶

A leading opening tag appearing before the beginning of a sentence and its corresponding closing tag appearing after the end of the sentence are an example of suprasentential codes:

<p class="foo">What is the total length of the sticks in the line?</p>

Since they are expected to appear in exactly the same position in the target version of that segment, they don't need to appear in the translation editor and the translator does not need to see them:

1 What is the total length of the sticks in the line? ✔

1 <g0>What is the total length of the sticks in the line?</g0> ❌

2.2.2. Intrasentential codes¶

☞ The file markup_span.html in the sample project includes the following text:

<p><span style='font-size:12pt;font-family:"times new roman","serif"'>Code1:
</span><span style='font-size:12pt;font-family:"times new roman","serif"'>3/2 or
11/2 or 1.5</span></p>

which should be prepared as follows in the XLIFF file:

<source xml:lang="en"><bpt id="1" ctype="x-span">&lt;span style='font-size:12pt;
font-family:"times new roman","serif"'></bpt>Code1: <ept id="1">&lt;/span></ept>
<bpt id="2" ctype="x-span">&lt;span style='font-size:12pt;font-family:"times new
roman","serif"'></bpt>3/2 or 11/2 or 1.5<ept id="2">&lt;/span></ept></source>

which the CAT tool will display as, e.g.:

1 <g0>Code1: <g0><g1>3/2 or 11/2 or 1.5</g1> ✔

Representing the HTML tags as inline codes as specified in the XLIFF standard also reduces the length of each tag, thus better legibility is obtained and the segments are much easier to handle.

Some CAT tools display a floating tooltip including the original inline codes when the linguist hovers over the placeable tag (e.g. <g0> below) in the segment, thus showing the translator what the tag stands for. For formats like HTML, this can be helpful for the savvy translator, whereas for other more obscure formats like Open XML it is less useful.

⚠️ WARNING:
Escaping the HTML or XML markup by replacing < and > with < and > respectively, etc. is not a good approach, because the translation editor will consider the escaped markup as editable text rather than as locked codes and therefore the tags can be mishandled, let alone the fact that translation memories will be polluted. Furthermore, that approach does not reduce the length of inline codes, which means that the text is less readable and that can hamper the translation and bring about quality issues.

For example, escaping the markup in the above segment would produce the following XLIFF file:

<source xml:lang="en">&lt;span style=&apos;font-size:12pt;font-family:&quot;
times new roman&quot;,&quot;serif&quot;&apos;>Code1: &lt;/span&gt;
&lt;span style=&apos;font-size:12pt;font-family:&quot;times new roman&amp;
quot;,&quot;serif&quot;&apos;>3/2 or 11/2 or 1.5&lt;/span&gt;</source>

which would produce the following view in the translation editor, which is not translation-friendly:

1 Code1: 3/2 or 11/2 or 1.5 ❌

☞ In the sample project provided, files markup_custom_xml.xlf and markup_span.html.xlf show what the text looks like when the inline codes have been simply escaped (the ones in the 01_haram folder) and when they have been properly encapsulated as XLIFF content markup (the ones in folder 02_halal). There are two files to exemplify both standard HTML tags (e.g. ) as well as custom-crafted XML tags.

2.3. Encoding entities¶

HTML source content might contain named character entities (in the form of &char;) which are not allowed in XML and therefore cannot be used in an XLIFF file as such unless they are declared (except for amp, lt, gt, apos, quot). There is more than one possibility to prepare source HTML content containing named character entities, and some ways are preferable to others. Let us consider the ÷ entity as an example, found in the following HTML content:

<p>(40 &divide; 10) + (8 &divide; 2)</p>

2.3.1. Escaping named character entities¶

One possibility to create a valid XLIFF file is to escape the entities, as follows:

<source>(40 &amp;divide; 10) + (8 &amp;divide; 2)</source> ❌

The escaping approach is discouraged, though. Among other reasons, this approach will make the translation editor display the named character entity rather than the actual character itself (which reduces readability and could be misleading or even unintelligible for the linguist), as shown below:

1 40 [÷ 10) + (8 ÷ 2) ❌

Linguists need to see the character, not the code. While ÷ (÷) or π (π) might be more or less transparent in the appropriate context, other entities such as ≤ (≤) or &zwnj; (zero-width non-joiner) will be obscure and puzzling. The linguist could think that the named character entity must be maintained and therefore necessarily be used in the translation, whereas it might be the case that the target language spelling rules call for another character in that context. For example, this would be an incorrect translation according to French punctuation rules:

source: Punctuation works "differently" in French.
target: La ponctuation est "différente" en français. ❌

Compare with the correct translation:

source: Punctuation works "differently" in French.
target: La ponctuation est « différente » en français. ✔

Escapes were used to represent (by means of ASCII text only) characters that were not available in the character encoding you are using. The W3C (group in charge of the HTML specification) advises to use an encoding that allows to represent characters in their normal form, rather than using escaped named character entities, because using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.[10] Nowadays, it should be possible to encode any text as UTF-8, which allows to use Unicode characters and removes the need for such escapes.

Escapes can be a way of avoiding the use of a character for other reasons (e.g. if they conflict with other elements), in which case, you might want to escape some entities specifically. Escapes might also be useful to represent invisible or ambiguous characters, or characters that would otherwise be difficult to handle, such as whitespace or invisible Unicode control characters (e.g. using &rlm; in HTML source content --and ‏ in the prepared XLIFF file-- helps spot these characters).[11] However, in all other cases, it is preferable to avoid the escape.

2.3.2. Unicode characters and code points¶

A different approach is to represent the same character by means of the universal Unicode code point expressed as a numeric entity (e.g. the hexadecimal entity ÷) or as the Unicode character itself (e.g. ÷). The latter is preferable because it simplifies maintenance and running text searches directly in the raw XLIFF files with, say, grep or any other text-based tool.

If it's unavoidable to have named character entities in the source content to represent special characters, a simple pre-processing of the source content can be used to convert them into Unicode code points or Unicode characters. However, the ideal scenario would be to configure the authoring tool where the source content is authored so that special characters or symbols can be inserted directly by means of the Unicode character or the numeric entity. How authors insert symbols or special characters that are not on their keyboard depends on their authoring tool and their platform, but normally they can either pick the character from a special character palette (e.g. the Character Map in Windows or Character Viewer in Mac) or insert it by using a key combination, e.g. ALT+0176 on the keypad to insert the degree symbol (i.e. °).[12]

Both the approaches mentioned in the previous paragraph will produce one of the two following XLIFF codes (depending on the encoding chosen):

<source>(40 &#x00F7; 10) + (8 &#x00F7; 2)</source> ✔

<source>(40 ÷ 10) + (8 ÷ 2)</source> ✔

which will be displayed as the following in the translation editor:

1 (40 ÷ 10) + (8 ÷ 2) ✔

In a nutshell, then: using the Unicode characters in their normal form is preferable to representing them with their numeric --preferably hexadecimal-- reference, and any of those two options is preferable to escaping the named character references with &, or declaring them in the preamble of the document.

Approach	Example	Well-formed	Recommended
Unicode character	`÷`	yes	yes (more preferable)
numeric (hex) character entity	`÷`	yes	yes (less preferable)
unescaped named character entity	`÷`	no (unless declared)	no (but feasible if declared in the document)
escaped named character entity	`&divide;`	yes	no

When preparing XLIFF files with a localization engineering tool, e.g. Okapi Rainbow, both named and numeric character entities in the source content will be encoded as the Unicode character in the XLIFF file. ☞ In the sample OmegaT project provided, you can see how the three possible inputs (in file original/entities.html) are encoded in the same way (as the Unicode character) using the recommended approach in file 02_halal/entities.html.xlf as well as in the discouraged way in file 01_haram/entities.html.xlf.

With a few negligible exceptions[13], there should be no reason why a UTF-8 encoding and Unicode characters cannot be used any content to be localized.

3. Common issues¶

Following the recommendations above is necessary but might not be enough to achieve an optimized process. The source content might present a number of pitfalls that require special attention when creating the XLIFF files. Let us see some of those frequent issues that may hamper language tasks.

3.1. Split sentences¶

Sometimes sentences might be broken in two or more parts because the extraction filter is treating an embedded code as the end of the paragraph. ☞ In the OmegaT project provided, file 01_haram/markup_input.html.xlf shows how the sentence is broken at the text input code:

<source>An emperor penguin is</source>
<source>cm taller than a little penguin.</source>

displayed in the translation editor as:

1 An emperor penguin is ❌
2 cm taller than a little penguin.

The original content (in file original/markup_input.html) looks like this:

<p>An emperor penguin is <input type="text" name="fname" autocomplete="off"
size="4" id="emperor-penguin-versus-little-penguin" class="height-different"
pattern="[0-9]+" title="How many centimeters." formmethod="post" required
autofocus /> cm taller than a little penguin.</p>

That code represents this display in the online questionnaire:

Here, the expected and recommended preparation is to represent the text input field as content markup, as follows (and file 02_halal/markup_input.html.xlf exemplifies):

<source xml:lang="en">An emperor penguin is <ph id="1" ctype="x-input">&lt;input
type="text" name="fname" autocomplete="off" size="4" id="emperor-penguin-versus-
little-penguin" class="height-different" pattern="[0-9]+"
formmethod="post" required autofocus /></ph> cm taller than a little penguin.
</source>

which will appear as follows in the translation editor (see file 02_halal/markup_input.html.xlf):

1 An emperor penguin is <x0/> cm taller than a little penguin. ✔

That preparation makes it very easy for the translator to insert the placeable tag in the translation of that segment in the position where the input field should appear in the target version. For example, in Khmer:

1target: ព្រះចៅអធិរាជភេនឃ្វីនមានកំពស់ខ្ពស់ជាងភេនឃ្វីនតូចជាង <x0/> ហ្វីត។ ✔

File markup_inline.svg.xlf shows a similar case of text broken down at the two embedding SVG tags <tspan> and </tspan>.

Other similar examples:

1 Click ❌
2 to move on.

1 See uses on ❌
2 to show 2 children in her pictograph
3 How many ❌
4 will they need to draw?

1 Feed 5 penguins for ❌
2 zeds.

1 Drag ❌
2 onto the graph.

1 When your drawing is done, click ❌
2 to fill the garden with boxes of flowers.

In all these cases the original content includes some element (e.g. "Click [X](BUTTON) to move on." or "Feed 5 penguins for [QUANTITY] zeds.") that has been interpreted as the end of a paragraph.

Apart from the problem that the full sentence will not be stored in the translation memory as one unit, this is also problematic when the target language expresses things in a different order than English, e.g. say, "Tó móvê ón klïck [X])". In that case the linguist is forced to break the expected one-to-one segment correspondence, in order to maintain the correct order.

Maintaining the correspondence will produce the wrong order in the final content according to the syntax of the target language:

1 klïck ❌
2 Tó móvê ón ❌

Breaking the natural correspondence will produce the right order in the final content but spoils the TM containing these translations, which will not be reusable when the same kind of content needs to be translated in a subsequent cycle of the same project:

#	source	target
`1`	Click	Tó móvê ón ❌
`2`	to move on	klïk ❌

or

#	source	target
`1`	Click	Tó móvê ón klïk ❌
`2`	to move on

Auto-propagation of broken translations can also become problematic if the translation of a repeated segment (corresponding to part of the sentence) must be different in different contexts, for example due to agreement with other parts of the sentence, e.g.

1 Front ❌
2 wheel
3 Front ❌
4 headlamp

For example, in Spanish adjectives need to agree in gender and number with the nouns they modify, e.g. the translation of "front" is "delantera" (feminine) in seg1 to agree with the Spanish equivalent of "wheel" (i.e. "rueda"), which has feminine grammatical gender, whereas it is "delantero" (masculine) in seg3 to agree with the Spanish equivalent of "headlamp" (i.e. "faro"), which has masculine grammatical gender.

Auto-propagation would produce the following translation:

#	source	target
`1`	Front	Rueda (fem.)
`2`	wheel	delantera (fem.)
`3`	Front	Faro (masc.)
`4`	headlamp	delantera (fem.) ❌

In these cases, to achieve the correct translation the linguist might need to disable the default auto-propagation (to prevent the translation of the first occurrence being pulled automatically into the second occurrence), but that manual step could fail or be easily overlooked.

#	source	target
`1`	Front	Rueda (fem.)
`2`	wheel	delantera (fem.)
`3`	Front	Faro (masc.)
`4`	headlamp	delantero (masc.) ✔

In a nutshell, split sentences can be a nuisance to translate into certain languages with different word order than English -- the linguist might have to work around the translation in difficult or impossible ways. Also, productivity and internal consistency can be compromised if the same text appears later in the same project, or in future cycles, with different or correct segmentation.

Expected preparation¶

The expected result in the cases above would have been to use a tag or a placeholder to encode the inline code:

1 See uses on %s to show 2 children in her pictograph ✔
2 How many %s will they need to draw? ✔

1 Click <BUTTON/> to move on. ✔

1 Front wheel ✔
2 Front headlamp ✔

In the examples above where the segment has been properly prepared with placeholders, it is very easy (and common practice) for the translator to insert the placeables tags in the position where the inline codes belong in the target version.

Tip¶

To avoid this kind of issue, then, ask a translation technologist or a trained linguist to run a source review on your draft XLIFF files and then adjust the extraction filter accordingly, so that the filter knows which inline codes must be extracted along with the surrounding text and protected, and not interpreted as end of a paragraph.

3.2. Markup nimiety¶

Segments overloaded with markup make translation and all related subsequent language tasks more difficult, therefore increasing the chance to introduce errors in the translation, especially in right-to-left languages such as Arabic or Hebrew.

Some inline codes are unavoidable, e.g. to provide style:

Put the lengths in order from <b>shortest</b> to <i>longest</i>. ✔

However, other tags are unnecessary and should be avoided or cleaned up before preparing the files. For example, closing and opening tags of the same kind in the middle of a sentence or even in the middle of a word:

Vehicl</strong><strong>es in 2000  ❌

When this happens repeatedly, it results in segments that are (unnecessarily) very translation unfriendly. For example:

<strong>Star</strong><strong>t </strong><strong>T</strong><strong>i</strong><strong>me</strong>  ❌

In that example, there are a lot of  tags there to do the same job that could be achieved with simply one tag pair. This tag multiplicity might arise from adding superfluous formatting in a word processor or a wysiwyg editor to create the source, or from converting with OCR or from PDF.

Expected preparation¶

The expected design of the source content in the case above would have been to embed the formatted text with one single tag pair.

<strong>Start Time</strong> ✔

This tag pair is actually suprasentential markup, which could be excluded from the prepared segment, thus producing the following simple display in the translation editor:

1 Start Time ✔

The solution of this issue does not affect how the source content is prepared, but instead it relates to the pre-processing of the source content, before it is prepared. The actual preparation (parsing, extraction and segmentation) is the same regardless of whether the issue is present in the source files.

Tip¶

Provide feedback and tips to content authors and item developers and run some tag clean-up before extracting the text that must be included in the XLIFF files (tools like CodeZapper or TransTools Document Cleaner can be used for that in Word)[14].

3.3. Ending segments at line breaks¶

In some cases, line breaks are used to limit the length of each line in the source text. During extraction, the text might be split down at these line break tags.

For example:

1 Since the 1970s, scientists have been
2   ❌
3 worried about the amount of Dioxin, a
4   ❌
5 toxin in fish caught in Baltic Sea.

Expected preparation¶

One potential approach would be to represent the line break tags are as placeholders:

1 Since the 1970s, scientists have been <g0/> worried about the amount of Dioxin, a <g1/> toxin in fish caught in Baltic Sea. ✔

However, a much better approach would be to get rid of those line breaks, which were there probably to hard wrap the text at a certain width to adapt to some expected space, and soft wrap the translation by other more convenient and dynamic means (e.g. styles) in the actual final publication medium. In general it is recommended to separate content from layout.

In any case it should not be assumed that the translator will keep the line break tags in the translation or that their location will need to be equivalent to the source

Therefore, our recommendation, in the first place, would be to avoid using line break tags in the source text. Secondly (assuming we are dealing with HTML content), the width of the text can be defined by means of CSS styles. That approach achieves the same exact results without introducing any noise in the source text and without affecting the work of the translator. See https://jsfiddle.net/msoutopico/3p7x8ryr/1/ or the screenshot below:

Tip¶

To avoid this kind of issue, then, ask a translation technologist or a trained linguist to run a source review on your draft XLIFF files and/or your original source content and then clean up your source content (prefereably) to avoid hard-coding text wrapping or (less preferably) adjust the extraction filter accordingly, so that the line break element is treated as an inline code (extracted along with the surrounding text and protected) and not as the end of the paragraph.

Annexes¶

Preparing XLIFF files for translation in OmegaT entails some additional tweaks due to the special characteristics of this CAT tool.

Guidelines for creating XLIFF 1.2 files for OmegaT¶

Check list for translation¶

There are two possibilities for creating new XILFF files for translation tasks.

A. Using the Okapi XLIFF filter:

The target element must be empty or missing, like so:

  <trans-unit id="1086880">
    <source>This is the source text.</source>
    <target/>
  </trans-unit>

Also, the trans-unit must have property xml:space="preserve".

B. Using the default OmegaT XLIFF filter:

The target element must exist and be populated with the source text.

For example:

  <trans-unit id="1086880">
    <source>This is the source text.</source>
    <target>This is the source text.</target>
  </trans-unit>

Check list for bilingual review¶

Make sure your bilingual/translated XLIFF files meet the following criteria:

The trans-unit must have attribute xml:space with value preserve if leading and trailing space included in the source text must be replicated in the translation.
The trans-unit must have attribute-value pair approved="yes".

For example:

  <trans-unit id="1086880" xml:space="preserve" approved="yes">
    <source>This is the source text.</source>
    <target>Esto es la traducción.</target>
  </trans-unit>

In this case the OmegaT project must use the Okapi XLIFF filter.

References¶

OmegaT is a free and open source computer-assisted translation tool (CAT-tool) that cApStAn uses to translate and review/edit XLIFF files in international large-scale translation projects. It offers many more technical possibilities than OLT. See our OmegaT installation and customization guide: https://slides.com/capstan/omegat-installation-and-customization-guide/fullscreen ⏎
See https://okapiframework.org/wiki/index.php?title=Rainbow ⏎
See http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html ⏎
XML schema: https://docs.oasis-open.org/xliff/v1.2/os/xliff-core-1.2-strict.xsd ⏎
The XLIFF checker can be downloaded from https://www.maxprograms.com/products/xliffchecker.html ⏎
See https://okapiframework.org/wiki/index.php?title=SRX. ⏎
Okapi Framework is a set of libraries that can be used to prepare files for translation, among other things. ⏎
See http://okapiframework.org/wiki/index.php?title=Ratel ⏎
See http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine and http://docs.oasis-open.org/xliff/v1.2/xliff-profile-html/xliff-profile-html-1.2-cd02.html. ⏎
See https://www.w3.org/International/questions/qa-escapes#not ⏎
See https://www.w3.org/International/questions/qa-escapes ⏎
See https://support.office.com/en-us/article/insert-ascii-or-unicode-latin-based-symbols-and-characters-d13f58d3-7bcb-44a7-a4d5-972ee12e50e0 and https://support.apple.com/en-us/HT201586 for Mac. ⏎
See https://www.w3.org/International/questions/qa-chars-vs-markup#not ⏎
See http://kb.memoq.com/article/AA-00485/0/Cleaning-unnecessary-tags-with-TransTools-Document-Cleaner.html ⏎
Microsoft's Best Practices for Developing World-Ready Applications