The Byte Order Mark (BOM) and its Function in Software Localization

The Unicode® Logo

When the time comes to localize your software to another language, you might need to use the Unicode character U+FEFF byte order mark (BOM), which appears at the beginning of a text stream (file) and signals the text encoding. Why is this needed?

As you might know, computers use a binary encoding: numbers 1 and 0. Computers use a type of MAP, or INDEX, to represent words. Those maps have references, for example:

1010101 refers to the letter a
11111111 refers to the letter b
0101010 refers to the letter c

Why was the BOM created?

As computers were developed in the USA, the first character maps (ASCII – American Standard Code for Information Interchange) contained only a very limited number of characters and did not include accents, Asian and Arabic characters, or symbols. In fact, they had only 256 characters.

Later, developers noticed there were other languages in the world with different letters (characters). New character maps appeared; some of them were based on ASCII and some were not, and so some of them used 01 for both letter A and letter B, but others used 01 only for letter B.

After some time, people realized this was chaos, so UTF-8 character encoding (U for Universal) appeared. It was a complete and extensible character map – it even allowed for customized symbols. However, at the same time, Microsoft developed another character map, ISO-LATIN-1 (not as big and popular as UTF-8), and some index did not match the symbols. In addition, UTF-8 was length-variable, and sometimes it used more bits to represent a character; so when a computer read a document using the word character map, some characters matched and some didn’t.

That is why sometimes you would see a question mark instead of an accented letter, or sometimes two letters instead of the right character (because it was using a shorter character map to read a longer character representation).

When to use the BOM

Okay, now let’s talk about the BOM.

As you may know, some languages are written from right to left – like Arabic, for example. When a computer opens a document encoded with Arabic characters, it does not know where it should start reading (believe me – this is a very complex issue). So the smart guys said: “Let’s add a mark at the beginning of the file that determines if it is RTL.” That’s the Byte Order Mark, or BOM.

The fact is there is no specification of which character set should be used for a specific document. It has to be guessed or reported by another entity.

BOMs are different for UTF-8 and UTF-16, and so a BOM can also be used to determine the encoding of these file types.

Sometimes, a document with a BOM can be used to load the right character map.

So, when do you use a BOM? It really depends on your needs; you might use it when you are using UTF-16 Unicode, or whenever you have an encoding issue (seeing weird characters on the generated files).

ICanLocalize, a leading translation service provider for mobile apps, offers the feature “Enable/Disable BOM” for software localization projects. Find out more about this option here. Feel free to get more details about our amazing rates and quality of work on www.icanlocalize.com. You are welcome to contact us at hello@icanlocalize.com or on Skype (icanlocalize). We will be happy to assist you!

The Byte Order Mark (BOM) and its Function in Software Localization

Why was the BOM created?

When to use the BOM

Follow Us

Subscribe by Email