Understanding UTF-8, UTF-16, Endianness and the BOM (Byte Order Mark)


There is a difference between understanding something and being able to explain it. However, I was challenged to explain my knowledge and so I will attempt to do so. So let's start with unsigned integers, bits and bytes.

I'm sure that you will be familiar with a computer or device being described as 64-bit or 32-bit. And before 32-bit we had 16-bit computers and before that, in the days of the Commodore-64, computers were 8-bit. But aside from indicating the power and sophistication of these machines what did these numbers really mean?

In an 8-bit machine each unit of data (or to give it a more exact name a word) is 8 bits long. And bits are those zeroes and ones so familiar to us from popular portrayals of machine code. For example, this is a binary number that is 8-bits long: 00000000. It can be written in decimal terms as 0 and in hexadecimal as 0. This is another 8-bit long number: 11111111. It can be written in decimal terms as 255 and in hexadecimal as FF.

Note: Typically a byte is 8 bits long. So a word in an 8-bit machine is equal in size to a byte. The size of the word increases as the number of bits the processor can process (in one go) increases. So a 16-bit processor has a word size of 2 (bytes), a 32-bit processor has a word size of 4 (bytes) and a 64-bit processor has a word size of 8 (bytes).

Limits of an 8-bit unsigned integer

Now what we've done by setting all the bits in the eight bit binary number to zero is to discover the limits of an 8-bit unsigned integer, because 00000000 is its minimum value and by setting all the bits to 1 we've also discovered its maximum. Doesn't matter how we write it (in binary, decimal, hexadecimal or even octal form), the code unit is the same, but it might help to think of each code unit (or byte) as a number between 0 and 255 for now.

Note: If you'd like to learn more about binary numbers see here.

UTF-8 standard

An unsigned 8-bit integer (or UInt8) is the basis of the unicode UTF-8 standard. And the unicode system at its most basic is a list of characters that corresponds to a list of numbers. So the UTF-8 standard is list of UInt8s paired with a list of characters. (See here for the actual list.)

If you are using UTF-8 the letter 'a' will always be equivalent to 97 and the letter 'z' to 122. Thanks to the unicode standard this is true in HTML where we can access a unicode value using the h syntax and in Swift where we use a syntax like this \u{104}. While in JavaScript we can write String.fromCharCode(104). No matter which syntax is employed, because the unicode table tells us that a lowercase h is always represented by the number 104 then this will always be true in a unicode.

There is a problem, however, because there are more characters in the world than UInt8s can reference (remember that 255 maximum!). For this reason UTF-8 does something special. Instead of using every number between 0 and 255 to represent a single character, and thus limiting the number of characters to 256, it uses the numbers above 193 to prefix characters composed of multiple code points (or in other words an array of UInt8 numbers): the numbers 194 to 223 prefix characters with two code points, 224 to 239 for three code points and 244 is used for characters with four code points.  And because when the UTF-8 string is parsed it is known that these numbers are only used as prefixes and not standalone characters, the system will know to expect an array of bytes not a solitary one.

Note: One code point = one byte; two code points = two bytes; three code points = three bytes; four code points = four bytes. Examples of characters that consist of four code points include emoticons.

UTF-16

Let's now consider the difference between UTF-8 and UTF-16. UTF-8 is based on 8-bit numbers whereas UTF-16 is based on 16-bit numbers, which range from 0000000000000000 to 1111111111111111 (or in decimal terms 0 and 65,535). This added capacity (for larger numbers) means that while an e acute is treated in UTF-8 as two characters (or code points [195,169] combined), in UTF-16 the character is represented as a single 16-bit number (233). Of course, 233 is a valid UInt8 number but UTF-8 couldn't leverage multiple code points in the way it does if it did not reserve the number for the higher purpose of multiple code points (as discussed above).

While this 16-bit capacity covers most numbers it doesn't cover them all and so as the Unicode Consortium (UC) outlines, there are characters that go beyond even the capacity of UInt16:
Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit. (The Unicode Consortium)
Let's take for example a smiley face emoticon [55,357, 56,832].

What about the Endians?

OK, so UTF-8 is a system of relating 8-bit unsigned integers to characters and UTF-16 is its 16-bit counterpart. But when we save a file in a text editor like BBEdit we're not only faced with UTF-8 or UTF-16 options, but also with Little-Endian as well as BOM options.


These two things are related. To explain, in UTF-8 code points are always understood as being read from left to right no matter how many there are. Whereas in UTF-16 the situation is different because while a UInt16 number can rise to the maximum value of 65,535 it is considered by UTF-16 to be composed of two bytes (i.e. two numbers that are 8 bits in length). And these numbers can either be presented with the first 8 bits as the first half of the number or the second 8 bits. As the UC describes it:
Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian, the latter little-endian. When data is exchanged, bytes that appear in the "correct" order on the sending system may appear to be out of order on the receiving system ... [whereas] UTF-8 is byte oriented and therefore does not have that issue. (Unicode Consortium)
To be clearer, let's say we have a UTF-16 e acute. It would be written in binary as 00000000_11101001. In big-endian ordering (the normal type of UTF-16 formatting in BBEdit) the bytes would be stored left to right, or rather the most significant byte (the one capable of representing the highest decimal numbers) would be stored and read first, whereas in little-endian ordering the order would be swapped around. So that rather than recording 00000000_11101001, or in Hexadecimal 00 E9, the bytes would be stored as E9 00 instead.

The image at the head of this post goes some way to illustrate this endianness. It starts at the top with the least significant byte (i.e. the one representing the lowest numbers) in the 16-bit number and at the bottom is the most significant byte. Little-endian byte ordering progresses downwards and big-endian progresses upwards from top to bottom, but really it doesn't matter which order the bytes are stored in as long as they are written and read in the same order. This is where a byte order mark (BOM) becomes necessary. It means that a program can tell which order the bytes are arranged in, without which the app would have no way of knowing which format or which type of Endianness was employed (see also here). However, some systems and/or apps cannot handle BOMs, hence their optional status.

BOM

A BOM is a character that precedes another character indicating the order (Big-Endian or Little-Endian) of the bytes in a UTF-16 file and although a UTF-8 string can contain a BOM it is only used to indicate that it is a UTF-8 file because Endianness has no meaning when it comes to 8-bit numbers. They are only a single byte long.

Practical Exercise (examining file sizes)

The question becomes why use UTF-8 when 16-bit and 32-bit versions of the Unicode standard enable us to describe more complex characters using fewer code points. A major reason behind the choice when using mainly Latin characters without accents will be file size as demonstrated in the practical example below.
  1. Open BBEdit (or similar) and create a file with a single letter from a to z and then save in UTF-8 format
  2. Go to Finder and inspect the file
  3. You should find that the file size is 1 byte
  4. Replace the character with an é or ü and re-save
  5. The size when you inspect the character is now 2 bytes (because é or ü contain two UTF-8 code points)
  6. Now save as UTF-16 with no BOM.
  7. Don't look at the file size yet.
  8. Remember that UTF-16 can describe an é or ü using a single code point
  9. So how big do you think the file will be?
  10. It will be 2 bytes because each UTF-16 code unit is represented by a 16-bit number and a byte is 8 bits in length, so each code point is equal to 2 bytes.
  11. Change the single character back to a letter ranging from a to z
  12. Notice that the size remains 2 bytes (the size of a single UTF-16 code point)
  13. Now replace the text with two characters, e.g. ab, and save again
  14. Notice the file size has risen to 4 bytes.
  15. Re-save as UTF-8 and notice how the file size drops to 2 bytes.
As you see for Latin characters, using UTF-16 rather than UTF-8 doubles the file size and using UTF-32 would double it again.

Beyond plain text (doc and docx file sizes)

If you were to repeat the practical exercise in Microsoft Word and save as a doc or docx file then the resulting files would be much larger in size. The reasons for this are different for each format. A doc file is a binary file format and so uses bytes not only to describe text but also for formatting and arrangement, as well as for padding to maintain a regularity of section sizes in the file. Whereas a docx file uses a collection of XML files to describe structure, formatting and text arrangement. So, while each unaccented latin character of an XML file saved in UTF-8 will be one byte in length, because of the supporting files, the XML tags, and so on, the file will be much larger in bytes than its number of characters.

Conclusion

This was an unexpected article to be writing and I've not really had the time to be as clear or thorough as I would like but I hope it helps somewhat and doesn't make too many mistakes. As I've thought of new things to add, I've done so and may continue to do so. I hope this doesn't make for too much of a hotchpotch.


Comments