Unicode aims to create a worldwide convention that assigns a unique number to each character (or code point) so that they can be represented everywhere by that number. Unicode can address more than 1 million characters and currently adresses more than 0.1 million characters. Those are the characters from all the languages. It also includes some symbols and emojis. The technical name for a unit is a code point. A maximum of 1,114,112 code points can be assigned (17 * 65536).
As there are more than 2 ^ 16 Unicode code points, it means we cannot represent all of them in a two byte representation. A Byte is a number between 0 and 255, represented as between 0 and FF in hexadecimal. A combination of 2 bytes could represent up to 65.536 characters.
A combination of 4 bytes could represent more than 4 billion characters, and it's enough to contain all Unicode code points. But storing each character in a 4 bytes chunk is inefficient as most ASCII character only need 1 byte.
People have found a trick (UTF-8) to represent all the existing code points by using a 1 byte unit, that uses combination and sequences of 1 byte units to represent more complex code points, rather than using 2 bytes (utf-16) or 4 bytes (utf-32) units.
Whenever we read the value of a UTF-8 encoded text byte, we can immediately determine if it's either (1) an ASCII character, (2) the start of a byte sequence, (3) or a continuation byte.
(1) The byte represents a single byte ASCII character
This is the case when the byte value is between 0 and 127
0 to 127
0 to 7F
0 to 0111 1111
(2) The byte represents the start of a byte sequence
The byte starts a 2 bytes sequence when his value is between:
194 to 223
C2 to DF
1100 0010 to 1101 1111
30 different values
The byte starts a 3 bytes sequence when his value is between:
224 to 239
E0 to EF
1110 0000 to 1110 1111
16 different values
The byte starts a 4 bytes sequence when his value is between:
240 to 244
F0 to F4
1111 0000 to 1111 0100
5 different values
(3) The byte represents the continuation or the end of a sequence
64 different value between the following values:
128 to 191
80 to BF
1000 0000 to 1011 1111
Use od utility to visualize text files bytes
Display the bytes of a text file containing "Antoine" with the following command:
od -t xC hello.txt // hexadecimal bytes
od -t uC hello.txt // decimal bytes
Result:
41 6e 74 6f 69 6e 65 // hexadecimal output
65 110 116 111 105 110 101 // decimal output