utf8 binaries

I have a question concerning utf-8 Unicode binary numbers . In this page :

http://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&utf8=bin

In the utf-8 table (utf-8 (bin) section ) the character Ā is 11000100 10000000

in binary . I believed that each character in Unicode is 8 bits long so why

Ā is 16 bits long (11000100 10000000) ?? I am confused

MiiNiPaa (8886)

I believed that each character in Unicode is 8 bits long

Wrong. Unicode characters do not have any length. It depends on encoding: in UTF-32 all characters are 4 bytes (32 bits) long. In UTF-16 all characters are 2 bytes (16 bit) long, but this encoding cannot represent all Unicode characters.
UTF-8 is a multibyte encoding: each character is represented by one or more bytes: those from lower ASCII plane are represented by one byte, other by more.

Disch (13742)

Wikipedia has a great explanation of UTF-8 and how it works:

http://en.wikipedia.org/wiki/UTF-8#Description

dilver (142)

thanks to correct my wrong idea . As for the Wikipedia article it is very interesting . However , I found in it (the great article) two notions that confused me and I am still unable to understand them; namely :

1)the code point . Is it the character itself or the position of the character in the Unicode table?? and

2)the code unit

What are they in clear ??

Last edited on

MiiNiPaa (8886)

The code point is the position of character in Unicode table uniquely identifying one [pseudo]character.

The code unit is what character representation consist of. UTF-16 and UTF-32 uses a single 16- and 32-bits code unit respectively. UTF-8 uses one or more 8-bit code units to denote a single character.

dilver (142)

NiiNiPaa you mean that the code unit is the number of the sequence of bits used to encode one unique character ??? for example the code unit of the character "a" is 8 bits or 1 byte . Is this what you mean ?

MiiNiPaa (8886)

you mean that the code unit is the number of the sequence of bits used to encode one unique character

No. It is a building block used to encode value of character. UTF-8 is a variable length encoding. That means it can use more than one code usnit to represent a character, depending on character in questions.

Disch (13742)

MiiNiPaa wrote:
UTF-16 and UTF-32 uses a single 16- and 32-bits

<hypertechnicality>
UTF-16 uses 2 code units for code points above U+FFFF
</hypertechnicality>

@dilver:

Sort of. A code unit is a measure of how large the 'units' are for encoding text. This does not change per character, but instead changes per encoding.

The character "a" can be expressed in 1 code unit... but the size of that code unit varies depending on the encoding:

UTF-8 has 8-bit code units.
UTF-16 has 16-bit code units.
UTF-32 has 32-bit code units.

Last edited on

Cubbi (4774)

I don't think it's all that "hypertechnical" to know that UTF-16 is a variable-length encoding, just like UTF-8.

Topic archived. No new replies allowed.