Monday, 10 January 2022

Unicode Trivia U+00F0

Block: U+0080..00FF "Latin-1 Supplement"

The Basic Latin Unicode block (U+0000..007F) is fine if you're writing English, but it quickly runs out of steam for other languages. For example, the Icelandic alphabet ("stafrófið") has 32 letters:

Aa Áá Bb Dd Ðð Ee Éé Ff Gg Hh Ii Íí Jj Kk Ll Mm Nn Oo Óó Pp Rr Ss Tt Uu Úú Vv Xx Yy Ýý Þþ Ææ Öö

The fifth letter "ð" is "eth", seen here in a sign in Landmannalaugar:

One of Unicode's founding principles is "universal repertoire" and, indeed, "LATIN SMALL LETTER ETH" has been assigned the unique codepoint U+00F0. Before Unicode, region-specific characters had a tendency to "move around" in codepoint space. For instance, in IBM DOS Code Page 861, small eth was at position 0x8C.

For every codepoint, the Unicode Character Database maintains a plethora of information. For U+00F0, we can view that data using an online utility:

Amongst other things, we see that:

  • The official name of that codepoint is "LATIN SMALL LETTER ETH"
  • It belongs to the "Latin-1 Supplement" block (U+0080..00FF)
  • It primarily belongs to the "Latin" script
  • It was introduced in Unicode 1.1
  • Its General Category is "Lowercase Letter"
  • Its uppercase mapping is "Ð" U+00D0
  • Its titlecase mapping is also "Ð" U+00D0
  • etc.

The titlecase mapping is somewhat moot as there are no words in Icelandic that begin with "eth". This makes children's "A is for Apple"-style alphabet posters somewhat difficult to produce:

It also means that the uppercase letter "Ð" (U+00D0) only occurs in text where the whole word is uppercase, such as … erm … "STAFRÓFIÐ".

No comments:

Post a Comment