Friday, 14 January 2022

Unicode Trivial U+030C

Codepoint: U+030C "COMBINING CARON"
Block: U+0300..036F "Combining Diacritical Marks"

Imagine the role of Bjorn in a heavy metal ABBA tribute band (really?) who styles his name:

Bǰörn

There's a caron over the "j" and a (heavy metal) umlaut over the "o". That's five Unicode codepoints:

U+0042, U+01F0, U+00F6, U+0072, U+006E

When his name is converted to all uppercase for the tour poster:

BJ̌ÖRN

it mysteriously becomes six Unicode codepoints:

U+0042, U+004A, U+030C, U+00D6, U+0052, U+004E

This is because although Unicode has the codepoint U+01F0 "LATIN SMALL LETTER J WITH CARON", it has no single codepoint for "LATIN CAPITAL LETTER J WITH CARON". The case mapping algorithm uses data from the UCD to map U+01F0 to the pair U+004A/U+030C.

U+030C is the codepoint for "COMBINING CARON".

If we convert the name back to titlecase, we get:

Bǰörn

This looks the same (hopefully) but is also made up of six codepoints:

U+0042, U+006A, U+030C, U+00F6, U+0072, U+006E

The "O WITH DIAERESIS" round-tripped okay, but not the "J WITH CARON". What's going on?

Case mapping and case folding are very knotty problems. There are plenty of edge-cases in Unicode where converting to/from uppercase/lowercase and back again does not produce the original input. You cannot perform case-insensitive matching by simply converting both strings to uppercase and comparing for equality. Nor does converting to lowercase (or titlecase) work either.

If we look at the UCD entry for U+01F0 "LATIN SMALL LETTER J WITH CARON", we see:

  1. dm = 006A 030C
  2. uc = 004A 030C
  3. lc = #
  4. tc = 004A 030C
  5. cf = 006A 030C

This can be interpreted as:

  1. The "Decomposition Mapping" is "U+006A U+030C". That's lowercase "j" followed by a combining caron.
  2. The (non-simple) "Uppercase Mapping" is "U+004A U+030C". That's uppercase "J" followed by a combining caron.
  3. The (non-simple) "Lowercase Mapping" is the unaltered codepoint, i.e. "U+01F0". That's the single codepoint "LATIN SMALL LETTER J WITH CARON".
  4. The (non-simple) "Titlecase Mapping" is "U+004A U+030C". That's uppercase "J" followed by a combining caron.
  5. The (non-simple) "Case Folding" is "U+006A U+030C". That's lowercase "j" followed by a combining caron.

From these definitions, one can imagine algorithms for:

  1. Decomposing strings into normalized forms (NFC/NFD/NFKC/NFKD) to avoid ambiguity, although there are still lots of additional complications.
  2. Converting strings to uppercase.
  3. Converting strings to lowercase.
  4. Converting strings to titlecase.
  5. Comparing strings in a case-insensitive way.
Further complications occur when more than one diacritic is attached to a letter. And then there's the question of ordering (collating) text with diacritics...

Perhaps Bjorn was so busy wondering why there's no umlaut in "umlaut" that he missed a trick. He should have styled himself:


U+0243, U+01F0, U+00F6, U+1E5D, U+00F1

No comments:

Post a Comment