Codepoint: U+030C "COMBINING CARON"
Block: U+0300..036F "Combining Diacritical Marks"
Imagine the role of Bjorn in a heavy metal ABBA tribute band (really?) who styles his name:
Bǰörn
There's a caron over the "j" and a (heavy metal) umlaut over the "o". That's five Unicode codepoints:
U+0042, U+01F0, U+00F6, U+0072, U+006E
When his name is converted to all uppercase for the tour poster:
BJ̌ÖRN
it mysteriously becomes six Unicode codepoints:
U+0042, U+004A, U+030C, U+00D6, U+0052, U+004E
This is because although Unicode has the codepoint U+01F0 "LATIN SMALL LETTER J WITH CARON", it has no single codepoint for "LATIN CAPITAL LETTER J WITH CARON". The case mapping algorithm uses data from the UCD to map U+01F0 to the pair U+004A/U+030C.
U+030C is the codepoint for "COMBINING CARON".
If we convert the name back to titlecase, we get:
Bǰörn
This looks the same (hopefully) but is also made up of six codepoints:
U+0042, U+006A, U+030C, U+00F6, U+0072, U+006E
The "O WITH DIAERESIS" round-tripped okay, but not the "J WITH CARON". What's going on?
Case mapping and case folding are very knotty problems. There are plenty of edge-cases in Unicode where converting to/from uppercase/lowercase and back again does not produce the original input. You cannot perform case-insensitive matching by simply converting both strings to uppercase and comparing for equality. Nor does converting to lowercase (or titlecase) work either.
If we look at the UCD entry for U+01F0 "LATIN SMALL LETTER J WITH CARON", we see:
- dm = 006A 030C
- uc = 004A 030C
- lc = #
- tc = 004A 030C
- cf = 006A 030C
This can be interpreted as:
- The "Decomposition Mapping" is "U+006A U+030C". That's lowercase "j" followed by a combining caron.
- The (non-simple) "Uppercase Mapping" is "U+004A U+030C". That's uppercase "J" followed by a combining caron.
- The (non-simple) "Lowercase Mapping" is the unaltered codepoint, i.e. "U+01F0". That's the single codepoint "LATIN SMALL LETTER J WITH CARON".
- The (non-simple) "Titlecase Mapping" is "U+004A U+030C". That's uppercase "J" followed by a combining caron.
- The (non-simple) "Case Folding" is "U+006A U+030C". That's lowercase "j" followed by a combining caron.
From these definitions, one can imagine algorithms for:
- Decomposing strings into normalized forms (NFC/NFD/NFKC/NFKD) to avoid ambiguity, although there are still lots of additional complications.
- Converting strings to uppercase.
- Converting strings to lowercase.
- Converting strings to titlecase.
- Comparing strings in a case-insensitive way.
Perhaps Bjorn was so busy wondering why there's no umlaut in "umlaut" that he missed a trick. He should have styled himself:
No comments:
Post a Comment