Sunday 6 February 2022

Unicode Trivia U+0CDE

Block: U+0C80..0CFF "Kannada"

The Kannada script is one of those added in Unicode 1.0 as part of the importing of the ISCII character sets in 1991. The 1991 ISCII Standard encoded ten Indic character sets:

  1. Devanagari (DEV/57002)
  2. Bengali (BNG/57003)
  3. Tamil (TML/57004)
  4. Telugu (TLG/57005)
  5. Assamese (ASM/57006)
  6. Oriya (ORI/57007)
  7. Kannada (KND/57008)
  8. Malayalam (MLM/57009)
  9. Gujarati (GJR/57010)
  10. Punjabi (PNJ/57011)

As part of the importation process:

  • "Bengali" and "Assamese" were folded into a single "Bengali/Assamese" script known in Unicode data tables simply as "Bengali"
  • "Punjabi" was renamed "Gurmukhi" (the former is a language, the latter is a script)
  • "Oriya" was not renamed "Odia" (as this didn't happen until November 2011)

The nine remaining scripts were mapped to 128-byte blocks we see in Unicode today:

  • Devanagari [U+0900..097F]
  • Bengali [U+0980..09FF]
  • Gurmukhi [U+0A00..0A7F]
  • Gujarati [U+0A80..0AFF]
  • Oriya [U+0B00..0B7F]
  • Tamil [U+0B80..0BFF]
  • Telugu [U+0C00..0C7F]
  • Kannada [U+0C80..0CFF]
  • Malayalam [U+0D00..0D7F]
Richard Ishida has an excellent page describing these scripts and the importation process; but here's a summary table I put together of the codepoints (with hexadecimal offsets within the blocks) that are purposefully aligned in each script:

The alignment was originally designed to facilitate trivial transcription, but this was never truly practical.

We can see that the Tamil column has quite a few missing (grey) codepoints; Tamil has fewer isolated letters in its "alphabet" than other Brahmic scripts. This is partly because it does not have distinct letters for aspirated consonants.

There are obviously gaps in the rows in chart above, which give space for script-specific codepoints. So, for Kannada, there are extra codepoints:

  • U+0C80 "KANNADA SIGN SPACING CANDRABINDU" — a non-combining Candrabindu
  • U+0C84 "KANNADA SIGN SIDDHAM" — used at the beginning of texts as an invocation
  • U+0CBC "KANNADA SIGN NUKTA" — used to represent sounds not present in Kannada
  • U+0CD5 "KANNADA LENGTH MARK" — used to extend vowel sounds
  • U+0CD6 "KANNADA AI LENGTH MARK" — used to extend AI vowel sounds
  • U+0CDD "KANNADA LETTER NAKAARA POLLU" — a vowel-less form of NA

U+0CDE "KANNADA LETTER FA" was added in Unicode 1.0:

Unicode 1.0 Code Chart

But there is no letter FA for Kannada mentioned in ISCII 1991. Indeed, there is no letter FA in Kannada full stop. As Richard Ishida explains:

The Kannada character U+0CDE KANNADA LETTER FA "ೞ" was incorrectly named. A more appropriate name would be LLLA, rather than FA. Because of the rules for Unicode naming, the current name cannot, however, be changed. Fortunately this letter has not been actively used in Kannada since the end of the 10th century.

Fortunate, indeed!

The table in Wikipedia seems to want to perpetuate the error; although, as a record of the actual importation process, it's un-usefully accurate.

No comments:

Post a Comment