Sunday, 20 February 2022

Unicode Trivia U+0EA5

Codepoint: U+0EA5 "LAO LETTER LO LOOT"
Block: U+0E80..0EFF "Lao"

The Lao script (Akson Lao) is a sister script of the Thai script; both derive from the Sukhothai script of the thirteenth century CE. As such, they have many similarities. For instance, both Lao and Thai consonants are given individual names. Here are the 27 Lao consonants with their typical names:

  1. ກ = chicken (ໄກ່)
  2. ຂ = egg (ໄຂ່)
  3. ຄ = water buffalo (ຄວາຍ)
  4. ງ = ox (ງົວ)
  5. ຈ = glass (ຈອກ)
  6. ສ = tiger (ເສືອ)
  7. ຊ = elephant (ຊ້າງ)
  8. ຍ = mosquito (ຍຸງ)
  9. ດ = child (ເດັກ)
  10. ຕ = eye (ຕາ)
  11. ຖ = bag (ຖົງ)
  12. ທ = flag (ທຸງ)
  13. ນ = bird (ນົກ)
  14. ບ = goat (ແບ້)
  15. ປ = fish (ປາ)
  16. ຜ = bee (ເຜິ້ງ)
  17. ຝ = rain (ຝົນ)
  18. ພ = mountain (ພູ)
  19. ຟ = fire (ໄຟ)
  20. ມ = cat (ແມວ)
  21. ຢ = medicine (ຢາ)
  22. ຣ = car (ຣົຖ)
  23. ລ = monkey (ລີງ)
  24. ວ = fan (ວີ)
  25. ຫ = goose (ຫ່ານ)
  26. ອ = bowl (ອື່ງ)
  27. ຮ = house (ເຮືອນ)

Each consonant's name begins with that consonant in a similar fashion to English alphabet mnemonics such a "A is for apple, B is for banana, etc.", known as acrophony:

[source]

Alas, the mapping of these consonants to the appropriate "column" of the Unicode Lao block is complicated by two factors:

  1. The Unicode encoding is based loosely on Thai Industrial Standard 620-2533 and has holes where unused characters are omitted.
  2. The names of four of the consonants were incorrect when they were added to Unicode 1.0.

These complications are discussed in Andrew West's N3137 notes:

The Unicode code charts note that the Lao block is "Based on TIS 620-2529". This statement is misleading as TIS 620-2529 is a Thai standard for representing the Thai script in an 8-bit code, and does not define names or code points for the Lao script. The Unicode Lao block is based on a mapping of Lao characters to the equivalent Thai characters in TIS 620, but is not actually based on this standard.

And:

The Unicode names for Lao consonants are based on the syllabic pronunciation of the character (i.e. consonant plus inherent vowel). All consonants belong to one of three tone classes: high, mid and low. Where two letters are only distinguished phonetically by their tone class, the modifiers SUNG "high" and TAM "low" are used to indicate the tone class of the letter (e.g. U+0E82 "LAO LETTER KHO SUNG" and U+0E84 "LAO LETTER KHO TAM"). However, the Unicode names for two of the consonants have the wrong tone class applied to them:

U+0E9D "LAO LETTER FO TAM" is a high tone class letter, and should have been named "LAO LETTER FO SUNG"

U+0E9F "LAO LETTER FO SUNG" is a low tone class letter, and should have been named "LAO LETTER FO TAM"

Whilst the Unicode names for 25 of the 27 consonants use this naming scheme, the names of two of the consonants use mnemonic names (presumably because they share the same vowel and tone class, and so could not otherwise be differentiated). Mnemonic names are how the consonants are normally identified in the Lao language, although there is no official list of standard mnemonic names for consonants, and different sources may use different mnemonic names for some letters.

The two letters whose Unicode names are based on mnemonic names are:

U+0EA3 "LAO LETTER LO LING"

U+0EA5 "LAO LETTER LO LOOT"

The mnemonic names for these two letters are the wrong way round. U+0EA5 is the normal letter [l] and is universally identified by the mnemonic name lo ling "lo as in ling [monkey]". On the other hand, U+0EA3 is a letter that is used to represent [r] in foreign words; however this letter has been officially deprecated by the Lao government since 1975, and is no longer in common use. The name element LO LOOT applied to U+0EA5 would seem to represent the mnemonic ro rot, "rot" meaning automobile, that should be applied to U+0EA3.

So  U+0EA3 should be named "LAO LETTER RO ROT" (car) and U+0EA5 should be named "LAO LETTER LO LING" (monkey).

It is interesting that the Unicode standard has effectively "nailed down" the names of the consonants even though Andrew West says there is no official standard.

It has always troubled me that English does not have a satisfactory mechanism for naming its letters. These are the names typically used in British English:

  1. a
  2. bee
  3. cee
  4. dee
  5. e
  6. eff
  7. gee
  8. aitch
  9. i
  10. jay
  11. kay
  12. el
  13. em
  14. en
  15. o
  16. pee
  17. cue
  18. ar
  19. ess
  20. tee
  21. u
  22. vee
  23. double-u
  24. ex
  25. wye
  26. zed

If we ignore "double-u" (which we've met before), the obvious elephant in the room is "cue" for "Q". Not only is it not acrophonic (only 15 of the 26 truly are), "Q" doesn't appear anywhere in its name.

Monday, 14 February 2022

Unicode Trivia U+0E74

Codepoint: U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI"
Block: U+0E00..0E7F "Thai"

Huh? According to Unicode's own lookup utility, U+0E74 is an unassigned codepoint. But that wasn't always the case. Back in Unicode 1.0.0 (October 1991) it was U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI":

[source]

Alas, codepoints U+0E70 to U+0E74 only lasted until Unicode 1.0.1 (June 1992) when they were deleted. This was the only time a non-zero patch version (i.e. "major.minor.patch" where patch ≠ 0) of Unicode was officially released. The stability policy means that another patch release is highly unlikely and the removal of codepoints impossible:

Encoding Stability (since Unicode 2.0)

Once a character is encoded, it will not be moved or removed. 

This policy ensures that implementers can always depend on each version of the Unicode Standard being a superset of the previous version. The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.

So why was U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI" and its siblings removed? According to the Notice, it was to bring the Unicode and ISO 10646 standards back in line; U+0E74 was never added to ISO 10646. According to the minutes of a May 1992 meeting of the Unicode Technical Committee:

The UTC has noticed the requirement to remove 5 THAI characters (U+0E70 - U+0E74) and 5 LAO characters (U+0EF0 - U+0EF4). In the interest of the merger between ISO 10646 and Unicode the UTC authorizes its representatives attending the SC2/WG2 meeting in Korea to be flexible on this subject.

The juxtaposition of "authorizes" and "flexible" made me smile.

It appears that Thai Phonetic Order Vowel Signs were redundant and could cause ambiguity:

Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is only encoded <U+0E44 THAI CHARACTER SARA AI MAIMALAI, U+0E15 THAI CHARACTER TO TAO, U+0E23 THAI CHARACTER RO RUA>, and the character U+0E3A is always visible when used; for most routine purposes it is little different to U+0E38 THAI CHARACTER SARA U.  However, in Unicode 1.0[.0], while <U+0E44, U+0E15, U+0E23> was rendered as at present, the same visible string could also be encoded as <U+0E15, U+0E3A, U+0E23, U+0E74 THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI> - no glyph would be rendered for U+0E3A.

I think that's implying that the sequence <... U+0E74> could just as easily be encoded as <U+0E44 ...>. The original glyph charts suggest that too:

[source]

Of course, if someone legitimately used U+0E74 in a document between October 1991 and June 1992, their document would become officially invalid or corrupt after June 1992.

Friday, 11 February 2022

Unicode Trivia U+0DA5

Codepoint: U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA"
Block: U+0D80..0DFF "Sinhala"

As  Richard Gillam says in "Unicode Demystified" (2003), page 330:

The Unicode Sinhala block runs from U+0D80 to U+0DFF. It does not follow the ISCII order, partly because the ISCII standard doesn't include a code page for Sinhala and partly because Sinhala includes a lot of sounds (and, thus, letters) that aren't present in any of the Indian scripts. The basic setup of the block is the same: anusvara and visarga first, followed by independent vowels, consonants, dependent vowels, and punctuation. Unlike in the ISCII-derived blocks, the al-lakuna (virama) precedes the dependent vowels, rather than following them.

The order of codepoints (or of text made up of codepoints) can be thought of in at least three ways:

  1. The order of codepoints within the character set, e.g. Unicode ("codepoint order")
  2. The order of letters in an 'alphabet', e.g. Sinhala abugida ("alphabet order")
  3. The typical order of words in a language's dictionary ("collation order")

As an example, we'll consider the letters (and only the standalone letters) from the Sinhala block (U+0D80..0DFF).

In codepoint order, these are:

  • 18 independent vowels (U+0D85..0D96)
  • 41 consonants (U+0D9A..0DC6)

The alphabet order (according to sites such as Omniglot) is the same as the codepoint order. This was presumably a factor in the ordering of the codepoints when the block was added to Unicode 3.0 in 1999.

However, in "collation order" these 59 letters (along with their Sinhalese and Romanized phonetic names) are:

  1. U+0D85 = "අ" = AYANNA = vowel a
  2. U+0D86 = "ආ" = AAYANNA = vowel aa
  3. U+0D87 = "ඇ" = AEYANNA = vowel ae
  4. U+0D88 = "ඈ" = AEEYANNA = vowel aae
  5. U+0D89 = "ඉ" = IYANNA = vowel i
  6. U+0D8A = "ඊ" = IIYANNA = vowel ii
  7. U+0D8B = "උ" = UYANNA = vowel u
  8. U+0D8C = "ඌ" = UUYANNA = vowel uu
  9. U+0D8D = "ඍ" = IRUYANNA = vowel vocalic r
  10. U+0D8E = "ඎ" = IRUUYANNA = vowel vocalic rr
  11. U+0D8F = "ඏ" = ILUYANNA = vowel vocalic l
  12. U+0D90 = "ඐ" = ILUUYANNA = vowel vocalic ll
  13. U+0D91 = "එ" = EYANNA = vowel e
  14. U+0D92 = "ඒ" = EEYANNA = vowel ee
  15. U+0D93 = "ඓ" = AIYANNA = vowel ai
  16. U+0D94 = "ඔ" = OYANNA = vowel o
  17. U+0D95 = "ඕ" = OOYANNA = vowel oo
  18. U+0D96 = "ඖ" = AUYANNA = vowel au
  19. U+0D9A = "ක" = ALPAPRAANA KAYANNA = consonant ka
  20. U+0D9B = "ඛ" = MAHAAPRAANA KAYANNA = consonant kha
  21. U+0D9C = "ග" = ALPAPRAANA GAYANNA = consonant ga
  22. U+0D9D = "ඝ" = MAHAAPRAANA GAYANNA = consonant gha
  23. U+0D9E = "ඞ" = KANTAJA NAASIKYAYA = consonant nga
  24. U+0D9F = "ඟ" = SANYAKA GAYANNA = consonant nnga
  25. U+0DA0 = "ච" = ALPAPRAANA CAYANNA = consonant ca
  26. U+0DA1 = "ඡ" = MAHAAPRAANA CAYANNA = consonant cha
  27. U+0DA2 = "ජ" = ALPAPRAANA JAYANNA = consonant ja
  28. U+0DA5 = "ඥ" = TAALUJA SANYOOGA NAAKSIKYAYA = consonant jnya
  29. U+0DA3 = "ඣ" = MAHAAPRAANA JAYANNA = consonant jha
  30. U+0DA4 = "ඤ" = TAALUJA NAASIKYAYA = consonant nya
  31. U+0DA6 = "ඦ" = SANYAKA JAYANNA = consonant nyja
  32. U+0DA7 = "ට" = ALPAPRAANA TTAYANNA = consonant tta
  33. U+0DA8 = "ඨ" = MAHAAPRAANA TTAYANNA = consonant ttha
  34. U+0DA9 = "ඩ" = ALPAPRAANA DDAYANNA = consonant dda
  35. U+0DAA = "ඪ" = MAHAAPRAANA DDAYANNA = consonant ddha
  36. U+0DAB = "ණ" = MUURDHAJA NAYANNA = consonant nna
  37. U+0DAC = "ඬ" = SANYAKA DDAYANNA = consonant nndda
  38. U+0DAD = "ත" = ALPAPRAANA TAYANNA = consonant ta
  39. U+0DAE = "ථ" = MAHAAPRAANA TAYANNA = consonant tha
  40. U+0DAF = "ද" = ALPAPRAANA DAYANNA = consonant da
  41. U+0DB0 = "ධ" = MAHAAPRAANA DAYANNA = consonant dha
  42. U+0DB1 = "න" = DANTAJA NAYANNA = consonant na
  43. U+0DB3 = "ඳ" = SANYAKA DAYANNA = consonant nda
  44. U+0DB4 = "ප" = ALPAPRAANA PAYANNA = consonant pa
  45. U+0DB5 = "ඵ" = MAHAAPRAANA PAYANNA = consonant pha
  46. U+0DB6 = "බ" = ALPAPRAANA BAYANNA = consonant ba
  47. U+0DB7 = "භ" = MAHAAPRAANA BAYANNA = consonant bha
  48. U+0DB8 = "ම" = MAYANNA = consonant ma
  49. U+0DB9 = "ඹ" = AMBA BAYANNA = consonant mba
  50. U+0DBA = "ය" = YAYANNA = consonant ya
  51. U+0DBB = "ර" = RAYANNA = consonant ra
  52. U+0DBD = "ල" = DANTAJA LAYANNA = consonant la
  53. U+0DC0 = "ව" = VAYANNA = consonant va
  54. U+0DC1 = "ශ" = TAALUJA SAYANNA = consonant sha
  55. U+0DC2 = "ෂ" = MUURDHAJA SAYANNA = consonant ssa
  56. U+0DC3 = "ස" = DANTAJA SAYANNA = consonant sa
  57. U+0DC4 = "හ" = HAYANNA = consonant ha
  58. U+0DC5 = "ළ" = MUURDHAJA LAYANNA = consonant lla
  59. U+0DC6 = "ෆ" = FAYANNA = consonant fa
Spot the anomaly? Well, U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is out of order.
U+0DA5 from r12a

As an English speaker, the codepoint order, alphabet order and collation order of the letters "A" to "Z" are identical; so having subtle anomalies like this feels jarring. So jarring, in fact, that I checked it against three different sources (Unicode CLDR, MySQL and dictionary.gov.lk) to make sure I hadn't made a transcription error.

It's a bit like having the English alphabet "ABCDEFGHIJKLMNOPQRSTUVWXYZ" but listing words in an English dictionary in a different order, such as "ABCDEFGHIJKLPMNOQRSTUVWXYZ".

You only really need to nail down the order of letters of an writing system when you start creating reference dictionaries. However, as the Sinhala Dictionary Compilation Institute says, this didn't happen until British colonial rule of what became Sri Lanka. It's impossible to imagine that the British compilers didn't impose some of their preconceptions on the process and therefore muddied the ordering waters.

As Richard Gillam pointed out, Sinhala has a large number of letters and U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is one of those that doesn't fit into the canonical Brahmic consonant ordering utilised by ISCII.

survey by Weerasinghe, Herath and Gamage (2006) supplies many definitions of Sinhalese "dictionary order" in current use. Indeed, even if Unicode CLDR collation is adopted as a single de facto standard, the collation tailoring metadata is considered "live", and therefore liable to change anyway.


Monday, 7 February 2022

Unicode Trivia U+0D5A

Codepoint: U+0D5A "MALAYALAM FRACTION THREE EIGHTIETHS"
Block: U+0D00..0D7F "Malayalam"

Malayalam script is the main method of writing the Malayalam language of South West India, spoken by about forty million people. It is a Brahmic script, imported into Unicode 1.0 along with the other scripts covered by ISCII 1991.

Although "Malayalam" is a palindrome, I want to talk about their fractions. I like Unicode fractions. Have you noticed?

The original Unicode 1.0 block didn't have any had Malayalam-specific fraction codepoints. These were added in Unicode 6.0 (proposal N2970 by V. S. Umamaheswaran 2005-08-23) and Unicode 9.0 (proposal N4429 by Shriramana Sharma 2013-04-25). The latter proposal, N4429, gives details of the old system of Malayalam fractions used before decimalisation:

Malayalam Fraction Multiplication Examples
Prācīna Gaṇitaṃ Malayāḷattil, Prof C K Moosathu, Kerala State Institute of Language, 1980

It appears to be similar to Tamil fractions, using 320 as a common denominator:

  • 1/320 (one three-hundred-and-twentieth) = U+0D2A U+0D4D U+0D24 = "പ്ത" (muntiri)
  • 2/320 (one one-hundred-and-sixtieth) = U+0D58 = "൘" (arakkāṇi)
  • 4/320 (one eightieth) = U+0D2E = "മ" (kāṇi)
  • 8/320 (one fortieth) = U+0D59 = "൙" (aramā)
  • 12/320 (three eightieths) = U+0D5A = "൚" (mūnnukāṇi)
  • 16/320 (one twentieth) = U+0D5B = "൛" (orumā)
  • 20/320 (one sixteenth) = U+0D76 = "൶" (mākāṇi)
  • 32/320 (one tenth) = U+0D5C = "൜" (raṇṭumā)
  • 40/320 (one eighth) = U+0D77 = "൷" (arakkāl)
  • 48/320 (three twentieths) = U+0D5D = "൝" (mūnnumā)
  • 60/320 (three sixteenths) = U+0D78 = "൸" (muṇṭāṇi)
  • 64/320 (one fifth) = U+0D5E = "൞" (nālŭmā)
  • 80/320 (one quarter) = U+0D73 = "൳" (kāl)
  • 160/320 (one half) = U+0D74 = "൴" (ara)
  • 240/320 (three quarters) = U+0D75 = "൵" (mukkāl)

Note that there's no single codepoint for "1/320" (the sequence U+0D2A U+0D4D U+0D24 achieves the required glyph) and "4/320" shares a glyph with U+0D2E "MALAYALAM LETTER MA". Other than these two exceptions, the names of the fraction codepoints are as expected:

  • U+0D58 "MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH"
  • U+0D59 "MALAYALAM FRACTION ONE FORTIETH"
  • U+0D5A "MALAYALAM FRACTION THREE EIGHTIETHS"
  • U+0D5B "MALAYALAM FRACTION ONE TWENTIETH"
  • U+0D5C "MALAYALAM FRACTION ONE TENTH"
  • U+0D5D "MALAYALAM FRACTION THREE TWENTIETHS"
  • U+0D5E "MALAYALAM FRACTION ONE FIFTH"
  • U+0D73 "MALAYALAM FRACTION ONE QUARTER"
  • U+0D74 "MALAYALAM FRACTION ONE HALF"
  • U+0D75 "MALAYALAM FRACTION THREE QUARTERS"
  • U+0D76 "MALAYALAM FRACTION ONE SIXTEENTH"
  • U+0D77 "MALAYALAM FRACTION ONE EIGHTH"
  • U+0D78 "MALAYALAM FRACTION THREE SIXTEENTHS"

All fractions with a denominator of 320 can easily be represented by adding together parts. In the image above, the third answer is:

3/16 = 60/320 = "൸" (U+0D78)

The sixth answer is:

3/64 = (12 + 2 + 1)/320 = "൚൘പ്ത" (U+0D5A U+0D58 U+0D2A U+0D4D U+0D24)

There's the possibility of ambiguity here because we could construct 3/64 using (12+2+1)/320 or (8+4+2+1)/320, but I assume you always pick the biggest part you can at every step.

Many Indic systems use base-4 for fractions, so the choice of 320 as the common denominator seems peculiar to me. If it was based on a currency or metric with subdivisions of 320, I haven't been able to find any references to that. I suppose base-320 has an advantage over dyadic fractions in that it is divisible by 5, 10, etc. But why not choose base-60 like the Sumerians, or even base-360?

  • Factors of 320: 1, 2, 4, 5, 8, 10, 16, 20, 32, 40, 64, 80, 160, 320 (14)
  • Factors of 360: 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 18, 20, 24, 30, 36, 40, 45, 60, 72, 90, 120, 180, 360 (24)

I assume that the Malayali preferred powers of two over the convenience of dividing into thirds and the like. But the choice of 320 means that the sequence breaks down if you double in size from the smallest:

  • 1/320 (one three-hundred-and-twentieth) = U+0D2A U+0D4D U+0D24 = "പ്ത" (muntiri)
  • 2/320 (one one-hundred-and-sixtieth) = U+0D58 = "൘" (arakkāṇi)
  • 4/320 (one eightieth) = U+0D2E = "മ" (kāṇi)
  • 8/320 (one fortieth) = U+0D59 = "൙" (aramā)
  • 16/320 (one twentieth) = U+0D5B = "൛" (orumā)
  • 32/320 (one tenth) = U+0D5C = "൜" (raṇṭumā)
  • 64/320 (one fifth) = U+0D5E = "൞" (nālŭmā)
  • What now?

Or if you halve in size:

  • 160/320 (one half) = U+0D74 = "൴" (ara)
  • 80/320 (one quarter) = U+0D73 = "൳" (kāl)
  • 40/320 (one eighth) = U+0D77 = "൷" (arakkāl)
  • 20/320 (one sixteenth) = U+0D76 = "൶" (mākāṇi)
  • 10/320 (one thirty-second) = missing
  • 5/320 (one sixty-fourth) = missing
  • What now?
Or if you quarter in size (base-4 fractions):
  • 80/320 (one quarter) = U+0D73 = "൳" (kāl)
  • 160/320 (one half) = U+0D74 = "൴" (ara)
  • 240/320 (three quarters) = U+0D75 = "൵" (mukkāl)

  • 20/320 (one sixteenth) = U+0D76 = "൶" (mākāṇi)
  • 40/320 (one eighth) = U+0D77 = "൷" (arakkāl)
  • 60/320 (three sixteenths) = U+0D78 = "൸" (muṇṭāṇi)

  • 5/320 (one sixty-fourth) = missing
  • 10/320 (one thirty-second) = missing
  • 15/320 (three sixty-fourths) = missing

  • What now?

The remaining fractions (if you remove those found in the three schemes immediately above) are:

  • 12/320 (three eightieths) = U+0D5A = "൚" (mūnnukāṇi)
  • 48/320 (three twentieths) = U+0D5D = "൝" (mūnnumā)

These suggest division into fifths and then quartering thereafter:

  • 64/320 (one fifth) = U+0D5E = "൞" (nālŭmā)

  • 16/320 (one twentieth) = U+0D5B = "൛" (orumā)
  • 32/320 (one tenth) = U+0D5C = "൜" (raṇṭumā)
  • 48/320 (three twentieths) = U+0D5D = "൝" (mūnnumā)

  • 4/320 (one eightieth) = U+0D2E = "മ" (kāṇi)
  • 8/320 (one fortieth) = U+0D59 = "൙" (aramā)
  • 12/320 (three eightieths) = U+0D5A = "൚" (mūnnukāṇi)

  • 1/320 (one three-hundred-and-twentieth) = U+0D2A U+0D4D U+0D24 = "പ്ത" (muntiri)
  • 2/320 (one one-hundred-and-sixtieth) = U+0D58 = "൘" (arakkāṇi)
  • 3/320 (three three-hundred-and-twentieths) = missing
That's the only way I can think that they'd bother to have a symbol that became U+0D5A "MALAYALAM FRACTION THREE EIGHTIETHS"

Sunday, 6 February 2022

Unicode Trivia U+0CDE

Codepoint: U+0CDE "KANNADA LETTER FA"
Block: U+0C80..0CFF "Kannada"

The Kannada script is one of those added in Unicode 1.0 as part of the importing of the ISCII character sets in 1991. The 1991 ISCII Standard encoded ten Indic character sets:

  1. Devanagari (DEV/57002)
  2. Bengali (BNG/57003)
  3. Tamil (TML/57004)
  4. Telugu (TLG/57005)
  5. Assamese (ASM/57006)
  6. Oriya (ORI/57007)
  7. Kannada (KND/57008)
  8. Malayalam (MLM/57009)
  9. Gujarati (GJR/57010)
  10. Punjabi (PNJ/57011)

As part of the importation process:

  • "Bengali" and "Assamese" were folded into a single "Bengali/Assamese" script known in Unicode data tables simply as "Bengali"
  • "Punjabi" was renamed "Gurmukhi" (the former is a language, the latter is a script)
  • "Oriya" was not renamed "Odia" (as this didn't happen until November 2011)

The nine remaining scripts were mapped to 128-byte blocks we see in Unicode today:

  • Devanagari [U+0900..097F]
  • Bengali [U+0980..09FF]
  • Gurmukhi [U+0A00..0A7F]
  • Gujarati [U+0A80..0AFF]
  • Oriya [U+0B00..0B7F]
  • Tamil [U+0B80..0BFF]
  • Telugu [U+0C00..0C7F]
  • Kannada [U+0C80..0CFF]
  • Malayalam [U+0D00..0D7F]
Richard Ishida has an excellent page describing these scripts and the importation process; but here's a summary table I put together of the codepoints (with hexadecimal offsets within the blocks) that are purposefully aligned in each script:

The alignment was originally designed to facilitate trivial transcription, but this was never truly practical.

We can see that the Tamil column has quite a few missing (grey) codepoints; Tamil has fewer isolated letters in its "alphabet" than other Brahmic scripts. This is partly because it does not have distinct letters for aspirated consonants.

There are obviously gaps in the rows in chart above, which give space for script-specific codepoints. So, for Kannada, there are extra codepoints:

  • U+0C80 "KANNADA SIGN SPACING CANDRABINDU" — a non-combining Candrabindu
  • U+0C84 "KANNADA SIGN SIDDHAM" — used at the beginning of texts as an invocation
  • U+0CBC "KANNADA SIGN NUKTA" — used to represent sounds not present in Kannada
  • U+0CD5 "KANNADA LENGTH MARK" — used to extend vowel sounds
  • U+0CD6 "KANNADA AI LENGTH MARK" — used to extend AI vowel sounds
  • U+0CDD "KANNADA LETTER NAKAARA POLLU" — a vowel-less form of NA
  • U+0CDE "KANNADA LETTER FA"

U+0CDE "KANNADA LETTER FA" was added in Unicode 1.0:

Unicode 1.0 Code Chart

But there is no letter FA for Kannada mentioned in ISCII 1991. Indeed, there is no letter FA in Kannada full stop. As Richard Ishida explains:

The Kannada character U+0CDE KANNADA LETTER FA "ೞ" was incorrectly named. A more appropriate name would be LLLA, rather than FA. Because of the rules for Unicode naming, the current name cannot, however, be changed. Fortunately this letter has not been actively used in Kannada since the end of the 10th century.

Fortunate, indeed!

The table in Wikipedia seems to want to perpetuate the error; although, as a record of the actual importation process, it's un-usefully accurate.

Saturday, 5 February 2022

Unicode Trivia U+0C4D

Codepoint: U+0C4D "TELUGU SIGN VIRAMA"
Block: U+0C00..0C7F "Telugu"

Telugu is a Dravidian language spoken by about 100 million people worldwide. The Telugu script was added to Unicode 1.0 in 1991 as part of the migration of ISCII.

Telugu codepoints hit the headlines in February 2018 due to CVE-2018-4124, also known as the "Telugu Bug". The actual bug was in Apple's text layout engine (named "Core Text"), not in the Unicode specification. But that didn't stop some people pointing the finger and saying that Unicode composition was fundamentally flawed and hence, indirectly, the cause of the problem.

SerHack and Manish Goregaokar provide good, in-depth reports of the bug, but essentially "Core Text" mangles the heap when it sees codepoint sequences like the following:

  1. U+0C1C "TELUGU LETTER JA" = "జ"
  2. U+0C4D "TELUGU SIGN VIRAMA" = "్"
  3. U+0C1E "TELUGU LETTER NYA" = "ఞ"
  4. U+200C "ZERO WIDTH NON-JOINER" = ZWNJ
  5. U+0C3E "TELUGU VOWEL SIGN AA" = "ా"

That should be rendered as:

I won't be embedding the actual sequence in this post, just in case you haven't updated your iPhone software since 2018. But when presented to Apple's library before the fix, "Core Text" attempts to perform a memory optimization that ends up writing data to an invalid address, thereby usually crashing whichever application is running.

It turns out the ZWNJ is bogus and can be dropped:

But that four-codepoint sequence doesn't trigger the bug in "Core Text". It raises the interesting (but knotty) problem of what constitutes a "valid" sequence of codepoints. Whatever the result, crashing is probably not a good response under any circumstances.

The Unicode mailing list has a thread discussing the bug, with a reference to just how complicated glyph shaping for Indic fonts is to implement.

"Core Text" is proprietary Apple code, so we cannot inspect the source code, nor is it Apple's policy to explain fixes to critical security bugs.

P.S. Another codepoint I could have picked for the Telugu block trivia was the fabulously named U+0C78 "TELUGU FRACTION DIGIT ZERO FOR ODD POWERS OF FOUR" but I've already recently covered fractions and Mark Jason Dominus describes it brilliantly

Friday, 4 February 2022

Unicode Trivia U+0BD0

Codepoint: U+0BD0 "TAMIL OM"
Block: U+0B80..0BFF "Tamil"

Looking for mildly interesting facts within a Unicode block means a bit of research. Take block U+0B80..0BFF "Tamil"; here are my go-to places for information:

  1. Wikipedia (language, script and block)
  2. Unicode code chart
  3. Unicode core specification (Section 12.6)
  4. ScriptSource
  5. Omniglot
  6. Richard Ishida's excellent r12a

Obviously, many of these sites link to other resources. And therein lies the "fun".

Looking at the official code chart I found U+0BD0 "TAMIL OM". Of this codepoint, r12a says:

OM is a religious concept found in all three major religions born in India viz. Hinduism, Jainism and Buddhism. ௐ [U+0BD0 TAMIL OM] is widely used in Hindu religious texts, temple publications, and as neon lamps of sign boards in shops etc.

Hmm. That's a bit jarring, isn't it? How did the reference to "neon lamps of sign boards in shops" make it into a list of sacred uses? A quick google of that exact phrase only turns up references to r12a. But I cannot imagine Richard Ishida conjuring up that phrase from thin air.

U+0BD0 "TAMIL OM" was added in Unicode 5.1 (April 2008); recently enough for there to be quite a good paper trail. Indeed, the proposal (N3119) to add it was submitted in April 2006 by the International Forum for Information Technology in Tamil (INFITT) Working Group 2 (WG02). Section 2.1 of the proposal says:

Devanagari and Gujarati scripts have a sign named OM in their Unicode ranges. However in Tamil the corresponding slot is left vacant. Gurmukhi script also has an OM sign. Tamil OM sign is widely used in Hindu religious texts, temple publications, and as neon lamps of sign boards in shops etc. OM is a religious concept found in all three major religions born in India viz. Hinduism, Jainism and Buddhism. This document proposes to add the character TAMIL OM in Unicode Tamil range at U+0BD0.

Surely this must be the source of the "neon lamps" narrative? Somehow Google haven't (yet) indexed it.

Written proposals to the Unicode committee usually have examples (known as "attestations") attached to their end. Alas, proposal N3119 does not provide a photograph of a neon shop sign.

It is also surprisingly difficult to find shop signage featuring the Tamil Om on the internet; though other Oms are available. The only good match I found was this:

[source]

This is from the IndiaMART page of Sudha Neon Lights of Chennai, Tamil Nadu. The Tamil Om is in green; the red "spear" is Vel , the divine javelin of Murugan, the Hindu God of war. The blue text underneath is, I believe, "முருகா" or "Muruga", an alternative spelling of Murugan.

Given the paucity of images of neon lamp signage in shops incorporating Tamil Om, I wonder just how common it is in Southern India and where the suggestion in N3119 actually comes from. Alas, INFITT/WG02 was dissolved some time before May 2020, so we may never know.

Thursday, 3 February 2022

Unicode Trivia U+0B77

Codepoint: U+0B77 "ORIYA FRACTION THREE SIXTEENTHS"
Block: U+0B00..0B7F "Oriya"

The Odia language (formerly named Oriya) is spoken in Odisha (formerly Orissa): 

[source]

Unlike many Brahmic scripts, the head bar of each glyph is not a contiguous, straight line. As Omniglot says:

The Odia script developed from the Kalinga script, one of the many descendants of the Brahmi script of ancient India. The earliest known inscription in the Odia language, in the Kalinga script, dates from 1051.

The curved appearance of the Odia script is a result of the practice of writing on palm leaves, which have a tendency to tear if you use too many straight lines.

There are six "fractions signs" added to the "Oriya" block in Unicode 6.0 (October 2010):

  • U+0B72 "୲" ORIYA FRACTION ONE QUARTER
  • U+0B73 "୳" ORIYA FRACTION ONE HALF
  • U+0B74 "୴" ORIYA FRACTION THREE QUARTERS
  • U+0B75 "୵" ORIYA FRACTION ONE SIXTEENTH
  • U+0B76 "୶" ORIYA FRACTION ONE EIGHTH
  • U+0B77 "୷" ORIYA FRACTION THREE SIXTEENTHS

The original proposal by Anshuman Pandey explains that they were primarily used to subdivide one rupee into sixteen annas. See also Section 9.5 of South Asian Scripts-I (6.0).

Why does it stop at U+0B77 "ORIYA FRACTION THREE SIXTEENTHS"? It first glance, it looks like there must be some codepoints missing, but Anshuman Pandey explains that this is an additive base-4 system, where you can express "N/16" for N=1..15 with at most two of the above codepoints:

  • 1/16 = "୵" = 1/16
  • 2/16 = "୶" = 1/8
  • 3/16 = "୷" = 3/16
  • 4/16 = "୲" = 1/4
  • 5/16 = "୲୵" = 1/4 + 1/16
  • 6/16 = "୲୶" = 1/4 + 1/8
  • 7/16 = "୲୷" = 1/4 + 3/16
  • 8/16 = "୳" = 1/2
  • 9/16 = "୳୵" = 1/2 + 1/16
  • 10/16 = "୳୶" = 1/2 + 1/8
  • 11/16 = "୳୷" = 1/2 + 3/16
  • 12/16 = "୴" = 3/4
  • 13/16 = "୴୵" = 3/4 + 1/16
  • 14/16 = "୴୶" = 3/4 + 1/8
  • 15/16 = "୴୷" = 3/4 + 3/16

As supporting evidence, he also includes a passage from "First Lessons in Oriya" by A. H. Young (1953, revised by B. Das. Cuttack: Orissa Mission Press):

The leading principle of Oriya arithmetic, to divide by four rather than any other number, pervades also the system of fractions.

This suggests base-4 was used elsewhere in the region's number system. I haven't been able to find any other concrete examples for Odia, but Kharosthi numbers have a base-4 component and most other Brahmic scripts have fractions built upon quarters or sixteenths, such as Bengali that we saw earlier.

Wednesday, 2 February 2022

Unicode Trivia U+0AF1

Codepoint: U+0AF1 "GUJARATI RUPEE SIGN"
Block: U+0A80..0AFF "Gujarati"

Sometimes a codepoint loses its lustre. Take U+0AF1 "GUJARATI RUPEE SIGN" as an example.

  • October 1991 — The "Gujarati" block is imported from ISCII into Unicode 1.0 without a specific rupee symbol

The rise...

  • July 2001 — The Indian Ministry of Information Technology suggests the addition of a Gujarati rupee symbol
  • November 2001 — The Unicode Technical Committee agrees to "add this rupee sign for Gujarati to the list of proposed additions, since the symbol is not made from pieces that are already encoded Gujarati characters. The form of this character is very Gujarati-like, and it will be proposed for encoding at this location, rather than in the Currency Symbols block."
  • April 2003 — U+0AF1 "GUJARATI RUPEE SIGN" is formally added to Unicode 4.0

U+0AF1

And fall...

  • October 2009 — Anshuman Pandey proposes the addition of a Gujarati abbreviation sign
  • October 2009 — Anshuman Pandey also proposes that U+0AF1 be deprecated as, with the addition of the abbreviation sign, the Gujarati rupee can be rendered using the codepoint sequence:

    • U+0AB0 "GUJARATI LETTER RA"
    • U+0AC2 "GUJARATI VOWEL SIGN UU"
    • U+0AF0 "GUJARATI ABBREVIATION SIGN"

  • January 2012 — U+0AF0 "GUJARATI ABBREVIATION SIGN" is formally added to Unicode 6.1

Of course, you cannot just remove an existing codepoint from the Unicode standard. What would you do with all the documents that had already embedded U+0AF1 as the rupee symbol? Instead, an annotation was added to U+0AF1 saying

preferred spelling is 0AB0 0AC2 0AF0

Job done? Not quite...

  • September 2018 — Charlotte Buff points out an inconsistency. She "identified the following 18 characters [including U+0AF1] that are strongly implied to be deprecated in the code charts, but actually aren’t in the UCD". She also raises the point that "U+0AF1 does not decompose into its preferred representation"

Should U+0AF1 be formally deprecated? Or should its usage be "discouraged"? Should codepoints in general be decomposed into their preferred spellings?

Personally, I think this is a case that's getting beyond the purview of the core Unicode Standard. Let's face it, U+0AF1 is already out there. Of course, it's difficult to know how prevalent it is; but even one occurrence makes it irrevocable.

And how exactly do you discourage the use of a codepoint, let alone deprecate it? Do you raid people's homes in the middle of the night and confiscate all the Gujarati Rupee codepoints?

The keen-eyed reader will have noticed I haven't actually used codepoint U+0AF1 in this post. I don't want to be woken up at 2am, thank you very much!

Tuesday, 1 February 2022

Unicode Trivia U+0A70

Codepoint: U+0A70 "GURMUKHI TIPPI"
Block: U+0A00..0A7F "Gurmukhi"

Sometimes the information in the Unicode Character Database (UCD) is either insufficient for some purpose or requires clarification. This is the role of Unicode Technical Reports (UTR) and Unicode Technical Notes (UTN).

Like other Brahmic scripts, the Gurmukhi script was imported into Unicode 1.0 as part of ISCII, where it was known as Punjabi.

U+0A02 and U+0A70 highlighted in yellow [source]

However, Gurmukhi differs in having two diacritics for nasalisation, the Bindi and Tippi:

U+0A02 "GURMUKHI SIGN BINDI"
U+0A70 "GURMUKHI TIPPI"

The ISCII codepage for Punjabi uses character 0xA2 for both diacritics with the expectation that the correct combining glyph will be rendered according to context. This logic is clarified in UTN #30 from Sukhjinder Sidhu (2006):

Bindi and Tippi are encoded using a single code point in ISCII (0xA2) and the underlying rendering engine selects the correct glyph. However, in Unicode they are given two separate code points.

Thus, 0xA2 should be converted to U+0A70 (Tippi) when:

  • The preceding letter is a consonant (ignoring any Nuktas)
  • The preceding letter is Vowel Sign I (U+0A3F), Vowel Sign U (U+0A41), Vowel Sign UU (U+0A42)
  • The preceding letter is Letter A (U+0A05), Letter I (U+0A07)

In all other cases, the sign should remain a Bindi (U+0A02).

When converting from Unicode to ISCII, both Bindi and Tippi should be converted to Bindi (0xA2).

This special case logic isn't part of the core Unicode Standard; it is advisory only. But Sukhjinder Sidhu points out

If the advice in this document is not heeded, any resulting conversion will not be legible to readers of the Gurmukhi script,

One would hope that software vendors would take heed, but a casual read of Microsoft's .NET core library source reveals no implementation (or even mention) of UTN #30 in ISCIIEncoding,cs. The code maps 0xA2 to and from U+0A02 (Bindi) but provides no transformations for Tippi. At the top of the C# source file is a comment:

Ported from windows c_iscii. If you find bugs here, there're likely similar bugs in the windows version

I decided not to look any further.