Saturday 5 February 2022

Unicode Trivia U+0C4D

Codepoint: U+0C4D "TELUGU SIGN VIRAMA"
Block: U+0C00..0C7F "Telugu"

Telugu is a Dravidian language spoken by about 100 million people worldwide. The Telugu script was added to Unicode 1.0 in 1991 as part of the migration of ISCII.

Telugu codepoints hit the headlines in February 2018 due to CVE-2018-4124, also known as the "Telugu Bug". The actual bug was in Apple's text layout engine (named "Core Text"), not in the Unicode specification. But that didn't stop some people pointing the finger and saying that Unicode composition was fundamentally flawed and hence, indirectly, the cause of the problem.

SerHack and Manish Goregaokar provide good, in-depth reports of the bug, but essentially "Core Text" mangles the heap when it sees codepoint sequences like the following:

  1. U+0C1C "TELUGU LETTER JA" = "జ"
  2. U+0C4D "TELUGU SIGN VIRAMA" = "్"
  3. U+0C1E "TELUGU LETTER NYA" = "ఞ"
  4. U+200C "ZERO WIDTH NON-JOINER" = ZWNJ
  5. U+0C3E "TELUGU VOWEL SIGN AA" = "ా"

That should be rendered as:

I won't be embedding the actual sequence in this post, just in case you haven't updated your iPhone software since 2018. But when presented to Apple's library before the fix, "Core Text" attempts to perform a memory optimization that ends up writing data to an invalid address, thereby usually crashing whichever application is running.

It turns out the ZWNJ is bogus and can be dropped:

But that four-codepoint sequence doesn't trigger the bug in "Core Text". It raises the interesting (but knotty) problem of what constitutes a "valid" sequence of codepoints. Whatever the result, crashing is probably not a good response under any circumstances.

The Unicode mailing list has a thread discussing the bug, with a reference to just how complicated glyph shaping for Indic fonts is to implement.

"Core Text" is proprietary Apple code, so we cannot inspect the source code, nor is it Apple's policy to explain fixes to critical security bugs.

P.S. Another codepoint I could have picked for the Telugu block trivia was the fabulously named U+0C78 "TELUGU FRACTION DIGIT ZERO FOR ODD POWERS OF FOUR" but I've already recently covered fractions and Mark Jason Dominus describes it brilliantly

No comments:

Post a Comment