Saturday 22 January 2022

Unicode Trivia U+0753

Codepoint: U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE"
Block: U+0750..077F "Arabic Supplement"

Syriac is not the only script that makes extensive use of diacritics. The spread of the Arabic script throughout the world means it is used for diverse languages, many of which have sounds not found in Arabic. Part of the "Arabic Supplement" block contains a column "Extended Arabic letters" with the annotation:

These are primarily used in Arabic-script orthographies of African languages.

One codepoint, U+0753, has the somewhat precise name of "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE". When I render that codepoint using "Noto Sans Arabic" on my PC, I get this:

Noto Sans Arabic (2.004)

When I render it with a default local font, I get:

Arial (7.00)

Spot the difference!

There's definitely a discrepancy in the orientation of the lower dots, but which is correct? I came up with three possibilities:

  1. I have old/corrupt font files installed on my PC
  2. The name of the Unicode codepoint is incorrect
  3. The orientation of the lower dots doesn't really matter, so there is no issue
  4. One of the font glyphs is incorrect

Initially, I did indeed think it was an old version of Noto Sans Arabic installed on my machine. But I updated my local version of Noto Sans Arabic to 2.009 with the same results. Google web font specimens confirmed the issue is with Noto Sans Arabic in general:

Three of the four specimens suggest the name of the Unicode codepoint is probably correct. I checked that there are no similarly-named codepoints; there is no "ARABIC LETTER BEH WITH THREE DOTS POINTING DOWNWARDS BELOW AND TWO DOTS ABOVE"

I couldn't really imagine that, carefully named as it is, the orientation of the lower dots in U+0753 was unimportant.

I then checked Unicode Updates and Errata but found no references to this or nearby codepoints.

So the finger of suspicion fell on the glyph within the Noto Sans Arabic font being incorrect. FontForge confirmed this:


I looked through the issues reported for Noto fonts, but found nothing, so I submitted a new one.

Of course, this has only a passing connection to the Unicode standard. But one can easily imagine the amount of noise that has to be ploughed through by the committee along the lines of "My text doesn't get displayed how I expected" just to get to genuine issues with the Unicode standard itself.

According to Wiktionary, U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" is

The third letter of the Hausa alphabet in ajami script, equivalent to Latin script c.

I was initially a bit suspicious of this. Both Omniglot and Wikipedia suggest that the three dots go above that letter, making it more like U+062B "ARABIC LETTER THEH". However, Richard Ishida points out that there are lots of subtle local variations and the initial Unicode proposal shows an "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" in Figure 5. The proposal cites "Using Arabic Script in Writing the Languages of the Peoples of Muslim Africa" (1992) by Mohamed Chtatou:

"Figure 5" (Chtatou, 1992)

Richard Ishida again:

Unicode policy for the Arabic script is to encode fully precomposed characters rather than to use combining characters for ijam.

It would appear that the task of supporting more obscure and/or infrequent Arabic script glyphs in Unicode (and in fonts) can only get harder.

No comments:

Post a Comment