Tuesday, 25 January 2022

Unicode Trivia U+0837

Codepoint: U+0837 "SAMARITAN PUNCTUATION MELODIC QITSA"
Block: U+0800..083F "Samaritan"

The Samaritan script was derived from the Paleo-Hebrew circa 600 BCE and was used alongside the Aramaic script in Judaism until the latter was repurposed as the Hebrew alphabet circa 100 BCE.

Samaritan is a right-to-left abjad with 22 basic consonants and diacritics to mark vowels:

Much is made of the extensive punctuation in the Samaritan script. Here are the fifteen codepoints of the "Punctuation" column (U+0830 to U+083E):


Monday, 24 January 2022

Unicode Trivia U+07C1

Codepoint: U+07C1 "NKO DIGIT ONE"
Block: U+07C0..07FF "NKo"

As reported by Dr Dianne White Oyler in "A Cultural Revolution in Africa: Literacy in the Republic of Guinea since Independence" (2001), the N'Ko script was developed by Souleymane Kanté in 1949, partly in response to

a 1944 challenge posed by the Lebanese journalist Kamal Marwa in an Arabic-language publication, Nahnu fi Afrikiya [We Are in Africa]. Marwa argued that Africans were inferior because they possessed no indigenous written form of communication. His statement that "African voices [languages] are like those of the birds, impossible to transcribe" reflected the prevailing views of many colonial Europeans. Although the journalist acknowledged that the Vai had created a syllabary, he discounted its cultural relevancy because he deemed it incomplete. [Page 588]

Kanté discarded both Arabic and Latin scripts as unable to transcribe all the characteristics of the Mande languages. Having developed a completely novel alphabet instead,

he called together children and illiterates and asked them to draw a line in the dirt; he noticed that seven out of ten drew the line from right to left. For that reason he chose a right-to-left orientation. In all Mande languages the pronoun n- means "I" and the verb ko represents the verb "to say". [Page 589]

So "N'Ko" means "I say" in all the target languages.

The right-to-left mantra extends not only to words, but to digits and numbers too. The ten digits zero to nine (U+07C0 "NKO DIGIT ZERO" to U+07C9 "NKO DIGIT NINE") face right:

N'Ko digits (top), Western Arabic (middle), Eastern Arabic (bottom)

This is particularly noticeable with U+07C1 "NKO DIGIT ONE": '߁'

Not only that, but the least significant digits of multi-digit N'Ko numbers are on the left, unlike almost all other writing systems. Latin, Greek, Arabic and Hebrew numbers place the least significant digit on the right, even though the latter two scripts are written right-to-left.

Consider the improbable phrase "There are 12345 eggs":

There are 12345 eggs = English

Υπάρχουν 12345 αυγά = Greek

 يوجد ١٢٣٤٥ بيضة = Arabic

יש 12345 ביצים = Hebrew

߁߂߃߄߅ ߞߟߌ߫ ߦߋ߫ ߦߋ߲߬ = N’Ko

In case of tofu:

Note that the order of the codepoints for "1", "2", "3" "4" and "5" occur in ascending memory order in all cases. For example:


At first, I wasn't sure how much "support" the Unicode standard gives for this type of anomaly. UCD's sister project CLDR (Common Locale Data Repository) has very little to say about N'Ko. There is scope for algorithmic number formatting, but I didn't find anything specific.

However, after a bit of thought I realised that, because directionality is a property of each codepoint and not of the script of the codepoints, digit ordering in N'Ko works "out of the box".

Consider these bidirectional class fields ("bc") from the UCD:

  • Latin
    • "A" (U+0041 "LATIN CAPTIAL LETTER A") =  "L" = strong left-to-right
    • "1" (U+0041 "LATIN CAPTIAL LETTER A") = "EN" = European number (left-to-right)
  • Greek
    • "α" (U+03B1 "GREEK SMALL LETTER ALPHA") =  "L" = strong left-to-right
  • Arabic
    • "ا" (U+0627 "ARABIC LETTER ALEF") = "AL" =  Arabic letter (right-to-left)
    • "١" (U+0661 "ARABIC-INDIC DIGIT ONE") = "AN" =  Arabic number (left-to-right)
  • Hebrew
    • "א" (U+05D0 "HEBREW LETTER ALEF") = "R" = strong right-to-left
  • N'Ko
    • "ߊ" (U+07CA "NKO LETTER A") = "R" = strong right-to-left
    • "߁" (U+07C1 "NKO DIGIT ONE") = "R" = strong right-to-left

Unlike the other digits, N'Ko digits are marked as strongly right-to-left. The only other examples in Unicode 14.0 I could find were Adlam digits (1989).

Another interesting codepoint from the Unicode "NKo" block is U+07F7 "NKO SYMBOL GBAKURUNEN":

It's a decorative punctuation symbol used to mark the end of a major section of text and represents the three stones holding a cooking pot over a fire:

[source]

Finally, there can't be many alphabets that have their own day: April 14.

[Many thanks to Coleman Donaldson for help with the N'Ko language]

Sunday, 23 January 2022

Unicode Trivia U+0780

Codepoint: U+0780 "THAANA LETTER HAA"
Block: U+0780..07BF "Thaana"

The Thaana script is used to write the Maldivian language. According to Wikipedia, it's an abugida with no inherent vowel. According to the ISO standard, it's a right-to-left-written alphabet (as indicated by the hundreds digit of its numeric ISO-15924 code "170").

It first appeared in about 1705 CE and seems have been developed with obfuscation in mind. The alphabet order is arbitrary and the consonant letterforms are derived from numeric figures:

On the top row, in white, are the 24 basic consonants in Thaana alphabetical order. These are the 24 consecutive Unicode codepoints U+0780 "THAANA LETTER HAA" to U+0797 "THAANA LETTER CHAVIYANI".

The second row shows the Arabic-Indic digits one to nine in blue and the Dhives Akuru digits one to six in red. Dhives Akuru was a Maldivian script used before Thaana. The main part of the alphabet looks very much like a simple replacement cipher.

An early version of the Thaana script, Gabulhi Thaana, was written scriptio continua, that is, without inter-word spacing or punctuation. This sounds like an absolute nightmare but was quite common in classical Greek and Latin. Before mechanical printing, Arabic was also written without spacing. This is, perhaps, why many writing systems have distinct letterforms for final letters in words.

According to "Scripts of Maldives", the early Thaana script, Gabulhi Thaana, got its name from the Maldivian word "gabulhi" meaning the in-between stage of a coconut, when it is neither fully ripe nor quite tender. Hence the idea of "immature" or "not fully-formed".

Saturday, 22 January 2022

Unicode Trivia U+0753

Codepoint: U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE"
Block: U+0750..077F "Arabic Supplement"

Syriac is not the only script that makes extensive use of diacritics. The spread of the Arabic script throughout the world means it is used for diverse languages, many of which have sounds not found in Arabic. Part of the "Arabic Supplement" block contains a column "Extended Arabic letters" with the annotation:

These are primarily used in Arabic-script orthographies of African languages.

One codepoint, U+0753, has the somewhat precise name of "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE". When I render that codepoint using "Noto Sans Arabic" on my PC, I get this:

Noto Sans Arabic (2.004)

When I render it with a default local font, I get:

Arial (7.00)

Spot the difference!

There's definitely a discrepancy in the orientation of the lower dots, but which is correct? I came up with three possibilities:

  1. I have old/corrupt font files installed on my PC
  2. The name of the Unicode codepoint is incorrect
  3. The orientation of the lower dots doesn't really matter, so there is no issue
  4. One of the font glyphs is incorrect

Initially, I did indeed think it was an old version of Noto Sans Arabic installed on my machine. But I updated my local version of Noto Sans Arabic to 2.009 with the same results. Google web font specimens confirmed the issue is with Noto Sans Arabic in general:

Three of the four specimens suggest the name of the Unicode codepoint is probably correct. I checked that there are no similarly-named codepoints; there is no "ARABIC LETTER BEH WITH THREE DOTS POINTING DOWNWARDS BELOW AND TWO DOTS ABOVE"

I couldn't really imagine that, carefully named as it is, the orientation of the lower dots in U+0753 was unimportant.

I then checked Unicode Updates and Errata but found no references to this or nearby codepoints.

So the finger of suspicion fell on the glyph within the Noto Sans Arabic font being incorrect. FontForge confirmed this:


I looked through the issues reported for Noto fonts, but found nothing, so I submitted a new one.

Of course, this has only a passing connection to the Unicode standard. But one can easily imagine the amount of noise that has to be ploughed through by the committee along the lines of "My text doesn't get displayed how I expected" just to get to genuine issues with the Unicode standard itself.

According to Wiktionary, U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" is

The third letter of the Hausa alphabet in ajami script, equivalent to Latin script c.

I was initially a bit suspicious of this. Both Omniglot and Wikipedia suggest that the three dots go above that letter, making it more like U+062B "ARABIC LETTER THEH". However, Richard Ishida points out that there are lots of subtle local variations and the initial Unicode proposal shows an "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" in Figure 5. The proposal cites "Using Arabic Script in Writing the Languages of the Peoples of Muslim Africa" (1992) by Mohamed Chtatou:

"Figure 5" (Chtatou, 1992)

Richard Ishida again:

Unicode policy for the Arabic script is to encode fully precomposed characters rather than to use combining characters for ijam.

It would appear that the task of supporting more obscure and/or infrequent Arabic script glyphs in Unicode (and in fonts) can only get harder.

Friday, 21 January 2022

Unicode Trivia U+0740

Codepoint: U+0740 "SYRIAC FEMININE DOT"
Block: U+0700..074F "Syriac"

Syriac has got to be one of the dottiest scripts in Unicode. The fact that there's a 232-page book devoted to Syriac diacritics says a lot:

[source]

The dot is used for everything in Syriac from tense to gender, number, and pronunciation, and unsurprisingly represents one of the biggest obstacles to learning the language.

Section 9.3 of the Core Specification 14.0.0 gives an introduction to some of these complexities. Within the sub-section concerning exceptions to the diabolical diacritical rules, is this:

The feminine dot is usually placed to the left of a final taw.

This refers to codepoint U+0740 "SYRIAC FEMININE DOT". According to Richard Ishida, this non-spacing mark:

[...] is a feminine marker used with "ܬ" [U+072C SYRIAC LETTER TAW] to indicate a feminine suffix. East Syriac fonts should render as two dots below the base letter, whereas West Syriac fonts render as a single dot to the left of the base.

So far as I can tell, this is the only diacritic currently in Unicode that distinguishes (or elucidates) the underlying word's gender.

Below are variations of "ܩܛܠܬ" (= kill) distinguished solely by diacritics ("ܩ̇ܛܠܬ", "ܩܛ̣ܠܬ" and "ܩܛܠܬ݀") rendered with "Noto Sans Syriac":

Notice U+0740 "SYRIAC FEMININE DOT" at the end (left) of the last line.

[Thanks to Richard Ishida and, indirectly, J F Coakley at Jericho Press]

Thursday, 20 January 2022

Unicode Trivia U+0640

Codepoint: U+0640 "ARABIC TATWEEL"
Block: U+0600..06FF "Arabic"

The FIFA World Cup Qatar 2022 logo applies kashida to the Latin script word for "Qatar":

[source]

This is the elongation of the connection between the "t" and "a". In Unicode, U+0640 "ARABIC TATWEEL" (alias "kashida") can be used to represent this elongation. Tatweels are usually only used in Arabic (or similar) scripts, so it's a nice cross-cultural reference in this context.

Here's an extreme example in the form of an Arabic script basmala:

[source]

Tatweels could be considered typographical formatting, but, because a tatweel character was part of ISO/IEC 8859-6 at position 0xE0, it was "inherited" by Unicode as a separate graphical codepoint.

Arabic tatweels are similar to Latin hyphens when used for text justification, but the rules are obviously very different. An excellent history of the topic is given by Titus Nemeth.

[At this point, my complete lack of understanding of Arabic will shine through. Apologies.]

Like its Unicode block-neighbour Hebrew, Arabic script is a right-to-left abjad. The name "Qatar" in Arabic is made up of three Arabic consonants:

  • U+0642 "ARABIC LETTER QAF"
  • U+0637 "ARABIC LETTER TAH"
  • U+0631 "ARABIC LETTER REH"

قطر

If, as part of text justification or for aesthetic effect, we want to widen the word, we could insert a tatweel between the tah and reh:

قطـر

In fact, we can add more tatweels in sequence:

قطـــــر

This is, of course, an artificial example; words of only three consonants are rarely stretched.

Straight line tatweels are not the only mechanism that can be used to justify Arabic text. Others include:

  1. Whitespace
  2. Letterform lengthening/shortening
  3. Ligature variation



Wednesday, 19 January 2022

Unicode Trivia U+05D0

Codepoint: U+05D0 "HEBREW LETTER ALEF"
Block: U+0590..05FF "Hebrew"

Hebrew is usually written in an abjad script, right-to-left. Abjads are also known as consonant alphabets because they lack "letters" for vowel sounds. Diacritics indicating vowels are used for poetry, religious texts and teaching Hebrew.

When we type the 22 consonants, "alef" (U+0590) to "taw" (U+05EA), a text renderer should render them right-to-left:

But how does it know that?

Within the UCD fields for U+05D0, we see:

bc = R

This means that the "bidirectionality class" for U+05D0 is "any strong right-to-left (non-Arabic-type) character" (UAX #44). This, together with the fiendishly complex bidirectional algorithm (UAX #9), allows text renderers to render arbitrary sequences of mixed-script codepoints correctly.

The U+05D0 "HEBREW LETTER ALEF" codepoint is marked as Hebrew script:

sc = Hebr

but the Unicode bidirectional algorithm does not rely on script-level properties. That is, Unicode says that "alef" is usually rendered "right-to-left", not that "alef" is part of the "Hebrew" script and the "Hebrew" script is usually rendered "right-to-left".

There is no concept of lowercase and uppercase letters in Hebrew; the script is unicameral.

Finally, and appropriately, four Hebrew letters have "final forms" when they appear at the end of words:


This is similar to Greek sigma, but without the special-case handling necessary to overcome the lack of an uppercase final sigma.