Sunday, 16 January 2022

Unicode Trivia U+047C

Block: U+0400..04FF "Cyrillic"

It is perhaps not surprising, given the history of writing, that there are so many references to religious aspects of letterforms in the Unicode standard. Take U+047C "CYRILLIC CAPITAL LETTER OMEGA WITH TITLO" as an example:

Taken from Unicode Cyrillic Chart

It sits in the "Historic letters" column of the "Cyrillic" block. A look at the official Unicode charts reveals the following annotations:

  • [alias] Cyrillic "beautiful omega"
  • [note] despite its name, this character does not have a titlo, nor is it composed of an omega plus a diacritic
  • [see also] A64C Ꙍ cyrillic capital letter broad omega

Apparently, this glyph (or something that looks similar) is used in Church Slavonic religious texts for the interjection "Oh!" However, there has been some discussion within the Unicode community about this codepoint and its lowercase version:

These characters were originally encoded in the Unicode standard with an erroneous name and representation. After the UTC ruling on Everson et al. (2006), the representation was corrected and an annotation was added to U+047C, reading “despite its name, this character does not have a titlo, nor is it composed of an omega plus a diacritic”. However, no annotation was added to the lowercase form U+047D.

The character that is encoded here is a ligature of the Cyrillic broad (or wide) Omega (encoded at U+A64C and U+A64D) and the ‘great apostrof’, a stylized diacritical mark consisting of the soft breathing (encoded at U+0486) and the Cyrillic kamora (encoded at U+0311). The broad Omega (U+A64D) can occur by itself, without this diacritical mark, in pre-1700 printed Church Slavic books, though not in modern liturgical texts. Functionally, the character with the diacritical mark is analogous to the Greek character ὦ, which also consists of an Omega, a soft breathing mark and a Perispomene. Both the Greek and Church Slavic characters have identical functions: to record the exclamation ‘Oh!’ Since U+047C and U+047D were encoded without a canonical decomposition, though they are linguistically decomposable, they should not be decomposed to avoid an encoding ambiguity. However, in our opinion, the annotation as written does not make this clear.

There has been a suggestion to rename (or alias) U+047C to "CYRILLIC LETTER BROAD OH" with the observation:

In addition, the Unicode note “beautiful omega” should refer to A64C, not to this character.

At the time of writing there are no name aliases in the UCD for any of these codepoints.

It all goes to show that:

  1. Naming codepoints is a perilous task.
  2. The complexity of the competing interests makes errors inevitable.
  3. If mistakes are made, the Unicode stability policy makes fixing them difficult or unappealing.
  4. Unicode annotations can be more revealing than the raw UCD data.

