chilliant: December 2021

Sunday, 19 December 2021

Universe 5: Font Detection

Any Unicode codepoint browser ends up as an exercise in font wrangling. Universe is no exception.

The threat of fingerprinting means a browser script cannot typically enumerate the locally installed fonts. There is also currently no native way of determining if a font supports a particular codepoint. One would think that determining font support for codepoints would therefore be problematic, but it turns out to be relatively easy for our use-case with Chrome.

Consider the following JavaScript snippet:

var canvas = document.createElement("canvas");

var context = canvas.getContext("2d");
context.font = "10px font-a,font-b,font-c";
var measure = context.measureText(text);

If a character in the "text" string is not supported by "font-a", Chrome text rendering falls back to using "font-b". If "font-b" also doesn't support the character, "font-c" is used. If the character is not supported by "font-c" either, a system default is used.

We can take advantage of this fall back mechanism by using a "blank" font that guarantees to render any glyph as zero-width/zero-mark. Fortunately, there's just such a font already out there: Adobe Blank:

@font-face {
font-family: "blank";
src: url("https://raw.githubusercontent.com/adobe-fonts/
adobe-blank/master/AdobeBlank.otf.woff") format("woff");
}

Now, we can write a function to test a font for supported characters:

function IsBlank(font, text) {
var canvas = document.createElement("canvas");
var context = canvas.getContext("2d");
context.font = `10px "${font}",blank`;
var measure = context.measureText(text);
return (measure.width <= 0) &&
(measure.actualBoundingBoxRight <= -measure.actualBoundingBoxLeft);
}

The actual code in universe.js has some optimisations and additional features; see "MeasureTextContext()" and "TextIsBlank()".

Using this technique, we can iterate around some "well-known" fonts and render them where appropriate. For our example of codepoint U+0040:

The origins of each of the glyphs above are:

"notoverse" is the 32-by-32 pixel bitmap glyph described previously.
"(default)" is the default font used for this codepoint by Chrome. In my case, it's "Times New Roman".
"(sans-serif)" is the default sans serif font: "Arial".
"(serif)" is the default serif font: "Times New Roman".
"(monospace)" is the default monospace font: "Consolas".
"r12a.io" is the PNG from Richard Ishida's UniView online app.
"glyphwiki.org" is the SVG from GlyphWiki. It looks a bit squashed because it's a half-width glyph; GlyphWiki is primarily concerned with CJK glyphs.
"unifont" is the GNU Unifont font. I couldn't find a webfont-friendly source for version 14, so I had to piggyback version 12 from Terence Eden.
"noto" is a composite font of several dozen Google Noto fonts. See ".noto" in universe.css.
Subsequent "Noto ..." glyphs are from the corresponding, individual Noto font, in priority order.

Saturday, 18 December 2021

Universe 4: User Interface

The Universe character browser has a very basic user interface. In addition to the HTML5 presentation, there is a URL scheme that allows you to jump to various subpages.

Landing Page

universe.html#

The landing page gives a brief introduction and a collapsible table of statistics. As with all pages, a header contains breadcrumbs and a footer lists some useful jumping-off points.

Codepoints

To examine a specific codepoint, expressed in hexadecimal between "0000" and "10FFFF" inclusive, use:

universe.html#U+<hex>

In this case, "#U+0040" examines "@". The subpage lists the block, column and codepoint details, as well as font support, UCD data fields and external links, where appropriate.

Planes

To list all 17 Unicode planes, use:

universe.html#P

To examine details of the plane containing codepoint "U+<hex>", use:

universe.html#P+<hex>

Sheets

In this context, a "sheet" is a 32-by-32 grid of 1024 contiguous codepoints; it is how the Notoverse glyphs are organised.

To list all 161 sheets which contain allocated codepoints, use:

universe.html#S

To examine the sheet containing codepoint "U+<hex>", use:

universe.html#S+<hex>

Blocks

To list all 320 named Unicode blocks, use:

universe.html#B

To examine the block containing codepoint "U+<hex>", use:

universe.html#B+<hex>

This will list constituent codepoints, organised by column.

Columns

To examine just the column containing codepoint "U+<hex>", use:

universe.html#C+<hex>

This will list constituent codepoints in more detail.

Queries

To query the UCD for matching codepoints, use:

universe.html#<key>=<value>

The "<key>" can be:

One of the abbreviated UCD field names, such as "gc" or "na".
"id" to search the codepoint id (e.g. "U+0040").
"basic" to search the computed codepoint basic category.
"kind" to search the computed codepoint kind.
"script" to search the codepoint script list (see below).
"extra" to search NamesList.txt annotations.
"text" to search all the above.

The "<value>" can be simple text or a JavaScript regular expression in the form "/<expr>/<flags>". For example:

universe.html#na=/\bANT\b/i

This searches for codepoints whose name field contains the whole word "ANT" case-insensitively.

Scripts

Queries with the special key "script" search the fields "sc" and "scx". To query for the 29 codepoints of the Ogham script, use:

universe.html#script=Ogam

List all the 210 scripts of the ISO-15924 standard, use:

universe.html#script

Search

Queries with the special key "search" perform full-text searches. To bring up a search dialog, use:

universe.html#search

Gears

As mentioned earlier, loading the full UCD database and glyph sheets for the first time can take quite a few minutes. Searches can also take a few seconds. For long-running JavaScript functions, we display animated gears:

To keep the page responsive, we wrap the long-running functionality inside a call to the "Gears()" function in universe.js:

function Gears(parent, asynchronous, complete) {
var gears = document.createElement("img");
gears.className = "gears";
gears.title = "Please wait...";
gears.src = "gears.svg";
parent.appendChild(gears);
gears.onload = async () => {
var result = await asynchronous();
if (gears.parentNode === parent) {
parent.removeChild(gears);
}
if (complete) {
complete(result);
}
};
}

Inside the asynchronous, long-running function we have to make sure we periodically call "YieldAsync()":

function YieldAsync(milliseconds) {
// Yield every few milliseconds
// (or every time this function is called if argument is missing)
var now = performance.now();
if (!YieldAsync.already) {
// First call
YieldAsync.already = Promise.resolve(undefined);
YieldAsync.channel = new MessageChannel();
} else if (milliseconds && ((now - YieldAsync.previous) < milliseconds)) {
// Resolve immediately
return YieldAsync.already;
}
YieldAsync.previous = now;
return new Promise(resolve => {
YieldAsync.channel.port1.addEventListener("message",
() => resolve(), { once: true });
YieldAsync.channel.port1.start();
YieldAsync.channel.port2.postMessage(null);
});
}

This was inspired by a much-underrated StackOverflow answer.

Friday, 17 December 2021

Universe 3: Fonts

The Universe project uses Google Noto fonts as much as possible. As the Noto project page says:

The name is also short for "no tofu", as the project aims to eliminate 'tofu': blank rectangles shown when no font is available for your text.

According to the Universe Statistic page UCD 14.0.0, there are 144,762 "used" codepoints. This breaks down as follows when you include "unused" codepoints such as "private", "surrogate" and "noncharacter":

Kind	Criteria	Codepoints
format	gc is "Cf", "Zl" or "Zp"	165
control	gc is "Cc"	65
private	gc is "Co"	137,468
surrogate	gc is "Cs"	2,048
noncharacter	gc is "Cn"	66
modifier	gc is "Mc", "Me" or "Mn"	2,408
emoji	EPres is "Y"	1,185
graphic	otherwise	140,939
Total		284,344

Note that this "Kind" classification is slightly more fine-grained than Unicode's "basic type" but less so than "general category". Also note that the codepoints classified as "private", "surrogate" and "noncharacter" are fixed and will not change in subsequent Unicode versions.

This still leaves a great many codepoints that need to be rendered. The relationship between Unicode codepoints and character glyphs within a font is non-trivial, to say the least; but it can be useful, in a character browser, to render an "archetype" glyph of each codepoint for illustrative purposes.

In Universe, each "used" codepoint has a 32-by-32 pixel bitmap glyph. The aim is to use Google Noto fonts wherever possible to construct these bitmaps, because:

Noto fonts are "free and open source".
Their codepoint coverage is relatively good.
They try to adhere to a consistent look and feel.

Consequently, I named the set of bitmap glyphs "Notoverse". One can think of Notoverse as an alternative to GNU's Unifont initiative, except:

The glyphs are 32-by-32, not 16-by-16.
The glyphs are 24-bit RGB, not 1-bit monochrome.
There is full Unicode 14.0.0 coverage.

The construction of the Notoverse glyphs was tedious and exhausting. I now know why so many similar projects run out of steam. The final breakdown of sources for each of the 144,762 glyphs is as follows:

Hanazono: 58,118
Google Noto (CJK): 43,937
Google Noto (non-CJK): 34,225
GlyphWiki: 4,948
r12a.io UniView: 1,022
BabelStone Khitan: 470
Scraped from Unicode code charts: 346
Manually constructed (including some Noto Emoji): 1,696

It turns out that Noto contributes to about 54% of the glyphs; more if we include the glyphs manually constructed from Noto elements.

U+0061: LATIN SMALL LETTER A

Here is the list of the Google Noto fonts I used (in priority order):

Noto Sans
Noto Sans Armenian
Noto Sans Hebrew
Noto Sans Arabic
Noto Sans Syriac
Noto Sans Thaana
Noto Sans NKo/Noto Sans N Ko
Noto Sans Samaritan
Noto Sans Mandaic
Noto Sans Malayalam
Noto Sans Devanagari
Noto Sans Bengali
Noto Sans Gurmukhi
Noto Sans Gujarati
Noto Sans Oriya
Noto Sans Tamil
Noto Sans Tamil Supplement
Noto Sans Telugu
Noto Sans Kannada
Noto Sans Malayalam
Noto Sans Sinhala
Noto Sans Thai
Noto Sans Myanmar
Noto Sans Georgian
Noto Sans Cherokee
Noto Sans Canadian Aboriginal
Noto Sans Ogham
Noto Sans Runic
Noto Sans Tagalog
Noto Sans Hanunoo
Noto Sans Buhid
Noto Sans Tagbanwa
Noto Sans Khmer
Noto Sans Mongolian
Noto Sans Limbu
Noto Sans Tai Le
Noto Sans New Tai Lue
Noto Sans Buginese
Noto Sans Tai Tham
Noto Sans Balinese
Noto Sans Sundanese
Noto Sans Batak
Noto Sans Lepcha
Noto Sans Ol Chiki
Noto Sans Glagolitic
Noto Sans Coptic
Noto Sans Tifinagh
Noto Sans Yi
Noto Sans Lisu
Noto Sans Vai
Noto Sans Bamum
Noto Sans Syloti Nagri
Noto Sans PhagsPa
Noto Sans Saurashtra
Noto Sans Kayah Li
Noto Sans Rejang
Noto Sans Javanese
Noto Sans Cham
Noto Sans Tai Viet
Noto Sans Ethiopic
Noto Sans Linear A
Noto Sans Linear B
Noto Sans Phoenician
Noto Sans Lycian
Noto Sans Carian
Noto Sans Old Italic
Noto Sans Gothic
Noto Sans Old Permic
Noto Sans Ugaritic
Noto Sans Old Persian
Noto Sans Deseret
Noto Sans Shavian
Noto Sans Osmanya
Noto Sans Osage
Noto Sans Elbasan
Noto Sans Caucasian Albanian
Noto Sans Cypriot
Noto Sans Imperial Aramaic
Noto Sans Palmyrene
Noto Sans Nabataean
Noto Sans Hatran
Noto Sans Lydian
Noto Sans Meroitic
Noto Sans Kharoshthi
Noto Sans Old South Arabian
Noto Sans Old North Arabian
Noto Sans Manichaean
Noto Sans Avestan
Noto Sans Inscriptional Parthian
Noto Sans Inscriptional Pahlavi
Noto Sans Psalter Pahlavi
Noto Sans Old Turkic
Noto Sans Old Hungarian
Noto Sans Hanifi Rohingya
Noto Sans Old Sogdian
Noto Sans Sogdian
Noto Sans Elymaic
Noto Sans Brahmi
Noto Sans Kaithi
Noto Sans Sora Sompeng
Noto Sans Chakma
Noto Sans Mahajani
Noto Sans Sharada
Noto Sans Khojki
Noto Sans Multani
Noto Sans Khudawadi
Noto Sans Grantha
Noto Sans Newa
Noto Sans Tirhuta
Noto Sans Siddham
Noto Sans Modi
Noto Sans Takri
Noto Sans Warang Citi
Noto Sans Zanabazar Square
Noto Sans Soyombo
Noto Sans Pau Cin Hau
Noto Sans Bhaiksuki
Noto Sans Marchen
Noto Sans Masaram Gondi
Noto Sans Gunjala Gondi
Noto Sans Cuneiform
Noto Sans Egyptian Hieroglyphs
Noto Sans Anatolian Hieroglyphs
Noto Sans Mro
Noto Sans Bassa Vah
Noto Sans Pahawh Hmong
Noto Sans Medefaidrin
Noto Sans Miao
Noto Sans Nushu
Noto Sans Duployan
Noto Sans SignWriting
Noto Sans Wancho
Noto Sans Mende Kikakui
Noto Sans Meetei Mayek/Noto Sans MeeteiMayek
Noto Sans Adlam Unjoined
Noto Sans Indic Siyaq Numbers
Noto Serif Tibetan
Noto Serif Vithkuqi
Noto Serif Yezidi
Noto Serif Ahom
Noto Serif Dogra
Noto Serif Tangut
Noto Serif Hmong Nyiakeng/Noto Serif Nyiakeng Puachue Hmong
Noto Sans Symbols
Noto Sans Symbols2/Noto Sans Symbols 2
Noto Sans Math
Noto Sans Display
Noto Looped Lao
Noto Sans Lao
Noto Sans CJK SC/Noto Sans SC
Noto Music

Where there are two font names with a slash between them, the first is the local font name and the second the web font name. Alas, the naming is somewhat lax. See the HTML source and CSS for more details.

I hope I haven't trodden on anyone's toes by using their fonts in this way. The individual glyphs are down-sampled to 32-by-32 pixels and used for illustrative purposes only. I trust you'll agree that's "fair use".

So, it looks like we're still a long way from getting a pan-Unicode font, or even a set of fonts that achieve the same goal.

For completeness, here's a list of attempts at providing good Unicode coverage:

Lucida Unicode, Microsoft, 1993
Everson Mono, Michael Everson, 1995
Code2000 et al, James Kass, 1998-2008
GNU Unifont, 1998-
GNU FreeFont, 2002-
GlyphWiki, 2006-
Google Noto, 2012-

Thursday, 16 December 2021

Universe 2: Loading Resources

When the Universe web page first loads into a browser, about 100MB of data are pulled over the network in over 250 HTTP requests. This includes 77 web fonts (6MB), 161 glyph image sheets (77MB) and the Unicode Character Database, UCD, as a tab-separated value file (28MB).

It can take three or four minutes the first time around with a slow network connection, but all these resources are usually cached by the browser, so subsequent page loads perform very little actual data transfer. However, decoding the UCD is another matter.

Even the subset of the UCD data we're actually interested in takes up a quarter of a gigabyte when expressed as JSON. So caching the raw JSON file is problematic. I elected to transfer the data as sequential key-value deltas in TSV format. This reduces the size down to under 30MB and only 6.5MB when compressed "on the wire". It is also relatively quick to reconstitute the full UCD object hierarchy in JavaScript: it takes about seven seconds on my machine.

Here are the lines describing codepoint U+0040 ("@"):

na<tab>COMMERCIAL AT

lb<tab>AL
STerm<tab>N
Term<tab>N
SB<tab>XX
0040
<tab>= at sign

The first five lines (in "<key><tab><value>" form) list only the differences in UCD properties (keyed by short property aliases) compared to the preceding codepoint (i.e. U+003F "?").

The next line (in "<hex>" form) creates a new codepoint record for "U+0040".

The final line (in "<tab><extra>" form) adds an extra line of narrative to the preceding record. In this case, it refers to a NamesList.txt alias.

The ucd.14.0.0.tsv file is constructed using a NodeJS script from the following sources:

ucd.all.flat.xml (194MB) from https://www.unicode.org/Public/UCD/latest/ucdxml/ucd.all.flat.zip
https://www.unicode.org/Public/UCD/latest/ucd/NamesList.html (1.6MB)
https://www.unicode.org/iso15924/iso15924.txt (13KB)

It is read and reconstituted by the "LoadDatabaseFileAsync()" function of universe.js. Notice that the final action of that function is to store the data in an IndexedDB database. This allows the JavaScript to test for the existence of that data in subsequent page reloads, obviating the need to fetch and decode each time. This saves several seconds each time. See "IndexedDatabaseGet()" and "IndexedDatabasePut()" in universe.js.

The downside of using this client-side caching is that it takes up 260MB of disk space:

We rely on the browser (Chrome, in this case) to manage the IndexedDB storage correctly, but even if the cache is purged, we fall back over to parsing the TSV again.

Wednesday, 15 December 2021

Universe 1: Introduction

Version 14.0.0 of Unicode was released back in September. There are an plethora of online resources for browsing the Unicode Character Database (UCD):

But they usually have something lacking: the ability to render the codepoints and/or keeping up-to-date with the ever-evolving Unicode standard.

Of course, you can just download the official Unicode code charts as a single PDF and manually cross-reference them with the UCD. But the PDF is over 100MB and the UCD is distributed as a collection of human-unfriendly text files.

Universe is my attempt at producing a client-side HTML/CSS/JavaScript Unicode browser. It's obviously a work in progress, but the main areas of interest (that I plan to cover in subsequent blog posts) are:

A user interface to navigate around Unicode codepoints, blocks, planes, etc.
Loading the UCD into the client (this is my biggest concern at present as it takes about ten seconds to load and parse the database).
A flexible search mechanism.
Font support.
A representative rendering of all the codepoints.

chilliant