My current process

My Current Process

Listing the codepoints.

Before April 2025

I adapted a Public Domain file called has_char which I found somewhere (location
forgotten, it was years ago) and named it get_codepoints.cc: that links to fontconfig
and enables me to list all codepoints in a TTF or OTF file with a value greater than
or equal to space (it's a bit buggy, sometimes space is not reported).

For TTC or potentially for OTC files I need to separate the individual fonts. I used to
use fontforge, but that has always been awkward and with tiny and faint text in its
user interface. If I have to do this for any future files I will be using an old package
I found at github which I have adapted as getfonts-20240622.

That was all driven by my create-codepoint-files script. That created a codepoints
file, and a formatted 'coverage' file listing all the codepoints under their Unicode
headings. It had a backup approach, using ttf2config.pl supplied in the examples/
directory of Font-TTF-Scripts. The TTC files I knew about were listed but probably
no-longer referenced.

I then ran my generate-all-characters script. That read the codepoints file and
outputted a list (with spaces for missing items) on stdout. I ran it so infrequently
that I found using stdout to check what was happening, followed by Ctrl-C, useful.
Then I ran it again with the output sent to fontname-contents.txt.

To maintain the Unicode blocks and their names I used update-blocks.sh and the
files it references, using a copy of Blocks.txt from Unicode, giving unicode-blocks.
That listed assigned blocks and the unassigned areas in between them. Updating
was only necessary if an updated or new font of interest has a codepoint for which
the block name was not known. That version of unicode-blocks was updated
for Unicode-15.1.0 and has now been renamed to unicode-blocks.old.

Changes in April 2025

While reviewing the binary TL2025-pretest build I noticed that Gnome was moving
to two new fonts, Adwaita Sans and Adwaita Mono. Adwaita Sans is a variable font
( multiple weights and normal or italic in a single TTF). I assumed I could process it
in LuaLatex (true), but initial examination shouwed that my scripts were failing to
find Latin Uppercase A (U+0041) although it clearly exists in the file. That meant I
had to revise my process to allow comments in the codepoints file (in this case, a
comment that I added U+0041).

Two bigger problems arose when I looked at the conventional Mono (normal and
italic, regular and bold) font. First, there were items in an undefined block: these
two fonts are the first I have encountered which use Unicode-16.0. More critically,
I noticed that on one of the Cyrillic Extended blocks there was what looked like a
long vertical smear to the left of where the codepoints should be. I had assumed
that all combining codepoints were in blocks with Combining in their names. In
reality many blocks have one or more combining codepoints, but not all fonts
have those codepoints (e.g. added after font was released, or needed for variant
languages).

I now create unicode-blocks as before. Then I manually review each block to see if
it has any combining codepoints. Doing this, and getting it to all fit together, nearly
drove me out of my mind (a cynic would probably say that it did do that).

Unfortunately, there is no comprehensive list of all combining codepoints. Everyone
else who cares only needs to know the rules for the script(s) their font covers, but I
want to be able to show the codepoints individually. Some blocks show a dotted
circle under the codepoint even if not putting a space in front, but for many blocks
that is not true.

A further aggravation was that if I wrote the output to stdout, for semitic (RTL)
languges the address range for a line appeared at the left, but the glyphs atarted at
the right margin and continued to the left. At first I thought this had always been
wrong, but I now think that the problem is I have difficulty distinguishing these
glyphs. For peace of mind I now write U+202D (Left to Right override) in front of both
spaces and codepoints which are present in these blocks. In 'vim' that shows as
'<202d>' and in LibreOffice Writer (lowriter) it is effective.

The new process does not generate a blank line in front of headings, so I need to
add those blank lines manually.

As noted in 'newgenerate', usage is either
../newgenerate filename.codepoints 2>stderr
to see what is happening, or if there is a combining block I have not specified
or
../newgenerate filename.codepoints >filename-contents.txt 2>stderr
There is a lot of debug information available.

If anybody picks this up for a font which needs unicode-17 or later, look at changes
for all versions since 16.0. In 17.0-pre there is a new block with combining codepoints,
and perhaps other combining codepoints have been added.

The remainder of the process

I then open the contents.txt file in lowriter, save as an ODT file, add header and footer
(12pt, page number at right in footer, fontconfig name and version or date in header,
filename in footer), Then I remove the title my text file had, ensure that each page has
a heading for the block (... continued) and remove unnecessary blank lines, or move
things to a new page. Then I save that as a PDF. From April 2025 I also need to add
blank lines in front of the block headings.

When revising some old noto contents files to review coverage of combining codes I
discovered that from time to time fontconfig used some other random (and weird) font
for some of the block headings. So where an old noto font does not contain all of 0-9,
A-Z, a-z I ensure those headings and addresses are all edited to use Noto Sans.

After that, until early 2025 I took my languages-full.tex template, which used XeLaTeX
(now unmaintained) to see what the font covered. I now use my font-languages-new.tex
template with LuaLaTeX: I have dropped all the languages where I am unsure about
typographic standards (particularly, hyphenation) and use two-column text to use more
of the virtual page (but perhaps more than can be printed, I no longer have a working
printer and recomment using a decent PDF viewer).

Unfortunately, while processing the languages files for AdwaitaSans and AdwaitaMono
I discovered that LuaLaTeX has strange quirks - for AdwaitaMono it does not justify text,
and it seems to have strange ideas about where a line in English should break. I fixed
the Vietnamese tone markings and issues in my 'Quotes' section by replacing all spaces
and non-breaking spaces with space directive in millimetres or sub-millimetres. I cannot
see any logic to these hspace values.

Just because a codepoint is present does not mean it has the expected glyph, there are
sometimes errors. If there are small capitals, I have attempted to find which codepoints
they cover using my 'find-small-caps' script (does not always work), or by listing what is
in any separate small cap font file. If I found the sc codepoints, I merge them using my
'merge-sc-codepoints' file

It is amazing what partial items exist in some fonts, particularly the old AntPolt fonts
Have archaic Polish items. My script reports anything I do not think is a codepoint.

In 2023-4 I began to understand that HarfBuzz can generate missing codepoints using
combining codepoints, often with very ugly results. With XeLaTeX it was easy, if tedious,
to identify such codepoints. With LuaLaTeX that does not seem to happen reliably.

Finally, I use languages-sc-new.tex create a fontname-sc.tex file if there are any small
caps, use that to create a PDF, and then a fontname-languages.tex file to create the main
file including details about font weights and miscellania. I attempt to use polyglossia to
help keep text within the default margins, but for non-Latin writing systems that needs
OpenType tags and not every font has those for all its contents.

Reviewing often shows that the glyphs for the modern Greek alphabet either omit
diacriticals on capital letters, or has them very badly placed. Therefore, not every font
where I show the monotonic Greek alphabet is usable for Greek text. Similarly, in small
capitals for Turkic languages (now only Turkish, I dropped Azeri), the combining dot above
needed to distinguish small cap normal "dotted i" from small cap "dotless i" may be
missing or very badly positioned.

After a lot of aggravation trying to show the tone accents for Vietnamese in small cap
fonts, I eventually abandonned that. At best the layout was ugly and poorly aligned, at
worst some of the lines showed lowercase letters for no obvious reason. LuaLaTeX is nasty
if you have unusual requirements.

My future plans

None.