Hangul-Hanja.html

Hangul and Hanja

I am now preparing files showing coverage of the South Korean permitted
Hanja for forenames and the Hanja shown for surnames in the 2015 census.

For NanumMyeongjo, which is or was a preferred Korean serif font, I have
used the Hanazono Mincho fonts to provide the Hanja. For all other Korean
fonts which lack Hanja I will ignore them here.

PDFs
Baekmuk-Batang-forenames.pdf
Baekmuk-Batang-surnames.pdf
Baekmuk-Dotum-forenames.pdf
Baekmuk-Dotum-surnames.pdf
Baekmuk-Gulim-forenames.pdf
Baekmuk-Gulim-surnames.pdf
NanumMyeongjo-forenames.pdf
NanumMyeongjo-surnames.pdf
NotoSansCJKkr-forenames.pdf
NotoSansCJKkr-surnames.pdf
NotoSerifCJKkr-forenames.pdf
NotoSerifCJKkr-surnames.pdf
UnBatang-forenames.pdf
UnBatang-surnames.pdf
UnDotum-forenames.pdf
UnDotum-surnames.pdf

XeLatex source
Baekmuk-Batang-forenames.tex
Baekmuk-Batang-surnames.tex
Baekmuk-Dotum-forenames.tex
Baekmuk-Dotum-surnames.tex
Baekmuk-Gulim-forenames.tex
Baekmuk-Gulim-surnames.tex
NanumMyeongjo-forenames.tex
NanumMyeongjo-surnames.tex
NotoSansCJKkr-forenames.tex
NotoSansCJKkr-surnames.tex
NotoSerifCJKkr-forenames.tex
NotoSerifCJKkr-surnames.tex
UnBatang-forenames.tex
UnBatang-surnames.tex
UnDotum-forenames.tex
UnDotum-surnames.tex

My scripts for creating these.

These too are in files/korean.

These scripts are just a hack. They are not safe for use on a multiuser
machine, nor for running more than one of them at a time (insecure
filenames in current directory and reuse of at least one script.)

I began by assuming I would need to use Hanazono Mincho fonts to get
full coverage, but in the end I will only be doing that for NanumMyeongjo
which lacks all Hanja - for other fonts which lack all Hanja I will not bother.
The remaining fonts now show only what they actually support - the
naver-reduced file from process-naver.sh summarises what to change.

This eventually highlighted that the Hanja are not unique in either list,
counting them requires collecting the output and sorting it. That showed
a further problem - with the Baekmuk and Un fonts I can see where items
are missing, so I believe my lists of codepoints. But for the NotoCJK fonts,
when I went to confirm how missing codepoints are shown I eventually
found that everything appeared to be present. So, some sort of bug in
my get-codepoints script.

./process-naver.sh

This eventually creates naver-reduced and takes a significant time to
run. First it reads data-naver.csv, invoking ./naverfields.sh for each line,
to create naver-reduced-initial. The first line of that started as the csv
column headings and is garbage.

It then cats naver-heading to begin naver-reduced, adds all except the
first line from naver-reduced-initial,and finally adds end statements for
multicols and document.

If I used a Makefile, this script would be described as precious - keep it
and copy it for each naver-filename.tex file and then adapt those as
required.

I eventually realised that my earlier namings (naver-fontname and
Korean-surnames-fontname) were unmanageable. So, naver-reduced
now gets copied to fontname-forenames.txt. The template for
surnames-fontname.tex is files/tools/templates

Temporary files, showing the created format.

./myfields - the (two) Hangul, unicode in lowercase without any
prefix, Hanja glyph

./mystart : fields for Korean i.e. Hangul, space in normal font,
HanjaA with glyph, start of normal font markup.

./myval : Unicode value in uppercase, prefixed U+ and with
spaces before and after it.

./myend : closing '}' and newline.

./process-navercodes.sh

This similarly reads through data-naver.csv, using ./navercodes.sh
which writes to ./myfields (q.v.) but here only writes the unicode.
Then it sorts (5-digit hex codes need to come after 4-digit).

./check-hangul.sh

Copy a specified codepoints file to current directory (check only
one codepoints file is present), then loop through naver-unicodes:
If not present write to filename.missing-hanja,
Else write to filename-found-hanja.
Display how many hanja lines were processed (zero mean no
hanja in this codepoints file).