* Add back character sets that had characters outside 16 bit plane * Update XCCS-353=SYMBOLS3.TXT Update title line * Update UNICODE.TEDIT * Fix charset names * Reorganized the tables, added requested interfaces * Use a single hash * Top-level array branch beats a single hash * cleanup UNICODE.TRANSLATE macro * Fix slug in outcharfn * Remove a stray line * Another try, would work for raw * Remove duplicates, redo hashing * Getting complete maps in both directions * Initializing * Only the latest file versions * Add back gothic mappings
120 lines
19 KiB
Plaintext
120 lines
19 KiB
Plaintext
Medley UNICODE
|
||
2
|
||
|
||
4
|
||
|
||
1
|
||
|
||
UNICODE
|
||
1
|
||
|
||
4
|
||
|
||
By Ron Kaplan
|
||
This document was last edited in January 2025.
|
||
|
||
The UNICODE library package defines external file formats that enable Medley to read and write files where 16 bit character codes are represented as UTF-8 byte sequences or UTF-16 byte-pairs. It also provides for character codes to be converted (on reading) from Unicode codes to equivalent codes in the Medley-internal Xerox Character Code Standard (XCCS) and (on writing) from XCCS codes to equivalent Unicode codes.
|
||
Unicode external formats
|
||
Seven external formats are defined when the package is loaded:
|
||
:UTF-8 codes are represented as UTF-8 byte sequences and XCCS/Unicode character conversion takes place.
|
||
:UTF-16BE codes are represented as 2-byte pairs, with the high order byte appearing first in the file, and characters are converted.
|
||
:UTF-16LE codes are represented as 2-byte pairs, with the low order byte appearing first in the file, and characters are converted.
|
||
:UTF-8-SLUG A variant of :UTF-8 whose OUTCHARFN produces the Unicode slug code FFFD for XCCS codes whose mappings are not defined in the XCCS-to-Unicode tables.
|
||
The three other external formats translate byte sequences into codes, but do not translate the codes. These allow Medley to see and process characters in their native encoding.
|
||
:UTF-8-RAW codes are represented as UTF-8 byte sequences, but character conversion does not take place.
|
||
:UTF-16BE-RAW codes are represented as big-ending 2-byte pairs but there is no conversion.
|
||
:UTF-16LE-RAW codes are represented as little-ending 2-byte pairs but there is no conversion.
|
||
These formats all define the end-of-line convention (mostly for writing) for the external files according to the variable EXTERNALEOL (LF, CR, CRLF), initially set to LF.
|
||
The external format can be specified as a parameter when a stream is opened:
|
||
(OPENSTREAM 'foo.txt 'INPUT 'OLD '((EXTERNALFORMAT :UTF-8)))
|
||
(CL:OPEN 'foo.txt :DIRECTION :INPUT :EXTERNAL-FORMAT :UTF-8)
|
||
The opening parameters may be overridden if READBOM is invoked by the calling function (e.g. Tedit) and it detects a byte-order-mark at the beginning of the file:
|
||
(READBOM STREAM) [Function]<5D><> |