* UNICODE: a few additional Tedit helpers, revised documentation * New JIS files (courtesy of Peter) * Updated mapping files (courtesy of Peter Craven) * UNICODE: changed SHOULDNT to ERROR
96 lines
13 KiB
Plaintext
96 lines
13 KiB
Plaintext
Medley UNICODE
|
||
2
|
||
|
||
4
|
||
|
||
1
|
||
|
||
UNICODE
|
||
1
|
||
|
||
4
|
||
|
||
By Ron Kaplan
|
||
This document was last edited in January 2024.
|
||
|
||
The UNICODE library package defines external file formats that enable Medley to read and write files where 16 bit character codes are represented as UTF-8 byte sequences or big-endian UTF16 byte-pairs. It also provides for character codes to be converted (on reading) from Unicode codes to equivalent codes in the Medley-internal Xerox Character Code Standard (XCCS) and (on writing) from XCCS codes to equivalent Unicode codes.
|
||
UTF-8 External formats
|
||
Four external formats are defined when the package is loaded:
|
||
:UTF-8 codes are represented as UTF-8 byte sequences and XCCS/Unicode character conversion takes place.
|
||
:UTF-16BE codes are represented as 2-byte pairs, with the high order byte appearing first in the file, and characters are converted.
|
||
The two other external formats translate byte sequences into codes, but do not translate the codes. These allow Medley to see and process characters in their native encoding.
|
||
:UTF-8-RAW codes are represented as UTF-8 byte sequences, but character conversion does not take place.
|
||
:UTF-16BE-RAW codes are represented as big-ending 2-byte pairs but there is no conversion.
|
||
These formats all define the end-of-line convention (mostly for writing) for the external files according to the variable EXTERNALEOL (LF, CR, CRLF), with LF the default.
|
||
The external format can be specified as a parameter when a stream is opened:
|
||
(OPENSTREAM 'foo.txt 'INPUT 'OLD '((EXTERNALFORMAT :UTF-8)))
|
||
(CL:OPEN 'foo.txt :DIRECTION :INPUT :EXTERNAL-FORMAT :UTF-8)
|
||
The function STREAMPROP obtains or changes the external format of an open stream:
|
||
(STREAMPROP stream 'EXTERNALFORMAT) -> :XCCS
|
||
(STREAMPROP stream 'EXTERNALFORMAT :UTF-8) -> :XCCS
|
||
In the latter case, the stream's format is changed to :UTF-8 and the previous value is returned, in this example it is Medley's historical default format :XCCS.
|
||
Entries can be placed on the variable *DEFAULT-EXTERNALFORMATS* to change the external format that is set by default when a file is opened on a particular device. Loading UNICODE executes
|
||
(PUSH *DEFAULT-EXTERNALFORMATS* '(UNIX :UTF-8))
|
||
so that all files opened (by OPENSTREAM, CL:OPEN, etc.) on the UNIX file device will be initialized with :UTF-8. Note that the UNIX and DSK file devices reference the same files (although some caution is needed because {UNIX} does not simulate Medley versioning), but the device name in a file name ({UNIX}/Users/... vs. {DSK}/Users/...) selects the particular device. The default setting above applies only to files specified with {UNIX}; a separate default entry for DSK must be established to change its default from :XCCS.
|
||
The user can also specify the external format on a per-stream basis by putting a function on the list STREAM-AFTER-OPEN-FNS. After OPENSTREAM opens a stream and just before it is returned to the calling function, the functions on that list are applied in order to arguments STREAM, ACCESS, PARAMETERS. They can examine and/or change the properties of the stream, in particular, by calling STREAMPROP to change the external format from its device default.
|
||
The macro
|
||
(UNICODE.TRANSLATE CODE TRANSLATION-TABLE) [Macro]
|
||
is used by the external formats to perform the mappings described by XCCS-to-Unicode mapping-tables.
|
||
Mapping between Unicode and XCCS character codes
|
||
The XCCS/Unicode mapping tables are defined by the code-mapping files for particular XCCS character sets. These are typically located in the Library> sister directory
|
||
..>Unicode>Xerox>
|
||
and the variable UNICODEDIRECTORIES is initialized with a globally valid reference to that path. The global reference is constructed by prepending the value of the Unix environment-variable "MEDLEYDIR" to the suffix >Unicode>Xerox>.
|
||
The mapping files have conventional names of the form XCCS-[charsetnum]=[charsetname].TXT, for example, XCCS-0=LATIN.TXT, XCCS-357=RSYMBOLS4.TXT. The translations used by the external formats are read from these files by the function
|
||
(READ-UNICODE-MAPPING FILESPEC NOPRINT NOERROR) [Function]
|
||
where FILESPEC can be a list of files, charset octal strings ("0" "357"), or XCCS charset names (LATIN EXTENDED-LATIN GREEK). Reading will be silent if NOPRINT, and the process will not abort if an error occurs and NOERROR. The value is a flat list of the mappings for all the character sets, with elements of the form (XCCC-code Unicode-code).
|
||
READ-UNICODE-MAPPING uses READ-UNICODE-MAPPING-FILENAMES to interpret the FILESPEC.
|
||
(READ-UNICODE-MAPPING-FILENAMES FILESPEC DIRS) [Function]
|
||
converts the list of mapping-file specifications into a list of corresponding files in any of the directories in DIRS, defaulting to UNICODEDIRECTORIES. If a file specification is the name of a subdirectory it will expand to the names of all of the mapping files in that subdirectory Thus JIS will result in a list of all of the JIS>XCCS-*=JIS.TXT files.
|
||
When UNICODE is loaded the mappings for the character sets specified in the variable DEFAULT-XCCS-CHARSETS are installed. This is initialized to
|
||
<EFBFBD><EFBFBD> |