1
0
mirror of synced 2026-01-13 15:37:38 +00:00
Larry Masinter 4efe2f93af
Merge (rebase) Cleanup-character-IO-interfaces with master (#356)
* Cleanup  of character IO interface

Committing this branch for further testing.  I know at least that the TTY output stream somehow is defaulting to :XCCS, which is wrong, but I haven't yet found the interface for that.

* Clean out \NSIN etc

No top-level calls to the NS specific functions, just to the generic \OUTCHAR etc.

Updated full.database

* MODERNIZE: added dragging for fixed-menu windows

They can be dragged by their title bars

* UNICODE:  Added Greek to the default set

Also made spelling of default-externalformats consistent with FILEIO

* FASLOAD: EOL conversion in FASL::READ-TEXT

EOL's printed as LF's will be read as EOL

* LLREAD:  Added meta as a CHARACTERSETNAME

meta,a maps to 1,a now.  But slowly propagating this to TEDIT, SEDIT, etc will make it easier to change the coding of meta characters, e.g. as part of a Unicode transition.

* APRINT FILEIO LLREAD: \OUTCHAR now a closed function

Removed the macro

* LLKEY: call CHARCODE.DECODE directory in \KEYACTION1

Minor cleanup, avoid typical user entry and APPLY*

* WHEELSCROLL: re-enable on AFTERMAKESYS/SYSOUT FORMS

Also sets up mappings in the \COMMANDKEYACTIONS, whatever that is

* ABASIC:  NILL and ZERO change from LAMBDA NOBIND to LAMBDA NIL

So that things like Masterscope don't break

* MASTERSCOPE:  Added WHEREIS as last-resort for CONTAINS

Looks at the WHEREIS database, if present, for FNS and FUNCTIONS if it has no other information.  . WHO CONTAINS ANY CALLING FOO works, but not the inverse:  . WHO DOES FUM CONTAIN.  We still need to figure out why the CONTAINS table isn't populated

* POSTSCRIPTSTREAM: use standard \OUTCHAR conventions

Now uses generic \OUTCHAR to get the proper function from the stream (or default)

* Recompile with right EXPORTS.ALL

Some of the macros weren't correct.

* Fix POSTSCRIPTSTREAM

Cleaner separation between external \OUTCHAR and internal BOUT

* POSTSCRIPTSTREAM gets its own external format

* Minor fix

* Compile-time warning about EXPORTS.ALL

* MODERNIZE:  Modern button fn has same args as the original

For Notecards  #343

* Fixed another glitch in the MODERNIZE  arglist thing

\TEDIT.BUTTONEVENTFN actually takes a second STREAM argument.  I don't see where it is ever called with that.  The modernize replacement binds that argument, but it isn't being passed to the original.

* FILEWATCH:  added missing record field

* Update FILEWATCH.LCOM

* Eliminating record/type name conflicts

Mostly just qualifying references, more work to get BIGBITMAP stuff out of ADISPLAY and to eliminate ambiguity of LINE record (now XXLINE in XXGEOM)

* Compile away open calls to \OUTCHAR, add loadups/full.database

Mostly new LCOMS where \OUTCHAR calls were compiled open

* Remove garbage library/XCCS

Old tools for reading wikipedia XCCS tables, sources/XCCS will deal with XCCS external format

* Next step:  Remove open input-character calls, factor XCCS to separate file

XCCS is the default, but can be swapped out (eventually) by setting a few variables, without recompiling everything

* Lots of residual cleanup for XCCS isolation

* Delete old file MACINTERFACE (migrated to MODERNIZE)

* Eliminate straggling NS calls:  LAFITE, READINTERPRESS

* Typo

* READINTERPRESS:  removed CHARSET

* MODERNIZE: Interface to control title-bar response (for Notecards)

* Many changes for external format name consistency

Very close to the end of this

* Put :FORMAT in file info, fix TEDIT plaintext hardcopy

I distributed :FORMAT :XCCS as the default marking, but somehow one of the variables seems to get revert during the loadup.  This is correct, as far as it goes.

* Getting the format in the file-info

This is all very twisty, different variables set in different places.  It now seems to do the right thing, at least for new files.  Marks them with :FORMAT :XCCS.

* Another fileinfo glitch

* CLIPBOARD -UNICODE:  Make UTF8 to UTF-8 to match standards

* MODERNIZE:  fix bug in MODERWINDOW

* External format as MAKEFILE option, LOAD applies the file's format

(MAKEFILE 'XX '((FORMAT :UTF-8)))
  will dump XX as a UTF-8 file.  LOAD will load it back to XCCS internal.

* Compilers respect DEFINE-FILE-INFO format

* MODERNIZE:  little glitch

* Delete old FILEIO.LCOM

* More edge cases of external format thru MAKEFILE, PRETTY, PRETTYFILEINDEX etc.

* FILEBROWSER:  Can SEE UTF-8 Lisp sourcefile

* INSPECT:  Better macro for inspecting readtables

* recompile changed files and do new loadup

Co-authored-by: rmkaplan <ron.kaplan@post.harvard.edu>
2021-07-29 17:07:23 -07:00

99 lines
7.2 KiB
Plaintext

This file describes the UNICODE Lisp Library package.
Contributed by Ron Kaplan, August 2020.
The UNICODE library package defines external file formats that enable Medley to read and write files where 16 bit character codes are represented as UTF8 byte sequences or big-endian UTF16 byte-pairs. It also provides for character codes to be converted (on reading) from Unicode codes to equivalent codes in the Medley-internal Xerox Character Code Standard (XCCS) and (on writing) from XCCS codes to equivalent Unicode codes.
Four external formats are defined when the package is loaded:
:UTF8 codes are represented as UTF8 byte sequences and XCCS/Unicode character
conversion takes place.
:UTF16BE codes are represented as 2-byte pairs, with the high order by appearing
first in the file, and characters are converted.
The two other external formats translate byte sequences into codes, but do not translate the codes. These allow Medley to see and process characters in their native encoding.
:UTF8-RAW codes are represented as UTF8 byte sequences, but character conversion
does not take place.
:UTF16BE-RAW codes are represented as big-ending 2-byte pairs but there is no
conversion.
These formats all define the end-of-line convention (mostly for writing) for the external files according to the variable EXTERNALEOL (LF, CR, CRLF), with LF the default.
The external format can be specified as a parameter when a stream is opened:
(OPENSTREAM 'foo.txt 'INPUT 'OLD '((EXTERNALFORMAT :UTF8)))
(CL:OPEN 'foo.txt :DIRECTION :INPUT :EXTERNAL-FORMAT :UTF8)
The function STREAMPROP obtains or changes the external format of an open stream:
(STREAMPROP stream 'EXTERNALFORMAT) -> :XCCS
(STREAMPROP stream 'EXTERNALFORMAT :UTF8) -> :XCCS
In the latter case, the stream's format is changed to :UTF8 and the previous value is returned, in this example it is Medley's historical default format :XCCS.
Entries can be placed on the variable *DEFAULT-EXTERNALFORMATS* to change the external format that is set by default when a file is opened on a particular device. Loading UNICODE executes
(PUSH *DEFAULT-EXTERNALFORMATS* '(UNIX :UTF8))
so that all files opened (by OPENSTREAM, CL:OPEN, etc.) on the UNIX file device will be initialized with :UTF8. Note that the UNIX and DSK file devices reference the same files (although some caution is needed because {UNIX} does not simulate Medley versioning), but the device name in a file name ({UNIX}/Users/... vs. {DSK}/Users/...) selects one or the other. The default setting above applies only to files specified with {UNIX}; a separate default entry for DSK must be established to change its default from :XCCS.
The user can also specify the external format on a per-stream basis by putting a function on the list STREAM-AFTER-OPEN-FNS. After OPENSTREAM opens a stream and just before it is returned to the calling function, the functions on that list are applied in order to arguments STREAM, ACCESS, PARAMETERS. They can examine and/or change the properties of the stream, in particular, by calling STREAMPROP to change the external format from its device default.
The XCCS/Unicode mapping tables are defined by the code-mapping files for particular XCCS character sets. These are typically located in the Library sister directory
../Unicode/Xerox/
and the variable UNICODEDIRECTORIES is initialized with a globally valid reference to that path. The global reference is constructed by prepending the value of the Unix environment-variable "MEDLEYDIR" to the suffix /Unicode/Xerox/. MEDLEYDIR should be set by the Medley start-up shell script (e.g. /Users/kaplan/local/medley3.5/lispcore/)
The mapping files have conventional names of the form XCCS-<charsetnum>=<charsetname>.TXT, for example, XCCS-0=LATIN.TXT, XCCS-357=SYMBOLS4.TXT. The translations used by the external formats are read from these files by the function
(READ-UNICODE-MAPPING FILESPEC NOPRINT NOERROR)
where FILESPEC can be a list of files, charset octal strings ("0" "357"), or XCCS charset names (LATIN EXTENDED-LATIN). Reading will be silent if NOPRINT, and the process will not abort if an error occurs and NOERROR. The value is a flat list of the mappings for all the character sets, with elements of the form (XCCC-code Unicode-code).
When UNICODE is loaded the mappings for the character sets specified in the variable DEFAULT-XCCS-CHARSETS are installed. This is initialized to
(LATIN SYMBOLS1 SYMBOLS2 EXTENDED-LATIN FORMS SYMBOLS3 SYMBOLS4 ACCENTED-LATIN GREEK)
but DEFAULT-XCCS-CHARSETS can be set to a different collection before UNICODE is loaded.
The internal translation tables used by the external formats are constructed from a list of correspondence pairs by the function
(MAKE-UNICODE-TRANSLATION-TABLES MAPPING [FROM-XCCS-VAR][TO-XCCS-VAR])
This returns a list of two arrays (XCCS-to-Unicode Unicode-to-XCCS)containing the relevant translation information organized for rapid access. If the optional from/to-variables arguments are provide, they are the names of variables whose top-level values will be set to these arrays, for convenience. For the external formats defined above, these variables are *XCCSTOUNICODE* and *UNICODETOXCCS*.
The macro
(UNICODE.TRANSLATE CODE TRANSLATION-TABLE)
is used by the external formats to perform the mappings described by the translation-tables.
The following utilities are provided for lower-level manipulation of codes and strings
(XTOUCODE XCCSCODE) -> corresponding Unicode
(UTOXCODE UNICODE) -> corresponding XCCS code
(NUTF8CODEBYTES N) -> number of bytes in the UTF8 representation of N
(NUTF8STRINGBYTES STRING RAWFLG) -> number of UTF8 bytes in the UTF8
representation of STRING, translating XCCS to Unicode unless RAWFLG.
(XTOUSTRING XCCSSTRING RAWFLG) -> The string of bytes in the UTF8 representation
of the characters in XCCSSTRING (= the bytes in its UTF8 file encoding)
(HEXSTRING N WIDTH) -> the hex string for N, padded to WIDTH
The UNICODE file also contains a function for writing a mapping file given a list of mapping pairs. The function
(WRITE-TRANSLATION-TABLE MAPPING [INCLUDEDCHARSETS] [FILE])
produces one or more mapping files for the mapping-pairs in mapping. If the optional FILE argument is provided, then a single file with that name will be produced and contain all the mappings for all the character sets in MAPPING. If FILE and INCLUDEDCHARSETS are not provided, then all of the mappings will again go to a single file with a composite name XCCS-csn1,csn2,csn3.TXT. Each cs may be a single charset number, or a range of adjacent charset numbers. For example, if the mappings contain entries for characters in charset LATIN, SYMBOLS1, SYMBOLS2, and SYMBOLS3, the file name will be XCCS-0,41-43.TXT.
If INCLUDEDCHARSETS is provided, it specifies possibly a subset of the mappings in MAPPING for which files should be produced. This provides an implicit subsetting capability.
Finally, if FILE is not provided and INCLUDEDCHARSETS is T, then a separate file will be produced for each of the character sets, essentially a way of splitting a collection of character-set mappings into separate canonically named files (e.g. XCCS-357=SYMBOLS4.TXT).