* Updated all ISO8859 and Vendor Unicode mappings from current Unicode.org information * John Cowan says that East Asia is obsolete * Delete obsolete file
184 lines
7.2 KiB
Plaintext
184 lines
7.2 KiB
Plaintext
The following files describe windows code page behavior for the "ansi" code pages provided by Microsoft.
|
||
|
||
File Code Page Description
|
||
|
||
bestfit874.txt windows-874 ANSI/OEM Thai (same as 28605, ISO 8859-15); Thai (Windows)
|
||
bestfit932.txt shift_jis ANSI/OEM Japanese; Japanese (Shift-JIS)
|
||
bestfit936.txt gb2312 ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)
|
||
bestfit949.txt ks_c_5601-1987 ANSI/OEM Korean (Unified Hangul Code)
|
||
bestfit950.txt big5 ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)
|
||
bestfit1250.txt windows-1250 ANSI Central European; Central European (Windows)
|
||
bestfit1251.txt windows-1251 ANSI Cyrillic; Cyrillic (Windows)
|
||
bestfit1252.txt windows-1252 ANSI Latin 1; Western European (Windows)
|
||
bestfit1253.txt windows-1253 ANSI Greek; Greek (Windows)
|
||
bestfit1254.txt windows-1254 ANSI Turkish; Turkish (Windows)
|
||
bestfit1255.txt windows-1255 ANSI Hebrew; Hebrew (Windows)
|
||
bestfit1256.txt windows-1256 ANSI Arabic; Arabic (Windows)
|
||
bestfit1257.txt windows-1257 ANSI Baltic; Baltic (Windows)
|
||
bestfit1258.txt windows-1258 ANSI/OEM Vietnamese; Vietnamese (Windows)
|
||
|
||
These tables include "best fit" behavior which is not present in the other files. Examples of best fit
|
||
are converting fullwidth letters to their counterparts when converting to single byte code pages, and
|
||
mapping the Infinity character to the number 8.
|
||
|
||
932, 936, 949 and 950 are all double byte code pages. The remainder are single byte code pages. Each file
|
||
is encoded in the code page it describes, eg: bestfit1252.txt is encoded in the windows-1252 encoding. The only
|
||
non-ASCII characters however are in the comments so these files may be read by an ASCII parser if necessary.
|
||
|
||
Each file has sections of key word tags and records. Any text after a ; is ignored as are blank lines. Fields are
|
||
delimited by one or more space or tab characters. Each section begins one of the following tags:
|
||
|
||
CODEPAGE
|
||
CPINFO
|
||
MBTABLE
|
||
WCTABLE
|
||
DBCSRANGE (double byte code pages only)
|
||
DBSCTABLE (double byte code pages only)
|
||
|
||
Descriptions of each tag are:
|
||
|
||
CODEPAGE 932 ; Japanese - ANSI, OEM
|
||
|
||
The CODEPAGE tag contains 1 field and marks the start of the code page file.
|
||
|
||
Field 1 -- The only field is the decimal windows code page number for this code page.
|
||
|
||
CPINFO 2 0x3f 0x30fb ; DBCS CP, Unic Default Char = Katakana Middle Dot
|
||
|
||
The CPINFO tag describes the code page with 3 fields:
|
||
|
||
Field 1 -- "1" for a single byte code page, "2" for a double byte code page.
|
||
Field 2 -- Replacement characters for unassigned Unicode code points when written to this
|
||
code page (currently always ?)
|
||
Field 3 -- Replacement characters for illegal or unassigned code page values when converting to Unicode.
|
||
This is Katakana middle dot for 932 and ? for all other code pages.
|
||
|
||
MBTABLE 256
|
||
|
||
The MBTABLE tag marks the start of the "Multibyte" code page to Unicode conversion table. It has 1 field.
|
||
|
||
Field 1 -- This field contains the number of following records of code page to Unicode mappings. Note that
|
||
lead bytes dont have mappings, so this is not always 256. For 932 for example it is 196.
|
||
|
||
MultiByte Mapping Records:
|
||
|
||
0x00 0x0000 ;Null
|
||
0x01 0x0001 ;Start Of Heading
|
||
...
|
||
0x30 0x0030 ;Digit 0
|
||
0x31 0x0031 ;Digit 1
|
||
...
|
||
|
||
Each record consists of two fields to map from the code page to Unicode.
|
||
|
||
Field 1 -- The code page byte that is being mapped to Unicode, eg "0x3f"
|
||
Field 2 -- The Unicode UTF-16 code point that this byte maps to, eg "0x003f"
|
||
|
||
DBCSRANGE 2 ;2 DBCS Lead Byte Ranges: 0x81-0x9f and 0xe0-0xfc
|
||
|
||
The DBSCRANGE describes the number of double byte ranges for double byte code page. Ranges are consecutive
|
||
lead byte values such as 0x81-0x9f
|
||
|
||
Field 1 -- This field contains the number of double byte ranges. The next record is the 1st lead byte range
|
||
|
||
0x81 0x9f ;Lead Byte Range
|
||
|
||
This record describes the first lead byte range. It is the first record after DBCSRANGE and is followed by one
|
||
DBCSTABLE record for each lead byte in the range. If there are additional ranges, another Lead Byte Range record
|
||
will follow the last DBCSTABLE in the previous range.
|
||
|
||
Field 1 -- This field is the first lead byte used in this range.
|
||
|
||
Field 2 -- This field is the last lead byte used in this range
|
||
|
||
DBCSTABLE 147 ;LeadByte = 0x81
|
||
|
||
The DBCSTABLE record describes the mappings available for a particular lead byte. The comment is ignored but
|
||
descriptive. The lead byte of the first DBCSTABLE is the first lead byte of the previous Lead Byte Range
|
||
record. Each subsequent DBCSTABLE is for the next consecutive lead byte value.
|
||
|
||
Field 1 -- This field is the number of trail byte mappings following.
|
||
|
||
Double byte mapping records:
|
||
|
||
0x40 0x3000 ; Ideographic Space
|
||
0x41 0x3001 ; Ideographic Comma
|
||
...
|
||
|
||
Field 1 -- This field is the trail byte to map from.
|
||
|
||
Field 2 -- This field is the Unicode UTF-16 code point that this lead byte/trail byte combination map to.
|
||
|
||
Example:
|
||
|
||
DBCSRANGE 2 ;2 DBCS Lead Byte Ranges: 0x81-0x9f and 0xe0-0xfc
|
||
|
||
0x81 0x9f ;Lead Byte Range
|
||
|
||
DBCSTABLE 147 ;LeadByte = 0x81
|
||
|
||
0x40 0x3000 ; Ideographic Space
|
||
0x41 0x3001 ; Ideographic Comma
|
||
...
|
||
|
||
The preceeding example would map the byte sequences 0x81 0x40 to U+3000 and 0x81 0x41 to U+3001.
|
||
|
||
|
||
WCTABLE 698
|
||
|
||
The WCTABLE tag marks the start of the Unicode UTF-16 (WideChar) to "MultiByte" bytes. It has 1 field.
|
||
|
||
Field 1 -- This field contains the number of records of Unicode to byte mappings. Note that this is often
|
||
more than the number of round trip mappings supported by the code page due to windows "Best Fit" behavior.
|
||
|
||
Unicode UTF-16 (WideChar) Mapping Records:
|
||
|
||
These take two forms, differing between single byte and double byte code pages. Both forms have 2 fields:
|
||
|
||
Single byte WCTABLE records:
|
||
|
||
0x0000 0x00 ;Null
|
||
0x0001 0x01 ;Start Of Heading
|
||
...
|
||
0x0061 0x61 ;Latin Small Letter A
|
||
0x0062 0x62 ;Latin Small Letter B
|
||
0x0063 0x63 ;Latin Small Letter C
|
||
...
|
||
0x221e 0x38 ;Infinity << Best Fit Mapping
|
||
...
|
||
0xff41 0x61 ;Fullwidth Latin Small Letter A << Best Fit Mapping
|
||
0xff42 0x62 ;Fullwidth Latin Small Letter B << Best Fit Mapping
|
||
0xff43 0x63 ;Fullwidth Latin Small Letter C << Best Fit Mapping
|
||
...
|
||
|
||
Field 1 -- The Unicode UTF-16 code point for the character being converted.
|
||
|
||
Field 2 -- The single byte that this UTF-16 code point maps to. If a reverse mapping does not in the MBTABLE,
|
||
then this is a Best Fit mapping.
|
||
|
||
Multibyte WCTABLE records:
|
||
|
||
0x0000 0x0000 ; Null
|
||
0x0001 0x0001 ; Start Of Heading
|
||
...
|
||
0x0061 0x0061 ; a
|
||
0x0062 0x0062 ; b
|
||
0x0063 0x0063 ; c
|
||
...
|
||
0x221e 0x8187 ; Infinity
|
||
...
|
||
0xff41 0x8281 ; Fullwidth a
|
||
0xff42 0x8282 ; Fullwidth b
|
||
0xff43 0x8283 ; Fullwidth c
|
||
...
|
||
|
||
Field 1 -- The Unicode UTF-16 code point for the character being converted.
|
||
|
||
Field 2 -- The byte or bytes that this code point maps to as a 16 bit value. The high byte is the lead byte,
|
||
and the low byte is the trail byte. If the high byte is 0, then this is a single byte code point, with
|
||
the value of the low byte and no lead byte is emitted.
|
||
|
||
ENDCODEPAGE
|
||
|
||
This tag marks the end of the code page data. Anything after this marker is ignored.
|