Forth 200x: The optional Extended-Character word set

This word set deals with variable width character encodings. It also works with fixed width encodings.

Since the standard specifies ASCII encoding for characters, only ASCII-compatible encodings may be used. Because ASCII compatibility has so many benefits, most encodings actually are ASCII compatible. The characters beyond the ASCII encoding are called "extended characters" (xchars).

All words dealing with strings shall handle xchars when the xchar word set is present. This includes dictionary definitions. White space parsing does not have to treat code points greater than $20 as white space.

18.2 Additional terms and notation

18.2.1 Definition of Terms

18.2.2 Parsed-text notation

18.3 Additional usage requirements

18.3.1 Data types

18.3.1.1 Extended Characters

18.3.2 Environmental queries

Table 18.3: Environmental Query Strings


String Value data type		Constant?	Meaning

`XCHAR-ENCODING`	c-addr u	no	Returns a printable ASCII string that represents the encoding, and use the preferred MIME name (if any) or the name in the IANA character-set register^[1] (RFC-1700) such as "`ISO-LATIN-1`" or "`UTF–8`", with the exception of "`ASCII`", where the alias "`ASCII`" is preferred.
`MAX-XCHAR`	u	no	Maximal value for xchar
`XCHAR-MAXMEM`	u	no	Maximal memory consumed by an xchar in address units


^[1] http://www.iana.org/assignments/character-sets

18.3.3 Common encodings

18.3.4 The Forth text interpreter

18.3.5 Input and Output

IO words such as KEY, EMIT, TYPE, READ-FILE, READ-LINE, WRITE-FILE, and WRITE-LINE operate on pchars. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s). The IO system shall combine these pchars into a complete xchars on output, or split an xchars into pchars on input, and shall not throw a "malformed xchars" exception when the combination of these pchars form a valid xchars. -TRAILING-GARBAGE can be used to process an incomplete xchars at the end of such an IO operation. ACCEPT as input editor may be aware of xchars to provide comfort like backspace or cursor movement.

18.4 Additional documentation requirements

18.4.1 System documentation

18.4.1.1 Implementation-defined options

Since Unicode input and display poses a number of challenges like input method editors for different languages, left-to-right and right-to-left writing, and most fonts contain only a subset of Unicode glyphs, systems should document their capabilities. File IO and in-memory string handling should work transparently with xchars.

18.4.1.2 Ambiguous conditions

18.4.1.3 Other system documentation

18.4.2 Program documentation

18.5 Compliance and labeling

18.5.1 Forth-2012 systems

The phrase "Providing name(s) from the Extended-Character Extensions word set" shall be appended to the label of any Standard System that provides portions of the Extended-Character Extensions word set.

The phrase "Providing the Extended-Character Extensions word set" shall be appended to the label of any Standard System that provides all of the Extended-Character and Extended-Character Extensions word sets.

18.5.2 Forth-2012 programs

The phrase "Requiring name(s) from the Extended-Character Extensions word set" shall be appended to the label of Standard Programs that require the system to provide portions of the Extended-Character Extensions word set.

The phrase "Requiring the Extended-Character Extensions word set" shall be appended to the label of Standard Programs that require the system to provide all of the Extended-Character Exception and Extended-Character Extensions word sets.

18.6 Glossary

18.6.1 Extended-Character words

( xc-addr u₁ -- u₂ )

u₂ is the number of pchars used to encode the first xchar stored in the string xc-addr u1. To calculate the size of the xchar, only the bytes inside the buffer may be accessed. An ambiguous condition exists if the xchar is incomplete or malformed.


Abbreviation	Description

<xchar>	the delimiting extended character


Symbol	Data type	Size on stack

pchar	primitive character	1 cell
xchar	extended character	1 cell
xc-addr	xchar-aligned address	1 cell