PREVIOUS HEAD

30  Character Composition in Exported Font Metrics

The font metric exporter attempts to make exact matches betweeen input and output encodings where possible. For example, the character ``A'' is usually available in both input and output fonts, and so the virtual font usually contains a simple re-mapping between the associated codes.

A more difficult case is where the exporter must compose characters which are not available in the input fonts. For example, if the output font contains an ``A with umlaut accent'' (Ä) character, the input fonts may not contain the accented character, although they may contain both an ``A'' and an ``umlaut accent'' (\"). In this case the exported virtual font should contain a virtual character packet that sets the letter A with a superimposed umlaut accent.

Composing characters requires that the exporter have some geometric sophistication, since simply overlaying accents often does not yield a visually pleasing result. Typically the bounding boxes of the marks must be aligned according to rules specific to the letter and accent, and correction for italic slant must also be taken into account.

Some compositions do not even fit the letter-plus-accent model. For example we may wish to simulate a ligature ``fi'' by tucking a dotless ``i'' closely against an ``f''.

In order to provide a general and extensible method for composing virtual TEX characters from deficient input fonts, the TRUETEX metric exporter contains an interpreter for a character composition language. This language generally resembles a subset of PostScript operators and stack-oriented semantics.

TRUETEX includes a base composition script that does a fair job of composing characters used by TEX, and this is the default script used to export font metrics (you can write your own scripts to cover new cases or if you want to fine-tune the placement of composed character elements). The base script attempts to compose output characters from input characters in four ways:

When you direct TRUETEX to export font metrics, it will call upon a set of scripts in the character composition language to compose output characters which have no exact matches in the input encodings. These scripts are ASCII files containing the PostScript-like language. TRUETEX runs the script ``compose.ps'' in the font-encoding-files path preference item. If for some reason you need to disable the composition script, you can rename or move this file so that TRUETEX cannot find it; then TRUETEX will produce a valid metric export file, although you will get a message that TRUETEX could not find the composition script, compose.ps.

30.1  Composition Language for Experts

If you are a real expert in TEX and PostScript, you can modify compose.ps for special purposes.

The language syntax, data types, and operators all closely follow a subset of PostScript. We will describe what this subset is, and refer you then to a PostScript language reference for details on the language.

Remember to exit and restart TRUETEX if you make changes to the composition script, since TRUETEX only reads compose.ps once per instance, when it first exports a metric file.

Here are the pre-defined PostScript-like commands available in the composition language:

add and array begin clear cos count currentdict cvi cvn cvr cvs cvx def dictstack dict div dup end eq exchf exec exit falsef forall getinterval get ge gt idiv ifelse iff index known length le lt mod mul neg not orf pop put readonly roll round sin stack storef string sub tanf truef vpl_literal vpl_set vpl_wd_ht_dp_ic where

(The commands marked ``f'' are actually macros defined at the beginning of the script compose.ps.)

The built-in commands above have the same effect as their PostScript counterparts. A few commands are specific to the metric exporter, and have the following function:

string     vpl_literal     - Emits literal property-list text for string, indented, followed by new line.
wd ht dp ic     vpl_wd_ht_dp_ic     - Sets width, height, depth, and italic correction of character to be output.
x y code     vpl_set     - Emits property-list text to set input character code preceded by relative motion of (x,y).

Since the task at hand is limited, the non-relevant portions of full PostScript are not implemented, and the function of some operators is restricted. The file scanner is very simple and requires all tokens be separated by white space; for example, procedure delimiters (curly braces) must be separated by white space from the adjacent items. The data types supported are: integer, Boolean (as integer), double, name, procedure, dictionary. There is no Boolean type as in PostScript. Instead, the semantics overload integers for logical (Boolean) values, as in the C language. That is, integer 0 is false, anything else is true. There are also no bitwise operators as in PostScript. PostScript overloads the operators ``and'', etc., and maintains separate integer/Boolean types; we overload the type and would require separate operators. Integers must start with a digit (subtract from zero to get a negative value), floating point numbers must start with + or -, literal names with slash; anything else will be considered a name. Literal names can be decimal number strings, like ``/48'', as we illustrate below (in standard PostScript a number cannot be used as a literal name). Certain polymorphisms and coercions provided by PostScript are not supported, notably: dictionaries can only use names as keys; operators like roll that need integer arguments will not coerce floats to integers.

The systemdict, concomitant with the Boolean overloading of integers, contains definitions of ``true'' as 1 and ``false'' as 0.

The current implementation does not implement the following data types and operators: bitwise arithmetic, marks, random number functions, packed arrays (unpacked arrays are implemented), virtual memory, files, resources, the ``miscellaneous'' category of operators in the Adobe PostScript reference, graphics-related operators (graphics state, coordinate system, matrix, path construction, painting, insideness, forms, patterns, device setup and output), character-and-font operators, interpreter parameters, and Display PostScript. The interpreter maintains separate operand, dictionary, execution, and VM stacks as in PostScript.

The TRUETEX metric-exporter is connected to the interpreter such that five phases of execution occur: (1) the exporter exec's the interpreter's start() function (which is analogous to the PostScript start command, but not accessible to the composition script), (2) the exporter loads the composition macro file compose.ps, (3) the exporter does a begin followed by the operators and data to create the font information in the interpreter environment, (4) for each ``composable'' character, the exporter executes the definition of Compose in the interpreter, pushing the output character name as an argument, and (5) the exporter exec's an end to clear the font information loaded in (3).

The exporter does phases (1) and (2) only once per invocation of the previewer. It does phases (3) through (5) for each font exported. The exporter allocates and frees VM between steps (3) and (5), so that memory is reused between each composition. Exporting a font require several hundred KB of free memory.

A ``composable'' character is an output character which has no exact match in any of the input fonts' encodings.

In phase (4), the Compose definition can rely on the following two items being defined:

InputFont Dictionary of input font information
OutputFont Dictionary of output font information

Each of these two dictionaries in turn contains the following item(s):

CharCodes (Release 4.0N and later) A dictionary giving, for each character name, the code integer. For example, for the digit zero character at code position 0x30 (decimal 48), CharCodes contains an entry keyed by the name /one with value 48.
CharNames (Release 4.0N and later) A dictionary giving, for each code position (as a name consisting of the decimal representation of the code), the character name. For example, for the digit zero character at code position decimal 48, CharNames contains an entry keyed by the name /48 with value ``/one''.
    You can see that CharCodes and CharNames are inverse dictionaries. The size of these dictionaries is the count of the encoded characters in the relevant encoding, typically 128 or 256 for TEX encodings, and anywhere from 215 (ANSI) to 400-1700 entries for Windows Unicode fonts.
Encoding (Release 4.0M and prior only) An array giving character names for code numbers. The length of this array gives the encoding size, which is CHARS_IN_FONT, typically 256 or 128. (This table now obsoleted by CharCodes and CharNames.

These tables were upgraded beginning with release 4.0N. Also upgraded were complete search, sort, and hash internals for dictionaries and encoding tables, which make the composition interpreter run much faster.

Furthermore, the input font dictionary also contains:

TeX_Metrics Dictionary yielding for each character name an array:
[ width TFM width of character (integer)
height Likewise height
depth Likewise depth
ic Likewise italic correction
llx Lower-left X of
glyph bounding box (integer)
lly Likewise Y
urx Upper-right X of
glyph bounding box
ury Likewise Y
]
ItalicAngle Angle in degrees counter-clockwise of italic slant (float)


NEXT HEAD