14 Reading formulas in external form

14. Reading formulas in external form

Reading

A formula can be produced from an external representation, for example a string, using the pt function. It has one argument, a string denoting a formula, that is converted to the internal representation of the formula. For the following syntactical entities parsing functions are provided:

	(py string)		for parsing types
	(pv string)		for parsing variables
	(pt string)		for parsing terms
	(pf string)		for parsing formulas

The conversion occurs in two steps: lexical analysis and parsing.

14.1. Lexical analysis. In this stage the string is brocken into short sequences, called tokens.

A token can be one of the following:

An alphabetic symbol: A sequence of letters a–z and A–Z. Upper and lower case letters are considered different.
A number: A sequence of digits 0–9
A punctuation mark: One of the characters: ( ) [ ] . , ;
A special symbol: A sequence of characters, that are neither letters, digits, punctuation marks nor white space.

For example: abc, ABC and A are alphabetic symbols, 123, 0123 and 7 are numbers, ( is a punctuation mark, and <=, +, and #:-ˆ are special symbols.

Tokens are always character sequences of maximal length belonging to one of the above categories. Therefore fx is a single alphabetic symbol not two and likewise <+ is a single special symbol. The sequence alpha<=(-x+z), however, consists of the 8 tokens alpha, <=, (, -, x, +, z, and ). Note that the special symbols <= and - are separated by a punctuation mark, and the alphabetic symbols x and z are separated by the special symbol +.

If two alphabetic symbols, two special symbols, or two numbers follow each other they need to be separated by white space (spaces, newlines, tabs, formfeeds, etc.). Except for a few situations mentioned below, whitespace has no signiﬁcance other than separating tokens. It can be inserted and removed between any two tokens without affecting the signiﬁcance of the string.

Every token has a token type, and a value. The token type is one of the following: number, var-index, var-name, const, pvar-name, predconst, type-symbol, pscheme-symbol, postﬁx-op, preﬁx-op, binding-op, add-op, mul-op, rel-op, and-op, or-op, imp-op, pair-op, if-op, postﬁx-jct, preﬁx-jct, and-jct, or-jct, tensor-jct, imp-jct, quantor, dot, hat, underscore, comma, semicolon, arrow, lpar, rpar, lbracket, rbracket.

The possible values for a token depend on the token type and are explained below.

New tokens can be added using the function

(add -token string token -type value).

The inverse is the function

(remove -token string).

A list of all currently deﬁned tokens sorted by token types can be obtained by the function

(display -tokens).

14.2. Parsing. The second stage, parsing, extracts structure form the sequence of tokens.

Types. Type-symbols are types; the value of a type-symbol must be a type. If σ and τ are types, then σ;τ is a type (pair type) and σ=>τ is a type (function type). Parentheses can be used to indicate proper nesting. For example boole is a predeﬁned type-symbol and hence, (boole;boole)=>boole is again a type. The parentheses in this case are not strictly necessary, since ; binds stronger than =>. Both operators associate to the right.

Variables. Var-names are variables; the value of a var-name token must be a pair consisting of the type and the name of the variable (the same name string again! This is not nice and may be later, we ﬁnd a way to give the parser access to the string that is already implicit in the token). For example to add a new boolean variable called “ﬂag”, you have to invoke the function (add-token "flag" ’var-name (cons ’boole "flag")). This will enable the parser to recognize “flag3”, “flagˆ”, or “flagˆ14” as well.

Further, types, as deﬁned above, can be used to construct variables.

A variable given by a name or a type can be further modiﬁed. If it is followed by a ˆ, a partial variable is constructed. Instead of the ˆ a _ can be used to specify a total variable.

Total variables are the default and therefore, the _ can be omitted.

As another modiﬁer, a number can immediately follow, with no whitespace in between, the ˆ or the _, specifying a speciﬁc variable index.

In the case of indexed total variables given by a variable name or a type symbol, again the _ can be omitted. The number must then follow, with no whitespace in between, directly after the variable name or the type.

Note: This is the only place where whitespace is of any signiﬁcance in the input. If the ˆ, _, type name or variable name is separated from the following number by whitespace, this number is no longer considered to be an index for that variable but a numeric term in its own right.

For example, assuming that p is declared as a variable of type boole, we have:

p a total variable of type boole with name p and no index.
p_ a total variable of type boole with name p and no index.
pˆ a partial variable of type boole with name p and no index.
p2 a total variable of type boole with name p and index 2.
p_2 a total variable of type boole with name p and index 2.
pˆ2 a partial variable of type boole with name p and index 2.
boole a total anonymous variable of type boole with no index.
boole_ a total anonymous variable of type boole with no index.
booleˆ a partial anonymous variable of type boole with no index.
boole_2 a total anonymous variable of type boole with index 2.
boole2 a total anonymous variable of type boole with index 2.
booleˆ2 a partial anonymous variable of type boole with index 2.
boole 2 a total anonymous variable of type boole applied to the numeric term 2.
(boole)2 a total anonymous variable of type boole applied to the numeric term 2.
(boole)_2 a total anonymous variable of type boole with index 2.
boole=>booleˆ2 a partial anonymous variable of type function of boole to boole with index 2.

Terms are built from atomic terms using application and operators.

An atomic term is one of the following: a constant, a variable, a number, a conditional, or any other term enclosed in parentheses.

Constants have const as token type, and the respective term in internal form as value. There are also composed constants, so-called constant schemata. A constant schema has the form of the name of the constant schema (token type constscheme) followed by a list of types, the whole thing enclosed in parentheses. There are a few built in constant schemata: (Rec <typelist>) is the recursion over the types given in the type list; (EQat <type>) is the equality for the given type; (Eat <type>) is the existence predicate for the given type. The constant schema EQat can also be written as the relational inﬁx operator =; the constant schemata Eat can also be written as the preﬁx operator E.

For a number, the user deﬁned function make-numeric-term is called with the number as argument. The return value of make-numeric-term should be the internal term representation of the number.

To form a conditional term, the if operator if followed by a list of atomic terms is enclosed in square brackets. Depending on the constructor of the ﬁrst term, the selector, a conditional term can be reduced to one of the remaining terms.

From these atomic terms, compound terms are built not only by application but also using a variety of operators, that differ in binding strength and associativity.

Postﬁx operators (token type postfix-op) bind strongest, next in binding strength are preﬁx operators (token type prefix-op), next come binding operators (token type binding-op).

A binding operator is followed by a list of variables and ﬁnally a term. There are two more variations of binding operators, that bind much weaker and are discussed later.

Next, after the binding operators, is plain application. Juxtaposition of two terms means applying the ﬁrst term to the second. Sequences of applications associate to the left. According to the vector notation convention the meaning of application depends on the type of the ﬁrst term. Two forms of applications are deﬁned by default: if the type of the ﬁrst term is of arrow-form? then make-term-in-app-form is used; for the type of a free algebra we use the corresponding form of recursion. However, there is one exception: if the ﬁrst term is of type boole application is read as a short-hand for the “if...then ...else” construct (which is a special form) rather than boolean recursion. The user may use the function add-new-application to add new forms of applications. This function takes two arguments, a predicate for the type of the ﬁrst argument, and a function taking the two terms and returning another term intended to be the result of this form of application. Predicates are tested in the inverse order of their deﬁnition, so more general forms of applications should be added ﬁrst.

By default these new forms of application are not used for output; but the user might declare that certain terms should be output as formal application. When doing so it is the user’s responisbility to make sure that the syntax used for the output can still be parsed correctly by the parser! To do so the function (add-new-application-syntax pred toarg toop) can be used, where the ﬁrst argument has to be a predicate (i.e., a function mapping terms to #t and #f) telling whether this special form of application can be used. If so, the arguments toarg and toop have to be functions mapping the term to operator and argument of this “application” respectively.

After that, we have binary operators written in inﬁx notation. In order of decreasing binding strength these are:

multiplicative operators, leftassociative, token type mul-op;
additive operators, leftassociative, token type add-op;
relational operators, not associative, token type rel-op;
boolean and operators, leftassociative, token type and-op;
boolean or operators, leftassociative, token type or-op;
boolean implication operators, rightassociative, token type imp-op;
pairing operators, rightassociative, token type pair-op.

On the top level, we have two more forms of binding operators, one using the dot “.”, the other using square brackets “[ ]”. Recall that a binding operator is followed by a list of variables and a term. This notation can be augmented by a period “.” following after the variable list and before the term. In this case the scope of the binding extends as far to the right as possible. Bindings with the lambda operator can also be speciﬁed by including the list of variables in square brackets. In this case, again, the scope of the binding extends as far as possible.

Predeﬁned operators are E and = as described above, the binding operator lambda, and the pairing operator @ with two preﬁx operators left and right for projection.

The value of an operator token is a function that will obtain the internal representation of the component terms as arguments and returns the internal representation of the whole term.

If a term is formed by application, the function make-gen-application is called with two subterms and returns the compound term. The default here (for terms with an arrow type) is to make a term in application form. However other rules of composition might be introduced easily.

Formulas are built from atomic formulas using junctors and quantors.

The simplest atomic formulas are made from terms using the implicit predicate “atom”. The semantics of this predicate is well deﬁned only for terms of type boole. Further, a predicate constant (token type predconst) or a predicate variable (token type pvar) followed by a list of atomic terms is an atomic formula. Lastly, any formula enclosed in parentheses is considered an atomic formula.

The composition of formulas using junctors and quantors is very similar to the composition of terms using operators and binding. So, ﬁrst postﬁx junctors, token type postfix-jct, are applied, next preﬁx junctors, token type prefix-jct, and quantors, token type quantor, in the usual form: quantor, list of variables, formula. Again, we have a notation using a period after the list of variables, making the scope of the quantor as large as possible. Predeﬁned quantors are ex, excl, exca, and all.

The remaining junctors are binary junctors written in inﬁx form. In order of decreasing binding strength we have:

conjunction junctors, leftassociative, token type and-jct;
disjunction junctors, leftassociative, token type or-jct;
tensor junctors, rightassociative, token type tensor-jct;
implication junctors, rightassociative, token type imp-jct.

Predeﬁned junctors are & (and), ! (tensor), and -> (implication).

The value of junctors and quantors is a function that will be called with the appropriate subformulas, respectively variable lists, to produce the compound formula in internal form.

[next] [prev] [prev-tail] [front] [up]