New Page 1

The explanation symbol can be used to denote that line is a comment. This symbol is currently only allowed at the very beginning of the a line.

Examples

Defining Parameters

Syntax

Details

The GOLD Builder has a series of parameters that describe the form and function of your grammar. They are as follows:

Part	Description
ParameterName	A string containing the name of the parameter
Value	A string containing the new value of the parameter

Parameter Name	Type	Description
Name	Optional	The name of the grammar.
Version	Optional	The version of the grammar. This can contain any alphanumeric string.
Author	Optional	The grammar's author.
About	Optional	A short description of the grammar.
Case Sensitive	Optional	Whether the grammar is considered to be case sensitive. When this parameter is set "True", the GOLD Builder will construct case sensitive tokenizer tables (DFA). In other words, if your language contains a terminal 'if', the text 'IF', 'If', and 'iF' will cause a syntax error. This parameter defaults to 'False'.
Auto Whitespace	Optional	In the previous version of the GOLD Parser, the whitespace terminal was always created when omitted in the grammar. Unfortunately, not all grammars make use of whitespace. This parameter is set to 'True' by default, but can be changed to 'False'. When 'False', the system will not automatically create a whitespace terminal unless it is manually defined.
Start Symbol	Required	The starting symbol in the grammar. When LALR parse tables are constructed by the GOLD Builder, an "accepted" grammar will reduce to this nonterminal.

Example

Defining Rules

Syntax

Details

Part	Description
`RuleName`	A string specifying the name of the nonterminal the rule derives.
`Symbols`	A list of 0 or more terminals and nonterminals.

Typically, rules in a grammar are declared using BNF (Backus-Noir Form) statements. This notation consists series of 0 or more symbols where nonterminals are delimited by the angle brackets '<' and '>' and terminals are delimited by single quotes or not delimited at all.

The symbols 'if', 'then', 'end', and 'if' are terminals and <Expression> and <Statements> are nonterminals.

If you are declaring a series of rules that derive the same nonterminal (i.e. different versions of a rule), you can use a single pipe character '|' in the place of the rule's name and the "::=" symbol. The following declares a series of 3 different rules that define a 'Statement'. In this example, the shortcut notation is used to simply the declaration.

"Enhanced" BNF

There is also an "Enhanced" BNF format with incorporates special notation for optional symbols (either terminals or nonterminals). At this time, the GOLD Builder will only uses the original format. The final build of version 1.0 might incorporate the enhanced format, but this is not yet determined.

Additional Examples

The following two rules define a comma delimited list of Identifiers. The use of single quotes to delimit the actual comma are not required.

Operator precedence is an important aspect of most programming languages. The following rules define the common arithmetic operators.

Defining Sets

Syntax

Details

Literal sets of characters are delimited using the square brackets '[' and ']' and pre-defined sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet; while the text "{abc}" refers to a set named "abc".

Part	Description
`SetName`	A string specifying the name of the set being declared.
`SetExpression`	An arithmetic expression containing one or more sets.

Sets can then be declared by adding and subtracting previously declared sets and literal sets. The GOLD Builder provides a collection of pre-defined sets that contain characters often used to define terminals..

Examples

Additional Examples

The following declares a set named "Hex Char" containing the characters that are valid in a hexadecimal number.

The following declares a set containing the characters that can be placed inside a normal "string". In this case, the double quote is the delimiting character (which it is in most programming languages).

Defining Terminals

Syntax

Regular Expressions

The notation is rather simple, yet versatile enough to express any terminal needed. Basically, regular expressions consist of a series of characters that define the pattern of the terminal.

Declaration	Resulting Set
{Bracket} = [']']	]
{Quote} = ['']	'
{Vowels} = [aeiou]	aeiou
{Vowels 2} = {Vowels} + [y]	aeiouy
{Set 1} = [abc]	abc
{Set 2} = {Set 1} + [12] - [c]	ab12
{Set 3} = {Set 2} + [0123456789]	ab0123456789

Part	Description
`TerminalName`	A string of characters specifying the name of the terminal being declared.
`RegularExpression`	A regular expression defining the pattern of the terminal.

Literal sets of characters are delimited using the square brackets '[' and ']' and defined sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet; while the text "{abc}" refers to a set named "abc". Neither of these are part of the "pure" notation for regular expressions, but are widely used in other parser generators such as Lex/Yacc.

Sub-expressions are delimited by normal parenthesis '(' and ')'. The pipe character '|' is used to denote alternate expressions.

Either a set, a sub expression, or a single character can be followed by any of the following three symbols:

For example, the regular expression ab* translates to "an a followed by zero or more b's" and [abc]+ translates to "an series of one or more a's, b's or c's".

Examples

Special Terminals

Declaration	Valid strings
`Example1 = abc*`	`ab, abc, abcc, abccc, abcccc, ...`
`Example2 = ab?c`	`abc, ac`
`Example3 = a\|b\|c`	`a, b, c`
`Example4 = a[12]*b`	`ab, a1b, a2b, a12b, a21b, a22b, a111b, ...`
`Example5 = '*'+`	`, , , **, ...`
`Example6 = {Letter}+`	`cat, dog, Sacramento, ...`
`Identifier = {Letter}{AlphaNumeric}*`	`e4, Param4b, Color2, temp, ...`
`ListFunction = c[ad]+r`	`car, cdr, caar, cadr, cdar, cddr, caaar, ...`
`ListFunction = c(a\|d)+r`	The same as the above using a different, yet equivalent, regular expression.
`NewLine = {CR}{LF}\|{CR}`	Windows and DOS use {CR}{LF} for newlines, UNIX simply uses {CR}. This definition will detect both.

The Whitespace terminal is used by the GOLD Parser to represent information that can ignored by the parsing engine. Normally this is defined as {Whitespace}+

In addition, there are three Comment terminals that are used to define block and line comments.

Pre-Defined Character Sets

The GOLD Builder has a collection of useful pre-defined sets at your disposal. These include the sets that are often used for defining terminals as well as characters not accessable via the keyboard. This documentation also includes a Pre-Defined Character Set Chart.

Standard Characters

Commonly Used Character Sets

Extended Character Sets

Comment Terminals

One of the key principles in programming languages is the ability to incorporate comments and other documentation directly to the source code. Whether it is FORTRAN, COBOL or C++, the ability exists, but in varying forms.

Essentially, there are three different types of comment terminals used in programming languages: those that tell the compiler to ignore the remaining text in the current line of code and those used to denote the start and end of a multi-line comment.

To accommodate the intricacies of comments, the GOLD Parser Builder provides for this special class of terminals.

Set Name	Characters
{HT}	Horizontal Tab character (#09).
{LF}	Line Feed character (#10).
{VT}	Vertical Tab character (#11). This character is rarely used.
{FF}	Form Feed character (#12). This character is also known as "New Page".
{CR}	Carriage Return character (#13).
{Space}	Space character (#32). Techically, this set is not needed since a "space" can be expressed by using single quotes: ' '. The set was added to allow the developer to more explicitly indicate the character and add readability.
{NBSP}	No-Break Space character (#160). The No-Break Space character is used to represent a space where a line break is not allowed. It is often used in source code for indentation.

Set Name	Characters
{Digit}	0123456789
{Letter}	abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
{AlphaNumeric}	This set includes all the characters in {Letter} and {Digit}
{Printable}	This set includes all standard characters that can be printed onscreen. This includes the characters from #32 to #127 and #160 (No-Break Space). The No-Break Space character was included since it is often used in source code.
{Whitespace}	This set includes all characters that are normally considered whitespace and ignored by the parser. The set consists of the Space, Horizontal Tab, Line Feed, Vertical Tab, Form Feed, Carriage Return and No-Break Space.

Set Name	Characters
{#n}	Using this notation, you can specify characters normally not accessable via the keyboard. In this version of the GOLD Builder, n can be any number between 0 and 255. For instance, {#169} specifies the copyright character ©.
{Letter Extended}	This set includes all the letters which are part of the extended Unicode character set.
{Printable Extended}	This set includes all the printable characters above #127. Although rarely used in programming languages, they could be used, for instance, as valid characters in a string literal.

This documentation contains an example on how to use the comment terminals in a grammar.

Examples of Comment Terminals

Below is a comparison of comment terminals in several common programming languages. Blanks fields denote the programming language lacks a terminal of that type. For instance, Visual Basic does not provide block comments.

Whitespace Terminal

In practically all programming languages, the parser recognizes (and usually ignores) the spaces, new lines, and other meaningless characters that exist between tokens. For instance, in the code

the fact that there are two spaces between the 'If' and 'Done', a new line after 'Then', and multiple space before 'Counter' is irrelevant.

Programming Language	Line Comment	Block Comment Start	Block Comment End
BASIC	REM
C (Original)	//
C (ANSI)	//	/*	*/
C++	//	/*	*/
COBOL	*
LISP	;
FORTRAN 90	!
Java	//	/*	*/
Pascal		{ or (*	} or *)
Prolog	%	/*	*/
SQL	--	/*	*/
Visual Basic	' (Single quote) or Rem

From the parser's point of view (in particular the Deterministic Finite Automata that it uses) these whitespace characters are recognized as a special terminal which can be discarded. In GOLD, this terminal is simply called the Whitespace terminal and can be defined to whatever is needed. If the Whitespace Terminal is not defined explicitly in the grammar outline, it will be implicitly declared as one or more of the characters in the pre-defined Whitespace set: {Whitespace}+.

Normally, you would not need to worry about the Whitespace terminal unless you are designing a language where the end of a line is significant. This is the case with Visual Basic, BASIC and many, many others. The proper declaration can be seen in an example.

< RuleName >	::=	[ Symbols ]
[	\|	Symbols ] ...

<Statement>	::=	if <Expression> then <Statements> end if
	\|	while <Expression> do <Statements> end while
	\|	for Id = <Range> loop <Statements> end for

<Statement>	::=	if <Expression> then <Statements> end if
<Statement>	::=	while <Expression> do <Statements> end while
<Statement>	::=	for Id = <Range> loop <Statements> end for

<Expression>	::=	Identifier '+' <Expression>
	\|	Identifier '-' <Expression>
	\|	<Mult Exp>

<Mult Exp>	::=	Identifier '*' <Mult Exp>
	\|	Identifier '/' <Mult Exp>
	\|	Identifier

`*`	Kleene Closure. This symbol denotes 0 or more or the specified character(s)
`+`	One or more. This symbol denotes 1 or more of the specified character(s)
`?`	Optional. This symbol denotes 0 or 1 of the specified character(s)

`Comment Start`	The Comment Start terminal defines the symbol used to begin a block comment. When the tokenizer engine reads this symbol from the source text, it will increment an internal counter and ignore all other tokens until the Comment End token is encountered. Comments will be nested.
`Comment End`	The Comment End terminal defines the symbol that will denote the end of a block comment.
`Comment Line`	Unlike the Comment Start and Comment End terminals, the tokenizer will simply discard the rest of the line.

<List>	::=	Identifier ',' <List>
	\|	Identifier

Adding Comments

Syntax

Details

Examples

Defining Parameters

Syntax

Details

Example

Defining Rules

Syntax

Details

"Enhanced" BNF

Additional Examples

Defining Sets

Syntax

Details

Examples

Additional Examples

Defining Terminals

Syntax

Regular Expressions

Examples

Special Terminals

Pre-Defined Character Sets

Standard Characters

Commonly Used Character Sets

Extended Character Sets

Comment Terminals

Examples of Comment Terminals

Whitespace Terminal