Adding Comments

Alex Angelopoulos (aka at mvps dot org)

 

Syntax

! Comment

 

Details

The explanation symbol can be used to denote that line is a comment. This symbol is currently only allowed at the very beginning of the a line.

 

Examples

! This is a comment
! This is also a comment

! Remember to always comment your code

 

Defining Parameters

 

Syntax

" ParameterName " = Value

 

Part Description
ParameterName A string containing the name of the parameter
Value A string containing the new value of the parameter

 

Details

The GOLD Builder has a series of parameters that describe the form and function of your grammar. They are as follows:

Parameter Name Type Description
Name Optional The name of the grammar.
Version Optional The version of the grammar.   This can contain any alphanumeric string.
Author Optional The grammar's author.
About Optional A short description of the grammar.
Case Sensitive Optional Whether the grammar is considered to be case sensitive. When this parameter is set "True", the GOLD Builder will construct case sensitive tokenizer tables (DFA). In other words, if your language contains a terminal 'if', the text 'IF', 'If', and 'iF' will cause a syntax error. This parameter defaults to 'False'.
Auto Whitespace Optional In the previous version of the GOLD Parser, the whitespace terminal was always created when omitted in the grammar. Unfortunately, not all grammars make use of whitespace. This parameter is set to 'True' by default, but can be changed to 'False'. When 'False', the system will not automatically create a whitespace terminal unless it is manually defined.
Start Symbol Required The starting symbol in the grammar. When LALR parse tables are constructed by the GOLD Builder, an "accepted" grammar will reduce to this nonterminal.

 

Example

"Name"    = My Programming Language
"Version" = 1.0 beta
"Author"  = John Q. Public
"About"   = This is a test declaration

"Case Sensitive" = False
"Start Symbol" = <Statement>

 

Defining Rules

 

Syntax

< RuleName > ::= [ Symbols ]

[

| Symbols ] ...

 

Part Description
RuleName A string specifying the name of the nonterminal the rule derives.
Symbols A list of 0 or more terminals and nonterminals.

 

Details

Typically, rules in a grammar are declared using BNF (Backus-Noir Form) statements. This notation consists series of 0 or more symbols where nonterminals are delimited by the angle brackets '<' and '>' and terminals are delimited by single quotes or not delimited at all.

For instance, the following declares the common if-statement.

<Statement> ::= if <Expression> then <Statements> end if

The symbols 'if', 'then', 'end', and 'if' are terminals and <Expression> and <Statements> are nonterminals. 

If you are declaring a series of rules that derive the same nonterminal (i.e. different versions of a rule), you can use a single pipe character '|' in the place of the rule's name and the "::=" symbol. The following declares a series of 3 different rules that define a 'Statement'. In this example, the shortcut notation is used to simply the declaration.

<Statement> ::= if <Expression> then <Statements> end if
  | while <Expression> do <Statements> end while
  | for Id = <Range> loop <Statements> end for

This is equivalent to:

<Statement> ::= if <Expression> then <Statements> end if
<Statement> ::= while <Expression> do <Statements> end while
<Statement> ::= for Id = <Range> loop <Statements> end for

 

Note: When text is read by the Builder, all characters delimited by single quotes are analyzed as literal strings. In other words, any text delimited by single quotes is considered to be exactly as printed. This allows you to specify characters that would normally be limited by the notation. For instance, when defining a rule, angle brackets are used to delimit nonterminals. By typing '<' and '>', you can specify these two characters without worrying about the system misinterpreting them. A single quote character can be specified by typing two single quotes ''.

 

"Enhanced" BNF

There is also an "Enhanced" BNF format with incorporates special notation for optional symbols (either terminals or nonterminals). At this time, the GOLD Builder will only uses the original format. The final build of version 1.0 might incorporate the enhanced format, but this is not yet determined.

 

Additional Examples

The following two rules define a comma delimited list of Identifiers. The use of single quotes to delimit the actual comma are not required.

<List> ::= Identifier ',' <List>
  | Identifier

 

Operator precedence is an important aspect of most programming languages. The following rules define the common arithmetic operators.

<Expression> ::= Identifier '+' <Expression>
  | Identifier '-' <Expression>
  | <Mult Exp>
     
<Mult Exp> ::= Identifier '*' <Mult Exp>
  | Identifier '/' <Mult Exp>
  | Identifier

Defining Sets

 

Syntax

{ SetName } = SetExpression

 

Part Description
SetName A string specifying the name of the set being declared.
SetExpression An arithmetic expression containing one or more sets.

 

Details

Literal sets of characters are delimited using the square brackets '[' and ']' and pre-defined sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet; while the text "{abc}" refers to a set named "abc".

Sets can then be declared by adding and subtracting previously declared sets and literal sets.  The GOLD Builder provides a collection of pre-defined sets that contain characters often used to define terminals..

Note: When text is read by the Builder, all characters delimited by single quotes are analyzed as literal strings. In other words, any text delimited by single quotes is considered to be exactly as printed. This allows you to specify characters that would normally be limited by the notation. For instance, when defining a rule, angle brackets are used to delimit nonterminals. By typing '<' and '>', you can specify these two characters without worrying about the system misinterpreting them. A single quote character can be specified by typing a double single quote ''.

 

Examples

Declaration Resulting Set
{Bracket} = [']'] ]
{Quote} = [''] '
{Vowels} = [aeiou] aeiou
{Vowels 2} = {Vowels} + [y] aeiouy
{Set 1} = [abc] abc
{Set 2} = {Set 1} + [12] - [c] ab12
{Set 3} = {Set 2} + [0123456789] ab0123456789

 

Additional Examples

The following declares a set named "Hex Char" containing the characters that are valid in a hexadecimal number.

{Hex Char} = {Digit} + [ABCDEF]

The following declares a set containing the characters that can be placed inside a normal "string". In this case, the double quote is the delimiting character (which it is in most programming languages).

{String Char} = {Printable} - ["]

Defining Terminals

 

Syntax

TerminalName = RegularExpression

 

Part Description
TerminalName A string of characters specifying the name of the terminal being declared.
RegularExpression A regular expression defining the pattern of the terminal.

 

Regular Expressions

The notation is rather simple, yet versatile enough to express any terminal needed. Basically, regular expressions consist of a series of characters that define the pattern of the terminal.

Literal sets of characters are delimited using the square brackets '[' and ']' and defined sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet; while the text "{abc}" refers to a set named "abc". Neither of these are part of the "pure" notation for regular expressions, but are widely used in other parser generators such as Lex/Yacc.

Sub-expressions are delimited by normal parenthesis '(' and ')'. The pipe character '|' is used to denote alternate expressions.

Either a set, a sub expression, or a single character can be followed by any of the following three symbols:

* Kleene Closure. This symbol denotes 0 or more or the specified character(s)
+ One or more. This symbol denotes 1 or more of the specified character(s)
? Optional. This symbol denotes 0 or 1 of the specified character(s)

For example, the regular expression ab* translates to "an a followed by zero or more b's" and [abc]+ translates to "an series of one or more a's, b's or c's".

Note: When text is read by the Builder, all characters delimited by single quotes are analyzed as literal strings. In other words, any text delimited by single quotes is considered to be exactly as printed. This allows you to specify characters that would normally be limited by the notation. For instance, when defining a rule, angle brackets are used to delimit nonterminals. By typing '<' and '>', you can specify these two characters without worrying about the system misinterpreting them. A single quote character can be specified by typing a double single quote ''.

In the case of regular expressions, single quotes allow you to specify the following characters: ? * + ( ) { } [ ]

 

Examples

Declaration Valid strings
Example1 = abc* ab, abc, abcc, abccc, abcccc, ...
Example2 = ab?c abc, ac
Example3 = a|b|c a, b, c
Example4 = a[12]*b ab, a1b, a2b, a12b, a21b, a22b, a111b, ...
Example5 = '*'+ *, **, ***, ****, ...
Example6 = {Letter}+ cat, dog, Sacramento, ...
Identifier = {Letter}{AlphaNumeric}* e4, Param4b, Color2, temp, ...
ListFunction = c[ad]+r car, cdr, caar, cadr, cdar, cddr, caaar, ...
ListFunction = c(a|d)+r The same as the above using a different, yet equivalent, regular expression.
NewLine = {CR}{LF}|{CR} Windows and DOS use {CR}{LF} for newlines, UNIX simply uses {CR}. This definition will detect both.

 

Special Terminals

The Whitespace terminal is used by the GOLD Parser to represent information that can ignored by the parsing engine. Normally this is defined as {Whitespace}+

In addition, there are three Comment terminals that are used to define block and line comments.

Pre-Defined Character Sets

 

The GOLD Builder has a collection of useful pre-defined sets at your disposal. These include the sets that are often used for defining terminals as well as characters not accessable via the keyboard. This documentation also includes a Pre-Defined Character Set Chart.

 

Standard Characters

Set Name Characters
{HT} Horizontal Tab character (#09).
{LF} Line Feed character (#10).
{VT} Vertical Tab character (#11). This character is rarely used.
{FF} Form Feed character (#12). This character is also known as "New Page".
{CR} Carriage Return character (#13).
{Space} Space character (#32). Techically, this set is not needed since a "space" can be expressed by using single quotes: ' '. The set was added to allow the developer to more explicitly indicate the character and add readability.
{NBSP} No-Break Space character (#160). The No-Break Space character is used to represent a space where a line break is not allowed. It is often used in source code for indentation.

 

Commonly Used Character Sets

Set Name Characters
{Digit} 0123456789
{Letter} abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
{AlphaNumeric} This set includes all the characters in {Letter} and {Digit}
{Printable} This set includes all standard characters that can be printed onscreen. This includes the characters from  #32 to #127 and   #160 (No-Break Space). The No-Break Space character was included since it is often used in source code.
{Whitespace} This set includes all characters that are normally considered whitespace and ignored by the parser. The set consists of the Space, Horizontal  Tab, Line Feed, Vertical Tab, Form Feed, Carriage Return and No-Break Space.

 

Extended Character Sets

Please see the Pre-Defined Character Set Chart for pictures of these characters.

Set Name Characters
{#n} Using this notation, you can specify characters normally not accessable via the keyboard.  In this version of the GOLD Builder, n can be any number between 0 and 255. For instance, {#169} specifies the copyright character ©.
{Letter Extended} This set includes all the letters which are part of the extended Unicode character set.
{Printable Extended} This set includes all the printable characters above #127. Although rarely used in programming languages, they could be used, for instance, as valid characters in a string literal.

Comment Terminals

 

One of the key principles in programming languages is the ability to incorporate comments and other documentation directly to the source code. Whether it is FORTRAN, COBOL or C++, the ability exists, but in varying forms.

Essentially, there are three different types of comment terminals used in programming languages: those that tell the compiler to ignore the remaining text in the current line of code and those used to denote the start and end of a multi-line comment. 

To accommodate the intricacies of comments, the GOLD Parser Builder provides for this special class of terminals.

Comment Start The Comment Start terminal defines the symbol used to begin a block comment. When the  tokenizer engine reads this symbol from the source text, it will increment an internal counter and ignore all other tokens until the Comment End token is encountered. Comments will be nested.
Comment End The Comment End terminal defines the symbol that will denote the end of a block comment.
Comment Line Unlike the Comment Start and Comment End terminals, the tokenizer will simply discard the rest of the line.

This documentation contains an example on how to use the comment terminals in a grammar.

 

Examples of Comment Terminals

Below is a comparison of comment terminals in several common programming languages. Blanks fields denote the programming language lacks a terminal of that type. For instance, Visual Basic does not provide block comments.

Programming Language Line Comment Block Comment Start Block Comment End
BASIC REM    
C (Original) //    
C (ANSI) // /* */
C++ // /* */
COBOL *    
LISP ;    
FORTRAN 90 !    
Java // /* */
Pascal   { or (* } or *)
Prolog % /* */
SQL -- /* */
Visual Basic ' (Single quote) or Rem    

Whitespace Terminal

 

In practically all programming languages, the parser recognizes (and usually ignores) the spaces, new lines, and other meaningless characters that exist between tokens. For instance, in the code

If  Done Then
   Counter =1;
End If

the fact that there are two spaces between the 'If' and 'Done', a new line after 'Then', and multiple space before 'Counter' is irrelevant.

From the parser's point of view (in particular the Deterministic Finite Automata that it uses) these whitespace characters are recognized as a special terminal which can be discarded. In GOLD, this terminal is simply called the Whitespace terminal and can be defined to whatever is needed. If the Whitespace Terminal is not defined explicitly in the grammar outline, it will be implicitly declared as one or more of the characters in the pre-defined Whitespace set:  {Whitespace}+.

Normally, you would not need to worry about the Whitespace terminal unless you are designing a language where the end of a line is significant. This is the case with Visual Basic, BASIC and many, many others. The proper declaration can be seen in an example.