Basic Syntax

Pull request

Table of contents

Table of contents

Problem

The purpose of this proposal is to establish some basic syntactic elements of the Carbon language and make sure that the grammar is unambiguous and can be parsed by an LALR parser such as yacc or bison. The grammar presented here has indeed been checked by bison. The language features in this basic grammar include control flow by way of if and while, functions, simple structures, choice, pattern matching, and a sample of operators on Booleans and integers. The main syntactic categories are declaration, statement, and expression. Establishing these syntactic categories should help the other proposals choose syntax that is compatible with the rest of the language.

Background

The grammar proposed here is based on the following proposals:

Proposal

We summarize the four main syntactic categories here and define the grammar in the next section.

  • declaration includes function, structure, and choice definitions.

  • statement includes local variable definitions, assignment, blocks, if, while, match, break, continue, and return.

  • expression includes function calls, arithmetic, literals, and other syntaxes that evaluate to a value. In Carbon, a type is a kind of value, and Carbon makes no syntactic distinction between type-valued expressions and other kinds of expressions, so the expression nonterminal is used in positions where a type is expected.

  • pattern for the patterns in a match statement, for the left-hand side of a variable definition, and for describing the parameters of a function. The grammar treats patterns and expressions identically, but the type system will only allow pattern variables (expression ':' identifier) in patterns and not in value or type-producing expressions. Having the separate non-terminals for pattern and expression helps to document this distinction.

The proposal also specifies the abstract syntax.

When this proposal is accepted, the accompanying flex, bison, and C++ files that define the grammar and implement the parser actions will be placed in the executable_semantics directory of the carbon-lang repository.

In the near future there will be proposals regarding the semantic analysis (aka. type checking) and specification of runtime behavior for this subset of Carbon with the plan to add the accompanying executable forms of those specifications to the executable_semantics directory.

Looking towards growing the Carbon language, the intent is for other proposals to add to this grammar by modifying the flex and bison files and the abstract syntax definitions in the executable_semantics directory.

Details

Expressions and Patterns

The following grammar defines the concrete syntax for expressions. Below we comment on a few aspects of the grammar.

pattern:
  expression
;
expression:
  identifier
| expression designator
| expression '[' expression ']'
| expression ':' identifier
| integer_literal
| "true"
| "false"
| tuple
| expression "==" expression
| expression '+' expression
| expression '-' expression
| expression "and" expression
| expression "or" expression
| "not" expression
| '-' expression
| expression tuple
| "auto"
| "fnty" tuple return_type
;
designator:
  '.' identifier
;
tuple:
  '(' field_list ')'
;
field_list:
  /* empty */
| field
| field ',' field_list
;
field:
  pattern
| designator '=' pattern
;
return_type:
  /* empty */
| "->" expression
;

The grammar rule

expression:  expression ':' identifier

is for pattern variables. For example, in a variable definition such as

var Int: x = 0;

the Int: x is parsed with the grammar rule for pattern variables. In the right-hand side of the above grammar rule, the expression to the left of the : must evaluate to a type at compile time.

The grammar rule

expression:  "fnty" tuple return_type

is for function types. They are meant to play the role that function pointers play in C and that std::function plays in C++.

Regarding the grammar rule for the return type of a function:

return_type:  "->" expression

the expression is expected to evaluate to a type at compile-time.

The grammar rule

tuple:  '(' field_list ')'

is primarily for constructing a tuple, but it is also used for creating tuple types and tuple patterns, depending on the context in which the expression occurs.

Regarding the grammar rules for designator and for a named argument

designator: '.' identifier
field: designator '=' pattern

The period enables the addition of new keywords to Carbon without colliding with field names. The period also aids integrated development environments with auto-completion. The issue is that without the period, when the programmer is typing the identifier (and not yet typed the =), the IDE doesn’t know whether the identifier is for a positional argument, in which case it is a variable occurrence, or whether the identifier is a field name.

Statements

The following grammar defines the concrete syntax for statements.

statement:
  "var" pattern '=' expression ';'
| expression '=' expression ';'
| expression ';'
| "if" '(' expression ')' statement "else" statement
| "while" '(' expression ')' statement
| "break" ';'
| "continue" ';'
| "return" expression ';'
| '{' statement_list '}'
| "match" '(' expression ')' '{' clause_list '}'
;
statement_list:
  /* empty */
| statement statement_list
;
clause_list:
  /* empty */
| clause clause_list
;
clause:
  "case" pattern "=>" statement
| "default" "=>" statement
;

Declarations

The following grammar defines the concrete syntax for declarations.

declaration:
  "fn" identifier tuple return_type '{' statement_list '}'
| "fn" identifier tuple "=>" expression ';'
| "fn" identifier tuple return_type ';'
| "struct" identifier '{' member_list '}'
| "choice" identifier '{' alternative_list '}'
;
member:
  "var" expression ':' identifier ';'
;
member_list:
  /* empty */
| member member_list
;
alternative:
  identifier tuple ';'
;
alternative_list:
  /* empty */
| alternative alternative_list
;
declaration_list:
  /* empty */
| declaration declaration_list
;

In the grammar rule for function definitions

declaration:  "fn" identifier tuple return_type '{' statement_list '}'

the tuple is used as a pattern to describe the parameters of the function. The grammar for function definitions does not currently include implicit parameters, but the intent is to add them to the grammar in the future.

In the rule for field declarations

member:  "var" expression ':' identifier ';'

the expression must evaluate to a type at compile time. The same is true for the tuple in the grammar rule for an alternative:

alternative:  identifier tuple ';'

Precedence and Associativity

The following precedence and associativity specification is meant to approximate the definitions in the operators and precedence proposal #168 to the extent that is possible in a yacc/bison generated parser. The ordering is from lowest to highest precedence, with operators on the same line having equal precedence. Proposal 168 differs in that the operator groups are partially ordered instead of being totally ordered.

nonassoc '{' '}'
nonassoc ':' ','
left "or" "and"
nonassoc "==" "not"
left '+' '-'
left '.' "->"
nonassoc '(' ')' '[' ']'

Abstract Syntax

The output of parsing is an abstract syntax tree. There are many ways to define abstract syntax. Here we simply use C-style struct definitions. The definition of the abstract syntax is important because it is used in the specification of the semantic analysis and the specification of runtime behavior.

Abstract Syntax for Expressions

enum ExpressionKind { Variable, PatternVariable, Int, Bool,
                      PrimitiveOp, Call, Tuple, Index, GetField,
                      IntT, BoolT, TypeT, FunctionT, AutoT };
enum Operator { Neg, Add, Sub, Not, And, Or, Eq };

struct Expression {
  ExpressionKind tag;
  union {
    struct { string* name; } variable;
    struct { Expression* aggregate; string* field; } get_field;
    struct { Expression* aggregate; Expression* offset; } index;
    struct { string* name; Expression* type; } pattern_variable;
    int integer;
    bool boolean;
    struct { vector<pair<string,Expression*> >* fields; } tuple;
    struct {
      Operator operator_;
      vector<Expression*>* arguments;
    } primitive_op;
    struct { Expression* function; Expression* argument; } call;
    struct { Expression* parameter; Expression* return_type;} function_type;
  } u;
};

The correspondence between most of the grammar rules and the abstract syntax is straightforward. However, the parsing of the field_list deserves some explanation. The fields can be labeled with the grammar rule:

field:  designator '=' pattern

or unlabeled, with the rule

field: pattern

The unlabeled fields are given numbers (represented as strings) for field labels, starting with 0 and going up from left to right.

Regarding the rule for tuples:

tuple:  '(' field_list ')'

if the field list only has a single unlabeled item without a trailing comma, then the parse result for that item is returned. Otherwise a tuple AST node is created containing the parse results for the fields.

In the following grammar rules, if the parse result for tuple is not a tuple (because the field list was a single unlabeled item without a trailing comma), then a tuple is wrapped around the expression to ensure that it is a tuple.

expression: expression tuple
expression: "fnty" tuple return_type
function_definition:
  "fn" identifier tuple return_type '{' statement_list '}'
| "fn" identifier tuple DBLARROW expression ';'
| "fn" identifier tuple return_type ';'
alternative: identifier tuple ';'

Abstract Syntax for Statements

enum StatementKind { ExpressionStatement, Assign, VariableDefinition,
                     If,  Return, Sequence, Block, While, Break, Continue,
                     Match };

struct Statement {
  StatementKind tag;
  union {
    Expression* exp;
    struct { Expression* lhs; Expression* rhs; } assign;
    struct { Expression* pat; Expression* init; } variable_definition;
    struct { Expression* cond; Statement* then_; Statement* else_; } if_stmt;
    Expression* return_stmt;
    struct { Statement* stmt; Statement* next; } sequence;
    struct { Statement* stmt; } block;
    struct { Expression* cond; Statement* body; } while_stmt;
    struct {
      Expression* exp;
      list< pair<Expression*,Statement*> >* clauses;
    } match_stmt;
  } u;
};

Abstract Syntax for Declarations

struct FunctionDefinition {
  string name;
  Expression* param_pattern;
  Expression* return_type;
  Statement* body;
};

enum MemberKind { FieldMember };

struct Member {
  MemberKind tag;
  union {
    struct { string* name; Expression* type; } field;
  } u;
};

struct StructDefinition {
  string* name;
  list<Member*>* members;
};

enum DeclarationKind { FunctionDeclaration, StructDeclaration,
                       ChoiceDeclaration };

struct Declaration {
  DeclarationKind tag;
  union {
    struct FunctionDefinition* fun_def;
    struct StructDefinition* struct_def;
    struct {
      string* name;
      list<pair<string, Expression*> >* alts;
    } choice_def;
  } u;
};

Alternatives considered

Expressions, type expressions, and patterns could instead be defined using completely disjoint grammar productions, but that would require adding more keywords or symbols to avoid ambiguity and there would be a fair amount of repetition in grammar productions.

The proposal does not include declarations for uninitialized variables, leaving that to a later proposal.

In this proposal, assignment is a statement. It could instead be an expression as it is in C and C++. The arguments against assignment-as-an-expression include 1) it complicates reasoning about the ordering of side-effects and 2) it can cause confusion between = and == (SEI CERT C Coding Standard) Visual Studio Warning.

The syntax for variable initialization uses =, which is the same token used in assignment. An alternative would be to use a different syntax such as := to better signal the semantic differences between initialization and assignment. The author does not have a preference on this choice. The := alternative does not introduce any complications regarding ambiguity despite the other uses of : in the grammar.

Inside a choice, an alternative could instead begin with a keyword such as alt, fn, or var, as discussed in the proposal for sum types #157. The author does not have a preference in this regard.

The precedence for and and or are equal, following the suggestion of proposal #168, whereas in C++ && has higher precedence than ||.

The parameters of a function are described using the syntax of a pattern. There could instead be separate syntax that is special to function parameters, as there is in C++.

There are open questions regarding the design of function types, so we could alternatively postpone the specification of the syntax for function types. The choice to include them is based on the idea that Carbon will need something that fills the roles of std::function and C-style function pointers. Also, the features were selected for the basic syntax proposal in a way that aimed to make the proposal complete in the sense that each value-oriented feature (functions, tuples, structures) comes with syntax for 1) creating values, 2) using values, and 3) a type for classifying values.

The parameters of a function type are described using the syntax for type expression. There could instead be separate syntax that is special to the parameters of a function type, as there is in C++.

The keyword for introducing a function type is fnty, which is an arbitrary choice. Other alternatives include fn_type and fn. The fn alternative would conflict with future plans to use fn for lambda expressions. Some dislike of fn_type was voiced, which motivated the choice of fnty. It would be nice to do without a keyword, but as the grammar currently stands, that introduce ambiguity.

The pattern non-terminal is defined in terms of expression. It could instead be the other way around, defining expression in terms of pattern.

Regarding tuples and tuple types, this proposal follows #111 in that there is not a distinct syntax for tuple types. Instead, the intent is that when a tuple of types results from an expression in a context that expects a type, the type system will convert the tuple to a tuple type. An alternative approach has been discussed in which tuple types have a distinct syntax, such as Tuple(Int, Bool).

Rationale

Using code to validate our specification is a really promising direction, and this proposal seems like a good starting point. A reference implementation that’s simple enough to be part of the design iteration process should help us move faster, by quickly uncovering the places where our specifications are ambiguous, syntactically or semantically unsound, or don’t give the behavior we expect. In other words, it will help us keep ourselves honest, even at the proposal stage, which will help us avoid wasting time and effort implementing designs that turn out to be unworkable.

This can be considered as sort of a counterpart to In-progress design overview #83, in that the design specifics are being approved in order to bootstrap the specification process. We aren’t necessarily adopting the specific syntax and semantics expressed by this proposal, and those choices will need to be presented and justified from scratch by future proposals.

This decision is deferring the implementation to code review. The specific tooling used to implement the syntax checker, such as Bison, is a detail which may be changed, now or later, without requiring a proposal for core team review.