Pattern matching
Table of contents
- Overview
- Pattern instructions
- Instruction ordering
- Parser-driven pattern block pushing
- Function parameters
Overview
This document focuses on the implementation of pattern matching. See here for more on the design and fundamental concepts.
The SemIR for a pattern-matching operation is emitted in three steps:
- Pattern: Traverse the parse tree of the pattern to emit SemIR that abstractly describes the pattern.
- Scrutinee: Traverse the parse tree of the scrutinee expression to emit SemIR that evaluates it.
- Match: Traverse the pattern SemIR from step 1 (sometimes in conjunction with the scrutinee SemIR) to emit SemIR that actually performs pattern matching.
Pattern instructions
The SemIR emitted in the pattern step primarily consists of pattern instructions, which are instructions that describe the pattern itself. For example, given the pattern (x: i32, y:i32)
, the pattern step might emit the following SemIR:
%x.patt: %pattern_type.7ce = binding_pattern x [concrete]
%y.patt: %pattern_type.7ce = binding_pattern y [concrete]
%.loc4_21: %pattern_type.511 = tuple_pattern (%x.patt, %y.patt) [concrete]
Pattern instructions do not represent executable code, and are generally ignored during lowering. Instead, they descriptively represent the pattern itself as a kind of constant value, and their primary consumer is the match step. The type of a pattern instruction is a pattern type, which is represented by a PatternType
instruction. For example, the constants
block might define the types in the above SemIR like so:
%i32: type = class_type @Int, @Int(%int_32) [concrete]
%pattern_type.7ce: type = pattern_type %i32 [concrete]
%tuple.type: type = tuple_type (%i32, %i32) [concrete]
%pattern_type.511: type = pattern_type %tuple.type [concrete]
We can read this as saying that the type of %x.patt
and %y.patt
is “pattern that matches an i32
scrutinee”, and the type of %.loc4_21
is “pattern that matches a (i32, i32)
scrutinee”.
Pattern instructions are only emitted during the pattern step, but that step can emit non-pattern instructions as well. For example, in a pattern like (x: i32, a + b)
, i32
and a + b
are ordinary expressions, and so their SemIR must be emitted during the initial traversal of the parse tree, as with any other expression.
All the pattern instructions for a given full-pattern are grouped together in a distinct block that contains only pattern instructions. Consequently, Check::Context
maintains pattern_block_stack
as a separate InstBlockStack
for pattern blocks, and provides separate methods like AddPatternInst
for adding instructions to it.
Instruction ordering
The SemIR produced in the first two steps is (like most SemIR) generally in post-order, reflecting the order of the parse tree. However, the match step traversal is performed pre-order, starting with the root instruction of the pattern and traversing into its dependencies.
In some cases it is necessary for the pattern step to allocate instructions that won’t actually be emitted until the match step, because they are responsible for performing pattern matching. When that happens, they are allocated but not added to a block, and their IDs are stored in the Check::Context
so that they can be spliced into the current block at the appropriate point in the match step.
Currently this happens in two cases, which are handled using two maps in Check::Context
from pattern instruction IDs to the corresponding match instruction IDs:
- A name binding can be used within the same pattern that declares it:
match (x) { case (n: i32, n) => ...
For this to work, the name
n
needs to be added to the scope as soon as we handle its declaration, and it needs to resolve to theBindName
instruction that binds a value to that name. This means that theBindName
instruction needs to be allocated during the pattern step, even though it is part of matching, not part of the pattern.Context::bind_name_map
stores theseBindName
s, keyed by the correspondingBindingPattern
instruction. - A
var
pattern allocates storage during matching, which is represented by aVarStorage
instruction. This instruction must be allocated during the pattern step, so that it can be used as the output parameter of scrutinee expression evaluation during the scrutinee step.Context::var_storage_map
stores theseVarStorage
instructions, keyed by the correspondingVarPattern
instruction.
As noted earlier, the pattern step can also emit non-pattern instructions to evaluate expressions that are embedded in the pattern, such as the type expressions of binding patterns, and expressions that are used as patterns themselves (although those have not been implemented yet). The parse tree doesn’t mark these situations in advance: any given subpattern might turn out to be one that emits non-pattern instructions. To handle these situations, we speculatively push an instruction block onto the (non-pattern) stack whenever we are about to begin handling a subpattern, and then pop it at the end of the subpattern, with different treatment depending on whether the subpattern turned out to be a subexpression. This is handled by BeginSubpattern
, EndSubpatternAsExpr
, and EndSubpatternAsNonExpr
.
One further complication here is that the type expression can contain control flow (such as an if
expression). Consequently, we can’t represent the type expression SemIR as a single block; instead, we represent the SemIR for a given type expression as a single-entry, single-exit (SE/SE) region, potentially consisting of multiple blocks.
Note: The original motivation for rigorously excluding non-pattern instructions from the pattern block may no longer apply. In particular, it may make sense to put non-pattern instructions in the pattern block when they represent an expression that is part of the pattern. If so, substantial parts of this design might change. See issue #5351.
Parser-driven pattern block pushing
At the same time as all of that, we have to manage the pattern block stack as well. We attempt to do this precisely rather than speculatively, by leveraging the parser to precisely mark the nodes immediately before full-patterns, and pushing the pattern block stack when we handle those nodes. We then rely on signals from both the parser and the node stack to determine when to pop from the pattern block stack.
In the case of let
and var
decls, this is fairly straightforward: the beginning is marked by the LetIntroducer
or VarIntroducer
node, and the end is marked by the LetInitializer
or VarInitializer
, or by the VarDecl
in the case of a var
decl with no initializer. Similarly, the beginning of an impl forall
parameter list is marked by the Forall
node, and the end is marked by the ImplDecl
or ImplDefinitionStart
.
The case of a parameterized name (such as Bar(y: i32)
) is more challenging. The node immediately before the start of the full-pattern is an identifier, but an identifier doesn’t necessarily mark the start of a full-pattern. We’ve solved that by having the parser mark identifier nodes that are followed by full-patterns (using lookahead). Rather than use additional storage for what is logically a single bit of data, we effectively smuggle that bit into the kind enum by having separate node kinds IdentifierNameBeforeParams
and IdentifierNameNotBeforeParams
.
If the parameterized name is a name qualifier (such as the first part of Foo(X:! i32).Bar(y: i32)
), the node immediately after it will be the qualifier node. As of this writing, we bifurcate qualifier nodes into NameQualifierWithParams
and NameQualifierWithoutParams
, much like we do with identifier names, but we don’t actually use that information, and instead use the presence of parameters on the node stack to determine whether to pop the pattern block stack.
Open question: should we re-combine the two qualifier node kinds?
If the parameterized name is not part of a name qualifier, the node immediately after it will be a *Decl
or *DefinitionStart
node of the appropriate kind (for example FunctionDecl
or FunctionDefinitionStart
if the introducer was fn
). Note that this means the pattern block is still on the stack while handling the return type of a function. This is intentional, because we model the return type as declaring an output parameter (see below), which makes it functionally part of the parameter pattern.
Function parameters
Call
parameters and arguments
SemIR models a function call as a Call
instruction, which has an instruction block consisting of one instruction per argument. Correspondingly, the SemIR representation of a function has a block consisting of one instruction per parameter. We refer to these as Call
arguments and Call
parameters, because they don’t necessarily correspond to the colloquial meaning of “arguments” and “parameters” (which are sometimes referred to as syntactic arguments and parameters).
For example, consider this function:
fn F(T:! type, U:! type) -> Core.String;
The Call
instruction is a runtime-phase operation, so it notionally runs after compile-time parameters have already been bound to values. As a result, a Call
instruction calling F
does not pass values for either T
or U
. On the other hand, it does pass a reference to the storage that F
should construct the return value in. So although we would colloquially say that F
takes two parameters of type type
, it has a single Call
parameter of type Core.String
.
If Carbon supports general patterns in function parameter lists, that introduces additional ways that Call
parameters can diverge from the colloquial meaning. For example:
fn G(x: i32, var (y: i32, z: i32));
fn H(x: i32, (y: i32, var z: i32));
A var
pattern converts the scrutinee to a durable reference expression, and then performs further pattern matching on the object it refers to. As a result, G
has two Call
parameters: a value corresponding to x
, and a reference to an object of type (i32, i32)
, corresponding to both y
and z
. On the other hand, H
has 3 Call
parameters: values corresponding to x
and y
, and a reference corresponding to z
.
Caller and callee matching
The Call
parameters define the API boundary between the caller and callee at the SemIR level. As a result, responsibility for matching the arguments against the parameter list is split between the caller and the callee. Continuing the example from above, given the call G(0, (x, y))
, the caller is responsible for converting 0
to i32
, and for initializing a new (i32, i32)
object from (x, y)
, but the callee is responsible for binding the name x
to its first Call
parameter, and for destructuring its second Call
parameter and binding the names y
and z
to its elements.
In SemIR we represent this situation with special ParamPattern
instructions, which mark the boundary: there is exactly one ParamPattern
instruction for each Call
parameter, which matches the entire corresponding Call
argument. The subpatterns of the ParamPattern
s are matched on the callee side, and everything above them is matched on the caller side. There are multiple kinds of ParamPattern
instruction, which correspond to different ways of passing a parameter (such as by reference or by value).
When performing callee-side pattern matching, we do not have an actual scrutinee expression. Instead, for each ParamPattern
instruction we generate a corresponding Param
instruction, which reads from the corresponding entry in the Call
argument list, and we use that as the scrutinee of the ParamPattern
. Every ParamPattern
kind has a corresponding Param
kind.
The return slot
If a function has a declared return type, the function takes an additional Call
parameter, which points to the storage that should be initialized with the return value. This Call
parameter is represented as an OutParamPattern
instruction with a ReturnSlotPattern
instruction as a subpattern. The ReturnSlotPattern
also represents the return type declaration itself, such as in FunctionFields
. The SemIR that matches these patterns consists of a ReturnSlot
instruction, which binds the special name NameId::ReturnSlot
to the OutParam
instruction representing the storage passed by the caller.
This structure is analogous to the handling of an ordinary by-value parameter, which is represented in the Call
parameters as a ValueParamPattern
instruction with a BindingPattern
, and in the pattern-matching SemIR as a BindName
instruction that binds the parameter name to the ValueParam
instruction representing the argument passed by the caller.
Note that if the return type does not have an in-place value representation (meaning that the return value should not be passed in memory), these instructions will all still be generated, but the SemIR for return
statements will not access the ReturnSlot
, and the Call
argument list will not contain an argument corresponding to the OutParamPattern
(and so it will be one element shorter than the Call
parameter list). However, the ReturnSlotPattern
is still used, in its other role as a representation of the return type declaration. This leads to a potentially confusing situation, where the term “return slot” sometimes refers to the ReturnSlotPattern
(for example in FunctionFields::return_slot_pattern
), which is present for any function with a declared return type, and sometimes refers to the actual storage provided by the caller (for example in ReturnTypeInfo::has_return_slot
), which is present only if the return type has an in-place value representation.
TODO: When the return type isn’t in-place, the
OutParamPattern
should probably not be in theCall
parameter list (for consistency with theCall
argument list), and possibly theOutParamPattern
,OutParam
, andReturnSlot
instructions should not be emitted in the first place. Furthermore, we should find a way to resolve the inconsistent “return slot” terminology.