Subscript syntax and semantics

Pull request

Table of contents

Abstract

Adds support for subscripting using the conventional square-bracket syntax, with support for both array-like and slice-like semantics.

Cloned from #1356, which was cloned from #1059.

Problem

Carbon needs a convenient way of indexing into objects like arrays and slices.

Background

For purposes of this proposal, a “slice” is an object that refers to some region of an array, and provides access to it, such as a std::string_view or std::span in C++. Slices have subtly different indexing API semantics than arrays: indexing into an array can produce an l-value only if the array itself is an l-value, but indexing into a slice can produce an l-value regardless of the slice’s value category.

Interfaces are Carbon’s only static open extension mechanism, so user-defined indexing must be defined in terms of interfaces. In order to support definition checking of generic functions, it must be possible to fully typecheck indexing operations using only the information exposed by those interfaces. And in order to provide coherence, a given indexing expression must execute the same code regardless of how much type information we have about it, so long as we have enough type information to establish that the expression is valid.

The primary problem this proposal needs to solve is how to support both array-like and slice-like indexing semantics while satisfying those requirements.

Proposal

Carbon will support indexing using the conventional a[i] subscript syntax. When a is an l-value, the result of subscripting will always be an l-value, but when a is an r-value, the result can be an l-value or an r-value, depending on which interface the type implements:

  • If subscripting an r-value produces an r-value result, as with an array, the type should implement IndexWith.
  • If subscripting an r-value produces an l-value result, as with a slice, the type should implement IndirectIndexWith.

IndirectIndexWith is a subtype of IndexWith, and subscript expressions are rewritten to method calls on IndirectIndexWith if the type is known to implement that interface, or to method calls on IndexWith otherwise.

A type can implement at most one of these two interfaces. IndirectIndexWith provides a final blanket impl of IndexWith, which ensures coherence.

One implication of this design is that the requirement of coherence must apply to value categories as well as types (and for those purposes, “l-value” is a subcategory of “r-value”). This implies that for any valid expression containing an r-value sub-expression, replacing that sub-expression with an l-value expression cannot change the validity or semantics of the outer expression, so long as the r-value expression has the same type and evaluates to the same value. This, in turn, ensures that once we know a value implements IndexWith, additionally learning that it implements IndirectIndexWith can’t invalidate code that previously typechecked, or change its semantics.

Details

A subscript expression has the form “lhs [ index ]”. As in C++, this syntax has the same precedence as ., ->, and function calls, and associates left-to-right with all of them.

Its semantics are defined in terms of the following interfaces:

interface IndexWith(SubscriptType:! Type) {
  let ElementType:! Type;
  fn At[me: Self](subscript: SubscriptType) -> ElementType;
  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
}

interface IndirectIndexWith(SubscriptType:! Type) {
  impl as IndexWith(SubscriptType);
  fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
}

A subscript expression where lhs has type T and index has type I is rewritten based on the value category of lhs and whether T is known to implement IndirectIndexWith(I):

  • If T implements IndirectIndexWith(I), the expression is rewritten to “*(( lhs ).(IndirectIndexWith(I).Addr)( index ))”.
  • Otherwise, if lhs is an l-value, the expression is rewritten to “*(( lhs ).(IndexWith(I).Addr)( index ))”.
  • Otherwise, the expression is rewritten to “( lhs ).(IndexWith(I).At)( index )”.

On their own, these rules would oblige IndirectIndexWith types to define three methods to support a single syntax. Worse, they would permit those types to define those methods in inconsistent ways, which would violate coherence. To avoid those problems, IndirectIndexWith provides a blanket final impl for IndexWith:

final external impl forall
    [SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
    T as IndexWith(SubscriptType) {
  let ElementType:! Type = T.(IndirectIndexWith(SubscriptType)).ElementType;
  fn At[me: Self](subscript: SubscriptType) -> ElementType {
    return *(me.(IndirectIndexWith(SubscriptType).Addr)(index));
  }
  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType* {
    return me->(IndirectIndexWith(SubscriptType).Addr)(index);
  }
}

Thus, a type that implements IndirectIndexWith need not, and cannot, provide its own definitions of IndexWith.At and IndexWith.Addr.

Examples

An array type could implement subscripting like so:

class Array(template T:! Type) {
  external impl as IndexWith(like i64) {
    let ElementType:! Type = T;
    fn At[me: Self](subscript: i64) -> T;
    fn Addr[addr me: Self*](subscript: i64) -> T*;
  }
}

And a slice type could look like this:

class Slice(T:! Type) {
  external impl as IndirectIndexWith(like i64) {
    let ElementType:! Type = T;
    fn Addr[me: Self](subscript: i64) -> T*;
  }
}

We can even approximate support for indexing on tuple types. We don’t have enough of a design for metaprogramming or variadics to specify it generically, but we can illustrate how it would look for a particular tuple type, such as (i8, f64):

external impl (i8, f64) as IndexWith(typeof(0)) {
  let ElementType:! Type = i8;
  fn At[me: Self](0) -> i8;
  fn Addr[addr me: Self*](0) -> i8*;
}

external impl (i8, f64) as IndexWith(typeof(1)) {
  let ElementType:! Type = f64;
  fn At[me: Self](1) -> f64;
  fn Addr[addr me: Self*](1) -> f64*;
}

However, this approach presupposes that every integer literal has a distinct type, which is not yet settled. It also only supports indexing with integer literals, not with constant values of ordinary integer types. This is somewhat contrary to Carbon’s metaprogramming philosophy: Carbon treats types as values in order to make metaprogramming more like ordinary programming, but here we’re doing the reverse, treating values as types. Consequently, the eventual design for tuple-style indexing may look very different.

Rationale based on Carbon’s goals

Indexing is a fundamental operation in procedural programming, and is especially common in performance-critical code, so supporting it with a concise, familiar syntax helps us make code easy to read, understand, and write, and support performance-critical software. Using the same syntax as C++, even for slice-like types, supports interoperability with and migration from existing C++ code.

Alternatives considered

Different subscripting syntaxes

The reason that slices have different semantics from arrays is that a slice represents an indirection, like a pointer. Indexing behaves the same for slice l-values and r-values for the same reason that dereferencing behaves the same for pointer l-values and r-values. Thus, our attempts to make a single subscripting syntax work for both arrays and slices are arguably analogous to trying to make the same method-call syntax work for both pointers and objects. Or, to put it another way, the difficulty with slices is that they behave like references.

From that point of view, it would be more consistent to have different subscripting syntaxes for the two cases. The most naive version of that would be to say that only array-like types support subscripting, and slice-like types should instead support dereferencing, so indexing into a slice would look like (*s)[i]. However, that would be syntactically onerous, and it’s not clear how to define the type of *s in a way that doesn’t beg the question.

Instead, we could define a distinct syntax for slice subscripting, such as s*[i], and define separate interfaces for the two subscript syntax. The two interfaces would be very similar to IndexWith and IndirectIndexWith, but neither would extend the other, and the rewrite from subscript syntax to method calls would not depend on type information (although it would depend on value category). However, any such syntax would be quite novel and unfamiliar, and there seem to be few good choices for how to spell it. s*[i] is almost the only syntax that seems both viable and somewhat mnemonic, but it would add further complexity to the already-fraught parsing of * (for humans as well as tools).

Multiple indices

This proposal does not support multiple comma-separated indices (such as a[i, j]), which is desirable for types like multidimensional arrays. We do support syntax like a[(i, j)], which is a single index whose type is a tuple, but the extra parens are syntactically noisy. We could add those parens implicitly, but that would effectively move the syntactic noise to the implementation, even for the single-index case (for example impl Foo as IndexWith((i64,))). It should be much cleaner to make the interfaces and their methods variadic, once we have a design for variadics.

Read-only subscripting

This proposal does not provide an obvious path for supporting types like C++’s std::string_view or std::span<const T>, whose subscripting operations expose read-only access to their contents. It is tempting to try to extend this proposal to support those use cases, both because of their inherent importance and because it already has to deal with read-only versus read-write access in order to support array r-values.

However, there’s a fundamental difference between the true immutability of something like an array r-value, and the contextual lack of mutable access provided by something like string_view. While value categories can express the former, they are not well-suited to expressing the latter. To address these use cases, Carbon will probably need something like C++’s const type system, but that should be largely orthogonal to this proposal.

Rvalue-only subscripting

This proposal does not support subscripting operations that can’t produce l-values. In particular, this means it does not support using subscript syntax to form slices, as in Python’s a[i:j] or Swift’s a[i...j]. To support this, we would need a separate pair of interfaces that return by value:

interface RValueIndexWith(SubscriptType:! Type) {
  let ElementType:! Type;
  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType;
}

interface RValueIndirectIndexWith(SubscriptType:! Type) {
  extends RValueIndexWith(SubscriptType);
  let ElementType:! Type;
  fn Addr[me: Self](subscript: SubscriptType) -> ElementType;
}

final external impl [SubscriptType:! Type,
                     T:! RValueIndirectIndexWith(SubscriptType)]
    T as IndexWith(SubscriptType) {
  let ElementType:! Type =
      T.(RValueIndirectIndexWith(SubscriptType)).ElementType;
  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType {
    return me->Addr(subscript);
  }
}

Note that we still need a pair of interfaces, and a blanket final impl to enforce consistency, because arrays and slices have different semantics in this context as well: taking a slice of an r-value array is invalid, because taking a slice is equivalent to (and presumably implemented in terms of) taking an address. On the other hand, taking a slice of an r-value slice is valid and should be supported.

We would likewise need to extend the rewrite rules for subscript syntax to detect and use implementations of these interfaces. This should not lead to a combinatorial explosion of cases, though; if T implements both IndexWith(I) and RValueIndexWith(I), the program should be ill-formed due to ambiguity.

However, it’s not clear if this would provide enough benefit to justify the added complexity.

Map-like subscripting

This proposal does not support subscripting operations that insert new elements into a collection, as in C++’s std::map and std::unordered_map, because it requires subscriptable types to support subscripting of r-values, which are immutable. We could support this with an additional interface for types that are subscriptable only as l-values, and a corresponding extension to the rewrite rules.

However, it’s debatable whether such insertion behavior is desirable; it has not been a clear-cut success in C++. Code like x = m[i]; looks like it reads from m, and the fact that it can write to m is surprising, and easy for even experienced programmers to overlook. Even for wary readers, it doesn’t convey the author’s intent, because it’s not clear whether the author assumed i is present, or is relying on the implicit insertion. Furthermore, the implicit insertion means that x = m[i]; won’t compile when m is const. On the other hand, while it’s relatively unsurprising that m[i] = x; might insert, that insertion is also potentially inefficient, since it must default-construct a new value before assigning x to it.

We could fix most of these problems by giving special treatment to syntax of the form m[i] = x;, and defining separate methods for it:

interface IndexWith(SubscriptType:! Type) {
  let ElementType:! Type;
  fn At[me: Self](subscript: SubscriptType) -> ElementType;
  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
  fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
    (*me->Addr(index)) = element;
  }
}

interface IndirectIndexWith(SubscriptType:! Type) {
  extends IndexWith(SubscriptType);
  let ElementType:! Type;
  fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
  fn Assign[me: Self](subscript: SubscriptType, element: ElementType) {
    me->Addr(subscript) = element;
  }
}

final external impl [SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
    T as IndexWith(SubscriptType) {
  ...
  fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
    me->(IndirectIndexWith.AssignAsRValue)(subscript, element);
  }
}

With this approach, expressions of the form m[i] = x would be rewritten to call the appropriate Assign method, while all other subscript expressions are rewritten as in the primary proposal. However, this does add some complexity to both the interfaces and the rewrite rules (for example, we presumably want (m[i]) = x to be treated the same as m[i] = x), so I’m leaving this as a potential future extension.

Usage-based rewriting

As an alternative way of getting std::map-like behavior, we could rewrite subscript expressions to use At instead of Addr when the usage context only needs an r-value, even if the operand is an l-value, in much the same way that Rust rewrites subscript expressions depending on whether the usage expects the result to be mutable. This would enable type authors to make m[i] = x; potentially-inserting while ensuring that x = m[i]; is not, by making Addr but not At potentially-inserting.

However, this would mean that m[i].Method() is potentially inserting if Method takes me by pointer, and would have the same performance drawback as in C++. Furthermore, it would violate the principle (which we have discussed but not yet adopted) that type information should flow from subexpressions to the parent expression, but never the reverse.

A variant of this approach would be to allow the compiler to choose between At and Addr when the operand is an l-value but the usage context only needs an r-value. This might enable the compiler to make the generated code more efficient and/or safer, and might create pressure for libraries to ensure At and Addr are consistent with each other.

However, if the compiler predictably chooses At under those conditions, that could create a temptation for library authors to leverage that implementation detail to try to implement a std::map-like API, by having Addr implicitly insert while At does not. That would have all the same drawbacks as if we supported it intentionally, plus the evolutionary drawbacks of having user code depend on unsupported implementation details.