`char` redesign

Abstract
Problem
Background
Proposal
Details
Rationale
Future work
Alternatives considered

Abstract

Add a char type literal mapping to Core.Char and equivalent to C++’s char.
- 8 bits, unsigned, treated as a single UTF-8 code unit.
Add a Core.CharLiteral type for character literals, similar to Core.IntLiteral.
Allow operations for char and Core.CharLiteral which reinforce the “character” concept, versus an integer value.
Revokes and replaces #1964: Character Literals.

Problem

char is an important type due to its common use in C++ code. However, the related proposal #1964: Character Literals has several issues, including:

Lacks a decision for char handling; it is not mentioned in proposal #1964.
- Similarly, decides there are character literals, but more detail is needed for implementation.
Type literal naming no longer reflects naming consensus.
- Char8 seems potentially more equivalent to std::char8_t instead of char, and for interop purposes these are slightly different types. Similar applies to Char16 and Char32.
- As a design direction, we have been lowercasing type literals (such as u8).
Conflicting statements about behavior.
- For example, “Rationale” states that var b: u8 = 'a' + 1 would be supported, while “Operations” states that + is returning a character literal (not a u8).
- For character literals, states “Escape sequences which would result in non-UTF-8 encodings or more than one code point are not included.” However, it goes on to say that let smiley: Char16 = '\u{1F600}' is valid even though 1F600 would require multiple code units in both UTF-8 and UTF-16.
Unclear that it gives us a good UTF plan.
- Does not decide what a single character in a Carbon string is.
- No consideration regarding interop with the std::char32_t family of types or ICU compatibility.

In other words, it’s likely we want something similar to Char32, but it may be named something like Core.Char32 and have slightly different type behaviors than decided in #1964. On the other hand, we need something compatible with the C++ char in order to proceed with basic C++ interop, and #1964 doesn’t provide that.

Background

Proposal #1964: Character Literals is fundamental, and a lot of the underlying thoughts still apply. In particular, we still want character types to be distinct from numeric types.
Proposal #199: String literals is important because we want character and string literals to have mirrored escaping concepts.
Proposal #5448: Carbon <-> C++ Interop: Primitive Types left the question of character type mappings open. This proposal aims to answer it for char.
Issue #5903: Built-in character type questions addressed type questions.
Issue #5922: Built-in character operators addressed operators.

Proposal

The way char will work is:

Add a char type literal.
- Carbon’s str type will use char for elements.
- For interop, map Carbon’s char to C++’s char.
Add a Core.CharLiteral type for character literals, similar to Core.IntLiteral.
Provide operators which are consistent with the character concept.

This proposal additionally revokes and replaces proposal #1964, rather than trying to define which parts we are keeping and which are changing.

Details

Add a `char` type literal

char is intended to offer a basic construct for Carbon’s strings that is both compatible with UTF-8, and has high fidelity with C++ strings.

In support of that, important notes are:

char itself will be a type literal.
char notionally represents a UTF-8 code unit.
- It can contain invalid code units, as long as it remains 8 bits. We do not assume runtime validation.
char will be backed by Core.Char, in the prelude.
- Core.Char will adapt u8.
C++ interoperability will transparently map char and Cpp.char on API boundaries.
- When used with Carbon, C++ char will be unsigned by default (-funsigned-char). A program can switch back to signed (-fno-unsigned-char), and Carbon will maintain interoperability but bits will be interpreted differently in each language.

Escape sequences

Escape sequences are the same as for a string literal. Only one escape sequence may be provided in a character literal.

Add a `Core.CharLiteral` type for character literals

Core.CharLiteral is the type of a character literal, similar to how Core.IntLiteral is the type of integer literals. It abstractly represents a single Unicode code point. This gives us a compile-time structure for characters that can be typed and referred to in programs.

Semantics of a character literal will be equivalent to a simple string literal, except that:

A character literal has a validated Unicode code point value.
The enclosing character is '.
The contents are precisely one character or escape sequence.
- The \x escape sequence is limited to values up to 7F, where the UTF-8 code unit and Unicode code point values are identical.

An important detail of the character literal type is it gives us a way to track constant values at compile time. For example, 'a' + 1 has a constant value of b. This means we can diagnose uses of character literals that don’t represent a valid Unicode code point, such as 'a' + 0xFFFFFF.

Operators

The goal of provided operators is to provide a set of operators which map to common operations a user would want to do. It is a non-goal to support use of char as an arbitrary byte or integer: developers should use u8 for that.

In general, char and Core.CharLiteral operators are intended to be mirrors of each other.

Conversion operators

char
- ImplicitAs: None
- ExplicitAs: To/from u8, plus the set of ImplicitAs for u8.
  - For example, u8 has ImplicitAs to u16, so char has ExplicitAs to u16.
Core.CharLiteral
- ImplicitAs: to char only
- ExplicitAs: To/from the set of ImplicitAs for i32 and u32.
  - For example, i32 has ImplicitAs to i64, so Core.CharLiteral has ExplicitAs to i64.
  - For example, i64 does not have ImplicitAs to i32; conversion requires two casts, ((i64_val as i32) as Core.CharLiteral).

Casting from a char to a Core.CharLiteral is not supported.

Comparison operators

char
- EqWith and OrderedWith when both operands are char.
- ImplicitAs should allow substituting one operand with Core.CharLiteral.
Core.CharLiteral
- EqWith and OrderedWith when operands are Core.CharLiteral.

Arithmetic operators

char
- AddWith: char + <integer> -> char (with reversible operands)
  - Equivalent to (((char as i16) + <integer>) as u8) as char)
- SubWith:
  - char - <integer> -> char (non-reversible operands)
    - Equivalent to (((char as i16) - <integer>) as u8) as char)
  - char - char -> i32
    - Equivalent to (lhs as i32) - (rhs as i32).
    - ImplicitAs should allow substituting one operand with Core.CharLiteral.
Core.CharLiteral
- AddWith: Core.CharLiteral + <integer> -> Core.CharLiteral (with reversible operands)
- SubWith:
  - Core.CharLiteral - <integer> -> Core.CharLiteral (non-reversible operands)
  - Core.CharLiteral - Core.CharLiteral -> i32
    - Provides a unicode code point delta.

`char` integer parameters

Arbitrary integers are supported for most of these operations. For example, we want to allow addition of negative numbers, even though the representation of char is unsigned, without requiring additional casts.

Overflow semantics

Operations will use error overflow semantics, similar to signed integers. For example, (('a' as char) + 500) is invalid code because it causes char overflow. That’s why conversions are to signed values (for example, char as i16).

Preferring i32 returns

In arithmetic, i32 returns are preferred for deltas because they should be valid for unicode code points. Even though char is only 8-bits, using i32 for returns there too creates consistency with Core.CharLiteral.

Revoke and replace proposal #1964: Character Literals

This revokes proposal #1964 for simplicity. Rather than trying to detail which decisions still apply and which don’t, this proposal is acting from an assumption that all decisions there no longer apply. We can still benefit by pointing towards the rationale in explicitly maintaining decisions, but we want to go through that step.

Rationale

Performance-critical software
- The intent is that Carbon’s main string type privileges UTF-8 over other potential encodings. A char represents a single code unit within that, and is consequently efficient to access. It can also be invalid, meaning we don’t guarantee performing runtime validation for users (avoiding performance overhead), and that users might be able to use it for other encodings.
Software and language evolution
- Core.CharLiteral is designed as a Unicode code point, and even though this design doesn’t include a way to use values over 7F, we anticipate those will be added in the future. It’s being provided as a building block for more elaborate Unicode functionality, including both UTF-16 and UTF-32, even as we prioritize UTF-8.
Code that is easy to read, understand, and write
- Character literal syntax mirrors string literal syntax. The main divergence is \x80 and higher similar escapes, which are not supported due to potentially ambiguous behavior, still in furtherance of this goal.
Practical safety and testing mechanisms
- Restricting the set of operators valid for char gives us a way to do different sorts of validation that can be more character-oriented than if we treated it as an arbitrary byte.
- Treating Core.CharLiteral as a valid Unicode character allows us to provide static checking for some operations, such as 'a' + 1 resulting in another valid Unicode code point; more is also transitively possible, including involving char.
Interoperability with and migration from existing C++ code
- Modeling char as a UTF-8 code unit creates behavior which is very similar to C++, but still shifts towards a more character-oriented approach. We do expect some migration friction as a consequence (as use-cases might need either more casts, or to switch to a byte type).

Future work

There’s still significant future work, including:

signed char, unsigned char
std::char8_t, std::char16_t, std::char32_t
UTF-16 and UTF-32 support

It should not be assumed that there’s any restriction on the designs of those features, particularly no restrictions from #1964.

Alternatives considered

Align `char` fully with C++, or make it fully valid

Alternatives were discussed in zygoloid’s comment on #5903.

The comment notes that three options were proposed:

char is fully aligned with C++.

There is no universal convention for what the value in a char means, and the numerical encoding of Unicode characters into char sequences might even be platform-dependent. For example, we might use some code page on Windows, EBCDIC on some IBM targets, and probably UTF-8 everywhere else. Likely the encoding would match what a character literal in C++ code would do for that target. Even when the target normally uses UTF-8, it would be reasonable to use an array of char as the type of the output buffer when transcoding from UTF-8 to some other encoding, and generally an encoded text buffer (in any encoding) would typically be represented as an array of char. It might also be reasonable to use an array of char for things that aren’t necessarily text, such as file contents.
char models a UTF-8 code unit, although it may not necessarily be valid, and may appear in a sequence that is not a valid UTF-8 encoding.

As with the first option, char can represent an integer in [0, 255], although it is not an integer type. Higher-level abstractions would likely (eventually) be provided to represent different views of the code unit sequence as (for example) a sequence of code points or a sequence of graphemes, but the fundamental model exposes the encoding. Functions taking char or char sequences would assume UTF-8 encoding, and would need to consider how to handle invalid chars and invalid char sequences.
Use a foundation that enforces Unicode string validity, for some definition of “Unicode string validity”.

The char type is a Unicode character. Strings would notionally be a sequence of Unicode characters, possibly also maintaining some higher-level string invariants. String indexing, if it exists, would likely treat the string as a sequence of Unicode characters. String invariants would be enforced by type conversion into the string type rather than within the string operations, and certain classes of invalid strings would be unrepresentable.

Rationale as evaluated are:

Privilege UTF-8 over other encodings: UTF-8 is typically the best choice for representing text, even when targeting languages where characters are 3 bytes in UTF-8 but 2 in UTF-16, and even on Windows where the system APIs typically operate primarily in UTF-16 or UCS-2. We should create affordances that encourage use of UTF-8 (such as having the char type be conventionally UTF-8).
- Our overall goal to support (only) modern environments and a general desire for consistency and portability argues against supporting non-Unicode encodings for character types.
- Having some convention for the meaning of the value of a char seems important, and the lack of one in C++ has been a notable problem over time, leading to the addition of char8_t et al, which have not been entirely satisfactory solutions due to the existing widespread usage of plain char.
Do not privilege any particular meaning of “validity”: There are many different ways in which you can view a sequence of UTF-8 code units as being valid or invalid. For example: Can a string start with a combining character? Can it have mismatched LRE/RLE/PDF characters in it? Can it be unnormalized, or must it be in NFC, or in NFD? Can it contain unassigned Unicode characters? Can it contain PUA characters? Can it contain non-characters? Picking any set of answers to these questions as being our canonical notion of “validity” is somewhat arbitrary.
Do not privilege any particular level for accessing elements of the string other than code units: There are many different layers of abstraction at which you can interpret the contents of a string. The atoms that users want to interact with, such as glyphs or grapheme clusters in rendering, or combining characters when editing or performing substring searches, aren’t in one-to-one correspondence with Unicode characters any more than they’re in one-to-one correspondence with UTF-8 code units. So it’s not clear that privileging Unicode-character-oriented access (or indeed any of the other higher-level Unicode views) is appropriate. However, code units are in direct correspondence with bytes of memory, which is directly relevant for low-level operations, so there is a reason to provide direct access to byte-level / code-unit-level operations.
- If string indexing operates on Unicode characters, it would either be non-constant-time or would require not storing strings as just a sequence of UTF-8. Having a constant-time indexing operation on strings seems very important (especially for interop and for meeting C++ developers where they are), even though a lot of the desired functionality (perhaps all of it) can be provided with iterator- or cursor-like machinery instead.
Enforcing validity is problematic for existing API structures: Requiring strings to be valid UTF-8 presents difficulties when moving text into or out of other sources. For example, when reading text from a validly-encoded UTF-8 file into a text buffer, one would need to deal with a read that ends in the middle of an encoding of a character. I don’t know how Rust deals with this, but it seems like it would create significant impedance mismatch with C-like buffered I/O utilities. Similarly, when interoperating with C++, it would create friction if our string representation requires strings to be valid UTF-8 encodings.
We can allow additional invariants without requiring them: For a known-to-be-valid UTF-8 sequence, a higher-level abstraction can be built, and similarly, yet-higher-level abstractions can be built for whatever other invariants we want to enforce. So using option 2 rather than option 3 as our foundation doesn’t prevent enforcing invariants in the type system (but nor does it encourage doing so).

This proposal is choosing option 2, that char models a UTF-8 code unit without validation. In some sense, option 2 is still “fully aligned with C++”, but with C++’s char8_t rather than with C++’s char.

Raw character literals

Raw string literals use a # prefix. There’s limited use for this in character literals; technically, '\\' could instead be #'\'#, but that’s longer and extra characters may prove distracting. Raw string literals are more useful when there’s a longer character sequence, whereas character literals have one character by definition. For simplicity, character literals won’t have raw syntax.

Disallow hex escape sequences in character literals

A \x## escape sequence abstractly represents a UTF-8 code unit. Whereas values over 7F are valid in string literals (allowing arbitrary byte values), these are disallowed in character literals because we want a more validated Unicode behavior. Developers could instead rely on \u escapes for \x.

It can still be useful to allow \x escapes for low-range values because some developers will still need to specify ANSI escapes. Carbon drops support for some escape sequences, such as \a, and specifically advises \x as an alternative for developers that need it. Requiring \a -> \x07 -> \u{07} is incrementally more verbose syntax, and developers may be confused why "\x1B" is allowed for strings but '\u{1B}' is required for characters.

Values over 7F are ambiguous between an arbitrary byte value and a Unicode code point, and so should be invalid. However, where both interpretations are identical for UTF-8 (values up to and including 7F), we will allow \x escape sequences.

Allow grapheme clusters in character literals

This proposal carries forward the decision in #1964 to not support grapheme clusters in character literals.

Reuse string literal syntax for character literals

Instead of using single quotes (for example, 'a'), we could use string literal syntax with a conversion (for example, "a" as char) for character literals. This was proposed because it would free up the single quote for other, unspecified syntax uses.

For background, character literals are common in C++. For example, in SourceGraph search statistics (some of these are in comments – a search limitation):

'(.|\\.)': 46.2 million
<<: over 100 million
>>: 10.4 million
%: 5.3 million

This creates several disadvantages for removing character literals in Carbon:

Migrating C++ developers to Carbon: The frequency of use can be expected to have trained developers to expect single quotes to be used for characters, especially the C++ developers that Carbon is targeting. Repurposing them would create a friction for C++ developers to need to understand the different meanings of the same syntax in each of C++ and Carbon, something Carbon prefers to avoid.
Increased runtime error risks: Runtime errors could take the form of simple increased overhead, such as converting a string literal to a str then to a char. However, they could also be more insidious, such as doing [0] on a string literal and not validating that the string is exactly one character (this would also likely return a null byte for ""[0]). By having a character literal type, Carbon encourages developers to stay within guide rails that make it easier to get compile-time behavior and program validation.
Block string literal use: We already have another use for single quotes in Carbon: block string literals. The syntax may need to change along with removing character literals, to make room for other uses of single quotes.
- If retained, it would constrain uses of single quotes. For example, a unary operator syntax has overlap (that is, if 'a and ''a are valid, then '''a is ambiguous).
- The choice of single quotes in proposal #1360: Change raw string literal syntax was made accounting for single quotes in character literals, and that commonality would be lost.
Tooling: The prevalence of single quotes being used for either strings or characters also affects their treatment in tools not specialized to Carbon: they expect them to be used for strings. For example, Rust’s use of single quotes for lifetime annotations has been observed to break language-agnostic syntax highlighting.

While a compelling proposal for a different use of single quotes may come up in the future, freeing up the character for other purposes is insufficient to justify a different syntax for character literals.

Treat single-character string literals as a third “text literal” type

A related alternative with the same goal of eliminating single quotes for character literals is that, rather than requiring single-character string literals be explicitly converted to char, they could instead have a third type of text literal. This would implicitly cast to either str or char.

This approach would lead to three literal types: StrLiteral, CharLiteral, and TextLiteral. The distinction of CharLiteral is important because we still want to support arithmetic on character literals, such as 'a' + 1 (which we would not want to be allowed for StrLiteral).

The existence of a third type would be important for generic code, even when not trying to use character literals. For example:

  fn StoreValue[U:! type](ref a: Optional(U), b: U) {
    a = b;
  }

  fn StrLogic[T:! type](a: T) {
    var x: Optional(T) = a;
    StoreValue(x, "str");
  }

  fn F() {
    StrLogic("a");
  }

Here, T is deduced to be TextLiteral. However, U has no valid value: it’s passed Optional(TextLiteral), while "str" is a StrLiteral (which should not be convertible to TextLiteral). As a consequence, this code is invalid, even though the same code would be valid if there were not TextLiteral type.

Advantages:

Avoids an explicit cast.

Disadvantages:

Shares most of the disadvantages of the primary explicit conversion approach.
- This includes the risk that developers will write "..."[0] instead of "..." as char when they need a character, although the frequency may be reduced.
Having additional types in common literals could lead to programmer errors in deducing generic types, as described above.
Implicit casts cause more operator ambiguity.
- How are operators that have different meanings for string and character literals handled, such as Cpp.std.cout << or <=>?
- In Carbon, we’d probably still want string operators to work; for example, "a" + "b" => "ab", and that can be compile-time. Is "a" + 1 a pointer to the null byte as it is in C++ (similar to &("a"[1])), a character addition ('a' + 1 => 'b'), or does it require an explicit cast in order to ensure behavior is deliberate?

char redesign

Table of contents