char redesign
Table of contents
Abstract
- Add a
chartype literal mapping toCore.Charand equivalent to C++’schar.- 8 bits, unsigned, treated as a single UTF-8 code unit.
- Add a
Core.CharLiteraltype for character literals, similar toCore.IntLiteral. - Allow operations for
charandCore.CharLiteralwhich reinforce the “character” concept, versus an integer value. - Revokes and replaces #1964: Character Literals.
Problem
char is an important type due to its common use in C++ code. However, the related proposal #1964: Character Literals has several issues, including:
- Lacks a decision for
charhandling; it is not mentioned in proposal #1964.- Similarly, decides there are character literals, but more detail is needed for implementation.
- Type literal naming no longer reflects naming consensus.
Char8seems potentially more equivalent tostd::char8_tinstead ofchar, and for interop purposes these are slightly different types. Similar applies toChar16andChar32.- As a design direction, we have been lowercasing type literals (such as
u8).
- Conflicting statements about behavior.
- For example, “Rationale” states that
var b: u8 = 'a' + 1would be supported, while “Operations” states that+is returning a character literal (not au8). - For character literals, states “Escape sequences which would result in non-UTF-8 encodings or more than one code point are not included.” However, it goes on to say that
let smiley: Char16 = '\u{1F600}'is valid even though1F600would require multiple code units in both UTF-8 and UTF-16.
- For example, “Rationale” states that
- Unclear that it gives us a good UTF plan.
- Does not decide what a single character in a Carbon string is.
- No consideration regarding interop with the
std::char32_tfamily of types or ICU compatibility.
In other words, it’s likely we want something similar to Char32, but it may be named something like Core.Char32 and have slightly different type behaviors than decided in #1964. On the other hand, we need something compatible with the C++ char in order to proceed with basic C++ interop, and #1964 doesn’t provide that.
Background
- Proposal #1964: Character Literals is fundamental, and a lot of the underlying thoughts still apply. In particular, we still want character types to be distinct from numeric types.
- Proposal #199: String literals is important because we want character and string literals to have mirrored escaping concepts.
- Proposal #5448: Carbon <-> C++ Interop: Primitive Types left the question of character type mappings open. This proposal aims to answer it for
char. - Issue #5903: Built-in character type questions addressed type questions.
- Issue #5922: Built-in character operators addressed operators.
Proposal
The way char will work is:
- Add a
chartype literal.- Carbon’s
strtype will usecharfor elements. - For interop, map Carbon’s
charto C++’schar.
- Carbon’s
- Add a
Core.CharLiteraltype for character literals, similar toCore.IntLiteral. - Provide operators which are consistent with the character concept.
This proposal additionally revokes and replaces proposal #1964, rather than trying to define which parts we are keeping and which are changing.
Details
Add a char type literal
char is intended to offer a basic construct for Carbon’s strings that is both compatible with UTF-8, and has high fidelity with C++ strings.
In support of that, important notes are:
charitself will be a type literal.charnotionally represents a UTF-8 code unit.- It can contain invalid code units, as long as it remains 8 bits. We do not assume runtime validation.
charwill be backed byCore.Char, in the prelude.Core.Charwill adaptu8.
- C++ interoperability will transparently map
charandCpp.charon API boundaries.- When used with Carbon, C++
charwill be unsigned by default (-funsigned-char). A program can switch back to signed (-fno-unsigned-char), and Carbon will maintain interoperability but bits will be interpreted differently in each language.
- When used with Carbon, C++
Escape sequences
Escape sequences are the same as for a string literal. Only one escape sequence may be provided in a character literal.
Add a Core.CharLiteral type for character literals
Core.CharLiteral is the type of a character literal, similar to how Core.IntLiteral is the type of integer literals. It abstractly represents a single Unicode code point. This gives us a compile-time structure for characters that can be typed and referred to in programs.
Semantics of a character literal will be equivalent to a simple string literal, except that:
- A character literal has a validated Unicode code point value.
- The enclosing character is
'. - The contents are precisely one character or escape sequence.
- The
\xescape sequence is limited to values up to7F, where the UTF-8 code unit and Unicode code point values are identical.
- The
An important detail of the character literal type is it gives us a way to track constant values at compile time. For example, 'a' + 1 has a constant value of b. This means we can diagnose uses of character literals that don’t represent a valid Unicode code point, such as 'a' + 0xFFFFFF.
Operators
The goal of provided operators is to provide a set of operators which map to common operations a user would want to do. It is a non-goal to support use of char as an arbitrary byte or integer: developers should use u8 for that.
In general, char and Core.CharLiteral operators are intended to be mirrors of each other.
Conversion operators
charImplicitAs: NoneExplicitAs: To/fromu8, plus the set ofImplicitAsforu8.- For example,
u8hasImplicitAstou16, socharhasExplicitAstou16.
- For example,
Core.CharLiteralImplicitAs: tocharonlyExplicitAs: To/from the set ofImplicitAsfori32andu32.- For example,
i32hasImplicitAstoi64, soCore.CharLiteralhasExplicitAstoi64. - For example,
i64does not haveImplicitAstoi32; conversion requires two casts,((i64_val as i32) as Core.CharLiteral).
- For example,
Casting from a char to a Core.CharLiteral is not supported.
See also implicit numeric conversions.
Comparison operators
charEqWithandOrderedWithwhen both operands arechar.ImplicitAsshould allow substituting one operand withCore.CharLiteral.
Core.CharLiteralEqWithandOrderedWithwhen operands areCore.CharLiteral.
Arithmetic operators
charAddWith:char + <integer> -> char(with reversible operands)- Equivalent to
(((char as i16) + <integer>) as u8) as char)
- Equivalent to
SubWith:char - <integer> -> char(non-reversible operands)- Equivalent to
(((char as i16) - <integer>) as u8) as char)
- Equivalent to
char - char -> i32- Equivalent to
(lhs as i32) - (rhs as i32). ImplicitAsshould allow substituting one operand withCore.CharLiteral.
- Equivalent to
Core.CharLiteralAddWith:Core.CharLiteral + <integer> -> Core.CharLiteral(with reversible operands)SubWith:Core.CharLiteral - <integer> -> Core.CharLiteral(non-reversible operands)Core.CharLiteral - Core.CharLiteral -> i32- Provides a unicode code point delta.
char integer parameters
Arbitrary integers are supported for most of these operations. For example, we want to allow addition of negative numbers, even though the representation of char is unsigned, without requiring additional casts.
Overflow semantics
Operations will use error overflow semantics, similar to signed integers. For example, (('a' as char) + 500) is invalid code because it causes char overflow. That’s why conversions are to signed values (for example, char as i16).
Preferring i32 returns
In arithmetic, i32 returns are preferred for deltas because they should be valid for unicode code points. Even though char is only 8-bits, using i32 for returns there too creates consistency with Core.CharLiteral.
Revoke and replace proposal #1964: Character Literals
This revokes proposal #1964 for simplicity. Rather than trying to detail which decisions still apply and which don’t, this proposal is acting from an assumption that all decisions there no longer apply. We can still benefit by pointing towards the rationale in explicitly maintaining decisions, but we want to go through that step.
Rationale
- Performance-critical software
- The intent is that Carbon’s main string type privileges UTF-8 over other potential encodings. A
charrepresents a single code unit within that, and is consequently efficient to access. It can also be invalid, meaning we don’t guarantee performing runtime validation for users (avoiding performance overhead), and that users might be able to use it for other encodings.
- The intent is that Carbon’s main string type privileges UTF-8 over other potential encodings. A
- Software and language evolution
Core.CharLiteralis designed as a Unicode code point, and even though this design doesn’t include a way to use values over7F, we anticipate those will be added in the future. It’s being provided as a building block for more elaborate Unicode functionality, including both UTF-16 and UTF-32, even as we prioritize UTF-8.
- Code that is easy to read, understand, and write
- Character literal syntax mirrors string literal syntax. The main divergence is
\x80and higher similar escapes, which are not supported due to potentially ambiguous behavior, still in furtherance of this goal.
- Character literal syntax mirrors string literal syntax. The main divergence is
- Practical safety and testing mechanisms
- Restricting the set of operators valid for
chargives us a way to do different sorts of validation that can be more character-oriented than if we treated it as an arbitrary byte. - Treating
Core.CharLiteralas a valid Unicode character allows us to provide static checking for some operations, such as'a' + 1resulting in another valid Unicode code point; more is also transitively possible, including involvingchar.
- Restricting the set of operators valid for
- Interoperability with and migration from existing C++ code
- Modeling
charas a UTF-8 code unit creates behavior which is very similar to C++, but still shifts towards a more character-oriented approach. We do expect some migration friction as a consequence (as use-cases might need either more casts, or to switch to a byte type).
- Modeling
Future work
There’s still significant future work, including:
signed char,unsigned charstd::char8_t,std::char16_t,std::char32_t- UTF-16 and UTF-32 support
It should not be assumed that there’s any restriction on the designs of those features, particularly no restrictions from #1964.
Alternatives considered
Align char fully with C++, or make it fully valid
Alternatives were discussed in zygoloid’s comment on #5903.
The comment notes that three options were proposed:
-
charis fully aligned with C++.There is no universal convention for what the value in a
charmeans, and the numerical encoding of Unicode characters intocharsequences might even be platform-dependent. For example, we might use some code page on Windows, EBCDIC on some IBM targets, and probably UTF-8 everywhere else. Likely the encoding would match what a character literal in C++ code would do for that target. Even when the target normally uses UTF-8, it would be reasonable to use an array ofcharas the type of the output buffer when transcoding from UTF-8 to some other encoding, and generally an encoded text buffer (in any encoding) would typically be represented as an array ofchar. It might also be reasonable to use an array ofcharfor things that aren’t necessarily text, such as file contents. -
charmodels a UTF-8 code unit, although it may not necessarily be valid, and may appear in a sequence that is not a valid UTF-8 encoding.As with the first option,
charcan represent an integer in [0, 255], although it is not an integer type. Higher-level abstractions would likely (eventually) be provided to represent different views of the code unit sequence as (for example) a sequence of code points or a sequence of graphemes, but the fundamental model exposes the encoding. Functions takingcharorcharsequences would assume UTF-8 encoding, and would need to consider how to handle invalidchars and invalidcharsequences. -
Use a foundation that enforces Unicode string validity, for some definition of “Unicode string validity”.
The
chartype is a Unicode character. Strings would notionally be a sequence of Unicode characters, possibly also maintaining some higher-level string invariants. String indexing, if it exists, would likely treat the string as a sequence of Unicode characters. String invariants would be enforced by type conversion into the string type rather than within the string operations, and certain classes of invalid strings would be unrepresentable.
Rationale as evaluated are:
- Privilege UTF-8 over other encodings: UTF-8 is typically the best choice for representing text, even when targeting languages where characters are 3 bytes in UTF-8 but 2 in UTF-16, and even on Windows where the system APIs typically operate primarily in UTF-16 or UCS-2. We should create affordances that encourage use of UTF-8 (such as having the
chartype be conventionally UTF-8).- Our overall goal to support (only) modern environments and a general desire for consistency and portability argues against supporting non-Unicode encodings for character types.
- Having some convention for the meaning of the value of a
charseems important, and the lack of one in C++ has been a notable problem over time, leading to the addition ofchar8_tet al, which have not been entirely satisfactory solutions due to the existing widespread usage of plainchar.
- Do not privilege any particular meaning of “validity”: There are many different ways in which you can view a sequence of UTF-8 code units as being valid or invalid. For example: Can a string start with a combining character? Can it have mismatched LRE/RLE/PDF characters in it? Can it be unnormalized, or must it be in NFC, or in NFD? Can it contain unassigned Unicode characters? Can it contain PUA characters? Can it contain non-characters? Picking any set of answers to these questions as being our canonical notion of “validity” is somewhat arbitrary.
- Do not privilege any particular level for accessing elements of the string other than code units: There are many different layers of abstraction at which you can interpret the contents of a string. The atoms that users want to interact with, such as glyphs or grapheme clusters in rendering, or combining characters when editing or performing substring searches, aren’t in one-to-one correspondence with Unicode characters any more than they’re in one-to-one correspondence with UTF-8 code units. So it’s not clear that privileging Unicode-character-oriented access (or indeed any of the other higher-level Unicode views) is appropriate. However, code units are in direct correspondence with bytes of memory, which is directly relevant for low-level operations, so there is a reason to provide direct access to byte-level / code-unit-level operations.
- If string indexing operates on Unicode characters, it would either be non-constant-time or would require not storing strings as just a sequence of UTF-8. Having a constant-time indexing operation on strings seems very important (especially for interop and for meeting C++ developers where they are), even though a lot of the desired functionality (perhaps all of it) can be provided with iterator- or cursor-like machinery instead.
- Enforcing validity is problematic for existing API structures: Requiring strings to be valid UTF-8 presents difficulties when moving text into or out of other sources. For example, when reading text from a validly-encoded UTF-8 file into a text buffer, one would need to deal with a read that ends in the middle of an encoding of a character. I don’t know how Rust deals with this, but it seems like it would create significant impedance mismatch with C-like buffered I/O utilities. Similarly, when interoperating with C++, it would create friction if our string representation requires strings to be valid UTF-8 encodings.
- We can allow additional invariants without requiring them: For a known-to-be-valid UTF-8 sequence, a higher-level abstraction can be built, and similarly, yet-higher-level abstractions can be built for whatever other invariants we want to enforce. So using option 2 rather than option 3 as our foundation doesn’t prevent enforcing invariants in the type system (but nor does it encourage doing so).
This proposal is choosing option 2, that char models a UTF-8 code unit without validation. In some sense, option 2 is still “fully aligned with C++”, but with C++’s char8_t rather than with C++’s char.
Raw character literals
Raw string literals use a # prefix. There’s limited use for this in character literals; technically, '\\' could instead be #'\'#, but that’s longer and extra characters may prove distracting. Raw string literals are more useful when there’s a longer character sequence, whereas character literals have one character by definition. For simplicity, character literals won’t have raw syntax.
Disallow hex escape sequences in character literals
A \x## escape sequence abstractly represents a UTF-8 code unit. Whereas values over 7F are valid in string literals (allowing arbitrary byte values), these are disallowed in character literals because we want a more validated Unicode behavior. Developers could instead rely on \u escapes for \x.
It can still be useful to allow \x escapes for low-range values because some developers will still need to specify ANSI escapes. Carbon drops support for some escape sequences, such as \a, and specifically advises \x as an alternative for developers that need it. Requiring \a -> \x07 -> \u{07} is incrementally more verbose syntax, and developers may be confused why "\x1B" is allowed for strings but '\u{1B}' is required for characters.
Values over 7F are ambiguous between an arbitrary byte value and a Unicode code point, and so should be invalid. However, where both interpretations are identical for UTF-8 (values up to and including 7F), we will allow \x escape sequences.
Allow grapheme clusters in character literals
This proposal carries forward the decision in #1964 to not support grapheme clusters in character literals.
Reuse string literal syntax for character literals
Instead of using single quotes (for example, 'a'), we could use string literal syntax with a conversion (for example, "a" as char) for character literals. This was proposed because it would free up the single quote for other, unspecified syntax uses.
For background, character literals are common in C++. For example, in SourceGraph search statistics (some of these are in comments – a search limitation):
'(.|\\.)': 46.2 million<<: over 100 million>>: 10.4 million%: 5.3 million
This creates several disadvantages for removing character literals in Carbon:
-
Migrating C++ developers to Carbon: The frequency of use can be expected to have trained developers to expect single quotes to be used for characters, especially the C++ developers that Carbon is targeting. Repurposing them would create a friction for C++ developers to need to understand the different meanings of the same syntax in each of C++ and Carbon, something Carbon prefers to avoid.
-
Increased runtime error risks: Runtime errors could take the form of simple increased overhead, such as converting a string literal to a
strthen to achar. However, they could also be more insidious, such as doing[0]on a string literal and not validating that the string is exactly one character (this would also likely return a null byte for""[0]). By having a character literal type, Carbon encourages developers to stay within guide rails that make it easier to get compile-time behavior and program validation. -
Block string literal use: We already have another use for single quotes in Carbon: block string literals. The syntax may need to change along with removing character literals, to make room for other uses of single quotes.
-
If retained, it would constrain uses of single quotes. For example, a unary operator syntax has overlap (that is, if
'aand''aare valid, then'''ais ambiguous). -
The choice of single quotes in proposal #1360: Change raw string literal syntax was made accounting for single quotes in character literals, and that commonality would be lost.
-
-
Tooling: The prevalence of single quotes being used for either strings or characters also affects their treatment in tools not specialized to Carbon: they expect them to be used for strings. For example, Rust’s use of single quotes for lifetime annotations has been observed to break language-agnostic syntax highlighting.
While a compelling proposal for a different use of single quotes may come up in the future, freeing up the character for other purposes is insufficient to justify a different syntax for character literals.
Treat single-character string literals as a third “text literal” type
A related alternative with the same goal of eliminating single quotes for character literals is that, rather than requiring single-character string literals be explicitly converted to char, they could instead have a third type of text literal. This would implicitly cast to either str or char.
This approach would lead to three literal types: StrLiteral, CharLiteral, and TextLiteral. The distinction of CharLiteral is important because we still want to support arithmetic on character literals, such as 'a' + 1 (which we would not want to be allowed for StrLiteral).
The existence of a third type would be important for generic code, even when not trying to use character literals. For example:
fn StoreValue[U:! type](ref a: Optional(U), b: U) {
a = b;
}
fn StrLogic[T:! type](a: T) {
var x: Optional(T) = a;
StoreValue(x, "str");
}
fn F() {
StrLogic("a");
}
Here, T is deduced to be TextLiteral. However, U has no valid value: it’s passed Optional(TextLiteral), while "str" is a StrLiteral (which should not be convertible to TextLiteral). As a consequence, this code is invalid, even though the same code would be valid if there were not TextLiteral type.
Advantages:
- Avoids an explicit cast.
Disadvantages:
- Shares most of the disadvantages of the primary explicit conversion approach.
- This includes the risk that developers will write
"..."[0]instead of"..." as charwhen they need a character, although the frequency may be reduced.
- This includes the risk that developers will write
- Having additional types in common literals could lead to programmer errors in deducing generic types, as described above.
- Implicit casts cause more operator ambiguity.
- How are operators that have different meanings for string and character literals handled, such as
Cpp.std.cout <<or<=>? - In Carbon, we’d probably still want string operators to work; for example,
"a" + "b" => "ab", and that can be compile-time. Is"a" + 1a pointer to the null byte as it is in C++ (similar to&("a"[1])), a character addition ('a' + 1 => 'b'), or does it require an explicit cast in order to ensure behavior is deliberate?
- How are operators that have different meanings for string and character literals handled, such as