Character literals

Pull request

Table of contents

Abstract

This proposal specifies lexical rules for constant characters in Carbon:

Put character literals in single quotes, like 'a'. Character literals work like numeric literals:

  • Every different literal value has its own type.
  • The literal itself doesn’t have a bit width as a consequence. Instead, variables use explicitly sized character types and character literals can be converted to these types when representable.
  • A character literal must contain exactly one code point.

Follows the plan from open design idea #1934: Character Literals.

Problem

Carbon currently has no lexical syntax for character literals, and only provides string literals and numeric literals. We wish to provide a distinct lexical syntax for character literals versus string literals.

The advantage of having an explicit character type fundamentally comes down to characters being represented as integers whereas strings are represented as buffers. This will allow characters to have different operations, and be more familiar to use. For example:

if (c >= 'A' and c <= 'Z') {
    c += 'a' - 'A';
}

The example above shows how we would be able to use operations similar to integers. Being able to use the comparison operations and supporting arithmetic operations provides an intuitive approach to using characters. This allows us to remove unnecessary logic of type conversion and other control flow logic, that is needed to work with a single element string. See Rationale for more examples showing more appropriate use of characters over using strings.

Background

Character Literals by definition is a type of literal in programming for the representation of a single character’s value within the source code of a computer program. Character literals between languages have some minor nuances but are fundamentally designed for the same purpose. Languages that have a dedicated character data type generally include character literals, for example C++, Java, Swift to name a few. Whereas other languages that lack distinct character type, like Python use strings of length one to serve the same purpose a character data type. For more information see Character Literals Wiki, Character Literals DBpedia

Proposal

Put character literals in single quotes, like 'a'. Character literals work like numeric literals:

  • Every different literal value has its own type.
  • The literal itself doesn’t have a bit width as a consequence. Instead, variables use explicitly sized character types and character literals can be converted to these types when representable. Follows the plan from #1934.
  • A character literal will model single Unicode code points that have a single concrete numerical representation. We will not be supporting other formulations like code unit sequences or grapheme clusters as these will be modeled with normal string literals.

Details

  • A character literal is a sequence enclosed with single quotes delimiter (‘), of UTF-8 code units that must be a valid encoding. This matches the UTF-8 encoding of Carbon source files.
  • A character literal must encode exactly one code point.
  • It supports addition and subtraction, as described below.
  • Character literals will support the relevant subset of the backslash (\) escape sequences in string literals, including \t, \n, \r, \", \', \\, \0, and \u{HHHH...}. See String Literals: Escape sequence.
    • Escape sequences which would result in non-UTF-8 encodings or more than one code point are not included.
    • The escape of an embedded newline is also excluded as it isn’t expected to be relevant for character literals.

We will not support:

  • character literals that don’t contain exactly one Unicode code point;
  • multi-line literals;
  • “raw” literals (using #’x’#);
  • \x escape sequences;
  • character literals with a single quote (') or back-slash (\), except as part of an escape sequence;
  • empty character literals ('');
  • a backslash followed by an (unescaped) newline;
  • ASCII control codes (0…31), including whitespace characters other than word space (tab, line feed, carriage return, form feed, and vertical tab), except when specified with an escape sequence.

Types

For the time being, Carbon will support three character types: Char8, Char16, and Char32. These types are capable of representing both code units and code points. It’s important to note that the support for different UTF-encoding code unit types will be addressed in a separate proposal. Please refer to the UTF code unit types proposalfor more information on that topic.

In Carbon, the type CharN represents a character, where N corresponds to the bit size of the character type (8, 16, or 32). We will only allow character literals that map directly to a complete value of a code point. Here are examples of character literals for each specific type:

  • Char8: The character literal consists of a single Unicode code point that can be represented within 8 bits. For example:

let allowed: Char8 = ‘a’

In this example, the character literal ’a’ corresponds to the Unicode code point 97, which is within the valid range of Char8 since 97 is less than or equal to 0x7F.

  • Char16: The character literal represents a Unicode code point that can be represented within 16 bits. Here’s an example:

let smiley: Char16 = ‘\u{1F600}’

The character literal ’\u{1F600}’ represents the smiley face emoji, which has the Unicode code point 128512. Since 128512 can be represented within 16 bits, it can be assigned to a variable of type Char16.

  • Char32: This character type allows the representation of Unicode code points within 32 bits. Here’s an example:

let musicalNote: Char32 = ‘🎵’

In this case, the character literal ’🎵’ corresponds to the musical note emoji with the Unicode code point 127925. Since 127925 falls within the range that can be represented by Char32, it can be assigned to a variable of type Char32.

By restricting character literals to those that can be directly mapped to code points within the specific character types, we ensure accurate representation and compatibility with the chosen character encoding scheme.

Operations

Character literals representing a single code point support the following operators:

  • Comparison: <, >, <=, >= ==
  • Plus: +. This doesn’t concatenate, but allows numerically adjusting the value:
    • Only one operand may be a character literal, the other must be an integer literal.
    • The result is the character literal whose numeric value is the sum of numeric value of the operands. If that sum is not a valid Unicode code point, it is an error.
  • Subtract: -. This will subtract the value of the two characters, or a character followed by an integer literal:
    • If the - is used between two character literals, the result will be an integer constant. For example, 'z' - 'a' is equivalent to 25.
    • If the - is used between a character literal followed by a integer literal, this will produce a character constant. For example 'z' - 4 is equivalent to 'v'.
    • If the - is used between a integer literal followed by a character literal 100 - 'a', this will be rejected unless the integer is cast to a character.

There is intentionally no implicit conversion from character literals to integer types, but explicit conversions are permitted between character literals and integer types. Carbon will separate the integer types from character types entirely.

Rationale

This proposal supports the goal of making Carbon code easy to read, understand, and write. Adding support for a specific character literal supports clean, readable, concise use and is a much more familiar concept that will make it easier to adopt Carbon coming from other languages. Have a distinct character literal will also allow us support useful operations designed to manipulate the literal’s value. When working with an explicit character type we can use operators that have unique behavior, for example say we wanted to advance a character to the next literal. In other languages the + operator is often used for concatenation, so using a String will produce a type error: "a" + 1. However with a character literal, we can support operations for these use cases:

var b: u8;

b = 'a' + 1;
b + 1 == 'c';

See Operations and No Distinct Character Literal for more information.

Further, this design follows other standards set in place by previous proposals. For example following the String Literals: Escaping Sequence and representing characters as integers with the behaviour inline with Integer Literals.

This also supports our goal for Interoperability with and migration from existing C++ code by ensuring that every kind of character literal that exists in C++ can be represented in a Carbon character literal. This is done in a way that is natural to adopt, understand, easy to read by having explicit character types mapped to the C++ character types and the correct associated encoding.

Finally, the choice to use Unicode and UTF-8 by default reflects the Carbon goal to prioritize modern OS platforms, hardware architectures, and environments. This reflects the growing adoption of UTF-8.

Alternatives considered

No distinct character types

Unlike C++, Carbon will separate the integer and the character types. We considered using u8, u16, and u32 instead of Char8, Char16, and Char32 to reduce the number of different types users needed to be aware of and convert between. We decided against it because it came with a number of disadvantages:

  • u8, u16, and u32 have the wrong arithmetic semantics: we don’t want wrapping, and many uN operations, like multiplication, division, and shift, are not meaningful on code units. There may be rare cases where you want to use those operations, such as if you’re implementing a conversion to or from code units. But in those rare cases it would be reasonable for the user to convert to an integer type to perform that operation and convert back when done.
  • Some operations want to be able to tell the difference between values that are intended to be UTF-8 instead of having no specified encoding.
  • Some operations want to be able to know that they’ve been given text rather than random bytes of data. For example, Print(0x41 as u8) would be expected to print "65" while Print('\u{41}') and Print(0x41 as Char8) would be expected to print "A".
  • It’s useful for developers to document the intended meaning of a value, and using a distinct type is one way to do that.

See UTF code unit types proposal for more information about UTF encoding types for a future proposal.

No distinct character literal

In principle, a character literal can be represented by reusing string literals similar to how Python handles character literals, however this would prevent performing operations on characters as integers. For example, the + operator on strings is used for concatenation, but + on a character would change its value.

// `digit` must be in the range 0..9.
fn DigitToChar(digit: i32) -> Char8 {
  return '0' + digit;
}

Furthermore, many properties of Unicode characters are defined on ranges of code points, motivating supporting comparison operators on code points.

fn IsDingBatCodePoint(c: Char32) -> bool {
  return c >= '\u{2700}' and c <= '\u{27BF}';
}

Supporting prefix declarations

No support is proposed for prefix declarations like u, U, or L. In practice they are used to specify the character literal types and their encoding in languages like C and C++. There are a several benefits to omitting prefix declarations; improved readablitly, simplifying how a character’s type is determined, and how we are encoding character literals. When declaring a character literal, the type is based on the contents of the character so that var c: u8 = 'a' is a valid character that can be converted to u8, in order to support prefix declarations we would need to extend our type system to have other explicit type checks like in C++; a UTF-16 u', UTF-32 U', and wide characters L'. This would be more familiar for individuals coming to Carbon from a C++ background, and simplify our approach for C++ Interoperability. At the cost of diverge from existing standards, for example Proposal 142 states all of Carbon source code should be UTF-8 encoded. Prefix declarations would detract the readability of the character literals and increase the complexity of character literal Types.

Allowing numeric escape sequences

This proposal does not support numeric escape sequences using \x. This simplifies the design of character types and literals, making them only represent code points and not code units. However this does come with the disadvantage of less consistency of character literals with string literals, since they now accept different escape sequences. We don’t want to remove numeric escape sequence from string literals, so we can support string use cases like representing invalid encodings.

This approach has the additional concern that if character literals don’t support numeric escape sequences, developers may choose to use numeric literals instead, at a cost of type-safety and readability. For example, it isn’t clear in var first_digit: Char8 = 0; whether 0 is supposed to be a NUL character or the encoding of the '0' character (48). We addressed this concern, and type safety concerns about distinguishing numbers and characters, by making the integer to character conversions explicit.

Supporting formulations of grapheme clusters and non-code-point code-units

Rather than explicitly limiting characters literals to a more integer-like representation of a single Unicode code point, we could represent characters literal formulations of grapheme clusters and non-code-point code units. What humans tend to think of as a “character” corresponds to a “grapheme cluster.” The encoding of a grapheme cluster can be arbitrarily long and complex, which would sacrifice the ability to perform integer operations. If we wanted to add support for other character formulations, we would need to use separate spellings to represent a small set of operations that are today expressed with integer-based math on C++’s character literals. This includes things like converting an integer between 0 and 9 into the corresponding digit character, or computing the difference between two digits/two other characters. For these reasons, we have decided to start out by representing character literals as single Unicode code points following a more integer-like model. However this topic should be revisited if we find that there is a significant need for the additional functionality and attendant complexity for these other character formulations.

Future Work

UTF code unit types proposal

There have been several ideas and discussions around how we would like to handle UTF code units. This section will hopefully provide some guidance for a future proposal when the topic is revisited for how we would like to build out encoding/decoding for character literals.

We will have the types Char8, Char16, and Char32 representing code units in UTF-8, UTF-16, and UTF-32, but we will not support all code units, but only those which map directly to the complete value of a code point. However, character literals will use their own types distinct from these:

  • We will support value preserving implicit conversions from character literals to code point or code unit types. In particular, a character literal converts to a Char8 UTF-8 code unit if it is less than or equal to 0x7F, and Char16 UTF-16 code unit if it is less than or equal to 0xFFFF.
  • Conversions from string or character literals to a non-value-preserving encoding must be explicit.
  • Conversions from string literals to Unicode strings are implicit, even though the numeric values of the encoding may change.

We can see whether the particular literal is represented in the variable’s type by only looking at the types.

let allowed: Char8 = 'a';

The above is allowed because the type of 'a' is the character literal consisting of the single Unicode code point 97, which can be converted to Char8 since 97 is less than or equal to 0x7F.

let error1: Char8 = '😃';
let error2: Char8 = 'AB';

However these should produce errors. The type of '😃' is the character literal consisting of the single Unicode code point 0x1F603, which is greater than 0x7F. The type of 'AB' is a character literal that is a sequence of two Unicode code points, which has no conversion to a type that only handles a single UTF-8 code unit.

All of '\n', and '\u{A}' represent the same character and so have the same type. However, explicitly converting this character literal to another character set might result in a character with a different value, but that still represents the newline character.