Numeric literals
Table of contents
Problem
This proposal specifies lexical rules for numeric constants in Carbon.
Background
We wish to cover literals for two categories of types:
- Integer types, that can represent some (typically contiguous) subset of the integers, ℤ.
- Real number types, that can represent some discrete subset of the real numbers, ℝ. (Typically only rational numbers can be represented, but that doesn’t matter for our purposes.)
Real number types may include additional values (infinities and NaN values). We do not provide a notation to express such values.
In C++, the following syntaxes are used:
- Integer literals
12345
(decimal)0x1FE
(hexadecimal)0123
(octal)0b1010
(binary)
- Real number literals
- Decimal
123.
.123
123.456
123.e456
(= 123 * 10456).123e456
123.456e789
123e456
(no decimal point)- Any of the above with a
+
or-
aftere
.
- Hexadecimal
0x123.p456
(= 12316 * 2456)0x.123p456
0x123.456p789
0x123p456
(no hexadecimal point)- Any of the above with a
+
or-
afterp
.
- Decimal
- Digit separators (
'
) may appear between any two digits - An optional suffix defines the type
U
(unsigned
) andL
(long
) orLL
(long long
) for integers (order-independent, butLUL
disallowed)F
(float
) orL
(long double
) for real numbers
- User-defined literals may have custom suffixes, starting with
_
for non-standard-library literals.
C++ numeric literals are case-insensitive, except in the suffix of a user-defined literal. Negative numbers are formed by applying a unary -
operator to a non-negative literal.
The type of a literal in C++ depends primarily on its syntax and its suffix. However, for integer literals, the type also depends on the value; the language rules attempt to pick a type large enough to fit the value. An unsigned
type is always used if a U
suffix is present, is never used for a decimal literal without a U
suffix, and otherwise may or may not be used depending on whether the value happens to fit into an unsigned type but not into a signed type of the same width.
Other languages use somewhat different rules, but the broad lexical structure above – an optional prefix for the base, a value, an optional exponent, and an optional suffix – is common across a large number of languages.
Proposal
We allow these syntaxes:
- Integer literals
12345
(decimal)0x1FE
(hexadecimal)0b1010
(binary)
- Real number literals
123.456
(digits on both sides of the.
)123.456e789
(optional+
or-
after thee
)0x1.2p123
(optional+
or-
after thep
)
- Digit separators (
_
) may be used, but only in conventional locations
Note that real number literals always contain a .
with digits on both sides, and integer literals never contain a .
.
Literals are case-sensitive.
No support is proposed for literals with type suffixes, but without prejudice: this proposal proposes neither the inclusion nor the absence of such literals.
Details
Integer literals
Decimal integers are written as a non-zero decimal digit followed by zero or more additional decimal digits, or as a single 0
.
Integers in other bases are written as a 0
followed by a base specifier character, followed by a sequence of digits in the corresponding base. The available base specifiers and corresponding bases are:
Base specifier | Base | Digits |
---|---|---|
b | 2 | 0 and 1 |
x | 16 | 0 … 9 , A … F |
The above table is case-sensitive. For example, 0b1
and 0x1A
are valid, and 0B1
, 0X1A
, and 0x1a
are invalid.
A zero at the start of a literal can never be followed by another digit: either the literal is 0
, the 0
begins a base specifier, or the next character is a decimal point (see below).
Real number literals
Real numbers are written as a decimal or hexadecimal integer followed by a period (.
) followed by a sequence of one or more decimal or hexadecimal digits, respectively. A digit is required on each side of the period. 0.
and .3
are both invalid.
A real number can be followed by an exponent character, an optional +
or -
(defaulting to +
if absent), and a character sequence matching the grammar of a decimal integer with some value N. For a decimal real number, the exponent character is e
, and the effect is to multiply the given value by 10±N. For a hexadecimal real number, the exponent character is p
, and the effect is to multiply the given value by 2±N. The exponent suffix is optional for both decimal and hexadecimal real numbers.
Note that a decimal integer followed by e
is not a real number literal. For example, 3e10
is not a valid literal.
When a real number literal is interpreted as a value of a real number type, its value is the representable real number closest to the value of the literal. In the case of a tie, the conversion to the real number type is invalid.
The decimal real number syntax allows for any decimal fraction to be expressed – that is, any number of the form a x 10-b, where a is an integer and b is a non-negative integer. Because the decimal fractions are dense in the reals and the set of values of the real number type is assumed to be discrete, every value of the real number type can be expressed as a real number literal. However, for certain applications, directly expressing the intended real number representation may be more convenient than producing a decimal equivalent that is known to convert to the intended value. Hexadecimal real number literals are provided in order to permit values of binary floating or fixed point real number types to be expressed directly.
Ties
As described above, a real number literal that lies exactly between two representable values for its target type is invalid. Such ties are extremely unlikely to occur by accident: for example, when interpreting a literal as Float64
, 1.
would need to be followed by exactly 53 decimal digits (followed by zero or more 0
s) to land exactly half-way between two representable values, and the probability of 1.
followed by a random 53-digit sequence resulting in such a tie is one in 553, or about 0.000000000000000000000000000000000009%. For Float32
, it’s about 0.000000000000001%, and even for a typical Float16
implementation with 10 fractional bits, it’s around 0.00001%.
Ties are much easier to express as hexadecimal floating-point literals: for example, 0x1.0000_0000_0000_08p+0
is exactly half way between 1.0
and the smallest Float64
value greater than 1.0
, which is 0x1.0000_0000_0000_1p+0
.
Whether written in decimal or hexadecimal, a tie provides very strong evidence that the developer intended to express a precise floating-point value, and provided one bit too much precision (or one bit too little, depending on whether they expected some rounding to occur), so rejecting the literal seems like a better option than accepting it and making an arbitrary choice between the two possible values.
Digit separators
If digit separators (_
) are included in literals, they must meet the respective condition:
- For decimal integers, the digit separators shall occur every three digits starting from the right. For example,
2_147_483_648
. - For hexadecimal integers, the digit separators shall occur every four digits starting from the right. For example,
0x7FFF_FFFF
. - For real number literals, digit separators can appear in the decimal and hexadecimal integer portions (prior to the period and after the optional
e
or mandatoryp
) as described in the previous bullets. For example,2_147.483648e12_345
or0x1_00CA.FEF00Dp+24
- For binary literals, digit separators can appear between any two digits. For example,
0b1_000_101_11
.
Open question: digit separator placement
2020-09-15: core team meeting selected Alternative 0
As an alternative to the rule proposed above, we could consider different restrictions on where digit separators can appear:
Alternative 0: as presented above.
Alternative 1: allow any digit groupings (for example, 123_4567_89
).
Pro:
- Simpler, more flexible rule, that may allow some groupings that are conventional in a specific domain. For example,
var Date: d = 01_12_1983;
, orvar Int64: time_in_microseconds = 123456_000000;
. - Culturally agnostic. For example, the Indian convention for digit separators would group the last three digits, and then every two digits before that (1,23,45,678 could be written
1_23_45_678
).
Con:
- Less self-checking that numeric literals are interpreted the way that the author intends.
Alternative 2: as above, but additionally require binary digits to be grouped in 4s.
Pro:
- More enforcement that digit grouping is conventional.
Con:
- No clear, established rule for how to group binary digits. In some cases, 8 digit groups may be more conventional.
-
When used to express literals involving bit-fields, arbitrary grouping may be desirable. For example:
var Float32: flt_max = BitCast(Float32, 0b0_11111110_11111111111111111111111);
Alternative 3: allow any regular grouping.
Pro:
- Can be applied uniformly to all bases.
Con:
- Provides no assistance for decimal numbers with a single digit separator.
- Does not allow binary literals to express an intent to initialize irregular bit-fields.
Alternatives considered
There are a number of different design choices we could make, as divergences from the above proposal. Those choices, along with the arguments that led to choosing the proposed design rather than each alternative, are presented below.
Integer bases
Octal literals
No support is proposed for octal literals. In practice, their appearance in C and C++ code in a sample corpus consisted of (in decreasing order of commonality and excluding 0
literals):
- file permissions,
- cases where decimal was clearly intended (
CivilDay(2020, 04, 01)
), and - (in distant third place) anything else.
The number of intentional uses of octal literals, other than in file permissions, was negligible. We considered the following alternatives:
Baseline: This proposal suggests that we do not support octal literals. Octal literals are rare and mostly obsolescent. File permissions can be supported in some other way.
Alternative 1: Follow C and C++, and use 0
as the base prefix for octal.
Pro:
- More similar to C++ and other languages.
Con:
- Subtle and error-prone rule: for example, left-padding with zeroes for alignment changes the meaning of literals.
Alternative 2: Use 0o
as the base prefix for octal.
Pro:
- Unlikely to be misinterpreted as decimal.
- Follows several other languages (for example, Python).
Con:
- Additional language complexity.
If we decide we want to introduce octal literals at a later date, use of alternative 2 is suggested.
Decimal literals
We could permit leading 0
s in decimal integers (and in floating-point numbers).
Pro:
- We would allow leading
0
s to be used to align columns of numbers.
Con:
- The same literal could be valid but have a different value in C++ and Carbon.
We could add an (optional) base specifier 0d
for decimal integers.
Pro:
- Uniform treatment of all bases. Left-padding with
0
could be achieved by using0d000123
.
Con:
- No evidence of need for this functionality.
We could permit an e
in decimal literals to express large powers of 10.
Pro:
- Many uses of (eg)
1e6
in our sample C++ corpus intend to form an integer literal instead of a floating-point literal.
Con:
- Would violate the expectations of many C++ programmers used to
e
indicating a floating-point constant.
We suggest that this syntax is not added at this point. However, it should be reconsidered at a later date, once developers are used the requirement that real literals always contain a period.
Case sensitivity
We could make base specifiers case-insensitive.
Pro:
- More similar to C++.
Con:
0B1
is easily mistaken for081
0B1
can be confused with0xB1
0O17
is easily mistaken for0017
- Allowing more than one way to write literals will lead to style divergence.
We could make the digit sequence in hexadecimal integers case-insensitive.
Pro:
- More similar to C++.
- Some developers will be more comfortable writing hexadecimal digits in lowercase. Some tools, such as
md5
, will print lowercase.
Con:
- Allowing more than one way to write literals will lead to style divergence.
- Lowercase hexadecimal digits are less visually distinct from the
x
base specifier (for example, the digit sequence is more visually distinct in0xAC
than in0xac
).
We could require the digit sequence in hexadecimal integers to be written using lowercase letters a
..f
.
Pro:
- Some developers will be more comfortable writing hexadecimal digits in lowercase. Some tools, such as
md5
, will print lowercase. B
andD
are more likely to be confused with8
and0
thanb
andd
are.
Con:
- Some developers will be more comfortable writing hexadecimal digits in uppercase. Some tools will print uppercase.
- Lowercase hexadecimal digits are less visually distinct from the
x
base specifier (for example, the digit sequence is more visually distinct in0xAC
than in0xac
).
Real number syntax
We could allow real numbers with no digits on one side of the period (3.
or .5
).
Pro:
- More similar to C++.
- Allows numbers to be expressed more tersely.
Con:
- Gives meaning to
tup.0
syntax that may be useful for indexing tuples. - Gives meaning to
0.ToString()
syntax that may be useful for performing member access on literals. - May harm readability by making the difference between an integer literal and a real number literal less significant.
- Allowing more than one way to write literals will lead to style divergence.
See also the section on floating-point literals in the Google style guide, which argues for the same rule.
We could allow a real number with no e
or p
to omit a period (1e100
).
Pro:
- More similar to C++.
- Allows numbers to be expressed more tersely.
Con:
- Assuming that such numbers are integers rather than real numbers is a common error in C++.
We could allow the e
or p
to be written in uppercase.
Pro:
- More similar to C++.
- Most calculators use
E
, to avoid confusion with the constante
.
Con:
- Allowing more than one way to write literals will lead to style divergence.
E
may be confused with a hexadecimal digit.
We could require a p
in a hexadecimal real number literal.
Pro:
- More similar to C++.
- When explicitly writing a bit-pattern for a floating-point type, it’s reasonable to always include the exponent value.
Con:
- Less consistent.
- Makes hexadecimal floating-point values even more expert-only.
We could arbitrarily pick one of the two values when a real number is exactly half-way between two representable values.
Pro:
- More similar to C++.
- Would accept more cases, and it’s likely that either of the two possible values would be acceptable in practice.
Con:
- Would either need to specify which option is chosen or, following C++, accept that programs using such literals have non-portable semantics.
- Numbers specified to the exact level of precision required to form a tie are a strong signal that the programmer intended to specify a particular value.
Digit separator syntax
2020-09-15: core team meeting chose to forward digit separator to painter
2020-10-05: painter selected Alternative 2: _
as digit separator
There are various different characters we could attempt to use as a digit separator. The options we considered are:
Alternative 0: '
as a digit separator.
Pro:
- Follows C++ syntax.
- Used in several (mostly European) writing conventions.
Con:
'
is also likely to be used to introduce character literals.
Alternative 1: ,
as a digit separator.
Pro:
- More similar to how numbers are written in English text and many other cultures.
Con:
- Commas are expected to widely be used in Carbon programs for other purposes, where there may be digits on both sides of the comma. For example, there could be readability problems if
f(1, 234)
calledf
with two arguments butf(1,234)
calledf
with a single argument. - Comma is interpreted as a decimal point in the conventions of many cultures.
- Unprecedented in common programming languages.
Alternative 2: _
as a digit separator.
Pro:
- Follows convention of C#, Java, JavaScript, Python, D, Ruby, Rust, Swift, …
- Culturally agnostic, because it doesn’t match any common human writing convention.
Con:
- Underscore is not used as a digit grouping separator in any common human writing convention.
Alternative 3: whitespace as a digit separator.
Pro:
- Used and understood by many cultures.
- Never interpreted as a decimal point instead of a grouping separator.
- Also usable to the right of a decimal point.
Con:
- Omitted separators in lists of numbers may result in distinct numbers being spliced together. For example,
f(1, 23, 4 567)
may be interpreted as three separate numerical arguments instead of four arguments with a missing comma. - Unprecedented in other programming languages.
Alternative 4: .
as digit separator, ,
as decimal point.
Pro:
- More familiar to cultures that write numbers this way.
Con:
- As with
,
as a digit separator,,
as a decimal point is problematic. - This usage is unfamiliar and would be surprising to programmers; programmers from cultures where
,
is the decimal point in regular writing are likely already accustomed to using.
as the decimal point in programming environments, and the converse is not true.
Alternative 5: No digit separator syntax.
Pro:
- Simpler language rules.
- More consistent source syntax, as there is no choice as to whether to use digit separators or not.
Con:
- Harms the readability of long literals.
Rationale
The proposal provides a syntax that is sufficiently close to that used both by C++ and many other languages to be very familiar. However, it selects a reasonably minimal subset of the syntaxes. This minimal approach provides benefits directly in line with both the simplicity and readability goals of Carbon:
- Reduces unnecessary choices for programmers.
- Simplifies the syntax rules of the language.
- Improves consistency of written Carbon code.
That said, it still provides sufficient variations to address important use cases for the goal of not leaving room for a lower level language:
- Hexadecimal and binary integer literals.
- Scientific notation floating point literals.
- Hexadecimal (scientific) floating point literals.
Painter rationale
The primary aesthetic benefit of '
to the painter is consistency with C++. However, its rare usage in C++ at this point reduces this advantage to a very small one, while there is broad convergence amongst other languages around _
. The choice here has no risk of significant meaning or building up patterns of reading for users that might be disrupted by the change, and so it seems reasonable to simply converge with other languages to end up in the less surprising and more conventional syntax space.
Open questions
Placement restrictions of digit separators:
- The core team had consensus for the proposed restricted placement rules.
Use _
or '
as the digit separator character:
- The core team deferred this decision to the painter.
- The painter selected
_
.