-
Notifications
You must be signed in to change notification settings - Fork 1.5k
/
p0143.md
621 lines (433 loc) · 21.2 KB
/
p0143.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
# Numeric literals
<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->
[Pull request](https:/carbon-language/carbon-lang/pull/143)
<!-- toc -->
## Table of contents
- [Problem](#problem)
- [Background](#background)
- [Proposal](#proposal)
- [Details](#details)
- [Integer literals](#integer-literals)
- [Real number literals](#real-number-literals)
- [Ties](#ties)
- [Digit separators](#digit-separators)
- [Open question: digit separator placement](#open-question-digit-separator-placement)
- [Alternatives considered](#alternatives-considered)
- [Integer bases](#integer-bases)
- [Octal literals](#octal-literals)
- [Decimal literals](#decimal-literals)
- [Case sensitivity](#case-sensitivity)
- [Real number syntax](#real-number-syntax)
- [Digit separator syntax](#digit-separator-syntax)
- [Rationale](#rationale)
- [Painter rationale](#painter-rationale)
- [Open questions](#open-questions)
<!-- tocstop -->
## Problem
This proposal specifies lexical rules for numeric constants in Carbon.
## Background
We wish to cover literals for two categories of types:
- Integer types, that can represent some (typically contiguous) subset of the
integers, ℤ.
- Real number types, that can represent some
[discrete](https://en.wikipedia.org/wiki/Isolated_point) subset of the real
numbers, ℝ. (Typically only rational numbers can be represented, but that
doesn't matter for our purposes.)
Real number types may include additional values (infinities and NaN values). We
do not provide a notation to express such values.
In C++, the following syntaxes are used:
- Integer literals
- `12345` (decimal)
- `0x1FE` (hexadecimal)
- `0123` (octal)
- `0b1010` (binary)
- Real number literals
- Decimal
- `123.`
- `.123`
- `123.456`
- `123.e456` (= 123 \* 10<sup>456</sup>)
- `.123e456`
- `123.456e789`
- `123e456` (no decimal point)
- Any of the above with a `+` or `-` after `e`.
- Hexadecimal
- `0x123.p456` (= 123<sub>16</sub> \* 2<sup>456</sup>)
- `0x.123p456`
- `0x123.456p789`
- `0x123p456` (no hexadecimal point)
- Any of the above with a `+` or `-` after `p`.
- Digit separators (`'`) may appear between any two digits
- An optional suffix defines the type
- `U` (`unsigned`) and `L` (`long`) or `LL` (`long long`) for integers
(order-independent, but `LUL` disallowed)
- `F` (`float`) or `L` (`long double`) for real numbers
- User-defined literals may have custom suffixes, starting with `_` for
non-standard-library literals.
C++ numeric literals are case-insensitive, except in the suffix of a
user-defined literal. Negative numbers are formed by applying a unary `-`
operator to a non-negative literal.
The type of a literal in C++ depends primarily on its syntax and its suffix.
However, for integer literals, the type also depends on the value; the language
rules attempt to pick a type large enough to fit the value. An `unsigned` type
is always used if a `U` suffix is present, is never used for a decimal literal
without a `U` suffix, and otherwise may or may not be used depending on whether
the value happens to fit into an unsigned type but not into a signed type of the
same width.
Other languages use somewhat different rules, but the broad lexical structure
above -- an optional prefix for the base, a value, an optional exponent, and an
optional suffix -- is common across a large number of languages.
## Proposal
We allow these syntaxes:
- Integer literals
- `12345` (decimal)
- `0x1FE` (hexadecimal)
- `0b1010` (binary)
- Real number literals
- `123.456` (digits on both sides of the `.`)
- `123.456e789` (optional `+` or `-` after the `e`)
- `0x1.2p123` (optional `+` or `-` after the `p`)
- Digit separators (`_`) may be used, but only in conventional locations
Note that real number literals always contain a `.` with digits on both sides,
and integer literals never contain a `.`.
Literals are case-sensitive.
No support is proposed for literals with type suffixes, but without prejudice:
this proposal proposes neither the inclusion nor the absence of such literals.
## Details
### Integer literals
Decimal integers are written as a non-zero decimal digit followed by zero or
more additional decimal digits, or as a single `0`.
Integers in other bases are written as a `0` followed by a base specifier
character, followed by a sequence of digits in the corresponding base. The
available base specifiers and corresponding bases are:
| Base specifier | Base | Digits |
| -------------- | ---- | ------------------------ |
| `b` | 2 | `0` and `1` |
| `x` | 16 | `0` ... `9`, `A` ... `F` |
The above table is case-sensitive. For example, `0b1` and `0x1A` are valid, and
`0B1`, `0X1A`, and `0x1a` are invalid.
A zero at the start of a literal can never be followed by another digit: either
the literal is `0`, the `0` begins a base specifier, or the next character is a
decimal point (see below).
### Real number literals
Real numbers are written as a decimal or hexadecimal integer followed by a
period (`.`) followed by a sequence of one or more decimal or hexadecimal
digits, respectively. A digit is required on each side of the period. `0.` and
`.3` are both invalid.
A real number can be followed by an exponent character, an optional `+` or `-`
(defaulting to `+` if absent), and a character sequence matching the grammar of
a decimal integer with some value _N_. For a decimal real number, the exponent
character is `e`, and the effect is to multiply the given value by
10<sup>±_N_</sup>. For a hexadecimal real number, the exponent character
is `p`, and the effect is to multiply the given value by
2<sup>±_N_</sup>. The exponent suffix is optional for both decimal and
hexadecimal real numbers.
Note that a decimal integer followed by `e` is not a real number literal. For
example, `3e10` is not a valid literal.
When a real number literal is interpreted as a value of a real number type, its
value is the representable real number closest to the value of the literal. In
the case of a [tie](#ties), the conversion to the real number type is invalid.
The decimal real number syntax allows for any decimal fraction to be expressed
-- that is, any number of the form _a_ x 10<sup>-_b_</sup>, where _a_ is an
integer and _b_ is a non-negative integer. Because the decimal fractions are
dense in the reals and the set of values of the real number type is assumed to
be discrete, every value of the real number type can be expressed as a real
number literal. However, for certain applications, directly expressing the
intended real number representation may be more convenient than producing a
decimal equivalent that is known to convert to the intended value. Hexadecimal
real number literals are provided in order to permit values of binary floating
or fixed point real number types to be expressed directly.
#### Ties
As described above, a real number literal that lies exactly between two
representable values for its target type is invalid. Such ties are extremely
unlikely to occur by accident: for example, when interpreting a literal as
`Float64`, `1.` would need to be followed by exactly 53 decimal digits (followed
by zero or more `0`s) to land exactly half-way between two representable values,
and the probability of `1.` followed by a random 53-digit sequence resulting in
such a tie is one in 5<sup>53</sup>, or about
0.000000000000000000000000000000000009%. For `Float32`, it's about
0.000000000000001%, and even for a typical `Float16` implementation with 10
fractional bits, it's around 0.00001%.
Ties are much easier to express as hexadecimal floating-point literals: for
example, `0x1.0000_0000_0000_08p+0` is exactly half way between `1.0` and the
smallest `Float64` value greater than `1.0`, which is `0x1.0000_0000_0000_1p+0`.
Whether written in decimal or hexadecimal, a tie provides very strong evidence
that the developer intended to express a precise floating-point value, and
provided one bit too much precision (or one bit too little, depending on whether
they expected some rounding to occur), so rejecting the literal seems like a
better option than accepting it and making an arbitrary choice between the two
possible values.
### Digit separators
If digit separators (`_`) are included in literals, they must meet the
respective condition:
- For decimal integers, the digit separators shall occur every three digits
starting from the right. For example, `2_147_483_648`.
- For hexadecimal integers, the digit separators shall occur every four digits
starting from the right. For example, `0x7FFF_FFFF`.
- For real number literals, digit separators can appear in the decimal and
hexadecimal integer portions (prior to the period and after the optional `e`
or mandatory `p`) as described in the previous bullets. For example,
`2_147.483648e12_345` or `0x1_00CA.FEF00Dp+24`
- For binary literals, digit separators can appear between any two digits. For
example, `0b1_000_101_11`.
#### Open question: digit separator placement
**2020-09-15: core team meeting selected Alternative 0**
As an alternative to the rule proposed above, we could consider different
restrictions on where digit separators can appear:
**Alternative 0:** as presented above.
**Alternative 1:** allow any digit groupings (for example, `123_4567_89`).
Pro:
- Simpler, more flexible rule, that may allow some groupings that are
conventional in a specific domain. For example, `var Date: d = 01_12_1983;`,
or `var Int64: time_in_microseconds = 123456_000000;`.
- Culturally agnostic. For example, the Indian convention for digit separators
would group the last three digits, and then every two digits before that
(1,23,45,678 could be written `1_23_45_678`).
Con:
- Less self-checking that numeric literals are interpreted the way that the
author intends.
**Alternative 2:** as above, but additionally require binary digits to be
grouped in 4s.
Pro:
- More enforcement that digit grouping is conventional.
Con:
- No clear, established rule for how to group binary digits. In some cases, 8
digit groups may be more conventional.
- When used to express literals involving bit-fields, arbitrary grouping may
be desirable. For example:
```carbon
var Float32: flt_max =
BitCast(Float32, 0b0_11111110_11111111111111111111111);
```
**Alternative 3:** allow any regular grouping.
Pro:
- Can be applied uniformly to all bases.
Con:
- Provides no assistance for decimal numbers with a single digit separator.
- Does not allow binary literals to express an intent to initialize irregular
bit-fields.
## Alternatives considered
There are a number of different design choices we could make, as divergences
from the above proposal. Those choices, along with the arguments that led to
choosing the proposed design rather than each alternative, are presented below.
### Integer bases
#### Octal literals
No support is proposed for octal literals. In practice, their appearance in C
and C++ code in a sample corpus consisted of (in decreasing order of commonality
and excluding `0` literals):
- file permissions,
- cases where decimal was clearly intended (`CivilDay(2020, 04, 01)`), and
- (in _distant_ third place) anything else.
The number of intentional uses of octal literals, other than in file
permissions, was negligible. We considered the following alternatives:
**Baseline:** This proposal suggests that we do not support octal literals.
Octal literals are rare and mostly obsolescent. File permissions can be
supported in some other way.
**Alternative 1:** Follow C and C++, and use `0` as the base prefix for octal.
Pro:
- More similar to C++ and other languages.
Con:
- Subtle and error-prone rule: for example, left-padding with zeroes for
alignment changes the meaning of literals.
**Alternative 2:** Use `0o` as the base prefix for octal.
Pro:
- Unlikely to be misinterpreted as decimal.
- Follows several other languages (for example, Python).
Con:
- Additional language complexity.
If we decide we want to introduce octal literals at a later date, use of
alternative 2 is suggested.
#### Decimal literals
**We could permit leading `0`s in decimal integers (and in floating-point
numbers).**
Pro:
- We would allow leading `0`s to be used to align columns of numbers.
Con:
- The same literal could be valid but have a different value in C++ and
Carbon.
**We could add an (optional) base specifier `0d` for decimal integers.**
Pro:
- Uniform treatment of all bases. Left-padding with `0` could be achieved by
using `0d000123`.
Con:
- No evidence of need for this functionality.
**We could permit an `e` in decimal literals to express large powers of 10.**
Pro:
- Many uses of (eg) `1e6` in our sample C++ corpus intend to form an integer
literal instead of a floating-point literal.
Con:
- Would violate the expectations of many C++ programmers used to `e`
indicating a floating-point constant.
We suggest that this syntax is not added at this point. However, it should be
reconsidered at a later date, once developers are used the requirement that real
literals always contain a period.
#### Case sensitivity
**We could make base specifiers case-insensitive.**
Pro:
- More similar to C++.
Con:
- `0B1` is easily mistaken for `081`
- `0B1` can be confused with `0xB1`
- `0O17` is easily mistaken for `0017`
- Allowing more than one way to write literals will lead to style divergence.
**We could make the digit sequence in hexadecimal integers case-insensitive.**
Pro:
- More similar to C++.
- Some developers will be more comfortable writing hexadecimal digits in
lowercase. Some tools, such as `md5`, will print lowercase.
Con:
- Allowing more than one way to write literals will lead to style divergence.
- Lowercase hexadecimal digits are less visually distinct from the `x` base
specifier (for example, the digit sequence is more visually distinct in
`0xAC` than in `0xac`).
**We could require the digit sequence in hexadecimal integers to be written
using lowercase letters `a`..`f`.**
Pro:
- Some developers will be more comfortable writing hexadecimal digits in
lowercase. Some tools, such as `md5`, will print lowercase.
- `B` and `D` are more likely to be confused with `8` and `0` than `b` and `d`
are.
Con:
- Some developers will be more comfortable writing hexadecimal digits in
uppercase. Some tools will print uppercase.
- Lowercase hexadecimal digits are less visually distinct from the `x` base
specifier (for example, the digit sequence is more visually distinct in
`0xAC` than in `0xac`).
### Real number syntax
**We could allow real numbers with no digits on one side of the period (`3.` or
`.5`).**
Pro:
- More similar to C++.
- Allows numbers to be expressed more tersely.
Con:
- Gives meaning to `tup.0` syntax that may be useful for indexing tuples.
- Gives meaning to `0.ToString()` syntax that may be useful for performing
member access on literals.
- May harm readability by making the difference between an integer literal and
a real number literal less significant.
- Allowing more than one way to write literals will lead to style divergence.
See also the section on
[floating-point literals](https://google.github.io/styleguide/cppguide.html#Floating_Literals)
in the Google style guide, which argues for the same rule.
**We could allow a real number with no `e` or `p` to omit a period (`1e100`).**
Pro:
- More similar to C++.
- Allows numbers to be expressed more tersely.
Con:
- Assuming that such numbers are integers rather than real numbers is a common
error in C++.
**We could allow the `e` or `p` to be written in uppercase.**
Pro:
- More similar to C++.
- Most calculators use `E`, to avoid confusion with the constant `e`.
Con:
- Allowing more than one way to write literals will lead to style divergence.
- `E` may be confused with a hexadecimal digit.
**We could require a `p` in a hexadecimal real number literal.**
Pro:
- More similar to C++.
- When explicitly writing a bit-pattern for a floating-point type, it's
reasonable to always include the exponent value.
Con:
- Less consistent.
- Makes hexadecimal floating-point values even more expert-only.
**We could arbitrarily pick one of the two values when a real number is exactly
half-way between two representable values.**
Pro:
- More similar to C++.
- Would accept more cases, and it's likely that either of the two possible
values would be acceptable in practice.
Con:
- Would either need to specify which option is chosen or, following C++,
accept that programs using such literals have non-portable semantics.
- Numbers specified to the exact level of precision required to form a tie are
a strong signal that the programmer intended to specify a particular value.
### Digit separator syntax
**2020-09-15: core team meeting chose to forward digit separator to painter**
**2020-10-05: painter selected Alternative 2: `_` as digit separator**
There are various different characters we could attempt to use as a digit
separator. The options we considered are:
**Alternative 0:** `'` as a digit separator.
Pro:
- Follows C++ syntax.
- Used in several (mostly European) writing conventions.
Con:
- `'` is also likely to be used to introduce character literals.
**Alternative 1:** `,` as a digit separator.
Pro:
- More similar to how numbers are written in English text and many other
cultures.
Con:
- Commas are expected to widely be used in Carbon programs for other purposes,
where there may be digits on both sides of the comma. For example, there
could be readability problems if `f(1, 234)` called `f` with two arguments
but `f(1,234)` called `f` with a single argument.
- Comma is interpreted as a decimal point in the conventions of many cultures.
- Unprecedented in common programming languages.
**Alternative 2:** `_` as a digit separator.
Pro:
- Follows convention of C#, Java, JavaScript, Python, D, Ruby, Rust, Swift,
...
- Culturally agnostic, because it doesn't match any common human writing
convention.
Con:
- Underscore is not used as a digit grouping separator in any common human
writing convention.
**Alternative 3:** whitespace as a digit separator.
Pro:
- Used and understood by many cultures.
- Never interpreted as a decimal point instead of a grouping separator.
- Also usable to the right of a decimal point.
Con:
- Omitted separators in lists of numbers may result in distinct numbers being
spliced together. For example, `f(1, 23, 4 567)` may be interpreted as three
separate numerical arguments instead of four arguments with a missing comma.
- Unprecedented in other programming languages.
**Alternative 4:** `.` as digit separator, `,` as decimal point.
Pro:
- More familiar to cultures that write numbers this way.
Con:
- As with `,` as a digit separator, `,` as a decimal point is problematic.
- This usage is unfamiliar and would be surprising to programmers; programmers
from cultures where `,` is the decimal point in regular writing are likely
already accustomed to using `.` as the decimal point in programming
environments, and the converse is not true.
**Alternative 5:** No digit separator syntax.
Pro:
- Simpler language rules.
- More consistent source syntax, as there is no choice as to whether to use
digit separators or not.
Con:
- Harms the readability of long literals.
## Rationale
The proposal provides a syntax that is sufficiently close to that used both by
C++ and many other languages to be very familiar. However, it selects a
reasonably minimal subset of the syntaxes. This minimal approach provides
benefits directly in line with both the simplicity and readability goals of
Carbon:
- Reduces unnecessary choices for programmers.
- Simplifies the syntax rules of the language.
- Improves consistency of written Carbon code.
That said, it still provides sufficient variations to address important use
cases for the goal of not leaving room for a lower level language:
- Hexadecimal and binary integer literals.
- Scientific notation floating point literals.
- Hexadecimal (scientific) floating point literals.
### Painter rationale
The primary aesthetic benefit of `'` to the painter is consistency with C++.
However, its rare usage in C++ at this point reduces this advantage to a very
small one, while there is broad convergence amongst other languages around `_`.
The choice here has no risk of significant meaning or building up patterns of
reading for users that might be disrupted by the change, and so it seems
reasonable to simply converge with other languages to end up in the less
surprising and more conventional syntax space.
### Open questions
Placement restrictions of digit separators:
- The core team had consensus for the proposed restricted placement rules.
Use `_` or `'` as the digit separator character:
- The core team deferred this decision to the painter.
- The painter selected `_`.