Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::net::Ipv4Addr parsing violates the "strict" form in IETF RFC 6943 Section 3.1.1. #86964

Closed
zacknewman opened this issue Jul 7, 2021 · 14 comments · Fixed by #86984
Closed
Labels
C-bug Category: This is a bug.

Comments

@zacknewman
Copy link

zacknewman commented Jul 7, 2021

According to the std::net::Ipv4Addr documentation, "The four octets are in decimal notation, divided by . (this is called 'dot-decimal notation'). Notably, octal numbers and hexadecimal numbers are not allowed per IETF RFC 6943"; however the parsing of the &str 127.0000.0.1 violates the "strict" form mentioned in Section 3.1.1.—this form is what prohibits octal and hexadecimal numbers—meanwhile the &str 127.0.0.001 conforms to it. IETF RFC 6943 Section 3.1.1. states:

If the af argument of inet_pton() is AF_INET, the src string shall
be in the standard IPv4 dotted-decimal form:

ddd.ddd.ddd.ddd

where "ddd" is a one to three digit decimal number between 0 and
255. The inet_pton() function does not accept other formats (such
as the octal numbers, hexadecimal numbers, and fewer than four
numbers that inet_addr() accepts).
As shown above, inet_pton() uses what we will refer to as the
"strict" form of an IPv4 address literal. Some platforms also use
the strict form with getaddrinfo() when the AI_NUMERICHOST flag is
passed to it.

I tried this code:

use std::net::Ipv4Addr;
fn main() {

    // This causes a panic despite violating the "strict" form
    // mentioned in RFC 6943 Section 3.1.1. specifically
    // the fact the second octet is represented as a 4-digit base-10
    // number.
    assert!("127.0000.0.1".parse::<Ipv4Addr>().is_err());
    // This causes a panic despite conforming to the "strict" form
    // mentioned in RFC 6943 Section 3.1.1. specifically
    // it is a string of exactly 4 octets separated by '.'
    // where each octet is represented as a one-to-three digit
    // base-10 number.
    assert!("127.0.0.001".parse::<Ipv4Addr>().is_ok())
}

I expected to see this happen: both assert!s not causing a panic since the first &str has an octet represented by more than three base-10 numbers despite the requirement per RFC 6943 stating each octet be a "one to three digit decimal number". The second &str is valid per RFC 6943 since it is a string of exactly 4 octets separated by . where each octet is a "one to three digit decimal number".

Instead, this happened: Both lines cause a panic.

Meta

rustc --version --verbose:

rustc 1.53.0 (53cb7b09b 2021-06-17)
binary: rustc
commit-hash: 53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b
commit-date: 2021-06-17
host: x86_64-apple-darwin
release: 1.53.0
LLVM version: 12.0.1

rustc --version --verbose:

rustc 1.55.0-nightly (885399992 2021-07-06)
binary: rustc
commit-hash: 885399992c4c1dde37b506b8507a7d69415646b9
commit-date: 2021-07-06
host: x86_64-apple-darwin
release: 1.55.0-nightly
LLVM version: 12.0.1
Backtrace

thread 'main' panicked at 'assertion failed: \"127.0000.0.1\".parse::<Ipv4Addr>().is_err()', src/main.rs:8:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/885399992c4c1dde37b506b8507a7d69415646b9/library/std/src/panicking.rs:515:5
   1: core::panicking::panic_fmt
             at /rustc/885399992c4c1dde37b506b8507a7d69415646b9/library/core/src/panicking.rs:92:14
   2: core::panicking::panic
             at /rustc/885399992c4c1dde37b506b8507a7d69415646b9/library/core/src/panicking.rs:50:5
   3: playground::main
             at /Users/zack/Documents/rust/playground/src/main.rs:8:5
   4: core::ops::function::FnOnce::call_once
             at /rustc/885399992c4c1dde37b506b8507a7d69415646b9/library/core/src/ops/function.rs:227:5

@zacknewman zacknewman added the C-bug Category: This is a bug. label Jul 7, 2021
@zacknewman zacknewman changed the title std::net::Ipv4Addr parsing violates the "strict" form in RFC 6943 Section 3.1.1. std::net::Ipv4Addr parsing violates the "strict" form in IETF RFC 6943 Section 3.1.1. Jul 7, 2021
@mbartlett21
Copy link
Contributor

mbartlett21 commented Jul 8, 2021

The second &str is valid per RFC 6943 since it is a string of exactly 4 octets separated by . where each octet is a "one to three digit decimal number".

I believe that the second input (127.0.0.001) is not valid, since the spec explicitly says that octal numbers are disallowed, and 001 is an octal number in the loose form.

The example in the RFC for an octal number is 012, which parses to 10 if you use octal, but 12 if you use decimal. If you instead have it parse it always as 12, that means that bad things can happen, such as different tools interpreting 012 as different things.

If you read the source code where the ipv4 address is parsed, it specifically disallows that:

// Disallow octal number in IP string.
// https://tools.ietf.org/html/rfc6943#section-3.1.1
match (p.peek_char(), p.read_number(10, None)) {
(Some('0'), Some(number)) if number != 0 => None,
(_, number) => number,
}

the first &str has an octet represented by more than three base-10 numbers despite the requirement per RFC 6943 stating each octet be a "one to three digit decimal number".

Regarding 127.0000.0.1, the documentation only says that the disallowing of hexadecimal and octal numbers are according to IETF RFC 6943, and does not say anything about only allowing three digits. Also, four or more zeroes are an edge case, and thus the only numbers that can be parsed there with more than three digits, due to the disallowing of octal numbers apart from 0.

@Urgau
Copy link
Member

Urgau commented Jul 8, 2021

From the release note of 1.53:

Ipv4::from_str will now reject octal format IP addresses in addition to rejecting hexadecimal IP addresses. The octal format can lead to confusion and potential security vulnerabilities and is no longer recommended (ietf6943).

@zacknewman
Copy link
Author

zacknewman commented Jul 8, 2021

I'm confused by these comments. Padded zeros is not unique to the base-8 numeral system. Base-2, Base-3, etc. all have the property of having an infinite number of padded zeros with the convention being to truncate them all. After all 127.0.0.1 is also valid in base-8, base-9, etc. The documentation is quite clear that the octets are to be interpreted in base-10 though, so this problem of how to interpret the octets is solved. Consequently the last octet in 127.0.0.001 is interpreted as 1 in base-10 for the same reason the first octet in 127.0.0.1 is interpreted as 127 in base-10 and not some other value (e.g., 87 in base-10) namely because the documentation says so.

After all objecting to 127.0.0.001 on the grounds that the last octet could be in base-8—again, all of the octets there could be in base-8—doesn't apply to the fact that 008.008.008.008 is also rejected by the parsing algorithm since clearly each of those octets are not in base-8 since 8 is not a symbol in the base-8 numeral system. This is of course nothing more than a consequence of incorrectly assuming padded 0s means base-8. The inconsistent treatment of 0 (in base-10) is also a consequence of this incorrect assumption. Why is 0 (in base-10) allowed to be represented by one or more 0s, but 1 (in base-10) cannot have padded 0s? There is nothing special about the syntactic representation of 0 (in base-10) that warrants this inconsistent treatment.

To me the solution is clear: more strictly follow IETF RFC 6943 and enforce that each octet be one-to-three digits in length. One could even follow a stricter format and require that there be no padded 0s at all—since the RFC doesn't actually specify that. This would then lead to consistent behavior (that's also more easily predictable since it follows a well-established convention) in that both 127.000.000.000 and 127.000.000.001 will either be accepted (since they both follow RFC 6943) or rejected (if a stricter form is required that requires the minimal number of digits necessary to represent each octet).

@mbartlett21
Copy link
Contributor

mbartlett21 commented Jul 8, 2021

This is of course nothing more than a consequence of incorrectly assuming padded 0s means base-8.

This is not an incorrect assumption, as octal numbers in the loose format are specified by the number having a leading zero. (See #83652 for the PR that banned octal numbers)

Why is 0 (in base-10) allowed to be represented by an arbitrary number of 0s, but 1 (in base-10) cannot have padded 0s? There is nothing special about the syntactic representation of 0 (in base-10) that warrants this inconsistent treatment.

This is an implementation detail in the way that it checks for leading zeroes: it checks if the number starts with a zero and rejects it if the resultant parsed number is not zero.

@zacknewman
Copy link
Author

zacknewman commented Jul 8, 2021

So we are re-defining how mathematics defines numeral systems? Saying that 089A is in base-8 (i.e., octal) because of the leading 0 makes no sense since none of those symbols except 0 exist in that system. Rust certainly doesn't treat the &str 0100 in base-8 here: assert!(110u32 == "0110".parse::<u32>().unwrap()).

Also the documentation clearly states that each octet is in "decimal notation". Suddenly treating the octet 001 as base-8 is a clear contradiction to the documentation.

@syvb
Copy link
Contributor

syvb commented Jul 8, 2021

Using leading zeros to designate octal literals is a fairly common convention. In other languages like JS and C, 010 == 8. Rust doesn't have that as an octal prefix because it is confusing. When using leading zeros to indicate octal numbers, 089A isn't a valid number at all (in the same way that ⼋ⓡ⮯ isn't a valid base-10 number).

@zacknewman
Copy link
Author

So it sounds like the documentation needs to be modified. It is confusing (because it is contradictory to reality) to say that each octet is in decimal notation when the actual implementation requires each octet to be either base-10 or base-8 where the latter is assumed when there are leading 0s and the value is not 0—at which point any positive-integer amount of 0s are allowed—and furthermore error when any octet is interpreted as base-8.

Seems much easier to both document and comprehend to simply require at most 3 decimal digits for each octet and either forbid leading 0s (including for the integer 0) or allow leading 0s (including 001).

@RalfJung
Copy link
Member

RalfJung commented Jul 9, 2021

So it sounds like the documentation needs to be modified. It is confusing (because it is contradictory to reality) to say that each octet is in decimal notation when the actual implementation requires each octet to be either base-10 or base-8 where the latter is assumed when there are leading 0s and the value is not 0—at which point any positive-integer amount of 0s are allowed—and furthermore error when any octet is interpreted as base-8.

"The actual implementation" is relevant only insofar as its behavior is externally observable. And -- correct me if I'm wrong -- in every case where the function succeeds, it (behaves as if it) interprets all digits in decimal, right? So I'd say it is correct to say that everything is "in decimal notation".

The only weird bit is that 0000 is accepted, but 0001 is not. This is strange, but not in contradiction with saying things are "in decimal notation".

@zacknewman
Copy link
Author

zacknewman commented Jul 9, 2021

When the documentation says the "four octets are in decimal notation", that is effectively a definition which in turn can be phrased as a logical biconditional. In this case, the logical biconditional I formed in my head was something like: "a &str will be parsed into an std::net::Ipv4Addr iff it is a sequence of exactly 4 base-10 numbers delimited by a .". Only one of those directions is true: if an std::net::Ipv4Addr is produced from parsing a &str, then the &str was a sequence of exactly 4 base-10 numbers delimited by a .. The other direction, which is observable via a parsing failure, is not true as can be seen with something like 001.001.001.001. That is a sequence of exactly 4 base-10 numbers delimited by a ., but it leads to a parsing failure.

As was discovered in the discussion here, it sounds like "decimal notation" as used in the documentation is not what it has meant my entire life nor what it means in a lot of areas in Rust. Specifically, any number that has leading 0s is not a base-10 number but in fact a base-8 number. With this modified definition of "decimal"; then my counterexample, 001.001.001.001, is in fact not a counterexample since each of those octets are not in "decimal" notation but in fact base-8. Even more perplexing as stated by @cuviper is that "a lone 0 is technically octal as well in C, not that it matters" which means the definition of "decimal" used in the documentation is neither the mathematical definition nor the C definition—since something like 0.0.0.0 is successfully parsed despite each of those octets being in base-8 per the C definition.

@RalfJung
Copy link
Member

RalfJung commented Jul 9, 2021

Specifically, any number that has leading 0s is not a base-10 number but in fact a base-8 number.

That's true in C. (Specifically, it is how C converts series of digits to numbers. "Numbers" don't have leading 0s, numbers are elements of the set of integers. The representation of a number as a string of digits can have leading 0s. At least that's how I think the term is usually used when one needs to be precise about series of digits vs. the thing they represent.)
However, it's not true for this function. Any number that has leading 0s is rejected by this function. This fact about C was brought up only to explain why leading 0s are rejected.

No base-8 occurs anywhere in the behavior of this function. If you disagree, please show an input-output pair for this function which demonstrates that base 8 is used. (The code inside the function doesn't matter, of course. Only its observable behavior does.)

I really don't understand why you keep insisting that this function has anything to do with base 8. I think it might be because of some quirk in how it is implemented, but unobservable implementation details are irrelevant. As a mathematician you should be even more used to this than most programmers, since the usual way of defining mathematical functions consists entirely of input-output pairs, there is not even such a thing as an "implementation"... but I am digressing. ;)

Even more perplexing as stated by @cuviper is that
"a lone 0 is technically octal as well in C, not that it matters" which means the definition of "decimal" used in the documentation is neither the mathematical definition nor the C definition

Even mathematically speaking, the sequence of digits "0" represents the same number in any base system. So, in a very precise formal sense, "not that it matters" is accurate. A base is just a way to convert a series of digits into a number, and for series of digits consisting only of 0s, the result is always the same.

The definition of "decimal" used in the documentation exactly matches the mathematical definition. The documentation just fails to mention that there is an additional restriction, namely that no leading 0s are accepted (except if all digits of an octet are 0). However, it remains the case that if an input is accepted by this function, then all octets are interpreted as decimal -- and moreover, all decimal representations without leading 0s are accepted by this function.

@zacknewman
Copy link
Author

zacknewman commented Jul 9, 2021

The definition of "decimal" used in the documentation exactly matches the mathematical definition. The documentation just fails to mention that there is an additional restriction, namely that no leading 0s are accepted (except if all digits of an octet are 0). However, it remains the case that if an input is accepted by this function, then all octets are interpreted as decimal -- and moreover, all decimal representations without leading 0s are accepted by this function.

Saying a definition matches exactly with another definition before subsequently changing the definition by adding other restrictions does not make sense to me. Does the definition of an even integer "match exactly" with the definition of an integer? No. Adding or removing restrictions to a definition changes the definition. So the fact that leading 0s is not allowed means we are not talking about mathematical base-10. If I define a function that takes an integer and it errors any time the integer is not even; however the documentation of that function makes no mention of that seems very problematic. I would not defend such documentation by saying "my definition of integer matches exactly with the definition of the mathematical definition of an integer except I also require the integer to be divisible by 2". My documentation should very clearly define what I mean by "integer", or better, simply state "even integer".

@RalfJung
Copy link
Member

RalfJung commented Jul 9, 2021

Okay, I guess I should have said that it is strictly a restriction of the mathematical definition (a subset, if you take the graph view of a function). But it doesn't disagree on any inputs where it is defined. Your talking about base 8 makes it sound like sometimes it is not matching the base 10 function even when it does produce a result (but matches the base 8 function instead), and it took a quite a lot of time to realize that is not what happens.

Moreover, the subset still contains the full graph of the inverse of the full "base 10" function where you always pick the shortest representation (i.e., the most common "serialization" function of numbers into base 10 representation). So unlike your "even" example it is a pretty reasonable parsing function still. There are only some "non-canonical" representations being rejected.

@zacknewman
Copy link
Author

zacknewman commented Jul 9, 2021

Well, I think we are talking in circles at this point honestly, lol. While I would agree my example is a more extreme example of documentation failure since conflating the definition of "even integer" with "integer" is more absurd than conflating the definition of "base-10 in the shortest representation (with the exception of 0 which is allowed any non-negative integer amount of leading 0s)" with "base-10", both examples suffer from the fact that additional failures occur than otherwise would occur based on the exact definition of the term that is used (in my example "integer" and in this example "decimal").

For me, as I mentioned in the other discussion, this really is not me being "pedantic". I wrote code based on the documentation that ended being invalid since I (mis-)interpreted "decimal" to be more general than what it actually means in this context. Ideally, in my opinion at least, that should not happen especially since it is not that much additional effort to more properly document what certain terms mean even if that means just adding one or two more examples highlighting parsing failures (or successes) that one (like myself) would not expect.

@RalfJung
Copy link
Member

RalfJung commented Jul 9, 2021

We already agree that the docs need to be improved. :) The rest of this discussion is then mostly philosophical and not going to lead to a productive outcome, so maybe we better stop it. ;)

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Oct 20, 2021
Reject octal zeros in IPv4 addresses

This fixes rust-lang#86964 by rejecting octal zeros in IP addresses, such that `192.168.00.00000000` is rejected with a parse error, since having leading zeros in front of another zero indicates it is a zero written in octal notation, which is not allowed in the strict mode specified by RFC 6943 3.1.1. Octal rejection was implemented in rust-lang#83652, but due to the way it was implemented octal zeros were still allowed.
JohnTitor added a commit to JohnTitor/rust that referenced this issue Oct 21, 2021
Reject octal zeros in IPv4 addresses

This fixes rust-lang#86964 by rejecting octal zeros in IP addresses, such that `192.168.00.00000000` is rejected with a parse error, since having leading zeros in front of another zero indicates it is a zero written in octal notation, which is not allowed in the strict mode specified by RFC 6943 3.1.1. Octal rejection was implemented in rust-lang#83652, but due to the way it was implemented octal zeros were still allowed.
JohnTitor added a commit to JohnTitor/rust that referenced this issue Oct 21, 2021
Reject octal zeros in IPv4 addresses

This fixes rust-lang#86964 by rejecting octal zeros in IP addresses, such that `192.168.00.00000000` is rejected with a parse error, since having leading zeros in front of another zero indicates it is a zero written in octal notation, which is not allowed in the strict mode specified by RFC 6943 3.1.1. Octal rejection was implemented in rust-lang#83652, but due to the way it was implemented octal zeros were still allowed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants