-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify Binary Strings are Printable #161
Comments
Assign this one to me, I'll handle it. |
That would be a very expensive check to add. I would just fail later when serializing the data. What do you think? |
String.printable? is expensive? I was getting a strange error when I was, https://gist.github.com/jeregrine/b66c078133d90de7dfc5 On Fri, Jan 17, 2014 at 4:39 PM, José Valim [email protected]:
|
It is expensive because you need to go over the while binary, so it is expensive for an operation that is supposed to only do type checking. Also, a string does not need to be necessarily printable, it can be valid but not printable. There is a chance your problem is more encoding related than printable related. Do you have the binary that caused the issues? @ericmj Can we improve postgrex to force utf-8 mode (maybe via an option) and do validation at the boundaries? Or should that be done by Ecto? |
+1 for the utf-8 mode. On Fri, Jan 17, 2014 at 4:45 PM, José Valim [email protected]:
|
To be clear, I don't think postgres will ever send bad data for a string field if marked as utf-8, but it is very important for us to validate it before sending to the database to ensure we won't insert bad data (and see the dreaded |
Today it is fine to do this in ecto because ecto is doing the serialization. But even when we start doing prepared statements (postgrex doing the serialization) I would prefer to do this check in ecto because postgrex should work regardless of encoding and doesn't really do any other data validations. I would prefer if this was a check in ecto's validator like all other data type checks. @josevalim makes a good point that we should not use |
@ericmj I think the validation should be per adapter during serialization. Exactly because databases can come up with different rules. My suggestion for doing in postgrex is because everyone using postgrex can run into those issues. Obviously we can't validate all encodings there, but utf-8 can be easily done if we have such configuration. |
@josevalim Doing the validation in postgrex doesn't gain us much. We will just error before sending the query instead of getting an error response from the server. OTOH if we do the check during ecto's validation we can raise a proper The reason why we get the weird "insufficient data left in message" error is because Ecto is generating an erroneous query expression, which I think it never should, that's a serious bug. When we are doing prepared statements we won't have the issue of erroneous query, but we may still give postgrex an invalid string. When we send an invalid string the postgres server will respond with "invalid byte sequence for encoding 'UTF8'" which I think is an error message just as good as any that postgrex can respond with. For now I think we should use a variant of |
Yeah, to be clear, my goal for doing it in the adapter (it could be even in Ecto's part of the adapter) is just to avoid going through the whole thing twice. Unless the adapter sends the string as is, then we should do it in Ecto. Otherwise, if the adapter needs to encode the string into something else, it is probably better done at the adapter. |
The only things ecto does with the string is escape with So for now we can do the check while escaping the string. I don't know about the best solution for performance though. We can either rewrite the escaper to do the escaping manually while checking for string validating, hence going through it once. OTOH Wdyt? |
When we have prepared statements, we will simply do a :binary.replace/4 as well? It seems it is better to wait for prepared statements then and do nothing for now. |
@josevalim When we have prepared statements we wont do anything with the string. |
Yeah, but you said the adapter will raise and I agree it is an exception as good as any. |
I'm closing this because it will be fixed when we do prepared statements. |
blast from the past! We had a crash on hex.pm on serialising non-UTF8 binary into an embed. Here's the full stack trace:
(erlef/rebar3_hex#84 (comment)) It comes down to this: iex> Ecto.Type.cast(:string, <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>)
{:ok, <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>} but: iex> String.valid?(<<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>)
false We could solve it ourselves in the app by doing a pass on params, or using a custom type. Perhaps it's still worth revising as part of Ecto though? Maybe if not as a check on |
I still think this should not be done in Ecto because of the reasons above. Both Jason and Plug will guarantee no bad UTF-8 will enter the system. In other words, if bad UTF-8 is reaching Ecto, it is likely already too late. Is this coming from Hex' custom parsers? If so, i would guarantee proper encoding when parsing. |
@josevalim it comes from metadata inside tarball. Agreed, we should ensure it when extracting the tarball 👍 |
https:/elixir-lang/ecto/blob/master/lib/ecto/query/util.ex#L61
The text was updated successfully, but these errors were encountered: