Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify Binary Strings are Printable #161

Closed
jeregrine opened this issue Jan 17, 2014 · 18 comments
Closed

Verify Binary Strings are Printable #161

jeregrine opened this issue Jan 17, 2014 · 18 comments

Comments

@jeregrine
Copy link
Contributor

https:/elixir-lang/ecto/blob/master/lib/ecto/query/util.ex#L61

@jeregrine
Copy link
Contributor Author

Assign this one to me, I'll handle it.

@josevalim
Copy link
Member

That would be a very expensive check to add. I would just fail later when serializing the data. What do you think?

@jeregrine
Copy link
Contributor Author

String.printable? is expensive? I was getting a strange error when I was,
accidently, trying to insert a binary into a text blob.

https://gist.github.com/jeregrine/b66c078133d90de7dfc5

On Fri, Jan 17, 2014 at 4:39 PM, José Valim [email protected]:

That would be a very expensive check to add. I would just fail later when
serializing the data. What do you think?


Reply to this email directly or view it on GitHubhttps://issues/161#issuecomment-32658589
.

@josevalim
Copy link
Member

It is expensive because you need to go over the while binary, so it is expensive for an operation that is supposed to only do type checking. Also, a string does not need to be necessarily printable, it can be valid but not printable. There is a chance your problem is more encoding related than printable related. Do you have the binary that caused the issues?

@ericmj Can we improve postgrex to force utf-8 mode (maybe via an option) and do validation at the boundaries? Or should that be done by Ecto?

@jeregrine
Copy link
Contributor Author

+1 for the utf-8 mode.

On Fri, Jan 17, 2014 at 4:45 PM, José Valim [email protected]:

It is expensive because you need to go over the while binary, so it is
expensive for an operation that is supposed to only do type checking. Also,
a string does not need to be necessarily printable, it can be valid but not
printable. There is a chance your problem is more encoding related than
printable related. Do you have the binary that caused the issues?

@ericmj https:/ericmj Can we improve postgrex to force
utf-8 mode (maybe via an option) and do validation at the boundaries? Or
should that be done by Ecto?


Reply to this email directly or view it on GitHubhttps://issues/161#issuecomment-32659138
.

@josevalim
Copy link
Member

To be clear, I don't think postgres will ever send bad data for a string field if marked as utf-8, but it is very important for us to validate it before sending to the database to ensure we won't insert bad data (and see the dreaded <?>).

@ericmj
Copy link
Member

ericmj commented Jan 17, 2014

Today it is fine to do this in ecto because ecto is doing the serialization. But even when we start doing prepared statements (postgrex doing the serialization) I would prefer to do this check in ecto because postgrex should work regardless of encoding and doesn't really do any other data validations. I would prefer if this was a check in ecto's validator like all other data type checks.

@josevalim makes a good point that we should not use String.printable? because not all valid utf-8 strings are printable. Here is postgres utf8 check: http://doxygen.postgresql.org/wchar_8c.html#a0f216452f58dc8433f423ae7f1c476b0. We cannot use String.valid? either because postgres doesn't accept NULL bytes. Maybe we can use String.valid? with some extra checks. I need to look into this closer.

@josevalim
Copy link
Member

@ericmj I think the validation should be per adapter during serialization. Exactly because databases can come up with different rules. My suggestion for doing in postgrex is because everyone using postgrex can run into those issues. Obviously we can't validate all encodings there, but utf-8 can be easily done if we have such configuration.

@ericmj
Copy link
Member

ericmj commented Jan 21, 2014

@josevalim Doing the validation in postgrex doesn't gain us much. We will just error before sending the query instead of getting an error response from the server. OTOH if we do the check during ecto's validation we can raise a proper Ecto.QueryError that points exactly to the faulty expression.

The reason why we get the weird "insufficient data left in message" error is because Ecto is generating an erroneous query expression, which I think it never should, that's a serious bug. When we are doing prepared statements we won't have the issue of erroneous query, but we may still give postgrex an invalid string. When we send an invalid string the postgres server will respond with "invalid byte sequence for encoding 'UTF8'" which I think is an error message just as good as any that postgrex can respond with.

For now I think we should use a variant of String.valid? to validate the string in ecto. We can revisit this when we start using prepared statements.

@josevalim
Copy link
Member

Yeah, to be clear, my goal for doing it in the adapter (it could be even in Ecto's part of the adapter) is just to avoid going through the whole thing twice. Unless the adapter sends the string as is, then we should do it in Ecto. Otherwise, if the adapter needs to encode the string into something else, it is probably better done at the adapter.

@ericmj
Copy link
Member

ericmj commented Jan 21, 2014

The only things ecto does with the string is escape with :binary.replace/4 and then concat it into the query. When we do prepared statements neither ecto or postgrex will do anything with the string, more than passing it along.

So for now we can do the check while escaping the string. I don't know about the best solution for performance though. We can either rewrite the escaper to do the escaping manually while checking for string validating, hence going through it once. OTOH :binary.replace/4 is most likely optimized so it may be worth to go through the string twice while still using :binary.replace/4.

Wdyt?

@josevalim
Copy link
Member

When we have prepared statements, we will simply do a :binary.replace/4 as well? It seems it is better to wait for prepared statements then and do nothing for now.

@ericmj
Copy link
Member

ericmj commented Jan 21, 2014

@josevalim When we have prepared statements we wont do anything with the string.

@josevalim
Copy link
Member

Yeah, but you said the adapter will raise and I agree it is an exception as good as any.

@ericmj
Copy link
Member

ericmj commented Mar 10, 2014

I'm closing this because it will be fixed when we do prepared statements.

@ericmj ericmj closed this as completed Mar 10, 2014
@wojtekmach
Copy link
Member

wojtekmach commented Dec 31, 2018

blast from the past!

We had a crash on hex.pm on serialising non-UTF8 binary into an embed. Here's the full stack trace:

Traceback (most recent call last):
  File "lib/jason.ex", line 199, in Jason.encode_to_iodata!/2 (jason)
  File "lib/postgrex/type_module.ex", line 713, in Postgrex.DefaultTypes.encode_params/3 (postgrex)
  File "lib/postgrex/query.ex", line 62, in DBConnection.Query.Postgrex.Query.encode/3 (postgrex)
  File "lib/db_connection.ex", line 1074, in DBConnection.encode/5 (db_connection)
  File "lib/db_connection.ex", line 1172, in DBConnection.run_prepare_execute/5 (db_connection)
  File "lib/db_connection.ex", line 480, in DBConnection.parsed_prepare_execute/5 (db_connection)
  File "lib/db_connection.ex", line 473, in DBConnection.prepare_execute/4 (db_connection)
  File "lib/postgrex.ex", line 167, in Postgrex.query/4 (postgrex)
  File "lib/ecto/adapters/sql.ex", line 627, in Ecto.Adapters.SQL.struct/10 (ecto_sql)
  File "lib/ecto/repo/schema.ex", line 603, in Ecto.Repo.Schema.apply/4 (ecto)
  File "lib/ecto/repo/schema.ex", line 311, in anonymous fn/15 in Ecto.Repo.Schema.do_update/3 (ecto)
  File "lib/ecto/repo/schema.ex", line 869, in anonymous fn/3 in Ecto.Repo.Schema.wrap_in_transaction/6 (ecto)
  File "lib/ecto/adapters/sql.ex", line 784, in anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4 (ecto_sql)
  File "lib/db_connection.ex", line 734, in DBConnection.transaction/3 (db_connection)
  File "lib/ecto/multi.ex", line 580, in Ecto.Multi.apply_operation/5 (ecto)
  File "lib/enum.ex", line 1899, in Enum."-reduce/3-lists^foldl/2-0-"/3 (elixir)
  File "lib/ecto/multi.ex", line 564, in anonymous fn/5 in Ecto.Multi.apply_operations/5 (ecto)
  File "lib/ecto/adapters/sql.ex", line 784, in anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4 (ecto_sql)
  File "lib/db_connection.ex", line 1341, in DBConnection.run_transaction/4 (db_connection)
  File "lib/ecto/repo/transaction.ex", line 16, in Ecto.Repo.Transaction.transaction/3 (ecto)
Jason.EncodeError: invalid byte 0xE9 in <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>

(erlef/rebar3_hex#84 (comment))

It comes down to this:

iex> Ecto.Type.cast(:string, <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>)
{:ok, <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>}

but:

iex> String.valid?(<<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>)
false

We could solve it ourselves in the app by doing a pass on params, or using a custom type. Perhaps it's still worth revising as part of Ecto though? Maybe if not as a check on :string type, on a new :utf8_string type then?

@josevalim
Copy link
Member

I still think this should not be done in Ecto because of the reasons above. Both Jason and Plug will guarantee no bad UTF-8 will enter the system. In other words, if bad UTF-8 is reaching Ecto, it is likely already too late. Is this coming from Hex' custom parsers? If so, i would guarantee proper encoding when parsing.

@wojtekmach
Copy link
Member

@josevalim it comes from metadata inside tarball. Agreed, we should ensure it when extracting the tarball 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants