Verify Binary Strings are Printable #161

jeregrine · 2014-01-17T22:34:23Z

https:/elixir-lang/ecto/blob/master/lib/ecto/query/util.ex#L61

jeregrine · 2014-01-17T22:34:33Z

Assign this one to me, I'll handle it.

josevalim · 2014-01-17T22:39:22Z

That would be a very expensive check to add. I would just fail later when serializing the data. What do you think?

jeregrine · 2014-01-17T22:41:41Z

String.printable? is expensive? I was getting a strange error when I was,
accidently, trying to insert a binary into a text blob.

https://gist.github.com/jeregrine/b66c078133d90de7dfc5

On Fri, Jan 17, 2014 at 4:39 PM, José Valim [email protected]:

That would be a very expensive check to add. I would just fail later when
serializing the data. What do you think?

—
Reply to this email directly or view it on GitHubhttps://issues/161#issuecomment-32658589
.

josevalim · 2014-01-17T22:45:20Z

It is expensive because you need to go over the while binary, so it is expensive for an operation that is supposed to only do type checking. Also, a string does not need to be necessarily printable, it can be valid but not printable. There is a chance your problem is more encoding related than printable related. Do you have the binary that caused the issues?

@ericmj Can we improve postgrex to force utf-8 mode (maybe via an option) and do validation at the boundaries? Or should that be done by Ecto?

jeregrine · 2014-01-17T22:46:07Z

+1 for the utf-8 mode.

On Fri, Jan 17, 2014 at 4:45 PM, José Valim [email protected]:

It is expensive because you need to go over the while binary, so it is
expensive for an operation that is supposed to only do type checking. Also,
a string does not need to be necessarily printable, it can be valid but not
printable. There is a chance your problem is more encoding related than
printable related. Do you have the binary that caused the issues?

@ericmj https:/ericmj Can we improve postgrex to force
utf-8 mode (maybe via an option) and do validation at the boundaries? Or
should that be done by Ecto?

—
Reply to this email directly or view it on GitHubhttps://issues/161#issuecomment-32659138
.

josevalim · 2014-01-17T22:52:06Z

To be clear, I don't think postgres will ever send bad data for a string field if marked as utf-8, but it is very important for us to validate it before sending to the database to ensure we won't insert bad data (and see the dreaded <?>).

ericmj · 2014-01-17T22:55:48Z

Today it is fine to do this in ecto because ecto is doing the serialization. But even when we start doing prepared statements (postgrex doing the serialization) I would prefer to do this check in ecto because postgrex should work regardless of encoding and doesn't really do any other data validations. I would prefer if this was a check in ecto's validator like all other data type checks.

@josevalim makes a good point that we should not use String.printable? because not all valid utf-8 strings are printable. Here is postgres utf8 check: http://doxygen.postgresql.org/wchar_8c.html#a0f216452f58dc8433f423ae7f1c476b0. We cannot use String.valid? either because postgres doesn't accept NULL bytes. Maybe we can use String.valid? with some extra checks. I need to look into this closer.

josevalim · 2014-01-18T09:00:59Z

@ericmj I think the validation should be per adapter during serialization. Exactly because databases can come up with different rules. My suggestion for doing in postgrex is because everyone using postgrex can run into those issues. Obviously we can't validate all encodings there, but utf-8 can be easily done if we have such configuration.

ericmj · 2014-01-21T01:35:10Z

@josevalim Doing the validation in postgrex doesn't gain us much. We will just error before sending the query instead of getting an error response from the server. OTOH if we do the check during ecto's validation we can raise a proper Ecto.QueryError that points exactly to the faulty expression.

The reason why we get the weird "insufficient data left in message" error is because Ecto is generating an erroneous query expression, which I think it never should, that's a serious bug. When we are doing prepared statements we won't have the issue of erroneous query, but we may still give postgrex an invalid string. When we send an invalid string the postgres server will respond with "invalid byte sequence for encoding 'UTF8'" which I think is an error message just as good as any that postgrex can respond with.

For now I think we should use a variant of String.valid? to validate the string in ecto. We can revisit this when we start using prepared statements.

josevalim · 2014-01-21T11:24:40Z

Yeah, to be clear, my goal for doing it in the adapter (it could be even in Ecto's part of the adapter) is just to avoid going through the whole thing twice. Unless the adapter sends the string as is, then we should do it in Ecto. Otherwise, if the adapter needs to encode the string into something else, it is probably better done at the adapter.

ericmj · 2014-01-21T12:44:15Z

The only things ecto does with the string is escape with :binary.replace/4 and then concat it into the query. When we do prepared statements neither ecto or postgrex will do anything with the string, more than passing it along.

So for now we can do the check while escaping the string. I don't know about the best solution for performance though. We can either rewrite the escaper to do the escaping manually while checking for string validating, hence going through it once. OTOH :binary.replace/4 is most likely optimized so it may be worth to go through the string twice while still using :binary.replace/4.

Wdyt?

josevalim · 2014-01-21T13:04:03Z

When we have prepared statements, we will simply do a :binary.replace/4 as well? It seems it is better to wait for prepared statements then and do nothing for now.

ericmj · 2014-01-21T13:08:17Z

@josevalim When we have prepared statements we wont do anything with the string.

josevalim · 2014-01-21T13:08:42Z

Yeah, but you said the adapter will raise and I agree it is an exception as good as any.

ericmj · 2014-03-10T10:33:50Z

I'm closing this because it will be fixed when we do prepared statements.

wojtekmach · 2018-12-31T06:27:47Z

blast from the past!

We had a crash on hex.pm on serialising non-UTF8 binary into an embed. Here's the full stack trace:

Traceback (most recent call last):
  File "lib/jason.ex", line 199, in Jason.encode_to_iodata!/2 (jason)
  File "lib/postgrex/type_module.ex", line 713, in Postgrex.DefaultTypes.encode_params/3 (postgrex)
  File "lib/postgrex/query.ex", line 62, in DBConnection.Query.Postgrex.Query.encode/3 (postgrex)
  File "lib/db_connection.ex", line 1074, in DBConnection.encode/5 (db_connection)
  File "lib/db_connection.ex", line 1172, in DBConnection.run_prepare_execute/5 (db_connection)
  File "lib/db_connection.ex", line 480, in DBConnection.parsed_prepare_execute/5 (db_connection)
  File "lib/db_connection.ex", line 473, in DBConnection.prepare_execute/4 (db_connection)
  File "lib/postgrex.ex", line 167, in Postgrex.query/4 (postgrex)
  File "lib/ecto/adapters/sql.ex", line 627, in Ecto.Adapters.SQL.struct/10 (ecto_sql)
  File "lib/ecto/repo/schema.ex", line 603, in Ecto.Repo.Schema.apply/4 (ecto)
  File "lib/ecto/repo/schema.ex", line 311, in anonymous fn/15 in Ecto.Repo.Schema.do_update/3 (ecto)
  File "lib/ecto/repo/schema.ex", line 869, in anonymous fn/3 in Ecto.Repo.Schema.wrap_in_transaction/6 (ecto)
  File "lib/ecto/adapters/sql.ex", line 784, in anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4 (ecto_sql)
  File "lib/db_connection.ex", line 734, in DBConnection.transaction/3 (db_connection)
  File "lib/ecto/multi.ex", line 580, in Ecto.Multi.apply_operation/5 (ecto)
  File "lib/enum.ex", line 1899, in Enum."-reduce/3-lists^foldl/2-0-"/3 (elixir)
  File "lib/ecto/multi.ex", line 564, in anonymous fn/5 in Ecto.Multi.apply_operations/5 (ecto)
  File "lib/ecto/adapters/sql.ex", line 784, in anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4 (ecto_sql)
  File "lib/db_connection.ex", line 1341, in DBConnection.run_transaction/4 (db_connection)
  File "lib/ecto/repo/transaction.ex", line 16, in Ecto.Repo.Transaction.transaction/3 (ecto)
Jason.EncodeError: invalid byte 0xE9 in <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>

(erlef/rebar3_hex#84 (comment))

It comes down to this:

iex> Ecto.Type.cast(:string, <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>)
{:ok, <<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>}

but:

iex> String.valid?(<<74, 111, 115, 233, 32, 86, 97, 108, 105, 109>>)
false

We could solve it ourselves in the app by doing a pass on params, or using a custom type. Perhaps it's still worth revising as part of Ecto though? Maybe if not as a check on :string type, on a new :utf8_string type then?

josevalim · 2018-12-31T09:33:37Z

I still think this should not be done in Ecto because of the reasons above. Both Jason and Plug will guarantee no bad UTF-8 will enter the system. In other words, if bad UTF-8 is reaching Ecto, it is likely already too late. Is this coming from Hex' custom parsers? If so, i would guarantee proper encoding when parsing.

wojtekmach · 2018-12-31T09:53:59Z

@josevalim it comes from metadata inside tarball. Agreed, we should ensure it when extracting the tarball 👍

ericmj closed this as completed Mar 10, 2014

wojtekmach mentioned this issue Dec 31, 2018

Validate package metadata hexpm/hex_core#22

Open

josevalim pushed a commit that referenced this issue Nov 5, 2019

Document CTE limitation with source fields, closes #161.

28e7d36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify Binary Strings are Printable #161

Verify Binary Strings are Printable #161

jeregrine commented Jan 17, 2014

jeregrine commented Jan 17, 2014

josevalim commented Jan 17, 2014

jeregrine commented Jan 17, 2014

josevalim commented Jan 17, 2014

jeregrine commented Jan 17, 2014

josevalim commented Jan 17, 2014

ericmj commented Jan 17, 2014

josevalim commented Jan 18, 2014

ericmj commented Jan 21, 2014

josevalim commented Jan 21, 2014

ericmj commented Jan 21, 2014

josevalim commented Jan 21, 2014

ericmj commented Jan 21, 2014

josevalim commented Jan 21, 2014

ericmj commented Mar 10, 2014

wojtekmach commented Dec 31, 2018 •

edited

Loading

josevalim commented Dec 31, 2018

wojtekmach commented Dec 31, 2018

Verify Binary Strings are Printable #161

Verify Binary Strings are Printable #161

Comments

jeregrine commented Jan 17, 2014

jeregrine commented Jan 17, 2014

josevalim commented Jan 17, 2014

jeregrine commented Jan 17, 2014

josevalim commented Jan 17, 2014

jeregrine commented Jan 17, 2014

josevalim commented Jan 17, 2014

ericmj commented Jan 17, 2014

josevalim commented Jan 18, 2014

ericmj commented Jan 21, 2014

josevalim commented Jan 21, 2014

ericmj commented Jan 21, 2014

josevalim commented Jan 21, 2014

ericmj commented Jan 21, 2014

josevalim commented Jan 21, 2014

ericmj commented Mar 10, 2014

wojtekmach commented Dec 31, 2018 • edited Loading

josevalim commented Dec 31, 2018

wojtekmach commented Dec 31, 2018

wojtekmach commented Dec 31, 2018 •

edited

Loading