Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rewrite rules for text conversion #213

Merged
merged 3 commits into from
Jul 26, 2019

Conversation

Martoon-00
Copy link
Member

@Martoon-00 Martoon-00 commented May 30, 2019

Added rewrite rule to eliminate toString . toText conversions.

Benchmarks without rewrite rule:

benchmarked toText/toString
time                 1.013 ms   (875.1 μs .. 1.148 ms)
                     0.819 R²   (0.691 R² .. 0.904 R²)
mean                 1.621 ms   (1.268 ms .. 2.619 ms)
std dev              2.134 ms   (331.2 μs .. 4.071 ms)
variance introduced by outliers: 97% (severely inflated)

And with rewrite rule:

benchmarked toText/toString
time                 186.1 μs   (183.3 μs .. 188.4 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 189.4 μs   (187.7 μs .. 191.8 μs)
std dev              7.508 μs   (5.823 μs .. 10.36 μs)
variance introduced by outliers: 20% (moderately inflated)

Description

Related issues(s)

Fixed #212.

✓ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).

  • I made sure my PR addresses a single concern, or multiple concerns which
    are inextricably linked. Otherwise I should open multiple PR's.
  • If your PR fixes/relates to an open issue then the description should
    reference this issue. See also auto linking on
    github
    .

Related changes (conditional)

  • Tests

    • If I added new functionality, I added tests covering it.
    • If I fixed a bug, I added a regression test to prevent the bug from
      silently reappearing again.
  • Documentation

    I checked whether I should update the docs and did so if necessary:

  • Record your changes

    • I added an entry to the changelog if my changes are visible to the users
      and
    • provided a migration guide for breaking changes if possible

Stylistic guide (mandatory)

  • My commit history is clean (only contains changes relating to my
    issue/pull request and no reverted-my-earlier-commit changes) and commit
    messages start with identifiers of related issues in square brackets.

    Example: [#42] Short commit description

    If necessary both of these can be achieved even after the commits have been
    made/pushed using rebase and squash.

@Martoon-00 Martoon-00 force-pushed the martoon/rewrite-string-conversions branch 2 times, most recently from 01a9d45 to 011c842 Compare May 30, 2019 16:15
@int-index
Copy link
Member

I tried to figure out why text couldn't do it by default.

Your rule seems equivalent to T.unpack (T.pack a) = a. Now, if we look at unpack and pack they are defined as

unpack = S.unstreamList . stream
{-# INLINE [1] unpack #-}

pack = unstream . S.map safe . S.streamList
{-# INLINE [1] pack #-}

After they get inlined, the rule seems to be

(S.unstreamList . stream) ((unstream . S.map safe . S.streamList) a)

If we also inline function composition, we get

S.unstreamList (stream (unstream (S.map safe (S.streamList a))))

stream and unstream surely cancel out via this rule:

{-# RULES "STREAM stream/unstream fusion" forall s. stream (unstream s) = s #-}

So we are left with

S.unstreamList (S.map safe (S.streamList a))

Now, what's this safe function? Turns out it's defined as

safe :: Char -> Char
safe c
    | ord c .&. 0x1ff800 /= 0xd800 = c
    | otherwise                    = '\xfffd'

Aha, so it's mapping some codepoints to '\xfffd'! There's a comment on top of it to explain this:

-- UTF-16 surrogate code points are not included in the set of Unicode
-- scalar values, but are unfortunately admitted as valid 'Char'
-- values by Haskell.  They cannot be represented in a 'Text'.  This
-- function remaps those code points to the Unicode replacement
-- character (U+FFFD, \'�\'), and leaves other code points
-- unchanged.

This logic is lost with your rewrite rule. Not a huge loss, but it does mean that your rewrite rule isn't meaning preserving. I bet you could write this rewrite rule instead:

{-# RULES "pack/unpack" forall s. S.unstreamList (S.map safe (S.streamList a)) = a #-}

@Martoon-00 Martoon-00 force-pushed the martoon/rewrite-string-conversions branch from 011c842 to 345c6d8 Compare May 30, 2019 22:10
@Martoon-00
Copy link
Member Author

Magnificent!

I tried to delve into sources, but stopped at not finding rules for pack/unpack 😒

The nice news is that with your rule I get the same results as with my incorrect approach, at least when applied to string literals. Pushed the fixed version.

@int-index
Copy link
Member

My rule does not preserve semantics either, but it should play nicer with other rewrite rules from text.

@@ -159,6 +162,87 @@ instance ToString T.Text where
instance ToString LT.Text where
toString = LT.unpack

{- [Note toString-toText-rewritting]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hell yeah GHC-style notes.

@Martoon-00 Martoon-00 force-pushed the martoon/rewrite-string-conversions branch from 29e2acd to 919db77 Compare May 30, 2019 22:41
```

This logic is lost with the mentioned rewrite rule.
Not a huge loss, but it does mean that this rewrite rule isn't meaning preserving.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to fully understand consequences of this loss and be more explicit about it. AFAIU, it means that if we have String which contains UTF-16 surrogate code point and apply toString . toText to it, we'll get different results with and without this rule. Since we ignore safe, does it mean that error will be called somewhere?
I propose to:

  1. Add a test where we generate a String with such character and demonstrate that behavior is different. I don't know what exactly that test should do and which behavior is expected, but without such test it's hard to understand in which case and how this rule changes the semantics.
  2. Mention it in Gotchas in README.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's not clear how can I illustrate this behavior. As comment to safe function suggests, Text cannot contain surrogate code points, so this safe will not change anything if my String is constructed from Text. Unless my text is constructed incorrectly, the rule does not change the semantics.

  2. Added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. But what if you apply toString . toText to [c] where ord c .&. 0x1ff800 == 0xd800? AFAIU, with this rule and without this rule the result will be different, am I wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, ya tuplu.

@gromakovsky
Copy link
Member

https://i.imgur.com/D1ZLGiG.png

Btw, Closes #212 should go to Related issues and description should go to Description :)

I had to add stringToText and textToString functions in order to
specify inlining pragma as mentioned

I don't see these functions, did you delete them?

@gromakovsky
Copy link
Member

/home/travis/build/serokell/universum/src/Universum/String/Conversion.hs:6:16:
    unknown flag in  {-# OPTIONS_GHC #-} pragma: -Wno-orphans

Sad story.

@Martoon-00
Copy link
Member Author

I don't see these functions, did you delete them?

Yes, forgot to update commit/issue description.

@Martoon-00
Copy link
Member Author

I'm trying to add a test on a character which is usually replaced by that safe, and apparently, the new rewrite rule doesn't fire there.
It indeed fires in benchmarks, and if I replace the rule with forall s. T.unpack (T.pack s) = s, then it also works in tests.

My guess is that, since unstream (stream s) = s rule is applied at the last phase of the simplifier, probably our rule simply does not fire. I'd like to check for certain with -ddump-simpl/-ddump-rule-rewrites/--dump-inlinings, but their output looks meaningless to me (like, if there were no inlinings or rewrites at all except for inlining toString and toText sometimes).

@int-index do you have any thoughts on this?

@Martoon-00 Martoon-00 force-pushed the martoon/rewrite-string-conversions branch from 3e06688 to 6608357 Compare June 3, 2019 20:25
@int-index
Copy link
Member

int-index commented Jun 3, 2019

It indeed fires in benchmarks, and if I replace the rule with forall s. T.unpack (T.pack s) = s, then it also works in tests.

I think my rule has a chance to fire after T.unpack and T.pack are inlined (and they are marked as {-# INLINE #-}), but the rule above has a chance to fire before they are inlined. Might as well keep both rules, or figure out what's going on with phases.

@Martoon-00
Copy link
Member Author

Kaef, it works.

Although I have a feeling that our tests are pretty unstable. 🙄

@gromakovsky
Copy link
Member

@Martoon-00 btw, did you consider toText . toString? It's probably less useful with our usage patterns, but anyway, doesn't it make sense to add it as well? Or does it work out of box?

README.md Outdated Show resolved Hide resolved
@Martoon-00
Copy link
Member Author

@Martoon-00 btw, did you consider toText . toString? It's probably less useful with our usage patterns, but anyway, doesn't it make sense to add it as well? Or does it work out of box?

Okay, why not add it.

@gromakovsky
Copy link
Member

Ok, cool. Just in case: did you run benchmarks and verify that toText . toString rules actually improves anything?

@gromakovsky gromakovsky added the type:fix Code is modified, but without breaking changes or new definitions. Not necessary a bug fix. label Jul 18, 2019
@gromakovsky
Copy link
Member

I assigned it fix type because I think that probability of breaking something (due to slightly changed semantics) is pretty much negligible.

@Martoon-00
Copy link
Member Author

Martoon-00 commented Jul 18, 2019

Ok, cool. Just in case: did you run benchmarks and verify that toText . toString rules actually improves anything?

Yes, I got something like 0.2 ms with the rule against ~4 ms without it.

@Martoon-00 Martoon-00 force-pushed the martoon/rewrite-string-conversions branch from 1e1a164 to 955a799 Compare July 18, 2019 21:30
@gromakovsky
Copy link
Member

It seems we should do #214 first.

@gromakovsky
Copy link
Member

Probably #216 should let us merge this PR.

@gromakovsky
Copy link
Member

Rebase on master and CI will hopefully pass here.

@Martoon-00 Martoon-00 force-pushed the martoon/rewrite-string-conversions branch from 955a799 to b0e3b26 Compare July 26, 2019 17:33
@gromakovsky
Copy link
Member

The base branch requires all commits to be signed.

:trollface:

Added rewrite rule to eliminate toString . toText conversions.

Closes #212.

Benchmarks without rewrite rule:

```
benchmarked toText/toString
time                 1.013 ms   (875.1 μs .. 1.148 ms)
                     0.819 R²   (0.691 R² .. 0.904 R²)
mean                 1.621 ms   (1.268 ms .. 2.619 ms)
std dev              2.134 ms   (331.2 μs .. 4.071 ms)
variance introduced by outliers: 97% (severely inflated)
```

And with rewrite rule:

```
benchmarked toText/toString
time                 185.6 μs   (182.7 μs .. 188.2 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 187.0 μs   (185.2 μs .. 189.4 μs)
std dev              7.211 μs   (5.623 μs .. 10.07 μs)
variance introduced by outliers: 20% (moderately inflated)
```
Also add just a plain `T.unpack . T.pack = id` rewrite rule.
@Martoon-00 Martoon-00 force-pushed the martoon/rewrite-string-conversions branch from b0e3b26 to 4ac5a16 Compare July 26, 2019 19:59
@gromakovsky gromakovsky merged commit c75f43a into master Jul 26, 2019
@delete-merged-branch delete-merged-branch bot deleted the martoon/rewrite-string-conversions branch July 26, 2019 20:20
@Martoon-00
Copy link
Member Author

🎉

@Martoon-00
Copy link
Member Author

Martoon-00 commented Nov 22, 2022

Something happened that I completely cannot explain at the moment.

@int-index's observation shows that pack = unstream . S.map safe . S.streamList. I checked this statement back then, I couldn't miss checking it as in the such case I likely wouldn't be able to import those map and other methods.

And now when I open text's sources for that 1.2.3.1 version, instead I see pack = unstream . S.streamList . L.map safe. In fact, all versions of text in the version bounds added by this PR (1.0.0.0-1.2.3.1) seem to have pack implemented not how we saw it.

And like, git blame says the pack implementation to be the second one forever.

That all feels really weird.

The only explanation I see is that all sufficiently old text versions on Hackage got new revisions around 2020, and those revisions overwrite the internals. But like, it should be impossible to change a revision when there is no 100% guarantee it will not change the interface? And for such a change of pack that we witness, it would be nearly impossible to prove with the necessary assurance level. So something wrong seems to be going on under the surface 🤔

@int-index
Copy link
Member

@Martoon-00 You seem to be looking at Data.Text.Lazy instead of Data.Text.

@Martoon-00
Copy link
Member Author

Meeh, right 🤦. Thanks for saving me from this madness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:fix Code is modified, but without breaking changes or new definitions. Not necessary a bug fix.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite rules for toString/toText
3 participants