Skip to content

Commit

Permalink
Merge pull request #213 from serokell/martoon/rewrite-string-conversions
Browse files Browse the repository at this point in the history
Add rewrite rules for text conversion
  • Loading branch information
gromakovsky authored Jul 26, 2019
2 parents f479cf3 + 4ac5a16 commit c75f43a
Show file tree
Hide file tree
Showing 6 changed files with 185 additions and 11 deletions.
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ Unreleased
* [#214](https:/serokell/universum/issues/214):
Update supported GHC versions (replace 7.10.3 with 8.6.5).

* [#212](https:/serokell/universum/issues/212)
Added rewrite rule for `toString . toText` case.
This may change semantics in some corner cases
(because `toString . toText` is not strictly the identity function).

1.5.0
=====

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ Gotchas [↑](#structure-of-this-tutorial)
* As a consequence of previous point, some functions like `traverse_`, `forM_`, `sequenceA_`, etc.
are generalized over `Container` type classes.
* `error` takes `Text`.
* We are exporting a rewrite rule which replaces `toString . toText :: Text -> Text` with `id`. Note that this changes semantics in some corner cases.


Things that you were already using, but now you don't have to import them explicitly [](#structure-of-this-tutorial)
Expand Down
14 changes: 14 additions & 0 deletions benchmark/Main.hs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ module Main where
import Universum hiding (show)

import Data.List (nub, zip5)
import qualified Data.Text as T
import Gauge (Benchmark, bench, bgroup, nf, whnf)
import Gauge.Main (defaultMain)
import Prelude (show)
Expand All @@ -24,6 +25,7 @@ main = defaultMain
, bgroupConcatMap
, bgroupMember
, bgroupFold
, bgroupTextConversion
]

bgroupList :: forall a .
Expand Down Expand Up @@ -171,3 +173,15 @@ bgroupFold = do
bgroup "foldl'" [ bench "flipped" $ nf flipFoldl' testList
, bench "base" $ nf ghcFoldl' testList
]

bgroupTextConversion :: Benchmark
bgroupTextConversion =
bgroup "text conversions"
[ let str = replicate 100000 'a'
countLength x = length (toString x)
in bench "toString . toText" $ whnf (countLength . toText) str

, let txt = T.replicate 100000 (T.singleton 'a')
countLength x = length (toText x)
in bench "toText . toString" $ whnf (countLength . toString) txt
]
110 changes: 110 additions & 0 deletions src/Universum/String/Conversion.hs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE TypeSynonymInstances #-}
{-# OPTIONS_GHC -Wno-orphans #-}

-- | This module implements type class which allow to have conversion to and
-- from 'Text', 'String' and 'ByteString' types (including both strict and lazy
Expand Down Expand Up @@ -30,6 +31,8 @@ import Data.Bifunctor (first)
import Data.Either (Either)
import Data.Function (id, (.))
import Data.String (String)
import qualified Data.Text.Internal as T
import qualified Data.Text.Internal.Fusion.Common as TF

import Universum.Functor ((<$>))
import Universum.String.Reexport (ByteString, IsString, Read, Text, fromString)
Expand Down Expand Up @@ -159,6 +162,113 @@ instance ToString T.Text where
instance ToString LT.Text where
toString = LT.unpack

{-
@toString . toText@ pattern may occur quite often after inlining because
we tend to use 'Text' rather than 'String' in function signatures, but
there are still some libraries which use 'String's and thus make us perform
conversions back and forth.
Note that @toString . toText@ is not strictly equal to identity function, see
explanation in the comment below.
-}

{-# RULES "pack/unpack" [~0]
forall s. T.unpack (T.pack s) = s
#-}

{- [Note toString-toText-rewritting]
We can do even better if take rules defined in 'Data.Text' into account.
Quoting investigation of @int-index:
If we look at @unpack@ and @pack@ they are defined as
@
unpack = S.unstreamList . stream
{-# INLINE [1] unpack #-}
pack = unstream . S.map safe . S.streamList
{-# INLINE [1] pack #-}
@
After they get inlined, the rule seems to be
@
(S.unstreamList . stream) ((unstream . S.map safe . S.streamList) a)
@
If we also inline function composition, we get
@
S.unstreamList (stream (unstream (S.map safe (S.streamList a))))
@
`stream` and `unstream` surely cancel out via this rule:
@
{-# RULES "STREAM stream/unstream fusion" forall s. stream (unstream s) = s #-}
@
So we are left with
@
S.unstreamList (S.map safe (S.streamList a))
@
Now, what's this 'safe' function? Turns out it's defined as
@
safe :: Char -> Char
safe c
| ord c .&. 0x1ff800 /= 0xd800 = c
| otherwise = '\xfffd'
@
Aha, so it's mapping some codepoints to @'\xfffd'@!
There's a comment on top of it to explain this:
```
-- UTF-16 surrogate code points are not included in the set of Unicode
-- scalar values, but are unfortunately admitted as valid 'Char'
-- values by Haskell. They cannot be represented in a 'Text'. This
-- function remaps those code points to the Unicode replacement
-- character (U+FFFD, \'&#xfffd;\'), and leaves other code points
-- unchanged.
```
This logic is lost with the mentioned rewrite rule.
Not a huge loss, but it does mean that this rewrite rule isn't meaning preserving.
We hope that in most cases it's fine.
And if it's not, one can mark his function using either @pack@ or @unpack@
with @NOINLINE@ pragma to prevent the rule from firing.
So, eventually, we add the following rule:
-}
{-# RULES "pack/unpack internal" [1]
forall s. TF.unstreamList (TF.map T.safe (TF.streamList s)) = s
#-}

{- In case if GHC didn't manage to inline and rewrite everything in
the remaining phases (@Data.Text.pack@ is inlined at 1-st phase),
we still have "pack/unpack" rule. Hopefully, one of them will fire.
-}

{- The opposite rule is safe to have because 'T.safe' /is/ the identity
function for strings made up from valid characters, and text is guaranteed
to have only valid ones.
However, for this case there is no @unstream (stream s) = id@ rule,
so we don't delve deep into internals. As long as @stream@ and @unstream@
only perform conversion between text and stream of characters, they should
be safe to collapse.
-}
{-# RULES "unpack/pack" [~0]
forall s. T.pack (T.unpack s) = s
#-}

-- | Polymorhpic version of 'Text.Read.readEither'.
--
-- >>> readEither @Text @Int "123"
Expand Down
61 changes: 51 additions & 10 deletions test/Test/Universum/Property.hs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{-# LANGUAGE CPP #-}
{-# LANGUAGE CPP #-}
{-# LANGUAGE TypeFamilies #-}

module Test.Universum.Property
Expand All @@ -8,23 +8,31 @@ module Test.Universum.Property
import Universum

import Data.List (nub)
import Hedgehog (Property, Gen, MonadGen, forAll, property, assert, (===))
import Hedgehog (Gen, MonadGen, Property, assert, forAll, property, (===))
#if MIN_VERSION_hedgehog(1,0,0)
import Hedgehog (GenBase)
#endif
import Test.Tasty (testGroup, TestTree)
import Test.Tasty (TestTree, testGroup)
import Test.Tasty.Hedgehog

import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as LB
import qualified Data.Text as T
import qualified Data.Text.Lazy as LT
import qualified Hedgehog.Gen as Gen
import qualified Hedgehog.Range as Range
import qualified Universum as U
import qualified Hedgehog.Gen as Gen
import qualified Hedgehog.Range as Range
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as LB
import qualified Data.Text as T
import qualified Data.Text.Lazy as LT

hedgehogTestTree :: TestTree
hedgehogTestTree = testGroup "Tests" [utfProps, listProps, boolMProps]
hedgehogTestTree = testGroup "Tests" [stringProps, utfProps, listProps, boolMProps]

stringProps :: TestTree
stringProps = testGroup "String conversions"
[ testProperty "toString . toText = id" prop_StringToTextAndBack
, testProperty "`toString . toText` for UTF-16 surrogate"
prop_StringToTextAndBackSurrogate
, testProperty "toText . toString = id" prop_TextToStringAndBack
]

utfProps :: TestTree
utfProps = testGroup "utf8 conversion property tests"
Expand All @@ -47,9 +55,15 @@ unicode' = do
utf8String :: Gen U.String
utf8String = Gen.string (Range.linear 0 10000) unicode'

unicodeAllString :: Gen U.String
unicodeAllString = Gen.string (Range.linear 0 10000) Gen.unicodeAll

utf8Text :: Gen T.Text
utf8Text = Gen.text (Range.linear 0 10000) unicode'

unicodeAllText :: Gen T.Text
unicodeAllText = Gen.text (Range.linear 0 10000) Gen.unicodeAll

utf8Bytes :: Gen B.ByteString
utf8Bytes = Gen.utf8 (Range.linear 0 10000) unicode'

Expand All @@ -58,6 +72,33 @@ utf8Bytes = Gen.utf8 (Range.linear 0 10000) unicode'
-- > BU.toString (BU.fromString "\65534") == "\65533"
-- > True

prop_StringToTextAndBack :: Property
prop_StringToTextAndBack = property $ do
str <- forAll unicodeAllString
toString (toText str) === str

-- | See comment to this function:
-- <http://hackage.haskell.org/package/text-1.2.3.1/docs/src/Data.Text.Internal.html#safe>
--
-- While 'String' may contain surrogate UTF-16 code points, actually UTF-8
-- doesn't allow them, as well as 'Text'. 'Data.Text.pack' replaces invalid
-- characters with unicode replacement character, so by default
-- @toString . toText@ is not identity.
--
-- However, we have a rewrite rule by which we /replace/ @toString . toText@
-- occurrences with the identity function.
prop_StringToTextAndBackSurrogate :: Property
prop_StringToTextAndBackSurrogate = property $ do
-- Surrogate character like this one should remain intact
-- Without rewrite rule this string would be transformed to "\9435"
let str = "\xD800"
toString (toText str) === str

prop_TextToStringAndBack :: Property
prop_TextToStringAndBack = property $ do
txt <- forAll unicodeAllText
toText (toString txt) === txt

prop_StringToBytes :: Property
prop_StringToBytes = property $ do
str <- forAll utf8String
Expand Down
5 changes: 4 additions & 1 deletion universum.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,9 @@ library
, mtl
, safe-exceptions
, stm
, text
-- Make sure that "toString-toText-rewritting" note
-- is still valid when bumping this constraint.
, text >= 1.0.0.0 && <= 1.2.3.1
, transformers
, unordered-containers
, utf8-string
Expand Down Expand Up @@ -133,6 +135,7 @@ benchmark universum-benchmark
, universum
, containers
, gauge
, text
, unordered-containers

default-extensions: NoImplicitPrelude
Expand Down

0 comments on commit c75f43a

Please sign in to comment.