Remove UCS-2LE encoding/decoding? #9

laurent22 · 2017-06-10T09:21:04Z

I've got the following example:

use DiffMatchPatch\DiffMatchPatch;
$dmp = new DiffMatchPatch();
$diff = $dmp->patch_make('car', 'car 🚘');
var_dump($dmp->patch_toText($diff));
var_dump($dmp->patch_apply($diff, 'car'));

Which results in this exception:

Fatal error: Uncaught iconv(): Detected an illegal character in input string

/mnt/d/Web/www/joplin/vendor/symfony/phpunit-bridge/DeprecationErrorHandler.php:73
/mnt/d/Web/www/joplin/vendor/yetanotherape/diff-match-patch/src/Diff.php:971
/mnt/d/Web/www/joplin/vendor/yetanotherape/diff-match-patch/src/Patch.php:301
/mnt/d/Web/www/joplin/vendor/yetanotherape/diff-match-patch/src/DiffMatchPatch.php:270

So it's throwing an exception here when encoding the string to UCS-2LE with iconv:

$text2 = iconv($prevInternalEncoding, 'UCS-2LE', $text2);

If I comment this line and the line that encodes back to the original encoding:

$change[1] = iconv('UCS-2LE', $prevInternalEncoding, $change[1]);

then it works fine:

string(36) "@@ -1,3 +1,5 @@
 car
+ %F0%9F%9A%98
"
array(2) {
  [0]=>
  string(8) "car 🚘"
  [1]=>
  array(1) {
    [0]=>
    bool(true)
  }
}

So I'm wondering - what is the purpose of this encoding/decoding? Could it be removed, or maybe could it be skipped if the input string is UTF-8 (which I assume is commonly used format)? That would allow the lib to be compatible with emojis and other valid UTF-8 strings.

The text was updated successfully, but these errors were encountered:

yetanotherape · 2017-06-10T09:48:23Z

The purpose of encoding/decoding is speed optimisation. As each UTF-8 character can be for 1 to 4 bytes, all string functions work much slower with UTF-8 rather than UTF-16. Unfortunately, emoji seems to be not included in UCS-2LE.

I'll think how to deal with emoji and not to lose the speed.

laurent22 · 2017-06-10T13:35:24Z

Thanks for the explanation. I guess one option could also be to try the iconv call and, if it fails, to fall back to using the string as is.

yetanotherape · 2018-08-04T17:13:22Z

Finally, I've added support of emoji.

laurent22 mentioned this issue Jun 10, 2017

Don't encode UTF-8 to UCS-2LE #10

Closed

yetanotherape closed this as completed Aug 4, 2018

josephrocca mentioned this issue Aug 18, 2019

Diff breaks unicode characters for emojis google/diff-match-patch#59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove UCS-2LE encoding/decoding? #9

Remove UCS-2LE encoding/decoding? #9

laurent22 commented Jun 10, 2017

yetanotherape commented Jun 10, 2017

laurent22 commented Jun 10, 2017

yetanotherape commented Aug 4, 2018

Remove UCS-2LE encoding/decoding? #9

Remove UCS-2LE encoding/decoding? #9

Comments

laurent22 commented Jun 10, 2017

yetanotherape commented Jun 10, 2017

laurent22 commented Jun 10, 2017

yetanotherape commented Aug 4, 2018