Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove UCS-2LE encoding/decoding? #9

Closed
laurent22 opened this issue Jun 10, 2017 · 3 comments
Closed

Remove UCS-2LE encoding/decoding? #9

laurent22 opened this issue Jun 10, 2017 · 3 comments

Comments

@laurent22
Copy link
Contributor

I've got the following example:

use DiffMatchPatch\DiffMatchPatch;
$dmp = new DiffMatchPatch();
$diff = $dmp->patch_make('car', 'car 🚘');
var_dump($dmp->patch_toText($diff));
var_dump($dmp->patch_apply($diff, 'car'));

Which results in this exception:

Fatal error: Uncaught iconv(): Detected an illegal character in input string

/mnt/d/Web/www/joplin/vendor/symfony/phpunit-bridge/DeprecationErrorHandler.php:73
/mnt/d/Web/www/joplin/vendor/yetanotherape/diff-match-patch/src/Diff.php:971
/mnt/d/Web/www/joplin/vendor/yetanotherape/diff-match-patch/src/Patch.php:301
/mnt/d/Web/www/joplin/vendor/yetanotherape/diff-match-patch/src/DiffMatchPatch.php:270

So it's throwing an exception here when encoding the string to UCS-2LE with iconv:

$text2 = iconv($prevInternalEncoding, 'UCS-2LE', $text2);

If I comment this line and the line that encodes back to the original encoding:

$change[1] = iconv('UCS-2LE', $prevInternalEncoding, $change[1]);

then it works fine:

string(36) "@@ -1,3 +1,5 @@
 car
+ %F0%9F%9A%98
"
array(2) {
  [0]=>
  string(8) "car 🚘"
  [1]=>
  array(1) {
    [0]=>
    bool(true)
  }
}

So I'm wondering - what is the purpose of this encoding/decoding? Could it be removed, or maybe could it be skipped if the input string is UTF-8 (which I assume is commonly used format)? That would allow the lib to be compatible with emojis and other valid UTF-8 strings.

@yetanotherape
Copy link
Owner

The purpose of encoding/decoding is speed optimisation. As each UTF-8 character can be for 1 to 4 bytes, all string functions work much slower with UTF-8 rather than UTF-16. Unfortunately, emoji seems to be not included in UCS-2LE.

I'll think how to deal with emoji and not to lose the speed.

@laurent22
Copy link
Contributor Author

Thanks for the explanation. I guess one option could also be to try the iconv call and, if it fails, to fall back to using the string as is.

@yetanotherape
Copy link
Owner

Finally, I've added support of emoji.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants