Skip to content
/ utf8tok Public

utf8tok -A single-file library to split UTF-8 strings into grapheme clusters

License

Notifications You must be signed in to change notification settings

rtrbt/utf8tok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

utf8tok

utf8tok - A non-allocating single-file C++-17 library to split UTF-8 strings into grapheme clusters. Supports Unicode 11.0.

Unicode defines user-perceived characters as grapheme clusters, often consisting of multiple code points. utf8tok splits UTF-8 encoded strings into grapheme clusters, implementing a part of UAX #29.

Setup

Just add utf8tok.h and graphemebreakproperties.inc to your project, include utf8tok.h where needed, also define UTF8TOK_IMPLEMENTATION once. The implementation of all utf8tok-related functions is generated there.

Usage

utf8tok supplies several functions, but typically only std::optional<ut8tok::grapheme_cluster_view> utf8tok::next_grapheme_cluster(std::string_view &str_view, uint8_t* scratchBuffer, size_t scratchBufferSize) is required for use.

next_grapheme_cluster expects a string:view containing the UTF-8 encoded text to separate. The function returns a grapheme_cluster_view (which is another name for string_view). If a cluster is separated successfully, it is also removed from the given string_view to simplify continued parsing. To let you control all allocations, you need to supply a scratch buffer. The contents of this buffer are not required to be stored between calls to next_grapheme_cluster. If the buffer was to small to separate the next grapheme cluster, std::nullopt is returned. Normally a buffer size of 50 bytes is sufficient for most grapheme clusters, but as f.e. emoji can be extended quite a lot, you might need more in extreme cases.

Generation

The grapheme cluster break property data is stored in graphemebreakproperty.inc, which can be regenerated by compiling and running utf8tok_generator. The program expects paths to the Unicode consortium's grapheme break property file (found here) and the emoji data (found here).

Tests

To test conformance to the UAX #29, the Unicode consortium has published test cases here. These test case definitions can be converted to doctest test cases using the C# program found in tests/GraphemeTestGenerator.

Tests are run using the doctest library, licensed unter MIT.

About

utf8tok -A single-file library to split UTF-8 strings into grapheme clusters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages