Skip to content

Semantic text segmentation. For sentence boundary detection, compound splitting and more.

License

Notifications You must be signed in to change notification settings

dell-research-harvard/nnsplit

 
 

Repository files navigation

NNSplit

PyPI Crates.io npm CI License

A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is also supported.

Features

  • Robust: Not reliant on proper punctuation, spelling and case. See the metrics.
  • Small: NNSplit uses a byte-level LSTM, so weights are small (< 4MB) and models can be trained for every unicode encodable language.
  • Portable: NNSplit is written in Rust with bindings for Rust, Python, and Javascript (Browser and Node.js). See how to get started in the usage section.
  • Fast: Up to 2x faster than Spacy sentencization, see the benchmark.
  • Multilingual: NNSplit currently has models for 9 different languages (German, English, French, Norwegian, Swedish, Simplified Chinese, Turkish, Russian and Ukrainian). Try them in the demo.

Documentation has moved to the NNSplit website: https://bminixhofer.github.io/nnsplit.

License

NNSplit is licensed under the MIT license.

About

Semantic text segmentation. For sentence boundary detection, compound splitting and more.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 41.8%
  • Python 27.3%
  • Vue 23.9%
  • JavaScript 3.6%
  • CSS 1.2%
  • Shell 0.9%
  • Other 1.3%