Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of language data #4140

Closed
wants to merge 153 commits into from
Closed

Reduce size of language data #4140

wants to merge 153 commits into from

Commits on Aug 18, 2019

  1. Move Turkish lemmas to a json file

    Rather than a large dict in Python source, the data is now a big json
    file. This includes a method for loading the json file, falling back to
    a compressed file, and an update to MANIFEST.in that excludes json in
    the spacy/lang directory.
    
    This focuses on Turkish specifically because it has the most language
    data in core.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    43eb680 View commit details
    Browse the repository at this point in the history
  2. Transition all lemmatizer.py files to json

    This covers all lemmatizer.py files of a significant size (>500k or so).
    Small files were left alone.
    
    None of the affected files have logic, so this was pretty
    straightforward.
    
    One unusual thing is that the lemma data for Urdu doesn't seem to be
    used anywhere. That may require further investigation.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    81026e7 View commit details
    Browse the repository at this point in the history
  3. Move large lang data to json for fr/nb/nl/sv

    These are the languages that use a lemmatizer directory (rather than a
    single file) and are larger than English.
    
    For most of these languages there were many language data files, in
    which case only the large ones (>500k or so) were converted to json. It
    may or may not be a good idea to migrate the remaining Python files to
    json in the future.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    9a7a0ed View commit details
    Browse the repository at this point in the history
  4. Fix id lemmas.json

    The contents of this file were originally just copied from the Python
    source, but that used single quotes, so it had to be properly converted
    to json first.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    438cbdf View commit details
    Browse the repository at this point in the history
  5. Add .json.gz to gitignore

    This covers the json.gz files built as part of distribution.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    6da699c View commit details
    Browse the repository at this point in the history
  6. Add language data gzip to build process

    Currently this gzip data on every build; it works, but it should be
    changed to only gzip when the source file has been updated.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    969c2c6 View commit details
    Browse the repository at this point in the history
  7. Return True from doc.is_... when no ambiguity

    * Make doc.is_sentenced return True if len(doc) < 2.
    
    * Make doc.is_nered return True if len(doc) == 0, for consistency.
    
    Closes explosion#3934
    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    52ec915 View commit details
    Browse the repository at this point in the history
  8. Rename Binder->DocBox, and improve it.

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    9568dee View commit details
    Browse the repository at this point in the history
  9. Set version to 2.1.5.dev0

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    5552f3a View commit details
    Browse the repository at this point in the history
  10. more friendly textcat errors (explosion#3946)

    * more friendly textcat errors with require_model and require_labels
    
    * update thinc version with recent bugfix
    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    9c199da View commit details
    Browse the repository at this point in the history
  11. Fix _serialize

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    ab8d80e View commit details
    Browse the repository at this point in the history
  12. Reformat

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    53acf1c View commit details
    Browse the repository at this point in the history
  13. Tidy up and auto-format

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    0fedbfa View commit details
    Browse the repository at this point in the history
  14. Fix test

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    91d8054 View commit details
    Browse the repository at this point in the history
  15. Add warning message re Issue explosion#3853

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    c846c27 View commit details
    Browse the repository at this point in the history
  16. 💫 Fix issue explosion#3839: Incorrect entity IDs from Matcher with op…

    …erators (explosion#3949)
    
    * Add regression test for issue explosion#3541
    
    * Add comment on bugfix
    
    * Remove incorrect test
    
    * Un-xfail test
    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    aede1ee View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    7477d3f View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    9e8ac78 View commit details
    Browse the repository at this point in the history
  19. Add default encoding utf-8 for test file

    yash authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    967eda3 View commit details
    Browse the repository at this point in the history
  20. failing unit test for issue explosion#3869

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    537f559 View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    bf16a1b View commit details
    Browse the repository at this point in the history
  22. counter instead of preshcounter

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    984b62b View commit details
    Browse the repository at this point in the history
  23. cleanup

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    28a1f00 View commit details
    Browse the repository at this point in the history
  24. Add warning for explosion#3853

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    d627127 View commit details
    Browse the repository at this point in the history
  25. Fix explosion#3853, and add warning

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    f38f102 View commit details
    Browse the repository at this point in the history
  26. Update Thinc version pin

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    22ef030 View commit details
    Browse the repository at this point in the history
  27. fix custom attribute links

    pmbaumgartner authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    309f72f View commit details
    Browse the repository at this point in the history
  28. Update Thinc version pin

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    fac34fe View commit details
    Browse the repository at this point in the history
  29. Fixing ngram bug (explosion#3953)

    * minimal failing example for Issue explosion#3661
    
    * referenced Issue explosion#3661 instead of Issue explosion#3611
    
    * cleanup
    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    1b71661 View commit details
    Browse the repository at this point in the history
  30. Set version to v2.1.5

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    8632299 View commit details
    Browse the repository at this point in the history
  31. Update landing with IRL videos [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    bfa9110 View commit details
    Browse the repository at this point in the history
  32. Increment version [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    912a27b View commit details
    Browse the repository at this point in the history
  33. Fix symbol alignment

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    bd01f90 View commit details
    Browse the repository at this point in the history
  34. Fix attrs alignment

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    83b3b96 View commit details
    Browse the repository at this point in the history
  35. Configuration menu
    Copy the full SHA
    dbdbfe6 View commit details
    Browse the repository at this point in the history
  36. contributor agreement

    pmbaumgartner authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    e815480 View commit details
    Browse the repository at this point in the history
  37. Add regression test for explosion#3972

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    eaec450 View commit details
    Browse the repository at this point in the history
  38. Add regression test for explosion#3951

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    3748c96 View commit details
    Browse the repository at this point in the history
  39. Auto-format [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    54c29eb View commit details
    Browse the repository at this point in the history
  40. Update README.md [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    c88252f View commit details
    Browse the repository at this point in the history
  41. Add docstring for spacy.gold.align

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    c0e52aa View commit details
    Browse the repository at this point in the history
  42. Rename function arguments

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    8ab4c8d View commit details
    Browse the repository at this point in the history
  43. Add API documentation

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    46cca4b View commit details
    Browse the repository at this point in the history
  44. Add usage docs for aligning tokenization

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    f130243 View commit details
    Browse the repository at this point in the history
  45. Adjust example

    Not actually supported in this alignment interpretation
    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    17e97ac View commit details
    Browse the repository at this point in the history
  46. Add "Things to try" prompts

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    6dcfd32 View commit details
    Browse the repository at this point in the history
  47. Improve wording

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    285ad08 View commit details
    Browse the repository at this point in the history
  48. Add infobox

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    f174d19 View commit details
    Browse the repository at this point in the history
  49. Adjust wording [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    90d565e View commit details
    Browse the repository at this point in the history
  50. Also add infobox to API docs [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    88aca5a View commit details
    Browse the repository at this point in the history
  51. Bugfix/issue 3968 (explosion#3982)

    * Fix for issue-3968
    
    * Added contributor agreement
    
    * Made suggested changes
    FallakAsad authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    13c226c View commit details
    Browse the repository at this point in the history
  52. Fix --force parameter of CLI package

    BreakBB authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    c023fd5 View commit details
    Browse the repository at this point in the history
  53. Add 'Prof.' to Englisch tokenizer_exceptions

    BreakBB authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    1df954e View commit details
    Browse the repository at this point in the history
  54. Configuration menu
    Copy the full SHA
    9247f8c View commit details
    Browse the repository at this point in the history
  55. Fix typos [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    4743761 View commit details
    Browse the repository at this point in the history
  56. Configuration menu
    Copy the full SHA
    889c829 View commit details
    Browse the repository at this point in the history
  57. Update annotation docs for German

    - minor formatting fixes
    - remove STTS tags not used in Tiger
    - update list of dependency relations to match tiger2dep
    adrianeboyd authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    4876843 View commit details
    Browse the repository at this point in the history
  58. Add regression test for explosion#4002

    Test that the PhraseMatcher can match on overwritten NORM attributes.
    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    9d682dc View commit details
    Browse the repository at this point in the history
  59. Fix dependency copy for as_doc (explosion#3969)

    * failing unit test for issue 3962
    
    * attempt to fix Issue explosion#3962
    
    * create artificial unit test example
    
    * using length instead of self.length
    
    * sp
    
    * reformat with black
    
    * find better ancestor within span and use generic 'dep'
    
    * attach to span.root if there is no appropriate ancestor
    
    * comment span text
    
    * clean up ancestor code
    
    * reconstruct dep tree to keep same number of sentences
    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    c1a3be7 View commit details
    Browse the repository at this point in the history
  60. Remove old comment (explosion#4012)

    Norwegian used to borrow from French but that doesn't appear to have
    been true for a while now, so the comment that was here is no longer
    relevant.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    8a87e22 View commit details
    Browse the repository at this point in the history
  61. tokenizer doc fix

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    e6daba9 View commit details
    Browse the repository at this point in the history
  62. proper error for missing cfg arguments

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    4a75b5e View commit details
    Browse the repository at this point in the history
  63. set default context width

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    04695c0 View commit details
    Browse the repository at this point in the history
  64. small fix

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    e22ce52 View commit details
    Browse the repository at this point in the history
  65. code cleanup

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    c7531f9 View commit details
    Browse the repository at this point in the history
  66. get vector functionality + unit test

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    d0b2d45 View commit details
    Browse the repository at this point in the history
  67. fixes in kb and gold

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    2950ee2 View commit details
    Browse the repository at this point in the history
  68. Configuration menu
    Copy the full SHA
    5a95412 View commit details
    Browse the repository at this point in the history
  69. Configuration menu
    Copy the full SHA
    2d9ca8d View commit details
    Browse the repository at this point in the history
  70. Configuration menu
    Copy the full SHA
    8301291 View commit details
    Browse the repository at this point in the history
  71. output tensors as part of predict

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    35ab66b View commit details
    Browse the repository at this point in the history
  72. formatting

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    3cf346d View commit details
    Browse the repository at this point in the history
  73. rename entity frequency

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    875638b View commit details
    Browse the repository at this point in the history
  74. fix for Issue explosion#4000

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    14bf047 View commit details
    Browse the repository at this point in the history
  75. test corner cases

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    db8054a View commit details
    Browse the repository at this point in the history
  76. Errors.E145 for IO errors when reading KB

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    8755995 View commit details
    Browse the repository at this point in the history
  77. Errors.E146 for IO error when FP is null

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    bbdc7d7 View commit details
    Browse the repository at this point in the history
  78. format

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    1d2fb40 View commit details
    Browse the repository at this point in the history
  79. format and bugfix

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    e983525 View commit details
    Browse the repository at this point in the history
  80. format offsets

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    513de82 View commit details
    Browse the repository at this point in the history
  81. replace assert's with custom error messages

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    fa6a940 View commit details
    Browse the repository at this point in the history
  82. use pathlib instead

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    347467c View commit details
    Browse the repository at this point in the history
  83. return fix

    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    ea29be6 View commit details
    Browse the repository at this point in the history
  84. 💫 Improve error message when model.from_bytes() dies (explosion#4014)

    * Improve error message when model.from_bytes() dies
    
    When Thinc's model.from_bytes() is called with a mismatched model, often
    we get a particularly ungraceful error,
    
    e.g. "AttributeError: FunctionLayer has no attribute G"
    
    This is because we're trying to load the parameters for something like
    a LayerNorm layer, and the model architecture has some other layer there
    instead. This is obviously terrible, especially since the error *type*
    is wrong.
    
    I've changed it to raise a ValueError. The error message is still
    probably a bit terse, but it's hard to be sure exactly what's gone
    wrong.
    
    * Update spacy/pipeline/pipes.pyx
    
    * Update spacy/pipeline/pipes.pyx
    
    * Update spacy/pipeline/pipes.pyx
    
    * Update spacy/syntax/nn_parser.pyx
    
    * Update spacy/syntax/nn_parser.pyx
    
    * Update spacy/pipeline/pipes.pyx
    
    Co-Authored-By: Matthew Honnibal <[email protected]>
    
    * Update spacy/pipeline/pipes.pyx
    
    Co-Authored-By: Matthew Honnibal <[email protected]>
    
    
    Co-authored-by: Ines Montani <[email protected]>
    2 people authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    d44e263 View commit details
    Browse the repository at this point in the history
  85. Update GoldParse attributes in API docs (explosion#4023)

    * add `words`
    * update name of entity list to `ner`
    
    I think it might be a bit more consistent to have `ner` named `entities`
    or `ents` (and `ents` is actually set somewhere to `None`, which is a
    bit confusing), but it looks like renaming it would be a non-trivial
    decision.
    adrianeboyd authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    67ee9d5 View commit details
    Browse the repository at this point in the history
  86. Improve consistency of docs examples [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    9bc0ec1 View commit details
    Browse the repository at this point in the history
  87. Configuration menu
    Copy the full SHA
    8bb0eab View commit details
    Browse the repository at this point in the history
  88. Improve section on disabling pipes [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    15b0b35 View commit details
    Browse the repository at this point in the history
  89. Add "Processing text" section [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    bbd1dda View commit details
    Browse the repository at this point in the history
  90. Configuration menu
    Copy the full SHA
    dc78392 View commit details
    Browse the repository at this point in the history
  91. Tidy up and auto-format [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    53feaa3 View commit details
    Browse the repository at this point in the history
  92. Configuration menu
    Copy the full SHA
    433117b View commit details
    Browse the repository at this point in the history
  93. Also support "requirements" in model.json

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    02bd3ca View commit details
    Browse the repository at this point in the history
  94. Configuration menu
    Copy the full SHA
    8010c0b View commit details
    Browse the repository at this point in the history
  95. Fix bug in Span.similarity when called via hook

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    269186c View commit details
    Browse the repository at this point in the history
  96. Fix formatting [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    577658d View commit details
    Browse the repository at this point in the history
  97. 💫 Support simple training format in nlp.evaluate and add tests (explo…

    …sion#4033)
    
    * Support simple training format in nlp.evaluate and add tests
    
    * Update docs [ci skip]
    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    889cd8e View commit details
    Browse the repository at this point in the history
  98. Don't raise NotImplemented in Pipe.update

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    17e1750 View commit details
    Browse the repository at this point in the history
  99. Configuration menu
    Copy the full SHA
    7301006 View commit details
    Browse the repository at this point in the history
  100. Update version

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    1a90d04 View commit details
    Browse the repository at this point in the history
  101. Configuration menu
    Copy the full SHA
    2851839 View commit details
    Browse the repository at this point in the history
  102. Resolve edge case when calling textcat.predict with empty doc (explos…

    …ion#4035)
    
    * resolve edge case where no doc has tokens when calling textcat.predict
    
    * more explicit value test
    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    85e384e View commit details
    Browse the repository at this point in the history
  103. Correct typo for AllenAI url on homepage (explosion#4050)

    * Typo fix for AllenAI url
    
    Changed incorrect home page url for AllenAI from appenai.org to allenai.org
    
    * Sign contributor agreement
    
    * Change date format
    mdaudali authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    97369a9 View commit details
    Browse the repository at this point in the history
  104. Corrected imported fucntion (explosion#4062)

    The example showed an incorrected import
    ejarkm authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    317407b View commit details
    Browse the repository at this point in the history
  105. Add links to tokenizer API docs to refer relevant information. (explo…

    …sion#4064)
    
    * Add links to tokenizer API docs to refer relevant information.
    
    * Add suggested changes
    
    Co-Authored-By: Ines Montani <[email protected]>
    2 people authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    ff7e06f View commit details
    Browse the repository at this point in the history
  106. ensure the lang of vocab and nlp stay consistent (explosion#4057)

    * ensure the language of vocab and nlp stay consistent across serialization
    
    * equality with =
    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    447585c View commit details
    Browse the repository at this point in the history
  107. Improve NER per type scoring (explosion#4052)

    * Improve NER per type scoring
    
    * include all gold labels in per type scoring, not only when recall > 0
    * improve efficiency of per type scoring
    
    * Create Scorer tests, initially with NER tests
    
    * move regression test explosion#3968 (per type NER scoring) to Scorer tests
    
    * add new test for per type NER scoring with imperfect P/R/F and per
    type P/R/F including a case where R == 0.0
    adrianeboyd authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    c9ca6e6 View commit details
    Browse the repository at this point in the history
  108. Configuration menu
    Copy the full SHA
    f2792fd View commit details
    Browse the repository at this point in the history
  109. Fix Pipe base class

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    3000b96 View commit details
    Browse the repository at this point in the history
  110. Set version to v2.1.7.dev1

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    abbcf26 View commit details
    Browse the repository at this point in the history
  111. Set version to v2.1.7

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    6756c38 View commit details
    Browse the repository at this point in the history
  112. Add span.tensor and token.tensor attributes

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    ca4eeff View commit details
    Browse the repository at this point in the history
  113. Configuration menu
    Copy the full SHA
    6834710 View commit details
    Browse the repository at this point in the history
  114. Update .tensor docs [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    d5ed25c View commit details
    Browse the repository at this point in the history
  115. Update gold corpus code to properly ingest a directory of jsonl… (exp…

    …losion#4067)
    
    * Update gold corpus code to properly ingest a directory of jsonlines files
    
    In response to: explosion#3975
    
    * Update spacy/gold.pyx
    
    Co-Authored-By: Ines Montani <[email protected]>
    2 people authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    9e9a7fc View commit details
    Browse the repository at this point in the history
  116. Configuration menu
    Copy the full SHA
    8dcb6b4 View commit details
    Browse the repository at this point in the history
  117. Fix handling of kwargs in Language.evaluate

    Makes it consistent with other methods
    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    5c32fe0 View commit details
    Browse the repository at this point in the history
  118. Fixed syntax error in lang/ko when using python 2 (explosion#4082) (c…

    …loses explosion#4068)
    
    * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py
    
    * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py
    
    * Update __init__.py
    
    * Create veer-bains.md
    
    * Update __init__.py
    
    fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7
    veer-bains authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    32d1dac View commit details
    Browse the repository at this point in the history
  119. Configuration menu
    Copy the full SHA
    487fe79 View commit details
    Browse the repository at this point in the history
  120. Stopwords for Serbian language. (explosion#4078)

    * Serbian stopwords added. (cyrillic alphabet)
    
    * spaCy Contribution agreement included.
    
    * Test initialize updated
    Pavle992 authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    72e9b40 View commit details
    Browse the repository at this point in the history
  121. Update universe.json [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    d155e22 View commit details
    Browse the repository at this point in the history
  122. 💫 Sync branches (explosion#4084) [ci skip]

    * Update from master
    
    * Re-added Universe readme (explosion#3688) (closes explosion#3680)
    
    * Fix typo
    
    * Add version tag to `--base-model` argument (closes explosion#3720)
    
    * fixing regex matcher examples (explosion#3708) (explosion#3719)
    
    * Improve Token.prob and Lexeme.prob docs (resolves explosion#3701)
    
    * Fix DependencyParser.predict docs (resolves explosion#3561)
    
    * Update languages.json
    
    
    Co-authored-by: Bram Vanroy <[email protected]>
    Co-authored-by: Aaron Kub <[email protected]>
    3 people authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    80879c9 View commit details
    Browse the repository at this point in the history
  123. Configuration menu
    Copy the full SHA
    e5a25ee View commit details
    Browse the repository at this point in the history
  124. Raise error if annotation dict in simple training style has unexpecte…

    …d keys explosion#4074 (explosion#4079)
    
    * adding enhancement explosion#4074.
    
    * modified behavior to strictly require top level dictionary keys - issue explosion#4074
    
    * pass expected keys to error message and add links as expected top level key
    jenojp authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    276b576 View commit details
    Browse the repository at this point in the history
  125. Auto-format

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    16db514 View commit details
    Browse the repository at this point in the history
  126. Configuration menu
    Copy the full SHA
    427d18b View commit details
    Browse the repository at this point in the history
  127. Use consistent casing for entity ruler patterns (see explosion#4063) …

    …[ci skip]
    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    d63cc30 View commit details
    Browse the repository at this point in the history
  128. Add validate option to EntityRuler (explosion#4089)

    * Add validate option to EntityRuler
    
    * Add validate to EntityRuler, passed to Matcher and PhraseMatcher
    
    * Add validate to usage and API docs
    
    * Update website/docs/usage/rule-based-matching.md
    
    Co-Authored-By: Ines Montani <[email protected]>
    
    * Update website/docs/usage/rule-based-matching.md
    
    Co-Authored-By: Ines Montani <[email protected]>
    2 people authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    9a7632a View commit details
    Browse the repository at this point in the history
  129. Adjust docs example [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    b622240 View commit details
    Browse the repository at this point in the history
  130. Configuration menu
    Copy the full SHA
    f2f0f56 View commit details
    Browse the repository at this point in the history
  131. Update Binder version [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    d245a21 View commit details
    Browse the repository at this point in the history
  132. Add Serbian to languages [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    f932900 View commit details
    Browse the repository at this point in the history
  133. Update README.md [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    f4a1311 View commit details
    Browse the repository at this point in the history
  134. Set version to v2.1.8

    honnibal authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    480b7c5 View commit details
    Browse the repository at this point in the history
  135. Update Binder version [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    b117a1d View commit details
    Browse the repository at this point in the history
  136. Update lemma and vector information after splitting a token (explosio…

    …n#4097)
    
    * fixing vector and lemma attributes after retokenizer.split
    
    * fixing unit test with mockup tensor
    
    * xp instead of numpy
    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    0f28f62 View commit details
    Browse the repository at this point in the history
  137. Add entry for Blackstone in universe.json (explosion#4101)

    * Add entry for Blackstone in universe.json
    
    Add an entry for the Blackstone project. Checked JSON is valid.
    
    * Create ICLRandD.md
    
    * Fix indentation (tabs to spaces)
    
    It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show *everything* as changed, which is obviously not true. This hopefully fixes that.
    
    * Try to fix formatting for diff
    
    * Fix diff
    
    
    Co-authored-by: Ines Montani <[email protected]>
    2 people authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    97c8308 View commit details
    Browse the repository at this point in the history
  138. Update universe.json [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    b95f839 View commit details
    Browse the repository at this point in the history
  139. Configuration menu
    Copy the full SHA
    855544b View commit details
    Browse the repository at this point in the history
  140. update lang/zh (explosion#4103)

    * update lang/zh
    
    * update lang/zh
    XiepengLi authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    97ce4fe View commit details
    Browse the repository at this point in the history
  141. Create wip.yaml [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    53a304c View commit details
    Browse the repository at this point in the history
  142. Fix file name [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    138a5c9 View commit details
    Browse the repository at this point in the history
  143. Delete wip.yml [ci skip]

    ines authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    8971aa1 View commit details
    Browse the repository at this point in the history
  144. CLI scripts for entity linking (wikipedia & generic) (explosion#4091)

    * document token ent_kb_id
    
    * document span kb_id
    
    * update pipeline documentation
    
    * prior and context weights as bool's instead
    
    * entitylinker api documentation
    
    * drop for both models
    
    * finish entitylinker documentation
    
    * small fixes
    
    * documentation for KB
    
    * candidate documentation
    
    * links to api pages in code
    
    * small fix
    
    * frequency examples as counts for consistency
    
    * consistent documentation about tensors returned by predict
    
    * add entity linking to usage 101
    
    * add entity linking infobox and KB section to 101
    
    * entity-linking in linguistic features
    
    * small typo corrections
    
    * training example and docs for entity_linker
    
    * predefined nlp and kb
    
    * revert back to similarity encodings for simplicity (for now)
    
    * set prior probabilities to 0 when excluded
    
    * code clean up
    
    * bugfix: deleting kb ID from tokens when entities were removed
    
    * refactor train el example to use either model or vocab
    
    * pretrain_kb example for example kb generation
    
    * add to training docs for KB + EL example scripts
    
    * small fixes
    
    * error numbering
    
    * ensure the language of vocab and nlp stay consistent across serialization
    
    * equality with =
    
    * avoid conflict in errors file
    
    * add error 151
    
    * final adjustements to the train scripts - consistency
    
    * update of goldparse documentation
    
    * small corrections
    
    * push commit
    
    * turn kb_creator into CLI script (wip)
    
    * proper parameters for training entity vectors
    
    * wikidata pipeline split up into two executable scripts
    
    * remove context_width
    
    * move wikidata scripts in bin directory, remove old dummy script
    
    * refine KB script with logs and preprocessing options
    
    * small edits
    
    * small improvements to logging of EL CLI script
    svlandeg authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    ffd89df View commit details
    Browse the repository at this point in the history
  145. Configuration menu
    Copy the full SHA
    1a74cb0 View commit details
    Browse the repository at this point in the history
  146. Correction of default lemmatizer lookup in English (Issue # 4104) (ex…

    …plosion#4110)
    
    * pytest file for issue4104 established
    
    * edited default lookup english lemmatizer for spun; fixes issue 4102
    
    * eliminated parameterization and sorted dictionary dependnency in issue 4104 test
    
    * added contributor agreement
    ajrader authored and polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    ece8b77 View commit details
    Browse the repository at this point in the history
  147. Remove Danish lemmatizer.py

    Missed this when I added the json.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    59999e1 View commit details
    Browse the repository at this point in the history
  148. Update to match latest explosion/srsly#9

    The way gzipped json is loaded/saved in srsly changed a bit.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    2b4227a View commit details
    Browse the repository at this point in the history
  149. Only compress language data if necessary

    If a .json.gz file exists and is newer than the corresponding json file,
    it's not recompressed.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    00e6420 View commit details
    Browse the repository at this point in the history
  150. Move en/el language data to json

    This only affected files >500kb, which was nouns for both languages and
    the generic lookup table for English.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    a322fc1 View commit details
    Browse the repository at this point in the history
  151. Remove empty files in Norwegian tokenizer

    It's unclear why, but the Norwegian (nb) tokenizer had empty files for
    adj/adv/noun/verb lemmas. This may have been a result of copying the
    structure of the English lemmatizer.
    
    This removed the files, but still creates the empty sets in the
    lemmatizer. That may not actually be necessary.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    f5256c2 View commit details
    Browse the repository at this point in the history
  152. Remove dubious entries in English lookup.json

    " furthest" and " skilled" - both prefixed with a space - were in the
    English lookup table. That seems obviously wrong so I have removed them.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    ee9609a View commit details
    Browse the repository at this point in the history
  153. Fix small issues with en/fr lemmatizers

    The en tokenizer was including the removed _nouns.py file, so that's
    removed.
    
    The fr tokenizer is unusual in that it has a lemmatizer directory with
    both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
    to load the json language data, so that was fixed.
    polm committed Aug 18, 2019
    Configuration menu
    Copy the full SHA
    f7204a9 View commit details
    Browse the repository at this point in the history