A question about document tokenization #13620

HelloWorldLTY · 2024-09-07T13:23:34Z

Hi, I found a very interesting result to tokenize a document. The example code is:

import spacy

nlp = spacy.load("en_core_web_sm")
# doc = nlp("Apple is looking at. startup for $1 billion.")
# for token in doc:
#     print(token.text, token.pos_, token.dep_)
# Example text
text = '''Panel C: Gene Associations in LUAD and NATs

In LUAD tumors, ZNF71 is associated with JUN, SAMHD1, RNASEL, IFNGR1, IKKB, and EIF2A.
In non-cancerous adjacent tissues (NATs), the associated genes are OAS1, MP3K7, and IFNAR2.'''

# Process the text
doc = nlp(text)
out_sen = []
# Iterate over the sentences
for sent in doc.sents:
    if len(sent) != 0:
        print(sent.text)
        out_sen.append(sent)

The result out_sen's length is 1, and it is treated as a whole sentence. Is this a bug or sth by default? Thanks.

The spacy version is 3.7.6

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about document tokenization #13620

A question about document tokenization #13620

HelloWorldLTY commented Sep 7, 2024

A question about document tokenization #13620

A question about document tokenization #13620

Comments

HelloWorldLTY commented Sep 7, 2024