Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about document tokenization #13620

Open
HelloWorldLTY opened this issue Sep 7, 2024 · 0 comments
Open

A question about document tokenization #13620

HelloWorldLTY opened this issue Sep 7, 2024 · 0 comments

Comments

@HelloWorldLTY
Copy link

Hi, I found a very interesting result to tokenize a document. The example code is:

import spacy

nlp = spacy.load("en_core_web_sm")
# doc = nlp("Apple is looking at. startup for $1 billion.")
# for token in doc:
#     print(token.text, token.pos_, token.dep_)
# Example text
text = '''Panel C: Gene Associations in LUAD and NATs

In LUAD tumors, ZNF71 is associated with JUN, SAMHD1, RNASEL, IFNGR1, IKKB, and EIF2A.
In non-cancerous adjacent tissues (NATs), the associated genes are OAS1, MP3K7, and IFNAR2.'''

# Process the text
doc = nlp(text)
out_sen = []
# Iterate over the sentences
for sent in doc.sents:
    if len(sent) != 0:
        print(sent.text)
        out_sen.append(sent)

The result out_sen's length is 1, and it is treated as a whole sentence. Is this a bug or sth by default? Thanks.

The spacy version is 3.7.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant