-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squad reader #535
Squad reader #535
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The data sample is a bit too large. Can we just a few kb samples?
bbdaccb
to
4e2d3d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add the documentation here:
Codecov Report
@@ Coverage Diff @@
## master #535 +/- ##
==========================================
+ Coverage 79.60% 79.73% +0.13%
==========================================
Files 224 226 +2
Lines 16006 16119 +113
==========================================
+ Hits 12741 12852 +111
- Misses 3265 3267 +2
Continue to review full report at Codecov.
|
the doc build failed, one word is considered as misspelling: https:/asyml/forte/pull/535/checks?check_run_id=3884346346#step:6:143 If those are not misspelled words, consider adding them to https:/asyml/forte/blob/master/docs/spelling_wordlist.txt |
ft/onto/base_ontology.py
Outdated
|
||
|
||
@dataclass | ||
class MRCAnswer(Annotation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that we previously use Phrase
instead of a specific type to represent the answer: https:/asyml/forte-wrappers/blob/main/src/huggingface/fortex/huggingface/question_and_answering_multi.py#L19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, maybe I can delete MRCAnswer
and just use Phrase
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, since I find there's no additional features introduced by MRCAnswer
forte/datasets/mrc/squad_reader.py
Outdated
pack.set_text(text) | ||
|
||
Document(pack, 0, context_end) | ||
passage = Passage(pack, 0, len(pack.text)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are using Document
for the reading part, and Passage
for the whole data pack (including questions)?
The meaning of Passage
is supposed to be the former (the reading material). And maybe we can use Document
to cover the whole datapack
|
||
Returns: QA pairs and the context of a paragraph of a passage in SQuAD dataset. | ||
""" | ||
with open(file_path, "r", encoding="utf8", errors="ignore") as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there other things like question id? those could be useful when doing an evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add question ID to MRC question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't see the id in this PR, did you push the changes?
This PR adds a new dataset: squadv2.0. Fix #537
Description of changes
Add a new reader for the SQuAD dataset in datasets/mrc; add corresponding ontologies.
Possible influences of this PR.
Resued ontology
Passage
in race_qaTest Conducted
Test squad reader