Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squad reader #535

Merged
merged 8 commits into from
Oct 22, 2021
Merged

Squad reader #535

merged 8 commits into from
Oct 22, 2021

Conversation

qinzzz
Copy link
Collaborator

@qinzzz qinzzz commented Sep 29, 2021

This PR adds a new dataset: squadv2.0. Fix #537

Description of changes

Add a new reader for the SQuAD dataset in datasets/mrc; add corresponding ontologies.

Possible influences of this PR.

Resued ontology Passage in race_qa

Test Conducted

Test squad reader

@qinzzz qinzzz closed this Sep 29, 2021
@qinzzz qinzzz reopened this Sep 29, 2021
@hunterhector hunterhector marked this pull request as draft September 29, 2021 18:59
@qinzzz qinzzz changed the title Qinxin dev Squad reader Sep 29, 2021
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The data sample is a bit too large. Can we just a few kb samples?

forte/datasets/mrc/squad_reader.py Outdated Show resolved Hide resolved
forte/datasets/mrc/squad_reader.py Show resolved Hide resolved
forte/datasets/mrc/squad_reader.py Outdated Show resolved Hide resolved
forte/datasets/mrc/squad_reader.py Outdated Show resolved Hide resolved
ft/onto/base_ontology.py Outdated Show resolved Hide resolved
forte/datasets/mrc/squad_reader.py Outdated Show resolved Hide resolved
@qinzzz qinzzz force-pushed the qinxin_dev branch 3 times, most recently from bbdaccb to 4e2d3d8 Compare October 1, 2021 05:48
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add the documentation here:

https:/asyml/forte/blob/master/docs/code/data.rst#packs

forte/datasets/mrc/squad_reader.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 13, 2021

Codecov Report

Merging #535 (418aceb) into master (5913ef6) will increase coverage by 0.13%.
The diff coverage is 96.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #535      +/-   ##
==========================================
+ Coverage   79.60%   79.73%   +0.13%     
==========================================
  Files         224      226       +2     
  Lines       16006    16119     +113     
==========================================
+ Hits        12741    12852     +111     
- Misses       3265     3267       +2     
Impacted Files Coverage Δ
forte/datasets/mrc/squad_reader.py 96.15% <96.15%> (ø)
tests/forte/datasets/mrc/squad_dataset_test.py 96.22% <96.22%> (ø)
ft/onto/base_ontology.py 95.25% <100.00%> (+0.16%) ⬆️
forte/pipeline.py 93.53% <0.00%> (+0.23%) ⬆️
forte/data/ontology/ontology_code_const.py 100.00% <0.00%> (+1.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5913ef6...418aceb. Read the comment docs.

@hunterhector
Copy link
Member

the doc build failed, one word is considered as misspelling: https:/asyml/forte/pull/535/checks?check_run_id=3884346346#step:6:143

If those are not misspelled words, consider adding them to https:/asyml/forte/blob/master/docs/spelling_wordlist.txt



@dataclass
class MRCAnswer(Annotation):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that we previously use Phrase instead of a specific type to represent the answer: https:/asyml/forte-wrappers/blob/main/src/huggingface/fortex/huggingface/question_and_answering_multi.py#L19

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, maybe I can delete MRCAnswer and just use Phrase?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, since I find there's no additional features introduced by MRCAnswer

pack.set_text(text)

Document(pack, 0, context_end)
passage = Passage(pack, 0, len(pack.text))
Copy link
Member

@hunterhector hunterhector Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are using Document for the reading part, and Passage for the whole data pack (including questions)?

The meaning of Passage is supposed to be the former (the reading material). And maybe we can use Document to cover the whole datapack


Returns: QA pairs and the context of a paragraph of a passage in SQuAD dataset.
"""
with open(file_path, "r", encoding="utf8", errors="ignore") as file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there other things like question id? those could be useful when doing an evaluation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add question ID to MRC question

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't see the id in this PR, did you push the changes?

@hunterhector hunterhector merged commit a365f76 into asyml:master Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New reader for Question-Answersing dataset
2 participants