Squad reader #535

qinzzz · 2021-09-29T18:23:24Z

This PR adds a new dataset: squadv2.0. Fix #537

Description of changes

Add a new reader for the SQuAD dataset in datasets/mrc; add corresponding ontologies.

Possible influences of this PR.

Resued ontology Passage in race_qa

Test Conducted

Test squad reader

hunterhector

The data sample is a bit too large. Can we just a few kb samples?

forte/datasets/mrc/squad_reader.py

ft/onto/base_ontology.py

forte/datasets/mrc/squad_reader.py

forte/ontology_specs/base_ontology.json

hunterhector

Let's also add the documentation here:

https:/asyml/forte/blob/master/docs/code/data.rst#packs

forte/datasets/mrc/squad_reader.py

codecov · 2021-10-13T15:10:32Z

Codecov Report

Merging #535 (418aceb) into master (5913ef6) will increase coverage by 0.13%.
The diff coverage is 96.46%.

@@            Coverage Diff             @@
##           master     #535      +/-   ##
==========================================
+ Coverage   79.60%   79.73%   +0.13%     
==========================================
  Files         224      226       +2     
  Lines       16006    16119     +113     
==========================================
+ Hits        12741    12852     +111     
- Misses       3265     3267       +2

Impacted Files	Coverage Δ
forte/datasets/mrc/squad_reader.py	`96.15% <96.15%> (ø)`
tests/forte/datasets/mrc/squad_dataset_test.py	`96.22% <96.22%> (ø)`
ft/onto/base_ontology.py	`95.25% <100.00%> (+0.16%)`	⬆️
forte/pipeline.py	`93.53% <0.00%> (+0.23%)`	⬆️
forte/data/ontology/ontology_code_const.py	`100.00% <0.00%> (+1.53%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5913ef6...418aceb. Read the comment docs.

hunterhector · 2021-10-13T15:17:49Z

the doc build failed, one word is considered as misspelling: https:/asyml/forte/pull/535/checks?check_run_id=3884346346#step:6:143

If those are not misspelled words, consider adding them to https:/asyml/forte/blob/master/docs/spelling_wordlist.txt

hunterhector · 2021-10-13T15:27:18Z

ft/onto/base_ontology.py

+
+
+@dataclass
+class MRCAnswer(Annotation):


I found that we previously use Phrase instead of a specific type to represent the answer: https:/asyml/forte-wrappers/blob/main/src/huggingface/fortex/huggingface/question_and_answering_multi.py#L19

Sure, maybe I can delete MRCAnswer and just use Phrase?

yeah, since I find there's no additional features introduced by MRCAnswer

hunterhector · 2021-10-13T16:54:27Z

forte/datasets/mrc/squad_reader.py

+ pack.set_text(text)
+
+ Document(pack, 0, context_end)
+ passage = Passage(pack, 0, len(pack.text))


Looks like you are using Document for the reading part, and Passage for the whole data pack (including questions)?

The meaning of Passage is supposed to be the former (the reading material). And maybe we can use Document to cover the whole datapack

forte/datasets/mrc/squad_reader.py

hunterhector · 2021-10-13T16:56:30Z

forte/datasets/mrc/squad_reader.py

+
+ Returns: QA pairs and the context of a paragraph of a passage in SQuAD dataset.
+ """
+ with open(file_path, "r", encoding="utf8", errors="ignore") as file:


are there other things like question id? those could be useful when doing an evaluation.

Add question ID to MRC question

couldn't see the id in this PR, did you push the changes?

qinzzz closed this Sep 29, 2021

qinzzz reopened this Sep 29, 2021

hunterhector marked this pull request as draft September 29, 2021 18:59

qinzzz changed the title ~~Qinxin dev~~ Squad reader Sep 29, 2021

hunterhector reviewed Sep 29, 2021

View reviewed changes

qinzzz force-pushed the qinxin_dev branch 3 times, most recently from bbdaccb to 4e2d3d8 Compare October 1, 2021 05:48

qinzzz added 2 commits October 1, 2021 01:49

add new reader and test script for SQUAD dataset

979084f

Add squad MRC dataset parser, including new ontology and testcase

43eca0a

qinzzz force-pushed the qinxin_dev branch from 4e2d3d8 to 42e59d2 Compare October 1, 2021 05:50

qinzzz marked this pull request as ready for review October 1, 2021 17:38

qinzzz force-pushed the qinxin_dev branch from 42e59d2 to 160eaec Compare October 4, 2021 14:52

hunterhector reviewed Oct 4, 2021

View reviewed changes

forte/datasets/mrc/squad_reader.py Outdated Show resolved Hide resolved

forte/ontology_specs/base_ontology.json Show resolved Hide resolved

qinzzz force-pushed the qinxin_dev branch from d4cfe34 to d4fbf6c Compare October 6, 2021 03:47

hunterhector reviewed Oct 6, 2021

View reviewed changes

forte/datasets/mrc/squad_reader.py Outdated Show resolved Hide resolved

format with balck; revise annotations

a87b695

qinzzz force-pushed the qinxin_dev branch from d4fbf6c to a87b695 Compare October 6, 2021 19:00

add squad data example

fd2a1e5

add new word to spelling checklist

e358fb9

qinzzz force-pushed the qinxin_dev branch from cdff213 to e358fb9 Compare October 13, 2021 15:19

hunterhector reviewed Oct 13, 2021

View reviewed changes

qinzzz force-pushed the qinxin_dev branch from 581df9e to ea25dc9 Compare October 13, 2021 16:50

hunterhector reviewed Oct 13, 2021

View reviewed changes

forte/datasets/mrc/squad_reader.py Show resolved Hide resolved

hunterhector reviewed Oct 13, 2021

View reviewed changes

replace MRCanswer with Phrase; substitute Passage with Document

d2885ff

qinzzz force-pushed the qinxin_dev branch from ea25dc9 to d2885ff Compare October 20, 2021 16:47

hunterhector approved these changes Oct 20, 2021

View reviewed changes

hunterhector added 2 commits October 20, 2021 21:06

Merge branch 'master' into qinxin_dev

fd04f97

Merge branch 'master' into qinxin_dev

418aceb

hunterhector merged commit a365f76 into asyml:master Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squad reader #535

Squad reader #535

qinzzz commented Sep 29, 2021 •

edited

Loading

hunterhector left a comment

hunterhector left a comment

codecov bot commented Oct 13, 2021 •

edited

Loading

hunterhector commented Oct 13, 2021

hunterhector Oct 13, 2021

qinzzz Oct 13, 2021

hunterhector Oct 13, 2021

hunterhector Oct 13, 2021 •

edited

Loading

hunterhector Oct 13, 2021

qinzzz Oct 13, 2021

hunterhector Oct 18, 2021



		@dataclass
		class MRCAnswer(Annotation):

Squad reader #535

Squad reader #535

Conversation

qinzzz commented Sep 29, 2021 • edited Loading

Description of changes

Possible influences of this PR.

Test Conducted

hunterhector left a comment

Choose a reason for hiding this comment

hunterhector left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 13, 2021 • edited Loading

Codecov Report

hunterhector commented Oct 13, 2021

hunterhector Oct 13, 2021

Choose a reason for hiding this comment

qinzzz Oct 13, 2021

Choose a reason for hiding this comment

hunterhector Oct 13, 2021

Choose a reason for hiding this comment

hunterhector Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

hunterhector Oct 13, 2021

Choose a reason for hiding this comment

qinzzz Oct 13, 2021

Choose a reason for hiding this comment

hunterhector Oct 18, 2021

Choose a reason for hiding this comment

qinzzz commented Sep 29, 2021 •

edited

Loading

codecov bot commented Oct 13, 2021 •

edited

Loading

hunterhector Oct 13, 2021 •

edited

Loading