Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when indexing AclAnthology #2084

Merged
merged 9 commits into from
Apr 22, 2023
Merged

Issues when indexing AclAnthology #2084

merged 9 commits into from
Apr 22, 2023

Conversation

ygorg
Copy link
Contributor

@ygorg ygorg commented Mar 27, 2023

This PR is linked to #2069.

There are some issues when following the "Indexing the ACL Anthology with Anserini" tutorial:

  1. The venues attribute is now venue and is stored in paper rather than volume.
  2. The page_first and page_last attributes are sometimes in roman notation and thus strings, not number. I change the schema, because translating roman to int should happen in acl-anthology.
  3. Some files are too big for fasterxml.jackson. By default files should be less that 3mb (~1,345,000 char). I increased this size to 10mb.
  4. The script create_hugo_yaml.py produces a yaml file containing aliases and references (to prevent data redundancy). These cannot be parsed by fasterxml.jackson (YAML Anchors / References FasterXML/jackson-dataformats-text#98). One way to prevent this is modifying the script to prevent this behavior (see the updated doc). An other way is to postprocess the data in acl-anthology/build/data using
pip install ruamel.yaml.cmd
mv volumes.yaml volumes_old.yaml
yaml merge-expand volumes_old.yaml volumes.yaml
mv papers papers_old
for yaml_old in papers_old/*.yaml; do  # this is ~10minutes !
  yaml_new=${yaml_old##*/}
  yaml merge-expand $yaml_old papers/$yaml_new
done

Point number 4 is hacky, maybe there is a way to preprocess and remove the aliases in java in AclAnthology.java ? Maybe this dataset should be read using another format (jsonl, xml, bibtex) ?
I still need to update the unittests.

@ygorg
Copy link
Contributor Author

ygorg commented Apr 21, 2023

I updated the unittests with the same sample documents as before, but from a recent extract of acl-anthology github repo.

@aryamancodes
Copy link
Contributor

I followed the instructions on this PR and the related issue (including both the "hacks" described in point 4) and was able to successfully index ACL Anthology. The unit tests ran successfully too. Just a small nitpick with the post-processing script, the correct version should look like:

pip install ruamel.yaml.cmd
mv volumes.yaml volumes_old.yaml
yaml merge-expand volumes_old.yaml volumes.yaml
mv papers papers_old

# since we renamed the papers dir, it no longer exists
mkdir papers 

for yaml_old in papers_old/*.yaml; do  # this is ~10minutes !
  yaml_new=${yaml_old##*/}
  yaml merge-expand $yaml_old papers/$yaml_new
done

@lintool
Copy link
Member

lintool commented Apr 21, 2023

@ygorg thoughts? let's sort through these final issues and then merge in your PR?

@codecov-commenter
Copy link

Codecov Report

Patch coverage: 92.85% and project coverage change: +0.03 🎉

Comparison is base (910821a) 58.89% compared to head (e8e324d) 58.93%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2084      +/-   ##
============================================
+ Coverage     58.89%   58.93%   +0.03%     
  Complexity     1188     1188              
============================================
  Files           194      194              
  Lines         11318    11328      +10     
  Branches       1486     1486              
============================================
+ Hits           6666     6676      +10     
  Misses         4170     4170              
  Partials        482      482              
Impacted Files Coverage Δ
...nserini/index/generator/AclAnthologyGenerator.java 89.04% <ø> (ø)
...main/java/io/anserini/collection/AclAnthology.java 81.52% <92.85%> (+2.25%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ygorg
Copy link
Contributor Author

ygorg commented Apr 22, 2023

Thank you for the review.
@aryamancodes My original comment wasn't so clear, the solution you used is very slow, my primary recommandation was to modify create_hugo_yaml.py with this line Dumper.ignore_aliases = lambda self, data: True
Which simplifies the process so much.

@aryamancodes
Copy link
Contributor

That makes sense. I did test the primary recommendation and it works well!

@lintool
Copy link
Member

lintool commented Apr 22, 2023

Thanks @ygorg for the contribution!

@lintool lintool merged commit 39dfcb4 into castorini:master Apr 22, 2023
@ygorg ygorg deleted the fix-acl branch April 22, 2023 12:36
aryamancodes pushed a commit to aryamancodes/anserini that referenced this pull request Apr 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants