Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with indexing ACLAnthology #2109

Closed
mobehbooei opened this issue Apr 27, 2023 · 7 comments
Closed

Problem with indexing ACLAnthology #2109

mobehbooei opened this issue Apr 27, 2023 · 7 comments

Comments

@mobehbooei
Copy link
Contributor

Hi @ygorg and @aryamancodes
I tried to follow https:/castorini/anserini/blob/master/docs/acl-anthology.md but it seems something is missing in the instructions.
first I did the Solr deploying instructions according to https:/castorini/anserini/blob/master/docs/solrini.md .
I executed python bin/create_hugo_yaml.py successfully and have the generated files in the acl-anthology/build/data/ directory.
I tried to run the sh src/main/resources/solr/setup/acl-anthology.sh step but but in the previous steps the solr/setup directory has not been created and I don't have the acl-anthology.sh file anywhere!

(also there is the same problem when I followed https:/castorini/anserini/blob/master/docs/solrini.md in running pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd)

Am I missing some part of the instructions? I was wondering if you could help me through this?

@aryamancodes
Copy link
Contributor

Hi @mobehbooei, let me try to set up Solr and get back to you on that issue. In the meantime, you can build an index for the ACL Anthology collection using Pyserini as described in this guide. A sample of building the index using Pyserini can also be seen in the "Steps to reproduce" section of this issue

@ygorg
Copy link
Contributor

ygorg commented Apr 27, 2023

I think this is because the README is not up to date, since anserini moved to lucene 9.3 since 02/08/22 2725655) support for Elastic Search and Solr was dropped (#1951).
From the instruction you followed, you need to create the index using target/appassembler/bin/IndexCollection without solr being involved.
You maybe can clone anserini from right before the 02/08/2023 (or anserini-0.14.4), but keeping the latest relevant source file for acl-indexing (see #2084). I have not tried that but was thinking of it, please let me know if it works !

@lintool
Copy link
Member

lintool commented Apr 30, 2023

Hi @mobehbooei - yes, support for Solr has been dropped, so we should go back to direct Anserini indexing. The commands on this issue should work: #2069

i.e.,

python -m pyserini.index -collection AclAnthology -generator AclAnthologyGenerator -threads 8 -input build/data/ -index index/lucene-index-acl-paragraph -storePositions -storeDocvectors -storeContents -storeRaw -optimize

Can you please try it out and then update this page accordingly? https:/castorini/anserini/blob/master/docs/acl-anthology.md

Send PR directly please.

@mobehbooei
Copy link
Contributor Author

Hi everyone @ygorg @aryamancodes @lintool - Thanks for the responses. I tried the pyserini approach but I still have some issues same as #2069
I am getting this error first:

2023-05-01 16:36:51,108 ERROR [main] collection.AclAnthology (AclAnthology.java:60) - Unable to open volumes.yaml

and then lots of this error:

2023-05-01 16:36:51,582 ERROR [pool-2-thread-6] index.IndexCollection$LocalIndexerThread (IndexCollection.java:348) - pool-2-thread-6: Unexpected Exception:
java.lang.NullPointerException: null
        at io.anserini.collection.AclAnthology$Document.<init>(AclAnthology.java:154) ~[anserini-0.21.0-fatjar.jar:?]
        at io.anserini.collection.AclAnthology$Segment.readNext(AclAnthology.java:115) ~[anserini-0.21.0-fatjar.jar:?]
        at io.anserini.collection.FileSegment$1.hasNext(FileSegment.java:136) ~[anserini-0.21.0-fatjar.jar:?]
        at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:287) [anserini-0.21.0-fatjar.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

Tried this and this to prevent yaml from creating aliases which can't be parsed by anserini but still have the same error!
I am using WSL on my Windows and installed the pyserini package according to this detailed version.
@aryamancodes as you mentioned here it worked for you, so do you have any idea what my problem is? tnx

@ygorg
Copy link
Contributor

ygorg commented May 2, 2023

You might not have the latest version the AclAnthology.java file. Because in the latest version the error message is more verbose. Try updating the file or cloning the latest version of anserini.

@mobehbooei
Copy link
Contributor Author

Thanks @ygorg. That worked. It needed the Development Installation of pyserini to have the latest versions.

@lintool
Copy link
Member

lintool commented May 30, 2023

Closing - ref: #2126 and castorini/pyserini#1537

@lintool lintool closed this as completed May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants