-
Notifications
You must be signed in to change notification settings - Fork 0
Turn the TiGer treebank into a real dependency treebank
License
wbwseeker/tiger2dep
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is the tiger2dep conversion script, which converts the TiGer corpus into dependency format. It now also converts the German part of the SMULTRON corpus (http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks/smultron_en.html) and the 707 sentences of EuroParl manually annotated by Sebastian Padó (http://www.nlpado.de/~sebastian/data/projection/german_goldgold.xml.gz) The script contains two parts, one that applies a number of corrections to the source files, and one that converts the corrected version to dependencies. The dependency conversion is also capable of introducing empty heads for verb ellipses in the data (only tested for TiGer so far). REFERENCES: @inproceedings{SEEKER12.235.L12-1088, author = {Wolfgang Seeker and Jonas Kuhn}, title = {Making Ellipses Explicit in Dependency Conversion for a German Treebank}, booktitle = {Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)}, year = {2012}, month = {May}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Mehmet U\u{g}ur Do\u{g}an and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {English}, pages = {3132--3139}, note = {ACL Anthology Identifier: L12-1088}, url = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/235_Paper.pdf} } @InProceedings{SEEKER14.809, author = {Wolfgang Seeker and Jonas Kuhn}, title = {An Out-of-Domain Test Suite for Dependency Parsing of German}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english}, url = {http://www.lrec-conf.org/proceedings/lrec2014/pdf/809_Paper.pdf} } REQUIREMENTS: the script relies on BeautifulSoup 4 (included in archive) for reading the tigerxml USAGE: for producing a corrected version of the treebank (prerequisite for the conversion) $ python apply-corrections.py tiger_release_aug07.xml corrections/* > tiger_release_aug07.corrected.xml for producing the conversion: $ python tigerxml2conll09.py -i tiger_release_aug07.corrected.xml -m corrections/manual_heads.pl > tiger_release_aug07.corrected.conll09 type -h with the conversion script to get help on configuration options There are bash scripts (convert-tiger.sh, convert-smultron.sh, convert-europarl.sh) that you can run to convert the source files. Usage is documented inside the scripts. The conversion of TiGer needs quite a lot of memory because the whole corpus is represented as a DOM object for the error correction. I would run it on a server with a couple of GB available. Once this is done, the actual conversion to dependencies does not need much memory. LICENSE: The software is licensed under the GNU General Public License (GPL) version 3, which can be found under www.gnu.org/licenses or in the attached file LICENSE. Third party libraries are subject to their own license. ---- 05/27/2014 Wolfgang Seeker
About
Turn the TiGer treebank into a real dependency treebank
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published