Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Dependency Tree Matcher (Validations + Semgrex operators) #2836

Merged
merged 1 commit into from
Oct 29, 2018

Conversation

skrcode
Copy link
Contributor

@skrcode skrcode commented Oct 9, 2018

Added validation and all major Semgrex operators

Description

  1. Added checks to validate input.
  2. Added all major Semgrex operator support. Includes support for'<','>','>>','<<','.','$+','$-','$++','$--'.
    Details regarding the operators can be found here : https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html
    Note: The subtree matching code needs to be re-looked at. I guess there are some caveats in the Semgrex pattern matching algorithm that need to be implemented. Unordered subtree matching takes exponential time and the heuristics implemented in Semgrex can be of help.

Types of change

Enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added enhancement Feature requests and improvements ⚠️ wip Work in progress feat / matcher Feature: Token, phrase and dependency matcher labels Oct 9, 2018
@honnibal honnibal merged commit 0bf1408 into explosion:develop Oct 29, 2018
@skrcode
Copy link
Contributor Author

skrcode commented Oct 30, 2018

@honnibal There's definitely something better than can be done on the pattern matching algorithm. I've hit a roadblock in trying to find a combinatorial approach for calculating all matches correctly and quickly. I've gone through a lot of the literature on this as well as existing code modules and could not find anything suitable that we could extend on to. Any thoughts on this?

@syllog1sm
Copy link
Contributor

I'll spend some time on it. I wanted to merge anyway because the patch doesn't interfere with others so there's no reason to let it diverge.

Thanks a lot for your work on this, i'm really grateful for the research. I'm hoping that once we have a few real patterns we can optimise the empirical runtime, which might look different from the worst case complexity.

@skrcode
Copy link
Contributor Author

skrcode commented Nov 2, 2018

@honnibal Yeah, I agree ! I think even http://corenlp.run/ does not handle those exponential cases.
While trying out this example {} > {} > {} > {} > {} > {} > {} > {} > {} > {} > {} > {} > {} > {} > {} > {}
for the text The quick brown fox jumped over the lazy dog., though the length of the input is small,
a large number of matches could be possibly found leading to a timeout or too much memory consumption.

@skrcode
Copy link
Contributor Author

skrcode commented Mar 23, 2019

@honnibal I've rewritten some of the code here: #3465
I've thought about the above performance problem in depth and figured out that for atleast some input specification, there are going to be a large number(exponential) of output subtrees. For example, if the input pattern is something like "give me all subtree matchings with any node having any two children", then the result would obviously have to contain all subtrees(without ordering too) from the dependency parse containing a node with two children. This result in itself would be exponential and so it has to run in atleast exponential time and memory footprint. I feel that there isn't any significant performance improvement that can be theoretically done, although I've added some constant time optimizations that could perform fairly well in practical use-cases, considering that an average sentence length would be some number hovering around 10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher ⚠️ wip Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants