Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior in aarch64 #100

Closed
estebanpw opened this issue Apr 12, 2022 · 4 comments
Closed

Unexpected behavior in aarch64 #100

estebanpw opened this issue Apr 12, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@estebanpw
Copy link

estebanpw commented Apr 12, 2022

Hello,

First, thank you for taking the time to make arm support possible :)

Second, I have found a case where vectorscan reports a false positive match on ARM aarch64. The same input does not produce a false positive in the original hyperscan on x64.

I have isolated a very small reproducible example with 2 input regexes and a couple of bytes of corpus text. The text that is scanned is:
xxxxxxxxxx?y\nTEXT12345xxxxxxxxxxxx

whereas the two regexes are:

^x\\z*x
y\\z*TEXT12345

The single match is reported as follows:

  • Match id: 1
  • Ending position of match: 23
  • Matched pattern: y\\z*TEXT12345
  • Input from 0 to 23: xxxxxxxxxx?y\nTEXT12345

As far as I know, this should not match.

What I think could help is that the two regexes only produce a match if compiled without the flag HS_FLAG_SOM_LEFTMOST (this is why I only report the ending position of the match). For example, in my tests I was using flags HS_FLAG_DOTALL | HS_FLAG_MULTILINE, but the moment you include HS_FLAG_SOM_LEFTMOST, the match is no longer falsely reported.

Furthermore, if I remove e.g. one or more x chars from the end of the input string (even though these are not matched), then the match is no longer reported. Same with the x chars at the beginning. I know this is a strange example but it comes from a much larger dataset of inputs and this is the smallest I could pinpoint. Also note that if compiling the regexes individually, none of them produce matches.

The self-contained code of the example (notice the multiple backslashes for the escaping character \\\\):

#include <iostream>
#include <vector>
#include <hs.h>
#include <cstring>

typedef struct match{
    unsigned int id;
    unsigned int from;
    unsigned int to;
} Match;

int on_match_counter(unsigned int id, unsigned long long from, unsigned long long to, unsigned int flags, void *ctx)
{
    std::vector<Match> * matches = (std::vector<Match> *) ctx;
    Match m;
    m.id = id;
    m.from = from;
    m.to = to;
    matches->push_back(m);
    return 0;
}

int main(int ac, char ** av)
{
    const char input[] = "xxxxxxxxxx?y\\nTEXT12345xxxxxxxxxxxx";

    std::vector<const char *> cstr_patterns;
    std::vector<unsigned> patterns_flags;
    std::vector<unsigned> patterns_ids;
    
    cstr_patterns.push_back("^x\\\\z*x");
    cstr_patterns.push_back("y\\\\z*TEXT12345");

    for(int i=0; i<(int) cstr_patterns.size(); i++)
    {
        patterns_flags.push_back(HS_FLAG_DOTALL | HS_FLAG_MULTILINE); // produces one match
        //patterns_flags.push_back(HS_FLAG_DOTALL | HS_FLAG_MULTILINE | HS_FLAG_SOM_LEFTMOST); // does not produce any
        patterns_ids.push_back(i);
    }


    hs_database_t * db_block = NULL;
    hs_compile_error_t * compile_err = NULL;
    hs_scratch_t * scratch = NULL;

    hs_error_t err = hs_compile_multi(cstr_patterns.data(), patterns_flags.data(),
                    patterns_ids.data(), cstr_patterns.size(), HS_MODE_BLOCK,
                    NULL, &db_block, &compile_err);

    if (err != HS_SUCCESS)
    {
        hs_free_compile_error(compile_err);
        throw std::runtime_error("ERROR: Unable to compile.\n");
    }
    err = hs_alloc_scratch(db_block, &scratch);
    if (err != HS_SUCCESS) {
        hs_free_database(db_block);
        throw std::runtime_error("ERROR: Unable to allocate scratch space. Exiting.\n");
    }

    std::vector<Match> matches;
    err = hs_scan(db_block, input, strlen(input), 0, scratch, on_match_counter, (void *) &matches);
    if (err != HS_SUCCESS) 
        throw std::runtime_error("ERROR: Scanning");
    

    std::cout << "Found " << matches.size() << " match(es) with " << cstr_patterns.size() << " patterns" << std::endl;
    for(int i=0; i<(int)matches.size(); i++)
    {
        std::cout << "Match " << i << "\n\t[id:" << matches.at(i).id << "]@(" << matches.at(i).from << "," << matches.at(i).to << ")" << std::endl;
        std::cout << "\t" << cstr_patterns.at(matches.at(i).id) << std::endl;
        std::cout << "\t" << std::string(input).substr(matches.at(i).from, matches.at(i).to) << std::endl;
        
    }
    
    hs_free_database(db_block);
    hs_free_scratch(scratch);

    return 0;
}

I compiled with g++-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0 on x64 and gcc10-g++ (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1) on aarch64. Ragel version is Ragel State Machine Compiler version 6.10 March 2017 for both.

I noticed there was also a recent post with a similar problem here and that maybe this PR fixes the problem. I can try rerunning the test when the PR is merged.

Let me know if there is anything else I can provide. Thank you for your time.

@danlark1
Copy link

After #93 I can't reproduce the problem on aarch64

@markos markos added the bug Something isn't working label Apr 18, 2022
@markos
Copy link

markos commented Apr 18, 2022

Closed with #102

@markos markos closed this as completed Apr 18, 2022
@vmurashev
Copy link

@markos, could I please kindly ask,
May it happen that this fix is important enough to release 5.4.7 ?

@markos
Copy link

markos commented Apr 21, 2022

@vmurashev I'm working on fixing #95 as well for 5.4.7, if that does not happen soon, expect 5.4.7 on Monday. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants