Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement phone number analyzer #15915

Merged

Conversation

rursprung
Copy link
Contributor

@rursprung rursprung commented Sep 12, 2024

Description

this is largely based on elasticsearch-phone and internally uses
libphonenumber.
this intentionally only ports a subset of the features: only phone and
phone-search are supported right now, phone-email can be added
if/when there's a clear need for it.

using libphonenumber is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list falsehoods programmers believe about phone
numbers
.

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (ZZ) had been used
which worked as long as international numbers were prefixed with + but
did not work when using local numbers (e.g. a number stored as
+4158... was not matched against a number entered as 004158... or
058...).

example configuration for an index:

{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}

this creates four analyzers: phone and phone-search which do not
explicitly specify a region and thus fall back to ZZ (unknown region,
regional version of international dialing prefix (e.g. 00 instead of
+ in most of europe) will not be recognised) and phone-ch and
phone-search-ch which will try to parse the phone number as a swiss
phone number (thus e.g. 00 as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a String in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in analysis-common (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes #11326

Signed-off-by: Ralph Ursprung [email protected]

Related Issues

Resolves #11326

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Relevance labels Sep 12, 2024
@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from 74429fe to d844ea9 Compare September 12, 2024 16:56
Copy link
Contributor

❌ Gradle check result for 74429fe: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d844ea9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from d844ea9 to 24e60a5 Compare September 13, 2024 13:06
Copy link
Contributor

❌ Gradle check result for 24e60a5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung rursprung force-pushed the implement-phone-number-analyzer branch from 24e60a5 to f7669e2 Compare September 13, 2024 13:30
Copy link
Contributor

❕ Gradle check result for f7669e2: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 97.36842% with 2 lines in your changes missing coverage. Please review.

Project coverage is 71.94%. Comparing base (6020c58) to head (a3ac6dc).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...earch/analysis/phone/PhoneNumberTermTokenizer.java 97.87% 0 Missing and 1 partial ⚠️
...nalysis/phone/PhoneNumberTermTokenizerFactory.java 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15915      +/-   ##
============================================
+ Coverage     71.88%   71.94%   +0.05%     
- Complexity    64496    64535      +39     
============================================
  Files          5291     5296       +5     
  Lines        301668   301744      +76     
  Branches      43576    43585       +9     
============================================
+ Hits         216863   217094     +231     
+ Misses        67031    66764     -267     
- Partials      17774    17886     +112     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rursprung
Copy link
Contributor Author

rursprung commented Sep 13, 2024

testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

❕ Gradle check result for f7669e2: UNSTABLE

* **TEST FAILURES:**
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

this is a flaky test: #14304

and the failure of the "mend security check" also seems to be random (but i don't have the rights to re-trigger it)

@rursprung
Copy link
Contributor Author

@rursprung thank you, could you please resolve the conflicts?

done (let's hope it stays mergeable for longer this time!)

@rursprung rursprung requested a review from reta October 3, 2024 15:11
Copy link
Collaborator

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @rursprung ! @msfroh anything left on your side?

Copy link
Contributor

github-actions bot commented Oct 3, 2024

❌ Gradle check result for a3ac6dc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 3, 2024

✅ Gradle check result for a3ac6dc: SUCCESS

Copy link
Collaborator

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@reta reta merged commit d1fd47c into opensearch-project:main Oct 3, 2024
33 of 35 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-15915-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d1fd47c652b4c6a2c0ec5d0ee574a0ff0d263177
# Push it to GitHub
git push --set-upstream origin backport/backport-15915-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-15915-to-2.x.

@reta
Copy link
Collaborator

reta commented Oct 3, 2024

@rursprung apologies, mind please sending a manual backport to2.x branch? thank you

@rursprung rursprung deleted the implement-phone-number-analyzer branch October 4, 2024 06:19
@rursprung
Copy link
Contributor Author

@rursprung apologies, mind please sending a manual backport to2.x branch? thank you

no worries, done: #16187

i'm a big fan of having a changelog, but it's causing a lot of merge conflicts here 🙁
i've seen another repo (don't remember which, but i have a vague feeling that it was even in the opensearch org?) which had a subfolder where you created one file per PR for the changelog and some automated tooling then collected all of that together and merged it into the main changelog file for the release (and deleted the other files). maybe that might be an idea here as well to avoid the merge conflicts? might be worth discussing in a separate issue (which should probably come from someone regularly contributing to this repo as you'll be much more affected by this issue than me)?

on another note: squash-merging destroys my nice atomic commits 🙁
i get that you do that for PRs where people just add a ton of "fix review finding" commits, but for the proper (linux kernel style ;)) PRs where you force-push to have a nice linear commit history with atomic commits, each with a nice commit message) i think this is a big loss (and makes it harder to find the actual culprit with tools like git-bisect and git-revert)

@reta
Copy link
Collaborator

reta commented Oct 4, 2024

might be worth discussing in a separate issue (which should probably come from someone regularly contributing to this repo as you'll be much more affected by this issue than me)?

Please feel free to open an issue or kick off discussion!

on another note: squash-merging destroys my nice atomic commits

The clean repo history is useful, but this is a tradeoff for sure

rursprung added a commit to rursprung/documentation-website that referenced this pull request Oct 4, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

resolves opensearch-project#8389
rursprung added a commit to rursprung/documentation-website that referenced this pull request Oct 4, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

resolves opensearch-project#8389

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/documentation-website that referenced this pull request Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

resolves opensearch-project#8389

Co-authored-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this pull request Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new tes group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this pull request Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new tes group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this pull request Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new tes group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

note that the CI currently needs to fetch the image from
`opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and
`ref` config can be removed once 2.18.0 has been released.

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this pull request Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new test group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

note that the CI currently needs to fetch the image from
`opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and
`ref` config can be removed once 2.18.0 has been released.

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this pull request Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new test group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

note that the CI currently needs to fetch the image from
`opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and
`ref` config can be removed once 2.18.0 has been released.

Signed-off-by: Ralph Ursprung <[email protected]>
dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 16, 2024
* add `Strings#isDigits` API

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <[email protected]>

* add `phone` & `phone-search` analyzer + tokenizer

this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>

---------

Signed-off-by: Ralph Ursprung <[email protected]>
dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 17, 2024
* add `Strings#isDigits` API

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <[email protected]>

* add `phone` & `phone-search` analyzer + tokenizer

this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>

---------

Signed-off-by: Ralph Ursprung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] new plugin with normalizer & analyzer for phone numbers
4 participants