[RFC] new plugin with normalizer & analyzer for phone numbers #11326

rursprung · 2023-11-24T12:58:54Z

UPDATE: RFC for new plugin
please use this issue as an RFC to have a new plugin under the opensearch-project org for the phone number normalizer & analyzer). i have implemented & open-sourced the plugin (needs minor polishing for the git history & port to 3.x - this can be done in a few minutes once we know where it'll live) and would very much like to see it hosted & owned by the project as i believe in the general usefulness of this.
see this comment for more details: #11326 (comment)

original post:

Is your feature request related to a problem? Please describe.
we have a use-case where we store (amongst other things) a phone number in a dedicated field of the document. this is ingested from another system where in turn it has been entered by users (while there's some validation there might still be some variation in how the number is written). a user can then trigger a search which (amongst other things) will try to match the phone number. since the text to be searched is entered by the user, the phone number might come in any format (with or without international calling prefix, calling prefix with + or 00 (or the national equivalent thereof), with or without separators (whitespaces, dashes, dots, you pick a character and chances are that a country is using it), with or without brackets for grouping numbers together, etc.).

as a corner case (doesn't really affect us, but relevant for a general solution): even e.g. just filtering for numbers doesn't work in case a number would be entered with alphabetical representation. the only one i actually know is 1-800-MICROSOFT in the USA, but i think you have lots of these over there?

Describe the solution you'd like
it'd be great if OpenSearch could ship with a normalizer (or even a dedicated field type which automatically uses this normalizer?) for phone numbers which would cover most (if not all) cases. it could start with the most common ones and then be improved over time by the community when need arises.

Describe alternatives you've considered
everyone can build their own normalizer for phone numbers. the problem is that none of them will cover all (or even most) phone numbers and this just creates additional effort if everyone needs to re-invent the wheel.

the following is a very basic implementation which however doesn't cover most of the cases listed above (hence why it's hard to build a good one on your own):

{
  "analysis": {
    "char_filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": "\\s",
        "replacement": ""
      },
      "transform_plus_to_00": {
        "type": "pattern_replace",
        "pattern": "\\+",
        "replacement": "00"
      }
    },
    "normalizer": {
      "phone_number_normalizer": {
        "type": "custom",
        "char_filter": [
          "whitespace_remove",
          "transform_plus_to_00"
        ],
        "filter": [
          "lowercase",
          "uppercase"
        ]
      }
    }
  }
}

Additional context
the wikipedia article on national conventions for writing telephone numbers seems to cover most (if not all?) ways of writing phone numbers

The text was updated successfully, but these errors were encountered:

msfroh · 2023-11-29T17:20:06Z

I can see this being extremely challenging, given the wide variety of conventions. The most "normal" form would probably involve stripping any international calling prefix and removing any punctuation/whitespace.

I suppose it would be debatable as to whether a country-specific long-distance prefix should be removed. E.g. 1-800-MICROSOFT is functionally equivalent to 800MICROSOFT, because the 1 historically indicated a call outside your local area code.

macohen · 2023-11-29T17:21:02Z

I found this plugin that could be leveraged as a starting point. https:/purecloudlabs/elasticsearch-phone. It looks like its been abandoned. cc: @drewinin, @ININDevEvangelists - any thoughts, guidance, or warnings on this?

timsmithgenesys · 2023-11-29T17:51:52Z

@macohen I administer the purecloudlabs org and am unfamiliar with the project itself, but I can tell you that it has no maintainers and so can be considered abandoned; it's on my list of repos that are to be archived. It's licensed under a MIT license, so you are free to do with it as you please.

rursprung · 2023-12-11T11:42:38Z

thanks for finding this library, @macohen!

@timsmithgenesys:

It's licensed under a MIT license, so you are free to do with it as you please.

thanks for this information! you have a LICENSE which contains the MIT license and a LICENSE.md which contains an Apache 2.0 license - is it correctly dual-licensed (would be good as all OpenSearch projects are under the Apache 2.0 license, making life easier) or is this a leftover and you consider it only licensed under MIT?

i've now seen that this library is backed by google's libphonenumber (which i didn't know before), that looks like it would help in abstracting away most of the discussions above (by leaving it to this library), though i haven't looked into it in any detail (yet). its Falsehoods Programmers Believe About Phone Numbers list is very informative and shows that we probably shouldn't attempt at implementing this manually (e.g. with textual token filtering).

timsmithgenesys · 2023-12-11T15:06:37Z

you have a LICENSE which contains the MIT license and a LICENSE.md which contains an Apache 2.0 license

Oh gross; I didn't notice that. I think the correct answer is this is meant to be licensed under Apache per the misspelled LICENCE.md as that was added first and by a human. The LICENSE file that contains the MIT license was added by automation, not intentionally.

I've deleted the MIT license and archived the repo to rectify this situation. I'll have to retract my previous statement about licensure. You're still free to do whatever you want with it, just per the remaining Apache license.

dblock · 2023-12-13T16:33:31Z

@macohen I think we could fork that abandoned library into opensearch-phone-number-analyzer or bring it into core here

macohen · 2023-12-13T16:36:53Z

Sounds good to me. Is it "core" enough to include here? I'm open to what seems right to maintainers.

dblock · 2023-12-13T16:55:41Z

@macohen core is pretty bloated, so I would include it only if you think it should be enabled by default, WDYT?

macohen · 2023-12-20T14:28:32Z

I do think it should be enabled by default - as long as it doesn't use any resources unless invoked.

However, I don't think we're at a stage yet where we should just move it in. @rursprung are you up for forking this to your own account, making it work with OpenSearch as a plugin and then we can take a public vote/decision on if we move this into the opensearch-project itself? I think I'd want to make sure this library can at least meet your use cases.

@dblock do you think the change from a plugin/extension to a core module would be significant enough that we would want to make a decision before anyone starts working on it?

cc: @dagneyb for an FYI

dblock · 2024-01-23T20:36:29Z

@dblock do you think the change from a plugin/extension to a core module would be significant enough that we would want to make a decision before anyone starts working on it?

I don't feel strongly about it at all.

this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. closes opensearch-project/OpenSearch#11326 [elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone [libphonenumber]: https:/google/libphonenumber

rursprung · 2024-06-11T14:49:32Z

i have now created an implementation for this (currently on 2.x because i'm testing there, but easily ported to 3.x): opensearch-phone-number-analyzer

this is not 100% ready for prime-time yet (namely i'm waiting for 1-2 PRs to end up in the plugin template and plan to then re-create the repo based on that and with everything in main first of course) but good enough to start the discussions.

this is largely based on the functionality from elasticsearch-phone from @purecloudlabs (@timsmithgenesys: thanks for clarifying the license here and hope that the attribution in the commit-msg & README is ok?).

@opensearch-project (CC @macohen):

However, I don't think we're at a stage yet where we should just move it in. @rursprung are you up for forking this to your own account, making it work with OpenSearch as a plugin and then we can take a public vote/decision on if we move this into the opensearch-project itself? I think I'd want to make sure this library can at least meet your use cases.

i'd like to nominate this plugin to be moved to the opensearch-project org. is this issue the right place to discuss this? in the best case you'd then create a fresh repo with the same name here based on the plugin template (with main & 2.x please 🙂) and i can fork it and raise a PR with the actual content (same commits as you can currently see in my repo).
alternatively - if you prefer that - i can of course also provide it as a PR to the plugins folder in this repo, but if i understood right that's not really desired anymore for new plugins?

rursprung · 2024-06-20T16:25:29Z

i've now changed the title of the issue and added a comment at the beginning to clearly state that this is now an RFC to get the plugin moved over to this org. see my previous comment for more details.

dblock · 2024-06-20T22:29:17Z

@AmiStrn was trying to contribute a process for deciding how/when/why we bring plugins into the org

@rursprung Do you have any users of the fork? Get CI to be green, make a release? We can consider including a plugin in the main distribution if it’s widely used.

dblock · 2024-06-25T16:06:14Z

Note that we generally describe the process of moving repos into the org in https:/opensearch-project/.github/blob/main/ADMINS.md#new-repos and have a proposal in opensearch-project/opensearch-plugins#216 to document moving plugins.

rursprung · 2024-06-26T08:42:13Z

@dblock: thanks for your answers!

Do you have any users of the fork?

not yet. i published it right after building it to (a) get feedback on it and (b) see if it can be upstreamed directly.
i have however started integrating it in one of our applications (the reason why i built the plugin in the first place) and we will roll it out with that to our customers. the main issue here is that for this it will have to be available in some public cloud setups (i.e. managed services) - which would be much easier if it were part of this project

Get CI to be green

that was a minor oversight (the lucene test expects the tokens to be in the specified order and seems that i mixed up something there; already fixed locally and on the branch using Set instead of List; waiting on #14179 to see what the correct approach is).

i'll update the repo once #14179 has been answered (probably re-create the repo from the template as a lot has happened there since i did so and then re-apply the commits with the necessary cleanups/code changes; as there are no releases, no forks, etc. there shouldn't be an issue with throwing away the existing git history in this case), this will then also porting it from 2.x to main (i haven't tried it yet and hope that the lucene major release upgrade won't cause any issues here).

make a release?

if ever possible i'd prefer to avoid going through that if we can move it to the opensearch project as then we'd have to do the whole setup twice. also, some things would then probably have to be renamed (i presume it'd be bad style if i'd publish it with org.opensearch.plugin for the java packages and maven coordinates), adding more noise.

We can consider including a plugin in the main distribution if it’s widely used.

i think these are two steps:

host the project under opensearch-project on GH & include it in the release process
include the plugin as part of the "full" distribution (i.e. make it a standard plugin)

while the latter would be nice (full disclosure: we're not using it, we use the min distribution and add only the plugins we need for our own setup) the former would be my immediate goal with this issue.

dblock · 2024-06-27T21:30:00Z

host the project under opensearch-project on GH & include it in the release process

include the plugin as part of the "full" distribution (i.e. make it a standard plugin)

Both of these are significant amounts of work for people who work on this project. The second one is also a promise to continue working on the plugin for security fixes and other version upgrades forever. I think it's a reasonable ask that any plugin that is included in the org and is released with the distribution gets either a team publicly committing to do it (your company?) or at least some traction with users, first, no?

msfroh · 2024-09-06T23:34:56Z

This is a lightweight tokenizer. Adding all the bloat of a new plugin to support one more analyzer seems silly. Adding it to analysis-common would be much easier.

smacrakis · 2024-09-09T14:22:56Z

The standard way of denoting the international call prefix is "+", not "00" (https://en.wikipedia.org/wiki/List_of_international_call_prefixes)

If the field is known to be a phone number (not mixed with other text), then all punctuation and whitespace can be removed to canonicalize it, e.g., (234) 321-9321, 234.321.9321, 234 321 9321 should all canonicalize to 2343219321. It would be nice if we could add + to indicate the international call prefix, but unfortunately, numbers are often given without the country code. If the application is hosted in North America, and the number has 10 digits, it's pretty safe to guess that the prefix "+1" should be applied. But alas you can't count on that outside of North America. An industrial-strength implementation would look at other fields (notably address) to make a better guess. Probably ML to do this robustly.

rursprung · 2024-09-09T14:27:27Z

This is a lightweight tokenizer. Adding all the bloat of a new plugin to support one more analyzer seems silly. Adding it to analysis-common would be much easier.

i'd be perfectly happy to provide a PR adding this to analysis-common if this is the general consensus! i agree that it'd involve a lot less overhead and would make my life easier!

dblock · 2024-09-11T23:49:49Z

+1 from me

this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). closes opensearch-project#11326 [elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone [libphonenumber]: https:/google/libphonenumber Signed-off-by: Ralph Ursprung <[email protected]>

this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. closes opensearch-project#11326 [elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone [libphonenumber]: https:/google/libphonenumber [falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]>

this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). closes opensearch-project#11326 [elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone [libphonenumber]: https:/google/libphonenumber [falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]>

this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone [libphonenumber]: https:/google/libphonenumber [falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]>

* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes #11326 [elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone [libphonenumber]: https:/google/libphonenumber [falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]> --------- Signed-off-by: Ralph Ursprung <[email protected]>

reta · 2024-10-04T12:21:46Z

Thanks a lot for contribution @rursprung ! Please provide the pull request for opensearch-project/documentation-website#8389 with documentation update upon your convenience (before 2.18.0 release), thank you.

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Signed-off-by: Ralph Ursprung <[email protected]>

dblock · 2024-10-09T17:05:12Z

Also https:/opensearch-project/opensearch-api-specification needs an update, with a working example. Thanks!

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Co-authored-by: Fanit Kolchina <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new tes group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). Signed-off-by: Ralph Ursprung <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new tes group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). note that the CI currently needs to fetch the image from `opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and `ref` config can be removed once 2.18.0 has been released. Signed-off-by: Ralph Ursprung <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new test group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). note that the CI currently needs to fetch the image from `opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and `ref` config can be removed once 2.18.0 has been released. Signed-off-by: Ralph Ursprung <[email protected]>

* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone [libphonenumber]: https:/google/libphonenumber [falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]> --------- Signed-off-by: Ralph Ursprung <[email protected]>

rursprung added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 24, 2023

macohen added the Search:Relevance label Nov 29, 2023

msfroh removed the untriaged label Nov 29, 2023

rursprung mentioned this issue Jun 11, 2024

analysis-common: make UniqueTokenFilter public #14179

Closed

3 tasks

rursprung changed the title ~~provide new normalizer (or even field type?) for phone numbers~~ [RFC] new plugin with normalizer & analyzer for phone numbers Jun 20, 2024

dblock mentioned this issue Jul 9, 2024

[PROPOSAL] Document a clear(er) bar/process for moving repos into opensearch-project opensearch-project/.github#209

Open

rursprung mentioned this issue Sep 12, 2024

Implement phone number analyzer #15915

Merged

3 tasks

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.18.0 Issues and PRs related to version 2.18.0 labels Sep 27, 2024

reta assigned rursprung Sep 27, 2024

reta closed this as completed in #15915 Oct 3, 2024

reta closed this as completed in d1fd47c Oct 3, 2024

rursprung mentioned this issue Oct 4, 2024

implement phone number analyzer (cherry-pick to 2.x) #16187

Merged

3 tasks

rursprung mentioned this issue Oct 4, 2024

document the new analysis-phonenumber plugin opensearch-project/documentation-website#8469

Open

1 task

rursprung mentioned this issue Oct 11, 2024

add phone number analysis plugin opensearch-project/opensearch-api-specification#609

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

rursprung commented Nov 24, 2023 •

edited

Loading

msfroh commented Nov 29, 2023

macohen commented Nov 29, 2023 •

edited

Loading

timsmithgenesys commented Nov 29, 2023

rursprung commented Dec 11, 2023 •

edited

Loading

timsmithgenesys commented Dec 11, 2023

dblock commented Dec 13, 2023

macohen commented Dec 13, 2023

dblock commented Dec 13, 2023

macohen commented Dec 20, 2023

dblock commented Jan 23, 2024

rursprung commented Jun 11, 2024 •

edited

Loading

rursprung commented Jun 20, 2024

dblock commented Jun 20, 2024 •

edited

Loading

dblock commented Jun 25, 2024 •

edited

Loading

rursprung commented Jun 26, 2024

dblock commented Jun 27, 2024

msfroh commented Sep 6, 2024

smacrakis commented Sep 9, 2024

rursprung commented Sep 9, 2024

dblock commented Sep 11, 2024

reta commented Oct 4, 2024

dblock commented Oct 9, 2024

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

Comments

rursprung commented Nov 24, 2023 • edited Loading

msfroh commented Nov 29, 2023

macohen commented Nov 29, 2023 • edited Loading

timsmithgenesys commented Nov 29, 2023

rursprung commented Dec 11, 2023 • edited Loading

timsmithgenesys commented Dec 11, 2023

dblock commented Dec 13, 2023

macohen commented Dec 13, 2023

dblock commented Dec 13, 2023

macohen commented Dec 20, 2023

dblock commented Jan 23, 2024

rursprung commented Jun 11, 2024 • edited Loading

rursprung commented Jun 20, 2024

dblock commented Jun 20, 2024 • edited Loading

dblock commented Jun 25, 2024 • edited Loading

rursprung commented Jun 26, 2024

dblock commented Jun 27, 2024

msfroh commented Sep 6, 2024

smacrakis commented Sep 9, 2024

rursprung commented Sep 9, 2024

dblock commented Sep 11, 2024

reta commented Oct 4, 2024

dblock commented Oct 9, 2024

rursprung commented Nov 24, 2023 •

edited

Loading

macohen commented Nov 29, 2023 •

edited

Loading

rursprung commented Dec 11, 2023 •

edited

Loading

rursprung commented Jun 11, 2024 •

edited

Loading

dblock commented Jun 20, 2024 •

edited

Loading

dblock commented Jun 25, 2024 •

edited

Loading