Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

Closed
rursprung opened this issue Nov 24, 2023 · 22 comments · Fixed by #15915
Closed

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

rursprung opened this issue Nov 24, 2023 · 22 comments · Fixed by #15915
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@rursprung
Copy link
Contributor

rursprung commented Nov 24, 2023

UPDATE: RFC for new plugin
please use this issue as an RFC to have a new plugin under the opensearch-project org for the phone number normalizer & analyzer). i have implemented & open-sourced the plugin (needs minor polishing for the git history & port to 3.x - this can be done in a few minutes once we know where it'll live) and would very much like to see it hosted & owned by the project as i believe in the general usefulness of this.
see this comment for more details: #11326 (comment)

original post:

Is your feature request related to a problem? Please describe.
we have a use-case where we store (amongst other things) a phone number in a dedicated field of the document. this is ingested from another system where in turn it has been entered by users (while there's some validation there might still be some variation in how the number is written). a user can then trigger a search which (amongst other things) will try to match the phone number. since the text to be searched is entered by the user, the phone number might come in any format (with or without international calling prefix, calling prefix with + or 00 (or the national equivalent thereof), with or without separators (whitespaces, dashes, dots, you pick a character and chances are that a country is using it), with or without brackets for grouping numbers together, etc.).

as a corner case (doesn't really affect us, but relevant for a general solution): even e.g. just filtering for numbers doesn't work in case a number would be entered with alphabetical representation. the only one i actually know is 1-800-MICROSOFT in the USA, but i think you have lots of these over there?

Describe the solution you'd like
it'd be great if OpenSearch could ship with a normalizer (or even a dedicated field type which automatically uses this normalizer?) for phone numbers which would cover most (if not all) cases. it could start with the most common ones and then be improved over time by the community when need arises.

Describe alternatives you've considered
everyone can build their own normalizer for phone numbers. the problem is that none of them will cover all (or even most) phone numbers and this just creates additional effort if everyone needs to re-invent the wheel.

the following is a very basic implementation which however doesn't cover most of the cases listed above (hence why it's hard to build a good one on your own):

{
  "analysis": {
    "char_filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": "\\s",
        "replacement": ""
      },
      "transform_plus_to_00": {
        "type": "pattern_replace",
        "pattern": "\\+",
        "replacement": "00"
      }
    },
    "normalizer": {
      "phone_number_normalizer": {
        "type": "custom",
        "char_filter": [
          "whitespace_remove",
          "transform_plus_to_00"
        ],
        "filter": [
          "lowercase",
          "uppercase"
        ]
      }
    }
  }
}

Additional context
the wikipedia article on national conventions for writing telephone numbers seems to cover most (if not all?) ways of writing phone numbers

@rursprung rursprung added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 24, 2023
@msfroh
Copy link
Collaborator

msfroh commented Nov 29, 2023

I can see this being extremely challenging, given the wide variety of conventions. The most "normal" form would probably involve stripping any international calling prefix and removing any punctuation/whitespace.

I suppose it would be debatable as to whether a country-specific long-distance prefix should be removed. E.g. 1-800-MICROSOFT is functionally equivalent to 800MICROSOFT, because the 1 historically indicated a call outside your local area code.

@msfroh msfroh removed the untriaged label Nov 29, 2023
@macohen
Copy link
Contributor

macohen commented Nov 29, 2023

I found this plugin that could be leveraged as a starting point. https:/purecloudlabs/elasticsearch-phone. It looks like its been abandoned. cc: @drewinin, @ININDevEvangelists - any thoughts, guidance, or warnings on this?

@timsmithgenesys
Copy link

@macohen I administer the purecloudlabs org and am unfamiliar with the project itself, but I can tell you that it has no maintainers and so can be considered abandoned; it's on my list of repos that are to be archived. It's licensed under a MIT license, so you are free to do with it as you please.

@rursprung
Copy link
Contributor Author

rursprung commented Dec 11, 2023

thanks for finding this library, @macohen!

@timsmithgenesys:

It's licensed under a MIT license, so you are free to do with it as you please.

thanks for this information! you have a LICENSE which contains the MIT license and a LICENSE.md which contains an Apache 2.0 license - is it correctly dual-licensed (would be good as all OpenSearch projects are under the Apache 2.0 license, making life easier) or is this a leftover and you consider it only licensed under MIT?

i've now seen that this library is backed by google's libphonenumber (which i didn't know before), that looks like it would help in abstracting away most of the discussions above (by leaving it to this library), though i haven't looked into it in any detail (yet). its Falsehoods Programmers Believe About Phone Numbers list is very informative and shows that we probably shouldn't attempt at implementing this manually (e.g. with textual token filtering).

@timsmithgenesys
Copy link

you have a LICENSE which contains the MIT license and a LICENSE.md which contains an Apache 2.0 license

Oh gross; I didn't notice that. I think the correct answer is this is meant to be licensed under Apache per the misspelled LICENCE.md as that was added first and by a human. The LICENSE file that contains the MIT license was added by automation, not intentionally.

I've deleted the MIT license and archived the repo to rectify this situation. I'll have to retract my previous statement about licensure. You're still free to do whatever you want with it, just per the remaining Apache license.

@dblock
Copy link
Member

dblock commented Dec 13, 2023

@macohen I think we could fork that abandoned library into opensearch-phone-number-analyzer or bring it into core here

@macohen
Copy link
Contributor

macohen commented Dec 13, 2023

Sounds good to me. Is it "core" enough to include here? I'm open to what seems right to maintainers.

@dblock
Copy link
Member

dblock commented Dec 13, 2023

@macohen core is pretty bloated, so I would include it only if you think it should be enabled by default, WDYT?

@macohen
Copy link
Contributor

macohen commented Dec 20, 2023

I do think it should be enabled by default - as long as it doesn't use any resources unless invoked.

However, I don't think we're at a stage yet where we should just move it in. @rursprung are you up for forking this to your own account, making it work with OpenSearch as a plugin and then we can take a public vote/decision on if we move this into the opensearch-project itself? I think I'd want to make sure this library can at least meet your use cases.

@dblock do you think the change from a plugin/extension to a core module would be significant enough that we would want to make a decision before anyone starts working on it?

cc: @dagneyb for an FYI

@dblock
Copy link
Member

dblock commented Jan 23, 2024

@dblock do you think the change from a plugin/extension to a core module would be significant enough that we would want to make a decision before anyone starts working on it?

I don't feel strongly about it at all.

rursprung added a commit to rursprung/opensearch-phone-number-analyzer that referenced this issue Jun 11, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

closes opensearch-project/OpenSearch#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
@rursprung
Copy link
Contributor Author

rursprung commented Jun 11, 2024

i have now created an implementation for this (currently on 2.x because i'm testing there, but easily ported to 3.x): opensearch-phone-number-analyzer

this is not 100% ready for prime-time yet (namely i'm waiting for 1-2 PRs to end up in the plugin template and plan to then re-create the repo based on that and with everything in main first of course) but good enough to start the discussions.

this is largely based on the functionality from elasticsearch-phone from @purecloudlabs (@timsmithgenesys: thanks for clarifying the license here and hope that the attribution in the commit-msg & README is ok?).

@opensearch-project (CC @macohen):

However, I don't think we're at a stage yet where we should just move it in. @rursprung are you up for forking this to your own account, making it work with OpenSearch as a plugin and then we can take a public vote/decision on if we move this into the opensearch-project itself? I think I'd want to make sure this library can at least meet your use cases.

i'd like to nominate this plugin to be moved to the opensearch-project org. is this issue the right place to discuss this? in the best case you'd then create a fresh repo with the same name here based on the plugin template (with main & 2.x please 🙂) and i can fork it and raise a PR with the actual content (same commits as you can currently see in my repo).
alternatively - if you prefer that - i can of course also provide it as a PR to the plugins folder in this repo, but if i understood right that's not really desired anymore for new plugins?

@rursprung rursprung changed the title provide new normalizer (or even field type?) for phone numbers [RFC] new plugin with normalizer & analyzer for phone numbers Jun 20, 2024
@rursprung
Copy link
Contributor Author

i've now changed the title of the issue and added a comment at the beginning to clearly state that this is now an RFC to get the plugin moved over to this org. see my previous comment for more details.

@dblock
Copy link
Member

dblock commented Jun 20, 2024

@AmiStrn was trying to contribute a process for deciding how/when/why we bring plugins into the org

@rursprung Do you have any users of the fork? Get CI to be green, make a release? We can consider including a plugin in the main distribution if it’s widely used.

@dblock
Copy link
Member

dblock commented Jun 25, 2024

Note that we generally describe the process of moving repos into the org in https:/opensearch-project/.github/blob/main/ADMINS.md#new-repos and have a proposal in opensearch-project/opensearch-plugins#216 to document moving plugins.

@rursprung
Copy link
Contributor Author

@dblock: thanks for your answers!

Do you have any users of the fork?

not yet. i published it right after building it to (a) get feedback on it and (b) see if it can be upstreamed directly.
i have however started integrating it in one of our applications (the reason why i built the plugin in the first place) and we will roll it out with that to our customers. the main issue here is that for this it will have to be available in some public cloud setups (i.e. managed services) - which would be much easier if it were part of this project

Get CI to be green

that was a minor oversight (the lucene test expects the tokens to be in the specified order and seems that i mixed up something there; already fixed locally and on the branch using Set instead of List; waiting on #14179 to see what the correct approach is).

i'll update the repo once #14179 has been answered (probably re-create the repo from the template as a lot has happened there since i did so and then re-apply the commits with the necessary cleanups/code changes; as there are no releases, no forks, etc. there shouldn't be an issue with throwing away the existing git history in this case), this will then also porting it from 2.x to main (i haven't tried it yet and hope that the lucene major release upgrade won't cause any issues here).

make a release?

if ever possible i'd prefer to avoid going through that if we can move it to the opensearch project as then we'd have to do the whole setup twice. also, some things would then probably have to be renamed (i presume it'd be bad style if i'd publish it with org.opensearch.plugin for the java packages and maven coordinates), adding more noise.

We can consider including a plugin in the main distribution if it’s widely used.

i think these are two steps:

  • host the project under opensearch-project on GH & include it in the release process
  • include the plugin as part of the "full" distribution (i.e. make it a standard plugin)

while the latter would be nice (full disclosure: we're not using it, we use the min distribution and add only the plugins we need for our own setup) the former would be my immediate goal with this issue.

@dblock
Copy link
Member

dblock commented Jun 27, 2024

  • host the project under opensearch-project on GH & include it in the release process
  • include the plugin as part of the "full" distribution (i.e. make it a standard plugin)

Both of these are significant amounts of work for people who work on this project. The second one is also a promise to continue working on the plugin for security fixes and other version upgrades forever. I think it's a reasonable ask that any plugin that is included in the org and is released with the distribution gets either a team publicly committing to do it (your company?) or at least some traction with users, first, no?

@msfroh
Copy link
Collaborator

msfroh commented Sep 6, 2024

This is a lightweight tokenizer. Adding all the bloat of a new plugin to support one more analyzer seems silly. Adding it to analysis-common would be much easier.

@smacrakis
Copy link

The standard way of denoting the international call prefix is "+", not "00" (https://en.wikipedia.org/wiki/List_of_international_call_prefixes)

If the field is known to be a phone number (not mixed with other text), then all punctuation and whitespace can be removed to canonicalize it, e.g., (234) 321-9321, 234.321.9321, 234 321 9321 should all canonicalize to 2343219321. It would be nice if we could add + to indicate the international call prefix, but unfortunately, numbers are often given without the country code. If the application is hosted in North America, and the number has 10 digits, it's pretty safe to guess that the prefix "+1" should be applied. But alas you can't count on that outside of North America. An industrial-strength implementation would look at other fields (notably address) to make a better guess. Probably ML to do this robustly.

@rursprung
Copy link
Contributor Author

This is a lightweight tokenizer. Adding all the bloat of a new plugin to support one more analyzer seems silly. Adding it to analysis-common would be much easier.

i'd be perfectly happy to provide a PR adding this to analysis-common if this is the general consensus! i agree that it'd involve a lot less overhead and would make my life easier!

@dblock
Copy link
Member

dblock commented Sep 11, 2024

+1 from me

rursprung added a commit to rursprung/OpenSearch that referenced this issue Sep 12, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/OpenSearch that referenced this issue Sep 13, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/OpenSearch that referenced this issue Sep 13, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/OpenSearch that referenced this issue Sep 26, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/OpenSearch that referenced this issue Sep 27, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/OpenSearch that referenced this issue Sep 27, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.18.0 Issues and PRs related to version 2.18.0 labels Sep 27, 2024
rursprung added a commit to rursprung/OpenSearch that referenced this issue Sep 27, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/OpenSearch that referenced this issue Oct 3, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/OpenSearch that referenced this issue Oct 3, 2024
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
@reta reta closed this as completed in d1fd47c Oct 3, 2024
reta pushed a commit that referenced this issue Oct 4, 2024
* add `Strings#isDigits` API

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <[email protected]>

* add `phone` & `phone-search` analyzer + tokenizer

this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes #11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>

---------

Signed-off-by: Ralph Ursprung <[email protected]>
@reta
Copy link
Collaborator

reta commented Oct 4, 2024

Thanks a lot for contribution @rursprung ! Please provide the pull request for opensearch-project/documentation-website#8389 with documentation update upon your convenience (before 2.18.0 release), thank you.

rursprung added a commit to rursprung/documentation-website that referenced this issue Oct 4, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

resolves opensearch-project#8389
rursprung added a commit to rursprung/documentation-website that referenced this issue Oct 4, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

resolves opensearch-project#8389

Signed-off-by: Ralph Ursprung <[email protected]>
@dblock
Copy link
Member

dblock commented Oct 9, 2024

Also https:/opensearch-project/opensearch-api-specification needs an update, with a working example. Thanks!

rursprung added a commit to rursprung/documentation-website that referenced this issue Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

resolves opensearch-project#8389

Co-authored-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this issue Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new tes group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this issue Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new tes group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this issue Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new tes group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

note that the CI currently needs to fetch the image from
`opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and
`ref` config can be removed once 2.18.0 has been released.

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this issue Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new test group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

note that the CI currently needs to fetch the image from
`opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and
`ref` config can be removed once 2.18.0 has been released.

Signed-off-by: Ralph Ursprung <[email protected]>
rursprung added a commit to rursprung/opensearch-api-specification that referenced this issue Oct 11, 2024
this is part of opensearch-project/OpenSearch#11326. the actual
implementation was done opensearch-project/OpenSearch#15915. see the
commit message on the PR for further details.

the new test group `analysis` has been added so that it can later be
extended with all other optional language analyzers (which are currently
also not covered).

note that the CI currently needs to fetch the image from
`opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and
`ref` config can be removed once 2.18.0 has been released.

Signed-off-by: Ralph Ursprung <[email protected]>
dk2k pushed a commit to dk2k/OpenSearch that referenced this issue Oct 16, 2024
* add `Strings#isDigits` API

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <[email protected]>

* add `phone` & `phone-search` analyzer + tokenizer

this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>

---------

Signed-off-by: Ralph Ursprung <[email protected]>
dk2k pushed a commit to dk2k/OpenSearch that referenced this issue Oct 17, 2024
* add `Strings#isDigits` API

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <[email protected]>

* add `phone` & `phone-search` analyzer + tokenizer

this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https:/purecloudlabs/elasticsearch-phone
[libphonenumber]: https:/google/libphonenumber
[falsehoods]: https:/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>

---------

Signed-off-by: Ralph Ursprung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: New
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants