Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introduce email field set - stage 2 #1593

Merged
merged 30 commits into from
Dec 13, 2021

Conversation

ebeahan
Copy link
Member

@ebeahan ebeahan commented Aug 25, 2021

Summary

Continuing onto Stage 2 with this proposal to introduce the email.* field set to the schema.

Stage 2 (Candidate) Criteria:

  • Opened pull request for this draft revising the existing proposal
  • Completed field definitions
  • Included a real-world example source document
  • Identifies scope of impact of changes to ingestion mechanisms (e.g., beats/logstash), usage mechanisms (e.g., Kibana applications, detections), and the ECS project (e.g., docs, tooling)
  • Subject matter experts weighed in on the technical utility of field definitions in the pull request

Preview of markdown proposal

@ebeahan ebeahan added the RFC label Aug 25, 2021
@ebeahan ebeahan self-assigned this Aug 25, 2021
@ebeahan
Copy link
Member Author

ebeahan commented Aug 25, 2021

Opening PR to capture any feedback or suggestions around the proposed set of email.* fields.

@wasserman
Copy link

wasserman commented Sep 20, 2021

In regards to Display Name, you could consider some of these options:

  1. Allow for name/address pairs in all email fields instead of just the email address itself.
  2. Permit RFC 5322 Email formats like First Last <[email protected]>. Then a keyword with a normalizer could index just the emails while still keeping the display names intact. Of course the names wouldn't be searchable.
  3. Multi-fields with some variation of this could work and allow for some form of the keyword and text fields.

I personally ran with option 2. I was hoping to leverage the uax_url_email tokenizer, but I had to settle for a simple regex.

"normalizer": {
    "email_normalizer": {
      "type": "custom",
      "char_filter": [
        "email_filter"
      ]
    }
  },
  "char_filter": {
    "email_filter": {
      "type": "pattern_replace",
      "pattern": ".*?<([^>]+)>",
      "replacement": "$1"
    }
  }

@ebeahan
Copy link
Member Author

ebeahan commented Nov 18, 2021

Proposed fields now include arrays of objects with both the email address and display name for the to, cc, and bcc recipients. The display_name and address fields have been added under email.from.

Field name Data type
email.from.address keyword
email.from.display_name keyword
email.to nested
email.to.address keyword
email.to.display_name keyword
email.subject keyword
email.cc nested
email.cc.address keyword
email.cc.display_name keyword
email.bcc nested
email.bcc.address keyword
email.bcc.display_name keyword

Copy link
Contributor

@djptek djptek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

As an aside, when this is merged I'd like to use it as an example to update the docs around nested type, as the difference between email.from and email.to illustrates this perfectly

@devonakerr devonakerr requested review from peasead and removed request for devonakerr November 19, 2021 13:50
rfcs/text/0010-email.md Outdated Show resolved Hide resolved
@ebeahan
Copy link
Member Author

ebeahan commented Nov 22, 2021

The limitations around building visualizations using type nested fields make me question if using nested for the various email sender/recipient fields is the best direction.

I'm going to think this over a bit more and iterate on the proposed fields.

@djptek
Copy link
Contributor

djptek commented Nov 23, 2021

@ebeahan re nested type - for the legal use case nested type is not necessary as that would center on *.address

Regarding visualisation of emails, we'd probably need a Chord or Sankey diagram - a quick look in Kibana Issues does not have those on the radar at this point in time though there is a Sankey example using Vega on the Blog - again this would probably center on *.address. so not nested.

There might also be value in running address and display name through ML, that would probably be incompatible with nested type

The only use case off the top of my head where we might want nested might be Spoof detection, however, given the cardinality this would probably be best done using ML for the heavy lifting and then manual inspection of anomalies, so you could work around that

@peasead Do you have a specific use case/query/aggs in mind where we'd need to leverage nested type?

@ebeahan ebeahan requested a review from a team November 30, 2021 19:30
@ebeahan
Copy link
Member Author

ebeahan commented Nov 30, 2021

@jamiehynds as the sponsor, can you take a look at how the email.* fields proposal is shaping up?

@peasead
Copy link
Contributor

peasead commented Dec 1, 2021

@peasead Do you have a specific use case/query/aggs in mind where we'd need to leverage nested type?

Thanks for your patience.

@djptek I don't have anything specific. I wasn't sure if there'd be a use case to query nested objects independently, but thinking more, I'm not sure that'd be needed.

@wasserman
Copy link

wasserman commented Dec 1, 2021

FYI, nested was just thought to be a useful way to preserve the relationships between display names and emails. Multi-fields or a normalizer could work too. Ultimately any smart way to not lose the display names in the process since it could be valuable just to be able to see, if nothing else.

Copy link
Contributor

@djptek djptek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

rfcs/text/0010-email.md Outdated Show resolved Hide resolved
@djptek
Copy link
Contributor

djptek commented Dec 2, 2021

Thanks @wasserman

nested was just thought to be a useful way to preserve the relationships between display names and emails

Looking at one use case that leverages this relationship, e.g. checking for spoofing of a known address, we'd need to involve additional data - a table/index defining the a priori relationship between a defined and countable set of address(es) and the legitimate display_name(s) for these address(es). This would require additional logic, including a join against that table/index. Elasticsearch joins are generally best implemented at ingest time rather than query time, so this use case could perhaps be addressed by building this into the ingest pipeline, or by reindexing a subset of data related to specific address(es) of interest.

Conversely, where the relationship between display_name and address is not explicitly defined a priori, there is no upper limit to the number of address(es) in the related use cases so it may be preferable to avoid nested_type to ensure the most performant solution.

@ebeahan ebeahan merged commit 2b55ff8 into elastic:main Dec 13, 2021
@ebeahan
Copy link
Member Author

ebeahan commented Dec 15, 2021

While implementing the email.* field set into the schema, reusing hash.* at email.attachments.file.hash.* felt like a better approach that aligns with the existing file.hash.* fields and adds a few additional hash fields for any attachment: #1688 (comment).

I'll capture these details fully in the proposal doc during stage 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants