Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge custom and core multi_fields array #982

Merged
merged 8 commits into from
Jan 6, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.next.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Thanks, you're awesome :-) -->
* Introduced `--strict` flag to perform stricter schema validation when running the generator script. #937
* Added check under `--strict` that ensures composite types in example fields are quoted. #966
* Added `ignore_above` and `normalizer` support for keyword multi-fields. #971
* Added functionality for merging custom and core multi-fields. #982
ebeahan marked this conversation as resolved.
Show resolved Hide resolved

#### Improvements

Expand Down
28 changes: 28 additions & 0 deletions scripts/schema/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,26 @@ def nest_fields(field_array):
return schema_root


def array_of_dicts_to_set(array_vals):
ret_set = set()
for dict_val in array_vals:
ret_set.add(frozenset(dict_val.items()))
return ret_set


def set_of_sets_to_array(set_vals):
ret_list = []
for set_info in set_vals:
ret_list.append(dict(set_info))
return sorted(ret_list, key=lambda k: k['name'])


def dedup_and_merge_lists(list_a, list_b):
list_a_set = array_of_dicts_to_set(list_a)
list_b_set = array_of_dicts_to_set(list_b)
return set_of_sets_to_array(list_a_set | list_b_set)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor issue I stumbled across while testing this out. Not sure it would be a blocker to merging, but worth noting the behavior.

The union will remove exact duplicate items:

> list_a_set
{frozenset({('name', 'text'), ('type', 'text')})}

> list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}

> list_a_set | list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}

But if the sets are not exact duplicates, it could lead to duplicate field names:

> list_a_set
{frozenset({('type', 'text'), ('name', 'text')})}

> list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'keyword'), ('name', 'text')})}

> list_a_set | list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'text'), ('name', 'text')}), frozenset({('type', 'keyword'), ('name', 'text')})}

schema include file:

---
  - name: file
    title: File
    group: 2
    short: Fields describing files.
    description: >
      Custom file
    fields:
      - name: path
        multi_fields:
          - name: caseless
            type: keyword
            normalizer: lowercase
          - name: text  
            type: keyword <= I imagine this would only happen by accident 😃

Resulting intermediate state:

        multi_fields:
        - flat_name: file.path.caseless
          ignore_above: 1024
          name: caseless
          normalizer: lowercase
          type: keyword
        - flat_name: file.path.text
          ignore_above: 1024
          name: text
          type: keyword
        - flat_name: file.path.text
          name: text
          norms: false
          type: text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch, what do we think the expected behavior should be in this scenario? I could put in a check to ensure that two of the same name fields don't exist in the resulting set and throw an error if they do? Or maybe just have core override?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should dedupe on name and take the most recent definition in the case of dupes (this would allow for overrides).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webmat do you have any thoughts? I recall back in #864, logic was removed from the tooling to allow --include supplied custom fields to be more permissive:

This means the tooling must now accept included files as they are, with all of the power this entails.

Perhaps we simply make sure to note that users need to be aware of introducing such duplicates fields?

Copy link
Contributor

@webmat webmat Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.

The --include option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.

But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text multi-field with such a custom definition:

        multi_fields:
        - name: text
          norms: false
          type: text
          normalizer: ua_normalizer 

I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:

        - name: text
          normalizer: ua_normalizer 



def merge_fields(a, b):
"""Merge ECS field sets with custom field sets."""
a = copy.deepcopy(a)
Expand All @@ -184,6 +204,14 @@ def merge_fields(a, b):
a[key].setdefault('field_details', {})
a[key]['field_details'].setdefault('normalize', [])
a[key]['field_details']['normalize'].extend(b[key]['field_details'].pop('normalize'))
if 'multi_fields' in b[key]['field_details']:
a[key].setdefault('field_details', {})
a[key]['field_details'].setdefault('multi_fields', [])
a[key]['field_details']['multi_fields'] = dedup_and_merge_lists(
a[key]['field_details']['multi_fields'], b[key]['field_details']['multi_fields'])
# if we don't do this then the update call below will overwrite a's field_details, with the original
# contents of b, which undoes our merging the multi_fields
del b[key]['field_details']['multi_fields']
a[key]['field_details'].update(b[key]['field_details'])
# merge schema details
if 'schema_details' in b[key]:
Expand Down
90 changes: 90 additions & 0 deletions scripts/tests/unit/test_schema_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -594,6 +594,96 @@ def test_merge_non_array_attributes(self):
}
self.assertEqual(merged_fields, expected_fields)

def test_merge_multi_fields(self):
schema1 = {
'base': {
'field_details': {
'multi_fields': [
webmat marked this conversation as resolved.
Show resolved Hide resolved
{
'type': 'text',
'name': 'text'
},
{
'type': 'keyword',
'name': 'caseless',
'normalizer': 'lowercase'
}
]
},
'fields': {
'message': {
'field_details': {
'multi_fields': [
{
'type': 'text',
'name': 'text'
}
]
}
}
}
}
}

schema2 = {
'base': {
'field_details': {
'multi_fields': [
{
'type': 'text',
'name': 'text'
},
{
'type': 'text',
'name': 'almost_text',
}
]
},
'fields': {
'message': {
'field_details': {
'multi_fields': [
{
'type': 'keyword',
'name': 'a_field'
}
]
}
}
}
}
}
merged_fields = loader.merge_fields(schema1, schema2)
expected_multi_fields = [
{
'type': 'text',
'name': 'almost_text'
},
{
'type': 'keyword',
'name': 'caseless',
'normalizer': 'lowercase'
},
{
'type': 'text',
'name': 'text'
}
]

expected_message_multi_fields = [
{
'type': 'keyword',
'name': 'a_field'
},
{
'type': 'text',
'name': 'text'
}
]
self.assertEqual(merged_fields['base']['field_details']['multi_fields'], expected_multi_fields)
self.assertEqual(merged_fields['base']['fields']['message']['field_details']
['multi_fields'], expected_message_multi_fields)


if __name__ == '__main__':
unittest.main()