Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: Add optional fields and limit joining keys on non-null values only #79677

Merged
merged 4 commits into from
Oct 26, 2021

Conversation

astefan
Copy link
Contributor

@astefan astefan commented Oct 22, 2021

Add optional fields, with usage inside queries' filters and as join keys.
Optional fields have a question mark in front of their name (?some_field) and can be used in standalone event queries and sequences.

If the field exists in at least one index that's getting queried, that field will actually be used in queries.
If the field doesn't exist in any of the indices, all its mentions in query filters will be replaced with null.
For sequences, optional fields as keys can have null as a value, whereas non-optional fields will be restricted to non-null values only.

For example, a query like

sequence by ?process.entity_id, process.pid
  [process where transID == 2]
  [file where transID == 0] with runs=2

can return a sequence with a join key [null,123].
If the sequence will use process.pid as an optional field (sequence by ?process.entity_id, ?process.pid), as well, the sequence can now return join keys as [null,123], [null,null], [512,null].

Restrict joining keys to non-null values only for non-optional fields
@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Oct 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change can be made less intrusive by mapping the optional key onto existing concepts.
For example for optional attribute construct a special unresolved attribute by initializing the unresolved message with a singleton string (say 'optional' or '?').
Don't keep track of the optional attributes, simply let the analyzer resolve all of them and add a separate rule in Finish Analysis that looks for all unresolved attributes that have defined string and replace them with a null literal.

This takes advantage of folding the expressions and also extracting the null constant if defined as a key. The verifier would not be tripped (and just has to be updated to consider the null type for key compatibility) while BoxedQuery & co only need to care about converting null to exists queries.

Comment on lines 106 to 110
if (arr != null) {
spec.joinKeys(arr.toArray(new String[0]));
} else {
spec.joinKeys(new String[0]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec.joinKeys(arr != null ? arr.toArray(new String[0]) : new String[0]);

@@ -430,3 +430,63 @@ query = '''
process where substring(command_line, 5) regex (".*?net[1]? localgroup.*?", ".*? myappserver.py .*?")
'''

[[queries]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Please add some test(s) with until as well.

Copy link
Contributor Author

@astefan astefan Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not easy to add a test with until given the current data. That's why I added some more data to extra.data and created more tests with that data set. There is a commented test in test_extra.toml but it is like that because I don't think until works as expected. I can uncomment that and adjust the join_keys to have a test with until.

expected_event_ids = [18,19,20]
join_keys = ["512","123"]

# Known issue: same key used as both optional and non-optional doesn't work as expected. Does it even make sense such scenario?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a use case where this makes sense - a key has to be either optional or not.
This use-case needs to be prevent in the verifier.

import org.elasticsearch.xpack.ql.type.DataType;
import org.elasticsearch.xpack.ql.type.DataTypes;

public class OptionalUnresolvedAttribute extends UnresolvedAttribute {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's currently defined the class is confusing.
It extends UnresolvedAttribute yet it can be resolved.
A better name might be OptionalAttribute.


public ExecutionManager(EqlSession eqlSession) {
this.session = eqlSession;
this.cfg = eqlSession.configuration();
this.optionalKeys = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd opt for LHS to help with debugging and preserving the order of the keys specified in the initial query.

@@ -38,6 +40,7 @@
private static final Logger log = LogManager.getLogger();

private final boolean DEBUG = false;
private final Set<Expression> keyOptionals = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Unless there's a performance problem, for debugging it's better to use LHS instead of a plain HS.
  2. The optional keys are attributes not expressions.

@@ -65,6 +68,10 @@ public Expression createExpression(String expression, ParserParams params) {
return invokeParser(expression, params, EqlBaseParser::singleExpression, AstBuilder::expression);
}

public Set<Expression> keyOptionals() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the public use of this getter? Why not use the same approach as for ParserParams? That is create the list during the method call since it's just needed inside invokeParser method.

try {
return ctx != null ? visitList(this, ctx.expression(), Attribute.class) : emptyList();
List<Attribute> keys = visitList(this, ctx.expression(), Attribute.class);
for (Attribute key : keys) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information here is redundant and this optimizations has too many repercutions in the method signatures.
Either the ? is remembered through a separate set or the information is embedded inside the UnresolvedAttribute either by subclassing or using a special tag (such as a special string used for identifying it).
I opt for the latter since it avoid the set to be passed as a constructor in classes that have nothing to do with it - instead the discovery of the unresolvedattribute can be done directly at resolution time or through a dedicated rule that looks for them.

@@ -120,6 +121,10 @@ public QueryContainer addColumn(Attribute attr) {
return new Tuple<>(this, extractorRegistry.fieldExtraction(expression));
}

if (expression instanceof OptionalUnresolvedAttribute) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like leaked concern - better to make the attribute foldable which maps nicely to the concept of replacing it with null.

@@ -266,6 +269,9 @@ private void checkJoinKeyTypes(LogicalPlan plan, Set<Failure> localFailures) {
}

private static void doCheckKeyTypes(Join join, Set<Failure> localFailures, NamedExpression expectedKey, NamedExpression currentKey) {
if (expectedKey instanceof OptionalUnresolvedAttribute || currentKey instanceof OptionalUnresolvedAttribute) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, just check if the datatype is not null

Copy link
Contributor

@Luegg Luegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the ? syntax will be used for two distinct things: a) treat a non-existing field as null instead of failing the query and b) also use null as a join key

I was not part of the discussions around the feature but I'm wondering wether it wouldn't be better to keep these two concerns separated. In total, there are 4 different ways on how nulls and non-existing fields can be treated in join keys:

  • don't use nulls as join key, fail on non-existing field (new default)
  • use nulls as join key, fail on non-existing field
  • don't use nulls as join key, replace non-existing field with null
  • use nulls as join key, replace non-existing field with null (? syntax)

The second and third option will not be available to the user with this change but I think the second option is the behavior users might usually prefer if they want to use null as join keys (and get a proper error if they misspelled the field name).

Hence, would it make sense to have a separate syntax for including nulls as join keys? Something like sequence by foo with nulls, bar ...

@astefan
Copy link
Contributor Author

astefan commented Oct 25, 2021

Synced with @Luegg offline on the concern he raised and we agreed it's worth investigating the possibility of being explicit about the "allow/disallow nulls" behavior for join keys:

  • if it would be a useful feature to have
  • if it makes sense for it to be added to EQL, in the context of join keys as expressions
  • if all of the above are valid points, what would be the syntax for it

Replace missing field based on a context either with literal or a
dedicated attribute. This allows folding to occur without dealing with
aliases or making the named expression act as a literal (which can lead
to accidental folding).

Make the mandatory key constraint explicit inside each filter query
to help in propagating the constraint across sequences without special
handling.
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of time, after a discussion with @astefan I've pushed an update based on my comments.
Essentially instead of using one attribute that acts both as a literal, resolved attribute and unresolved one, 3 different attribute classes are introduced.
They simply act as markers and don't introduce new behavior. The upside is the Analyzer and Optimizer can perform fine grained replacement:

  • the analyzer does the resolution separately and in case things are not found replaces it with a literal (in a filter) or an MissionOptionalAttribute.
  • the optimizer adds non-null constraints based on all mandatory keys (the optional fields are excluded).
    This simplifies things by avoiding passing any set or list - the boxed query generator can determine that based on the type of keys and add the or query when needed.

git log:
Replace missing field based on a context either with literal or a dedicated attribute. This allows folding to occur without dealing with aliases or making the named expression act as a literal (which can lead to accidental folding).

Make the mandatory key constraint explicit inside each filter query to help in propagating the constraint across sequences without special handling.

@astefan astefan added auto-backport-and-merge auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) v7.16.1 labels Oct 26, 2021
@elasticsearchmachine elasticsearchmachine merged commit d4bf2dc into elastic:master Oct 26, 2021
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.16

astefan added a commit to astefan/elasticsearch that referenced this pull request Oct 26, 2021
…ly (elastic#79677)

Add optional fields, with usage inside queries' filters and as join
keys. Optional fields have a question mark in front of their name
(`?some_field`) and can be used in standalone event queries and
sequences. If the field exists in at least one index that's getting
queried, that field will actually be used in queries. If the field
doesn't exist in any of the indices, all its mentions in query filters
will be replaced with `null`. For sequences, optional fields as keys can
have `null` as a value, whereas non-optional fields will be restricted
to non-null values only. For example, a query like

```
sequence by ?process.entity_id, process.pid
  [process where transID == 2]
  [file where transID == 0] with runs=2
```

can return a sequence with a join key `[null,123]`. If the sequence will
use `process.pid` as an optional field (`sequence by ?process.entity_id,
?process.pid`), as well, the sequence can now return join keys as
`[null,123]`, `[null,null]`, `[512,null]`.
Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

elasticsearchmachine pushed a commit that referenced this pull request Oct 26, 2021
…ly (#79677) (#79807)

Add optional fields, with usage inside queries' filters and as join
keys. Optional fields have a question mark in front of their name
(`?some_field`) and can be used in standalone event queries and
sequences. If the field exists in at least one index that's getting
queried, that field will actually be used in queries. If the field
doesn't exist in any of the indices, all its mentions in query filters
will be replaced with `null`. For sequences, optional fields as keys can
have `null` as a value, whereas non-optional fields will be restricted
to non-null values only. For example, a query like

```
sequence by ?process.entity_id, process.pid
  [process where transID == 2]
  [file where transID == 0] with runs=2
```

can return a sequence with a join key `[null,123]`. If the sequence will
use `process.pid` as an optional field (`sequence by ?process.entity_id,
?process.pid`), as well, the sequence can now return join keys as
`[null,123]`, `[null,null]`, `[512,null]`.
@astefan astefan deleted the eql_optional_fields2 branch October 26, 2021 14:38
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 26, 2021
* upstream/master: (209 commits)
  Enforce license expiration (elastic#79671)
  TSDB: Automatically add timestamp mapper (elastic#79136)
  [DOCS]  `_id` is required for bulk API's `update` action (elastic#79774)
  EQL: Add optional fields and limit joining keys on non-null values only (elastic#79677)
  [DOCS] Document range enrich policy (elastic#79607)
  [DOCS] Fix typos in 8.0 security migration (elastic#79802)
  Allow listing older repositories (elastic#78244)
  [ML] track inference model feature usage per node (elastic#79752)
  Remove IncrementalClusterStateWriter & related code (elastic#79738)
  Reuse previous indices lookup when possible (elastic#79004)
  Reduce merging in PersistedClusterStateService (elastic#79793)
  SQL: Adjust JDBC docs to use milliseconds for timeouts (elastic#79628)
  Remove endpoint for freezing indices (elastic#78918)
  [ML] add timeout parameter for DELETE trained_models API (elastic#79739)
  [ML] wait for .ml-state-write alias to be readable (elastic#79731)
  [Docs] Update edgengram-tokenizer.asciidoc (elastic#79577)
  [Test][Transform] fix UpdateTransformActionRequestTests failure (elastic#79787)
  Limit CS Update Task Description Size (elastic#79443)
  Apply the reader wrapper on can_match source (elastic#78988)
  [DOCS] Adds new transform limitation item and a note to the tutorial (elastic#79479)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexMode.java
#	server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
@astefan
Copy link
Contributor Author

astefan commented Oct 27, 2021

CC @jrodewig

lockewritesdocs pushed a commit to lockewritesdocs/elasticsearch that referenced this pull request Oct 28, 2021
…ly (elastic#79677)

Add optional fields, with usage inside queries' filters and as join
keys. Optional fields have a question mark in front of their name
(`?some_field`) and can be used in standalone event queries and
sequences. If the field exists in at least one index that's getting
queried, that field will actually be used in queries. If the field
doesn't exist in any of the indices, all its mentions in query filters
will be replaced with `null`. For sequences, optional fields as keys can
have `null` as a value, whereas non-optional fields will be restricted
to non-null values only. For example, a query like

```
sequence by ?process.entity_id, process.pid
  [process where transID == 2]
  [file where transID == 0] with runs=2
```

can return a sequence with a join key `[null,123]`. If the sequence will
use `process.pid` as an optional field (`sequence by ?process.entity_id,
?process.pid`), as well, the sequence can now return join keys as
`[null,123]`, `[null,null]`, `[512,null]`.
jrodewig added a commit that referenced this pull request Nov 3, 2021
Adds new sections for optional fields and optional `by` fields. Also revises some existing content to define **join keys**.

Closes #79910

Relates to #79677
elasticsearchmachine pushed a commit that referenced this pull request Nov 3, 2021
Adds new sections for optional fields and optional `by` fields. Also revises some existing content to define **join keys**.

Closes #79910

Relates to #79677
elasticsearchmachine pushed a commit that referenced this pull request Nov 3, 2021
Adds new sections for optional fields and optional `by` fields. Also revises some existing content to define **join keys**.

Closes #79910

Relates to #79677
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug Team:QL (Deprecated) Meta label for query languages team v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants