-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EQL: Add optional fields and limit joining keys on non-null values only #79677
EQL: Add optional fields and limit joining keys on non-null values only #79677
Conversation
Restrict joining keys to non-null values only for non-optional fields
Pinging @elastic/es-ql (Team:QL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change can be made less intrusive by mapping the optional key onto existing concepts.
For example for optional attribute construct a special unresolved attribute by initializing the unresolved message with a singleton string (say 'optional' or '?').
Don't keep track of the optional attributes, simply let the analyzer resolve all of them and add a separate rule in Finish Analysis
that looks for all unresolved attributes that have defined string and replace them with a null literal.
This takes advantage of folding the expressions and also extracting the null constant if defined as a key. The verifier would not be tripped (and just has to be updated to consider the null type for key compatibility) while BoxedQuery & co only need to care about converting null to exists queries.
if (arr != null) { | ||
spec.joinKeys(arr.toArray(new String[0])); | ||
} else { | ||
spec.joinKeys(new String[0]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec.joinKeys(arr != null ? arr.toArray(new String[0]) : new String[0]);
@@ -430,3 +430,63 @@ query = ''' | |||
process where substring(command_line, 5) regex (".*?net[1]? localgroup.*?", ".*? myappserver.py .*?") | |||
''' | |||
|
|||
[[queries]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Please add some test(s) with until
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not easy to add a test with until
given the current data. That's why I added some more data to extra.data
and created more tests with that data set. There is a commented test in test_extra.toml
but it is like that because I don't think until
works as expected. I can uncomment that and adjust the join_keys
to have a test with until
.
expected_event_ids = [18,19,20] | ||
join_keys = ["512","123"] | ||
|
||
# Known issue: same key used as both optional and non-optional doesn't work as expected. Does it even make sense such scenario? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a use case where this makes sense - a key has to be either optional or not.
This use-case needs to be prevent in the verifier.
import org.elasticsearch.xpack.ql.type.DataType; | ||
import org.elasticsearch.xpack.ql.type.DataTypes; | ||
|
||
public class OptionalUnresolvedAttribute extends UnresolvedAttribute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it's currently defined the class is confusing.
It extends UnresolvedAttribute
yet it can be resolved.
A better name might be OptionalAttribute
.
|
||
public ExecutionManager(EqlSession eqlSession) { | ||
this.session = eqlSession; | ||
this.cfg = eqlSession.configuration(); | ||
this.optionalKeys = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd opt for LHS to help with debugging and preserving the order of the keys specified in the initial query.
@@ -38,6 +40,7 @@ | |||
private static final Logger log = LogManager.getLogger(); | |||
|
|||
private final boolean DEBUG = false; | |||
private final Set<Expression> keyOptionals = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Unless there's a performance problem, for debugging it's better to use LHS instead of a plain HS.
- The optional keys are attributes not expressions.
@@ -65,6 +68,10 @@ public Expression createExpression(String expression, ParserParams params) { | |||
return invokeParser(expression, params, EqlBaseParser::singleExpression, AstBuilder::expression); | |||
} | |||
|
|||
public Set<Expression> keyOptionals() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the public use of this getter? Why not use the same approach as for ParserParams? That is create the list during the method call since it's just needed inside invokeParser method.
try { | ||
return ctx != null ? visitList(this, ctx.expression(), Attribute.class) : emptyList(); | ||
List<Attribute> keys = visitList(this, ctx.expression(), Attribute.class); | ||
for (Attribute key : keys) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information here is redundant and this optimizations has too many repercutions in the method signatures.
Either the ? is remembered through a separate set or the information is embedded inside the UnresolvedAttribute either by subclassing or using a special tag (such as a special string used for identifying it).
I opt for the latter since it avoid the set to be passed as a constructor in classes that have nothing to do with it - instead the discovery of the unresolvedattribute can be done directly at resolution time or through a dedicated rule that looks for them.
@@ -120,6 +121,10 @@ public QueryContainer addColumn(Attribute attr) { | |||
return new Tuple<>(this, extractorRegistry.fieldExtraction(expression)); | |||
} | |||
|
|||
if (expression instanceof OptionalUnresolvedAttribute) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like leaked concern - better to make the attribute foldable which maps nicely to the concept of replacing it with null.
@@ -266,6 +269,9 @@ private void checkJoinKeyTypes(LogicalPlan plan, Set<Failure> localFailures) { | |||
} | |||
|
|||
private static void doCheckKeyTypes(Join join, Set<Failure> localFailures, NamedExpression expectedKey, NamedExpression currentKey) { | |||
if (expectedKey instanceof OptionalUnresolvedAttribute || currentKey instanceof OptionalUnresolvedAttribute) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better yet, just check if the datatype is not null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the ?
syntax will be used for two distinct things: a) treat a non-existing field as null instead of failing the query and b) also use null
as a join key
I was not part of the discussions around the feature but I'm wondering wether it wouldn't be better to keep these two concerns separated. In total, there are 4 different ways on how nulls and non-existing fields can be treated in join keys:
- don't use nulls as join key, fail on non-existing field (new default)
- use nulls as join key, fail on non-existing field
- don't use nulls as join key, replace non-existing field with null
- use nulls as join key, replace non-existing field with null (
?
syntax)
The second and third option will not be available to the user with this change but I think the second option is the behavior users might usually prefer if they want to use null
as join keys (and get a proper error if they misspelled the field name).
Hence, would it make sense to have a separate syntax for including nulls as join keys? Something like sequence by foo with nulls, bar ...
Synced with @Luegg offline on the concern he raised and we agreed it's worth investigating the possibility of being explicit about the "allow/disallow nulls" behavior for join keys:
|
Replace missing field based on a context either with literal or a dedicated attribute. This allows folding to occur without dealing with aliases or making the named expression act as a literal (which can lead to accidental folding). Make the mandatory key constraint explicit inside each filter query to help in propagating the constraint across sequences without special handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of time, after a discussion with @astefan I've pushed an update based on my comments.
Essentially instead of using one attribute that acts both as a literal, resolved attribute and unresolved one, 3 different attribute classes are introduced.
They simply act as markers and don't introduce new behavior. The upside is the Analyzer and Optimizer can perform fine grained replacement:
- the analyzer does the resolution separately and in case things are not found replaces it with a literal (in a filter) or an MissionOptionalAttribute.
- the optimizer adds non-null constraints based on all mandatory keys (the optional fields are excluded).
This simplifies things by avoiding passing any set or list - the boxed query generator can determine that based on the type of keys and add the or query when needed.
git log:
Replace missing field based on a context either with literal or a dedicated attribute. This allows folding to occur without dealing with aliases or making the named expression act as a literal (which can lead to accidental folding).
Make the mandatory key constraint explicit inside each filter query to help in propagating the constraint across sequences without special handling.
💚 Backport successful
|
…ly (elastic#79677) Add optional fields, with usage inside queries' filters and as join keys. Optional fields have a question mark in front of their name (`?some_field`) and can be used in standalone event queries and sequences. If the field exists in at least one index that's getting queried, that field will actually be used in queries. If the field doesn't exist in any of the indices, all its mentions in query filters will be replaced with `null`. For sequences, optional fields as keys can have `null` as a value, whereas non-optional fields will be restricted to non-null values only. For example, a query like ``` sequence by ?process.entity_id, process.pid [process where transID == 2] [file where transID == 0] with runs=2 ``` can return a sequence with a join key `[null,123]`. If the sequence will use `process.pid` as an optional field (`sequence by ?process.entity_id, ?process.pid`), as well, the sequence can now return join keys as `[null,123]`, `[null,null]`, `[512,null]`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ly (#79677) (#79807) Add optional fields, with usage inside queries' filters and as join keys. Optional fields have a question mark in front of their name (`?some_field`) and can be used in standalone event queries and sequences. If the field exists in at least one index that's getting queried, that field will actually be used in queries. If the field doesn't exist in any of the indices, all its mentions in query filters will be replaced with `null`. For sequences, optional fields as keys can have `null` as a value, whereas non-optional fields will be restricted to non-null values only. For example, a query like ``` sequence by ?process.entity_id, process.pid [process where transID == 2] [file where transID == 0] with runs=2 ``` can return a sequence with a join key `[null,123]`. If the sequence will use `process.pid` as an optional field (`sequence by ?process.entity_id, ?process.pid`), as well, the sequence can now return join keys as `[null,123]`, `[null,null]`, `[512,null]`.
* upstream/master: (209 commits) Enforce license expiration (elastic#79671) TSDB: Automatically add timestamp mapper (elastic#79136) [DOCS] `_id` is required for bulk API's `update` action (elastic#79774) EQL: Add optional fields and limit joining keys on non-null values only (elastic#79677) [DOCS] Document range enrich policy (elastic#79607) [DOCS] Fix typos in 8.0 security migration (elastic#79802) Allow listing older repositories (elastic#78244) [ML] track inference model feature usage per node (elastic#79752) Remove IncrementalClusterStateWriter & related code (elastic#79738) Reuse previous indices lookup when possible (elastic#79004) Reduce merging in PersistedClusterStateService (elastic#79793) SQL: Adjust JDBC docs to use milliseconds for timeouts (elastic#79628) Remove endpoint for freezing indices (elastic#78918) [ML] add timeout parameter for DELETE trained_models API (elastic#79739) [ML] wait for .ml-state-write alias to be readable (elastic#79731) [Docs] Update edgengram-tokenizer.asciidoc (elastic#79577) [Test][Transform] fix UpdateTransformActionRequestTests failure (elastic#79787) Limit CS Update Task Description Size (elastic#79443) Apply the reader wrapper on can_match source (elastic#78988) [DOCS] Adds new transform limitation item and a note to the tutorial (elastic#79479) ... # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexMode.java # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
CC @jrodewig |
…ly (elastic#79677) Add optional fields, with usage inside queries' filters and as join keys. Optional fields have a question mark in front of their name (`?some_field`) and can be used in standalone event queries and sequences. If the field exists in at least one index that's getting queried, that field will actually be used in queries. If the field doesn't exist in any of the indices, all its mentions in query filters will be replaced with `null`. For sequences, optional fields as keys can have `null` as a value, whereas non-optional fields will be restricted to non-null values only. For example, a query like ``` sequence by ?process.entity_id, process.pid [process where transID == 2] [file where transID == 0] with runs=2 ``` can return a sequence with a join key `[null,123]`. If the sequence will use `process.pid` as an optional field (`sequence by ?process.entity_id, ?process.pid`), as well, the sequence can now return join keys as `[null,123]`, `[null,null]`, `[512,null]`.
Add optional fields, with usage inside queries' filters and as join keys.
Optional fields have a question mark in front of their name (
?some_field
) and can be used in standalone event queries and sequences.If the field exists in at least one index that's getting queried, that field will actually be used in queries.
If the field doesn't exist in any of the indices, all its mentions in query filters will be replaced with
null
.For sequences, optional fields as keys can have
null
as a value, whereas non-optional fields will be restricted to non-null values only.For example, a query like
can return a sequence with a join key
[null,123]
.If the sequence will use
process.pid
as an optional field (sequence by ?process.entity_id, ?process.pid
), as well, the sequence can now return join keys as[null,123]
,[null,null]
,[512,null]
.