Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard field optimised for wildcard queries #49993

Merged
merged 32 commits into from
Mar 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
baf369f
First cut at Wildcard field optimised for wildcard queries
markharwood Dec 5, 2019
d15571b
Added docs, added illegal argument checking to test
markharwood Feb 24, 2020
42e55bc
Docs change
markharwood Feb 24, 2020
c7714a0
Docs change
markharwood Feb 24, 2020
67c8a57
Docs change
markharwood Feb 24, 2020
dac8408
Doc change
markharwood Feb 24, 2020
8f7f0a4
Docs fix
markharwood Feb 24, 2020
dddd1ae
Remove redundant inter test, add ignore_above test to unit test
markharwood Feb 25, 2020
3cb80a4
Added support for aggs
markharwood Feb 25, 2020
a4af4c7
Remove redundant code now that we’ve settled on 3grams with binary do…
markharwood Feb 25, 2020
f9893cd
Bugfix - BinaryDVIndexFieldData.sortField had the wrong implementation.
markharwood Feb 26, 2020
ce47c6c
Addressing latest review comments
markharwood Mar 4, 2020
0714c13
Renamed field from `wildcard_keyword ` to `wildcard`
markharwood Mar 4, 2020
b80c231
Added REST tests for sorting and aggs
markharwood Mar 4, 2020
377f81f
Fix invalid docs reference
markharwood Mar 4, 2020
095d0f3
Renamed WildcardOnBinaryDVQuery to AutomatonQueryOnBinaryDV.
markharwood Mar 5, 2020
c41f208
Removed outdated limitation from docs
markharwood Mar 5, 2020
39f248f
Addressed latest review comments apart from support for arrays. That’…
markharwood Mar 5, 2020
beabe03
Unused import
markharwood Mar 5, 2020
e6bd8b0
Dammit. Line length
markharwood Mar 5, 2020
488e64a
Fix rest test bug
markharwood Mar 5, 2020
ecb021d
Add support for prefix query. Set tokenised =false on elasticsearch-f…
markharwood Mar 6, 2020
fa527dd
Add support for multi fields
markharwood Mar 9, 2020
40c7929
Unused import
markharwood Mar 9, 2020
6ccdc3b
Checkstyle fix
markharwood Mar 9, 2020
9e0b2b8
Removed String.getBytes()
markharwood Mar 9, 2020
6255347
Bugfix - overly long byte arrays being serialised for field values.
markharwood Mar 9, 2020
93dbdd0
Unused import
markharwood Mar 9, 2020
ad132af
Added max clause protection and related test
markharwood Mar 10, 2020
f7656fa
Removed TaperedNgramTokenizer and numChars. Changed encoding of terms…
markharwood Mar 11, 2020
8435ec6
Addressed Adrien’s review comments (minus the use of custom Analyzer)
markharwood Mar 12, 2020
9641b72
Switched to reusing same Analyzer for all tokenisation. Added checks …
markharwood Mar 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/reference/mapping/types.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ document:
[float]
=== Core datatypes

string:: <<text,`text`>> and <<keyword,`keyword`>>
string:: <<text,`text`>>, <<keyword,`keyword`>> and <<wildcard,`wildcard`>>
<<number>>:: `long`, `integer`, `short`, `byte`, `double`, `float`, `half_float`, `scaled_float`
<<date>>:: `date`
<<date_nanos>>:: `date_nanos`
Expand Down Expand Up @@ -131,3 +131,5 @@ include::types/token-count.asciidoc[]
include::types/shape.asciidoc[]

include::types/constant-keyword.asciidoc[]

include::types/wildcard.asciidoc[]
53 changes: 53 additions & 0 deletions docs/reference/mapping/types/wildcard.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
[role="xpack"]
[testenv="basic"]
[[wildcard]]
=== Wildcard datatype
++++
<titleabbrev>Wildcard</titleabbrev>
++++

A `wildcard` field stores values optimised for wildcard grep-like queries.
Wildcard queries are possible on other field types but suffer from constraints:
* `text` fields limit matching of any wildcard expressions to individual tokens rather than the original whole value held in a field
* `keyword` fields are untokenized but slow at performing wildcard queries (especially patterns with leading wildcards).

Internally the `wildcard` field indexes the whole field value using ngrams and stores the full string.
The index is used as a rough filter to cut down the number of values that are then checked by retrieving and checking the full values.
This field is especially well suited to run grep-like queries on log lines. Storage costs are typically lower than those of `keyword`
fields but search speeds for exact matches on full terms are slower.

You index and search a wildcard field as follows

[source,console]
--------------------------------------------------
PUT my_index
{
"mappings": {
"properties": {
"my_wildcard": {
"type": "wildcard"
}
}
}
}

PUT my_index/_doc/1
{
"my_wildcard" : "This string can be quite lengthy"
}

POST my_index/_doc/_search
{
"query": {
"wildcard" : "*quite*lengthy"
}
}


--------------------------------------------------


==== Limitations

* `wildcard` fields are untokenized like keyword fields, so do not support queries that rely on word positions such as phrase queries.

Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,6 @@

import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.SortedSetSortField;
import org.apache.lucene.search.SortedSetSelector;
import org.elasticsearch.common.Nullable;
import org.elasticsearch.common.util.BigArrays;
import org.elasticsearch.index.Index;
Expand Down Expand Up @@ -54,20 +52,7 @@ public BinaryDVAtomicFieldData loadDirect(LeafReaderContext context) throws Exce
public SortField sortField(@Nullable Object missingValue, MultiValueMode sortMode, XFieldComparatorSource.Nested nested,
boolean reverse) {
XFieldComparatorSource source = new BytesRefFieldComparatorSource(this, missingValue, sortMode, nested);
/**
* Check if we can use a simple {@link SortedSetSortField} compatible with index sorting and
* returns a custom sort field otherwise.
*/
if (nested != null ||
(sortMode != MultiValueMode.MAX && sortMode != MultiValueMode.MIN) ||
(source.sortMissingFirst(missingValue) == false && source.sortMissingLast(missingValue) == false)) {
return new SortField(getFieldName(), source, reverse);
}
SortField sortField = new SortedSetSortField(fieldName, reverse,
sortMode == MultiValueMode.MAX ? SortedSetSelector.Type.MAX : SortedSetSelector.Type.MIN);
sortField.setMissingValue(source.sortMissingLast(missingValue) ^ reverse ?
SortedSetSortField.STRING_LAST : SortedSetSortField.STRING_FIRST);
return sortField;
return new SortField(getFieldName(), source, reverse);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,16 @@ public boolean isFlattenedAllowed() {
public boolean isVectorsAllowed() {
return allowForAllLicenses();
}


/**
* Determine if Wildcard support should be enabled.
* <p>
* Wildcard is available for all license types except {@link OperationMode#MISSING}
*/
public synchronized boolean isWildcardAllowed() {
return status.active;
}

public boolean isOdbcAllowed() {
return isAllowedByLicense(OperationMode.PLATINUM);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
setup:
- skip:
features: headers
version: " - 7.9.99"
reason: "wildcard fields were added from 8.0"

- do:
indices.create:
index: test-index
body:
settings:
number_of_replicas: 0
mappings:
properties:
my_wildcard:
type: wildcard
- do:
index:
index: test-index
id: 1
body:
my_wildcard: hello world
- do:
index:
index: test-index
id: 2
body:
my_wildcard: goodbye world

- do:
indices.refresh: {}

---
"Short prefix query":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "hel*" }


- match: {hits.total.value: 1}

---
"Long prefix query":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "hello wor*" }


- match: {hits.total.value: 1}

---
"Short unrooted query":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "*ello*" }


- match: {hits.total.value: 1}

---
"Long unrooted query":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "*ello worl*" }


- match: {hits.total.value: 1}

---
"Short suffix query":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "*ld" }


- match: {hits.total.value: 2}

---
"Long suffix query":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "*ello world" }


- match: {hits.total.value: 1}

---
"No wildcard wildcard query":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "hello world" }


- match: {hits.total.value: 1}

---
"Term query on wildcard field":
- do:
search:
body:
track_total_hits: true
query:
term:
my_wildcard: "hello world"


- match: {hits.total.value: 1}

---
"Terms query on wildcard field":
- do:
search:
body:
track_total_hits: true
query:
terms:
my_wildcard: ["hello world", "does not exist"]


- match: {hits.total.value: 1}

---
"Prefix query on wildcard field":
- do:
search:
body:
track_total_hits: true
query:
prefix:
my_wildcard:
value: "hell*"


- match: {hits.total.value: 1}

---
"Sequence fail":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "*world*hello*" }


- match: {hits.total.value: 0}

---
"Aggs work":
- do:
search:
body:
track_total_hits: true
query:
wildcard:
my_wildcard: {value: "*world*" }
aggs:
top_vals:
terms: {field: "my_wildcard" }


- match: {hits.total.value: 2}
- length: { aggregations.top_vals.buckets: 2 }

---
"Sort works":
- do:
search:
body:
track_total_hits: true
sort: [ { "my_wildcard": "desc" } ]

- match: { hits.total.value: 2 }
- length: { hits.hits: 2 }
- match: { hits.hits.0._id: "1" }
- match: { hits.hits.1._id: "2" }

- do:
search:
body:
track_total_hits: true
sort: [ { "my_wildcard": "asc" } ]

- match: { hits.total.value: 2 }
- length: { hits.hits: 2 }
- match: { hits.hits.0._id: "2" }
- match: { hits.hits.1._id: "1" }


18 changes: 18 additions & 0 deletions x-pack/plugin/wildcard/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
evaluationDependsOn(xpackModule('core'))

apply plugin: 'elasticsearch.esplugin'

esplugin {
name 'wildcard'
description 'A plugin for a keyword field type with efficient wildcard search'
classname 'org.elasticsearch.xpack.wildcard.Wildcard'
extendedPlugins = ['x-pack-core']
}
archivesBaseName = 'x-pack-wildcard'

dependencies {
compileOnly project(path: xpackModule('core'), configuration: 'default')
testCompile project(path: xpackModule('core'), configuration: 'testArtifacts')
}

integTest.enabled = false
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License;
* you may not use this file except in compliance with the Elastic License.
*/

package org.elasticsearch.xpack.wildcard;

import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.mapper.Mapper;
import org.elasticsearch.plugins.MapperPlugin;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.xpack.wildcard.mapper.WildcardFieldMapper;

import java.util.Collections;
import java.util.LinkedHashMap;
import java.util.Map;

public class Wildcard extends Plugin implements MapperPlugin {


public Wildcard(Settings settings) {
}

@Override
public Map<String, Mapper.TypeParser> getMappers() {
Map<String, Mapper.TypeParser> mappers = new LinkedHashMap<>();
mappers.put(WildcardFieldMapper.CONTENT_TYPE, new WildcardFieldMapper.TypeParser());
return Collections.unmodifiableMap(mappers);
}
}
Loading