Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV Ingest Processor splitting on whitespace in message #31786

Open
Evesy opened this issue Jul 4, 2018 · 18 comments
Open

KV Ingest Processor splitting on whitespace in message #31786

Evesy opened this issue Jul 4, 2018 · 18 comments
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team

Comments

@Evesy
Copy link

Evesy commented Jul 4, 2018

I'm trying to parse some logs in the logfmt format using the KV processor, example log line below:

time="2018-07-04T09:36:25Z" level=info msg="Schedule is not due, skipping" logSource="pkg/controller/schedule_controller.go:325" nextRunTime="2018-07-05 01:00:00 +0000 UTC" schedule=daily

The processor being used is:

"kv": {
  "field": "message",
  "field_split": " ",
  "value_split": "="
}

Due to the whitespaces in the msg key the logs are being incorrectly split midway through the message, resulting in the msg field being "Schedule

There's a similar issue open for the Logstash equivalent plugin here: logstash-plugins/logstash-filter-kv#9

It's my understanding that quoted values as above should be treated and parsed as a single value, and the quotes should then be stripped from the resulting field value.

If this isn't the case it would be good to expose these options, as it makes the kv processor a lot less versatile without.

Cheers,
Mike

@dliappis
Copy link
Contributor

dliappis commented Jul 4, 2018

Reading the docs about the kv processor I think what's happening here is the expected behavior as:

  • field_split: Regex pattern to use for splitting key-value pairs
  • value_split: Regex pattern to use for splitting the key from the value within a key-value pair

The KV processor maybe too simple for what you need to achieve, given that some of your values are enclosed in double quotes, if they contain spaces and some not.

If the order of your fields doesn't change, perhaps the grok processor could be of better use here, as for example you can use the quoted string pattern to match values enclosed in double quotes.

@dliappis dliappis added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jul 4, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@Evesy
Copy link
Author

Evesy commented Jul 9, 2018

Thanks for the response @dliappis

Unfortunately, the fields aren't consistent across all applications, they're just all using the logfmt format. It would be great if the kv filter was to be extended a bit further to match similar capabilities as the Logstash one.

@original-brownbear
Copy link
Member

original-brownbear commented Jul 18, 2018

@jakelandis I'll handle this unless you already started somehow ?:)

@jakelandis
Copy link
Contributor

@original-brownbear - all yours

@jakelandis jakelandis removed their assignment Jul 18, 2018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 20, 2018
Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

Closes elastic#31786
original-brownbear added a commit that referenced this issue Jul 20, 2018
* INGEST: Extend KV Processor (#31789)

Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

Relates #31786
* Add documentation
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 21, 2018
* INGEST: Extend KV Processor (elastic#31789)

Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

Relates elastic#31786
* Add documentation
original-brownbear added a commit that referenced this issue Jul 21, 2018
* INGEST: Extend KV Processor (#31789)

Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values

Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.

Relates #31786
* Add documentation
@original-brownbear
Copy link
Member

@jakelandis assigning you here since you wanted to experiment some more with this :)

@philippkahr
Copy link
Contributor

Hi @original-brownbear any idea when this will be pushed / released into mainstream? I am currently writing a Filebeat module to dissect log messages sent from a Fortigate firewall. elastic/beats#13245

Here you have an ingest simulate pipeline sample. Funny thing is also that the mac fields are not showing up. I am opposed to the same problem as @Evesy as the log format sometimes includes more or less fields, thus GROK is not suitable.

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "_description",
    "processors": [
      {
        "kv": {
          "field_split": " ",
          "value_split": "=",
          "field": "message",
          "target_field": "fortinet.message",
          "ignore_failure": true,
          "exclude_keys":[
            "srccountry",
            "dstcountry"
          ],
          "trim_value": "\""
          
          
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "message": "date=\"2019-10-06\" time=\"19:09:27\" devname=\"FGT-2\" devid=\"FG101E4J17OOO702\" logid=\"0001000014\" type=\"traffic\" subtype=\"local\" level=\"notice\" vd=\"root\" eventtime=\"1570381767687891019\" tz=\"+0200\" srcip=\"81.8.45.152\" srcport=\"43688\" srcintf=\"vlan78\" srcintfrole=\"wan\" dstip=\"212.188.109.206\" dstport=\"63390\" dstintf=\"root\" dstintfrole=\"undefined\" sessionid=\"12336602\" proto=\"6\" action=\"deny\" policyid=\"0\" policytype=\"local-in-policy\" service=\"tcp/63390\" dstcountry=\"Austria\" srccountry=\"Russian Federation\" trandisp=\"noop\" duration=\"0\" sentbyte=\"0\" rcvdbyte=\"0\" sentpkt=\"0\" appcat=\"unscanned\" crscore=\"5\" craction=\"262144\" crlevel=\"low\" mastersrcmac=\"e0:5f:b9:65:b5:01\" srcmac=\"e0:5f:b9:65:b5:01\" srcserver=\"0\""
      }
    }
  ]
}

@philippkahr
Copy link
Contributor

Hi @jakelandis @original-brownbear

do you happen to have any update on when it will be released? I would really need it to finish my fortinet module.

@rverchere
Copy link

rverchere commented Dec 9, 2019

@philippkahr, I succeeded parsing Fortigate (CEF format enabled), with the following configuration, after some headache and regex magic.

The field_split regex check if the space is only before a word followed by an equal (the value_split).

"field_split" : """\s(?![-_,:()\w ]+?(\s+|$))""",

The kv filter:

"kv" : {
  "on_failure" : [
    {
      "append" : {
        "field" : "error.message",
        "value" : "{{ _ingest.on_failure_message }}"
      }
    }
  ],
  "field" : "message",
  "field_split" : """\s(?![-_,:()\w ]+?(\s+|$))""",
  "value_split" : "=",
  "target_field" : "cef.extensions",
  "trim_key" : "FTNTGT"
}

@crystalwm
Copy link

@philippkahr, I succeeded parsing Fortigate (CEF format enabled), with the following configuration, after some headache and regex magic.

The field_split regex check if the space is only before a word followed by an equal (the value_split).

"field_split" : """\s(?![-_,:()\w ]+?(\s+|$))""",

The kv filter:

"kv" : {
  "on_failure" : [
    {
      "append" : {
        "field" : "error.message",
        "value" : "{{ _ingest.on_failure_message }}"
      }
    }
  ],
  "field" : "message",
  "field_split" : """\s(?![-_,:()\w ]+?(\s+|$))""",
  "value_split" : "=",
  "target_field" : "cef.extensions",
  "trim_key" : "FTNTGT"
}

The double quote can also be excluded, After modifying the RegExp rverchere provided. thanks a lot.
"field_split" : """\s(?![-_,:()\w\" ]+?(\s+|$))""",

@rverchere
Copy link

Hi,

I've enhanced my KV filter with the following parameters (for the CEF format):

"field_split" : """\s(?![-_.,:()\w ]+?(\s+|$))""",
"value_split" : """(?<!\\)=""",

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@nanjum88
Copy link

Hi,

Stuck with similar issue, tried @rverchere REGEX patterns, but still struggling with spaces discovered between the File Paths as values.

sev="INFO" msg="Audit access" cat="\[AUDIT\]" pol="ENC104456DB_SQL_Operational" uinfo="SQLService\\UnPriviligedServices,GRP_Jenkins_Build,SG_DBSvcAccts_Rstrct...\\DCO,dco.elmae" sproc="D:\\Program Files\\Microsoft SQL Server\\MSSQL10_50.ENC104456DB\\MSSQL\\Binn\\sqlservr.exe" act="read_file" gp="f:\\mssql_data" filePath="\\ENC79.mdf:MSSQL_DBCC7" key="None" denyStr="PERMIT" showStr="Code (1A,2M)"
 
sev="INFO" msg="Event" event="Guardpath \\\\bnk11701fs\\bnk11701\\BE6789\\encdata\\Patth is not valid - will not guard (reason: Invalid Guard Path)!"
 
sev="ERROR" msg="failed to contact host" shost="syslogserver" nexttime="Tue Jun 16 05:12:05 PDT 2020"

Field Split REGEX - """\s(?![-_,:()\w\" ]+?(\s+|\d+|[,_\.]+|$))"""

  "kv": {
    "field": "syslog_message",
    "field_split": """\s(?![-_,:()\w\" ]+?(\s+|\d+|[,_\.]+|$))""",
    "value_split": """(?<!\\\s-:)=""",
    "strip_brackets": true,
    "ignore_failure": true,
    "ignore_missing": true
  }

@nanjum88
Copy link

got it working -

{
  "kv": {
    "field": "syslog_message",
    "field_split": """\s(?![-_,:()\w\"\\! ]+?(\s+|\d+|[,_\.]+|$))""",
    "value_split": """(?<!\\)=""",
    "ignore_failure": true,
    "ignore_missing": true
  }
      }

@bargarfj01
Copy link

bargarfj01 commented Apr 17, 2021

got it working -

{
  "kv": {
    "field": "syslog_message",
    "field_split": """\s(?![-_,:()\w\"\\! ]+?(\s+|\d+|[,_\.]+|$))""",
    "value_split": """(?<!\\)=""",
    "ignore_failure": true,
    "ignore_missing": true
  }
      }

Thank you for this example. It works with my kv use case aw well.
Could you advertise this regex code somewhere as an example of parsing using kv? I've spent many hours on trying to invent my own solution and another couple of hours on searching for the solution made by someone else.

@rowi9631
Copy link

\s(?![-_,:()\w" ]+?(\s+|$))

I tried to use this example in the built-in ingest pipeline for fortigate logs, but anytime I modify the KV processor via the UI, it looks like my field split regex (the example mentioned above) get's mangled, see the picture below.

image

@jakelandis
Copy link
Contributor

Another alternative is to use the dissect processor which support k/v pairing too, but you need to know the shape of the message ahead of time. In the example below, I would need to know that there 4 k/v pairs and that the second and 4th one does not have quotes. Not ideal, but possibly helpful in some scenarios.

POST /_ingest/pipeline/_simulate?verbose
{
  "pipeline": {
    "description": "my test",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern" : "%{*a}=\"%{&a}\" %{*b}=%{&b} %{*c}=\"%{&c}\" %{*d}=%{&d}"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "message": """time="2021-12-03T16:55:22Z" level=info msg="Here is the message. Still part of the message" application=something"""
      }
    }
  ]
}```

@threatangler-jp
Copy link

threatangler-jp commented Mar 8, 2022

We have been unable to get any of these solutions to work Our log uses = for the key value separator and space as the delimiter but there are spaces in the values sometimes. A field also may have other special characters in it like " \ / & and others. Also, the number of fields in a log is variable so I don't think the dissect option would work. Any advice?

@arve0
Copy link

arve0 commented Mar 28, 2022

Given you have no = in your messages, this works for me:

"kv": {
    "field": "message",
    "field_split": "\\s(?![^=]+?(\\s|$))",
    "value_split": "=",
    "target_field": "log",
    "ignore_missing": true,
    "strip_brackets": true,
    "ignore_failure": true
}

The negative lookahead is "any chars not = followed by whitespace or end of line". Test it with your own log lines at regex101.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests