Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed unescaped control character c causing jq parsing errors #1082

Merged

Conversation

hitenkoku
Copy link
Collaborator

What Changed

  • fixed unescaped control character c causing jq parsing errors

I would appreciate it if you could review when you have time.

@hitenkoku hitenkoku added the invalid This doesn't seem right label Jun 4, 2023
@hitenkoku hitenkoku self-assigned this Jun 4, 2023
@hitenkoku hitenkoku linked an issue Jun 4, 2023 that may be closed by this pull request
@hitenkoku
Copy link
Collaborator Author

Evidence

  • main
> ./main.exe json-timeline -f jq-parse-error.evtx -o main.json
> cat .\main.json | jq . > jq-main.json
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 10898, column 24
  • this PR
> ./1068.exe json-timeline -f jq-parse-error.evtx -o 1068.json
> cat .\1068.json | jq . > jq-1068json
>

@codecov
Copy link

codecov bot commented Jun 4, 2023

Codecov Report

Patch coverage: 85.71% and project coverage change: +0.01 🎉

Comparison is base (29dc0a0) 82.20% compared to head (9efa23b) 82.21%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1082      +/-   ##
==========================================
+ Coverage   82.20%   82.21%   +0.01%     
==========================================
  Files          24       24              
  Lines       19900    19967      +67     
==========================================
+ Hits        16358    16416      +58     
- Misses       3542     3551       +9     
Impacted Files Coverage Δ
src/afterfact.rs 67.15% <75.00%> (+0.12%) ⬆️
src/detections/configs.rs 66.86% <100.00%> (+1.00%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@YamatoSecurity
Copy link
Collaborator

@hitenkoku Thanks so much! It will take a while to take benchmarks. I'll let you know when its done.

Copy link
Collaborator

@fukusuket fukusuket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YamatoSecurity @hitenkoku
I have a question about the fix!
After this fix, the control characters will be removed(not replace) . Is it OK that this is the expected behavior?
If the above behavior is OK with the specification, it is LGTM!🚀

@YamatoSecurity
Copy link
Collaborator

@fukusuket I think it is better to leave them in and preserve the data. I thought ^C had to be escaped like ¥^C but that is not the case. @hitenkoku How are you avoiding the jq parse errors?

@YamatoSecurity
Copy link
Collaborator

@hitenkoku I figured it out. ^C needs to be replaced with \u0003, etc...
I wonder if that will cause a problem when importing, etc.. though..

@YamatoSecurity
Copy link
Collaborator

Here is the full list

^@ (NUL) -> \u0000
^A (SOH) -> \u0001
^B (STX) -> \u0002
^C (ETX) -> \u0003
^D (EOT) -> \u0004
^E (ENQ) -> \u0005
^F (ACK) -> \u0006
^G (BEL) -> \u0007
^H (BS) -> \u0008
^I (TAB) -> \u0009
^J (LF) -> \u000A
^K (VT) -> \u000B
^L (FF) -> \u000C
^M (CR) -> \u000D
^N (SO) -> \u000E
^O (SI) -> \u000F
^P (DLE) -> \u0010
^Q (DC1) -> \u0011
^R (DC2) -> \u0012
^S (DC3) -> \u0013
^T (DC4) -> \u0014
^U (NAK) -> \u0015
^V (SYN) -> \u0016
^W (ETB) -> \u0017
^X (CAN) -> \u0018
^Y (EM) -> \u0019
^Z (SUB) -> \u001A
^[ (ESC) -> \u001B
^\ (FS) -> \u001C
^] (GS) -> \u001D
^^ (RS) -> \u001E
^_ (US) -> \u001F

@hitenkoku
Copy link
Collaborator Author

@YamatoSecurity I apologize for taking so long. Would you please confirm that the correction has been completed?

@YamatoSecurity YamatoSecurity added this to the v2.6.0 milestone Jun 9, 2023
Copy link
Collaborator

@YamatoSecurity YamatoSecurity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks so much!
@fukusuket Could you check this as well?

@fukusuket
Copy link
Collaborator

@hitenkoku @YamatoSecurity
I've verified that control characters are unicode escaped!!

But the output speed seems to be slowing down as follows 😢 (I checked all-evtx.tgz(6.1GB))
./hayabusa json-timeline -d ../all-evtx -o 1.json --debug

main

Elapsed time: 00:06:54.837
Rule Parse Processing Time: 00:00:01.093
Analysis Processing Time: 00:06:36.165
Output Processing Time: 00:00:17.579

This PR

Elapsed time: 00:08:05.195
Rule Parse Processing Time: 00:00:01.213
Analysis Processing Time: 00:06:34.649
Output Processing Time: 00:01:29.330

I'm sorry, no good ideas right now, but is there a way to replace the control characters without slowing it down too much? 🤔

@YamatoSecurity
Copy link
Collaborator

@fukusuket Thanks for taking the benchmarks. (I forgot to do that.. m(__)m )
Indeed, that is quite a slow down. Maybe ask GPT-4 how to optimize? ^^v

@fukusuket
Copy link
Collaborator

fukusuket commented Jun 10, 2023

Since it is currently iterating with chars() below,
https:/Yamato-Security/hayabusa/blob/main/src/detections/message.rs#L254-L258
If you can replace the control characters here, it might improve performance a bit...?

I hope GPT-4 will give me the answer...w

@YamatoSecurity
Copy link
Collaborator

Here are my benchmarks. Not a huge difference but indeed a little slower. (12%?)

Saved file: main.json (498.7 MB)

Elapsed time: 00:04:23.102
Errors were generated. Please check ./logs/errorlog-20230610_100253.log for details.

Rule Parse Processing Time: 00:00:02.171
Analysis Processing Time: 00:02:57.478
Output Processing Time: 00:01:23.453

Memory usage stats:
heap stats:     peak       total       freed     current        unit       count
  reserved:     3.0 GiB     3.0 GiB     0           3.0 GiB
 committed:     2.9 GiB     3.0 GiB   200.9 GiB  -197.9 GiB                          ok
     reset:     0
    purged:    15.4 GiB
   touched:    64.2 KiB     5.1 MiB    21.9 GiB   -21.9 GiB                          ok
  segments:    18          83          71          12                                not all freed!
-abandoned:     0           0           0           0                                ok
   -cached:     0           0           0           0                                ok
     pages:     0           0         295.7 Ki   -295.7 Ki                           ok
-abandoned:     0           0           0           0                                ok
 -extended:     0
 -noretire:     0
     mmaps:     0
   commits:     0
    resets:     0
    purges:     8.2 Ki
   threads:    32          32           0          32                                not all freed!
  searches:     0.0 avg
numa nodes:     1
   elapsed:   263.107 s
   process: user: 2192.814 s, system: 39.641 s, faults: 0, rss: 1.9 GiB, commit: 2.9 GiB

PR:

Saved file: 1068.json (498.7 MB)

Elapsed time: 00:04:33.380
Errors were generated. Please check ./logs/errorlog-20230610_100946.log for details.

Rule Parse Processing Time: 00:00:02.083
Analysis Processing Time: 00:02:57.179
Output Processing Time: 00:01:34.117

Memory usage stats:
heap stats:     peak       total       freed     current        unit       count
  reserved:     3.0 GiB     3.0 GiB     0           3.0 GiB
 committed:     1.9 GiB     3.0 GiB   207.7 GiB  -204.7 GiB                          ok
     reset:     0
    purged:    15.6 GiB
   touched:    64.2 KiB     5.1 MiB    21.9 GiB   -21.9 GiB                          ok
  segments:    18          83          71          12                                not all freed!
-abandoned:     0           0           0           0                                ok
   -cached:     0           0           0           0                                ok
     pages:     0           0         295.8 Ki   -295.8 Ki                           ok
-abandoned:     0           0           0           0                                ok
 -extended:     0
 -noretire:     0
     mmaps:     0
   commits:     0
    resets:     0
    purges:     8.1 Ki
   threads:    32          32           0          32                                not all freed!
  searches:     0.0 avg
numa nodes:     1
   elapsed:   273.384 s
   process: user: 2153.941 s, system: 41.945 s, faults: 0, rss: 2.0 GiB, commit: 1.9 GiB

@fukusuket
Copy link
Collaborator

I'm really sorry.. 🙇 I was comparing to the #1090 branch, not the main branch.
The correct result is as follows.

main

Elapsed time: 00:07:45.238
Rule Parse Processing Time: 00:00:01.110
Analysis Processing Time: 00:06:21.660
Output Processing Time: 00:01:22.466

This PR

Elapsed time: 00:08:05.195
Rule Parse Processing Time: 00:00:01.213
Analysis Processing Time: 00:06:34.649
Output Processing Time: 00:01:29.330

#1090

Elapsed time: 00:06:54.837
Rule Parse Processing Time: 00:00:01.093
Analysis Processing Time: 00:06:36.165
Output Processing Time: 00:00:17.579

@YamatoSecurity
Copy link
Collaborator

After #1089 is merged into main and this branch, I'll take benchmarks and compare this PR vs main.

@hitenkoku
Copy link
Collaborator Author

@YamatoSecurity @fukusuket
Thanks for your check. I merged #1089 code in 9efa23b to this branch.

@YamatoSecurity
Copy link
Collaborator

main:
Elapsed time: 2:13:36
Analysis: 2:4:14
Output: 9:17
Memory: 48.1 GB

this PR:
Elapsed time: 2:17:36
Analysis: 2:5:23
Output: 12:05
Memory: 48.1 GB

Elapsed time is about 3% slower and memory usage is the same so I think it is good to merge.
@fukusuket If you have time, could you see if there is a way to make it faster and submit another PR if it is possible?
Although since we have to scan everything there may not be a way to do it faster.

@YamatoSecurity
Copy link
Collaborator

I took benchmarks comparing to 2.5.1 as well. Thanks to the speed improvements in the last PR, it is almost the same speed as 2.5.1 even with the control character replacement and checks to make sure fields gets parsed correctly.

This PR

json-timeline minimal
elapsed time: 2:11:32
memory: 33.7GB
filesize: 9.6GB

2.5.1

json-timeline minimal
elapsed time: 2:10:44
memory: 33.7GB
filesize: 9.6 GB

Copy link
Collaborator

@fukusuket fukusuket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was good that the performance was about the same! LGTM!!🚀

@hitenkoku
Copy link
Collaborator Author

Thanks for your review. I merge it.

@hitenkoku hitenkoku merged commit 8f52995 into main Jun 11, 2023
@hitenkoku hitenkoku deleted the 1068-unescaped-control-character-c-causing-jq-parsing-errors branch August 5, 2023 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unescaped control character ^C causing jq parsing errors
3 participants