-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This Code Provides a possibility to split large Reports into Chunks #680
Changes from all commits
e0257c8
c704859
db7cd21
3a47f9b
7ce46bb
bf2d473
1f08610
135a9a7
fa7677a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Mail collector bots | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove all README.md files on this pull request and add the documentation text to the following location: |
||
|
||
This file should contain the documentation for: | ||
|
||
* Generic Mail Attachment Fetcher | ||
* Generic Mail URL Fetcher | ||
|
||
Currently, no documentation is available. | ||
|
||
|
||
## Generic Mail URL Fetcher | ||
|
||
### Chunking | ||
|
||
For line-based inputs the bot can split up large reports into smaller chunks. | ||
|
||
This is particularly important for setups that use Redis as a message queue | ||
which has a per-message size limitation of 512 MB. | ||
|
||
To configure chunking, set `chunk_size` to a value in bytes. | ||
`chunk_replicate_header` determines whether the header line should be repeated | ||
for each chunk that is passed on to a parser bot. | ||
|
||
Specifically, to configure a large file input to work around Redis' size | ||
limitation set `chunk_size` to something like `384000000`, i.e., ~384 MB. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All this code should move to https:/certtools/intelmq/blob/master/intelmq/lib/bot.py#L576 |
||
Support for splitting large raw reports into smaller ones. | ||
|
||
The main intention of this module is to help work around limitations in | ||
Redis which limits strings to 512MB. Collector bots can use the | ||
functions in this module to split the incoming data into smaller pieces | ||
which can be sent as separate reports. | ||
|
||
Collectors usually don't really know anything about the data they | ||
collect, so the data cannot be reliably split into pieces in all cases. | ||
This module can be used for those cases, though, where users know that | ||
the data is actually a line-based format and can easily be split into | ||
pieces as newline characters. For this to work, some assumptions are | ||
made: | ||
|
||
- The data can be split at any newline character | ||
|
||
This would not work, for e.g. a CSV based formats which allow | ||
newlines in values as long as they're within quotes. | ||
|
||
- The lines are much shorter than the maximum chunk size | ||
|
||
Obviously, if this condition does not hold, it may not be possible to | ||
split the data into small enough chunks at newline characters. | ||
|
||
Other considerations: | ||
|
||
- To accommodate CSV formats, the code can optionally replicate the | ||
first line of the file at the start of all chunks. | ||
|
||
- The redis limit applies to the entire IntelMQ report, not just the | ||
raw data. The report has some meta data in addition to the raw data | ||
and the raw data is encoded as base64 in the report. The maximum | ||
chunk size must take this into account, but multiplying the actual | ||
limit by 3/4 and subtracting a generous amount for the meta data. | ||
""" | ||
|
||
|
||
def split_chunks(chunk, chunk_size): | ||
"""Split a bytestring into chunk_size pieces at ASCII newlines characters. | ||
|
||
The return value is a list of bytestring objects. Appending all of | ||
them yields a bytestring equal to the input string. All items in the | ||
list except the last item end in newline. The items are shorter than | ||
chunk_size if possible, but may be longer if the input data has | ||
places where the distance between two neline characters is too long. | ||
|
||
Note in particular, that the last item may not end in a newline! | ||
""" | ||
chunks = [] | ||
|
||
while len(chunk) > chunk_size: | ||
newline_pos = chunk.rfind(b"\n", 0, chunk_size) | ||
if newline_pos == -1: | ||
# no newline available to make chunk smaller than | ||
# chunk_size. Search forward to get a minimum chunk longer | ||
# than chunk_size | ||
newline_pos = chunk.find(b"\n", chunk_size) | ||
|
||
if newline_pos == -1: | ||
# no newline in chunk, so this is a leftover that may have | ||
# to be combined with the next data read. | ||
chunks.append(chunk) | ||
chunk = b"" | ||
else: | ||
split_pos = newline_pos + 1 | ||
chunks.append(chunk[:split_pos]) | ||
chunk = chunk[split_pos:] | ||
if chunk: | ||
chunks.append(chunk) | ||
|
||
return chunks | ||
|
||
|
||
def read_delimited_chunks(infile, chunk_size): | ||
"""Yield the contents of infile in chunk_size pieces ending at newlines. | ||
The individual pieces, except for the last one, end in newlines and | ||
are smaller than chunk_size if possible. | ||
""" | ||
leftover = b"" | ||
|
||
while True: | ||
new_chunk = infile.read(chunk_size) | ||
chunks = split_chunks(leftover + new_chunk, chunk_size) | ||
leftover = b"" | ||
# the last item in chunks has to be combined with the next chunk | ||
# read from the file because it may not actually stop at a | ||
# newline and to avoid very small chunks. | ||
if chunks: | ||
leftover = chunks[-1] | ||
chunks = chunks[:-1] | ||
for chunk in chunks: | ||
yield chunk | ||
|
||
if not new_chunk: | ||
if leftover: | ||
yield leftover | ||
break | ||
|
||
|
||
def generate_reports(report_template, infile, chunk_size, copy_header_line): | ||
"""Generate reports from a template and input file, optionally split into chunks. | ||
|
||
If chunk_size is None, a single report is generated with the entire | ||
contents of infile as the raw data. Otherwise chunk_size should be | ||
an integer giving the maximum number of bytes in a chunk. The data | ||
read from infile is then split into chunks of this size at newline | ||
characters (see read_delimited_chunks). For each of the chunks, this | ||
function yields a copy of the report_template with that chunk as the | ||
value of the raw attribute. | ||
|
||
When splitting the data into chunks, if copy_header_line is true, | ||
the first line the file is read before chunking and then prepended | ||
to each of the chunks. This is particularly useful when splitting | ||
CSV files. | ||
|
||
The infile should be a file-like object. generate_reports uses only | ||
two methods, readline and read, with readline only called once and | ||
only if copy_header_line is true. Both methods should return bytes | ||
objects. | ||
""" | ||
if chunk_size is None: | ||
report = report_template.copy() | ||
report.add("raw", infile.read(), force=True) | ||
yield report | ||
else: | ||
header = b"" | ||
if copy_header_line: | ||
header = infile.readline() | ||
for chunk in read_delimited_chunks(infile, chunk_size): | ||
report = report_template.copy() | ||
report.add("raw", header + chunk, force=True) | ||
yield report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctly if I'm wrong but I think we need to replicate this solution to all collectors, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written in #680 (comment) not every collector might be capable of splitting the data. It depends on the data-format if it can be split.
CSV or some Blocklists can be split this way, whilst XML, JSON, Binary cannot!
The solution can be extended to the collectors which are
Candidates are:
HTTP, Mail-Url, Mail-Attachment, File, FTP/FTPs (?)