-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This Code Provides a possibility to split large Reports into Chunks #680
Conversation
As a work-around for limitations in the size of strings allowed by Redis, we want to be able to split large inputs into several reports. This commit adds basic support code for this together with corresponding unit tests. Part of issue certtools#547
The file collector bot now has two new parameters, chunk_size and chunk_replicate_header, controlling whether the input is to be split into chunks and how big those chunks should be and what to do with (CSV) header lines. Part of issue certtools#547
The mail URL collector bot now has two new parameters, chunk_size and chunk_replicate_header, controlling whether the input is to be split into chunks and how big those chunks should be and what to do with (CSV) header lines. Part of issue certtools#547
…v-split-csv-reports Manually resolved two conflicts. Thei fields feed.name and feed.code are now set by the CollectorBots __add_report_fields method. Conflicts: intelmq/bots/collectors/file/collector_file.py intelmq/bots/collectors/mail/collector_mail_url.py
@@ -6,12 +6,27 @@ events. In combination with the Generic CSV Parser this should work great. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename file to 'README.md'
Fixed merge conflicts in: Conflicts: intelmq/bots/collectors/file/collector_file.py intelmq/bots/collectors/mail/collector_mail_url.py
@SYNchroACK and I talked about this PR on IRC. Thank you very much for your valuable input. One Idea to continue the integration of this PR into IntelMQ is to remove
|
@@ -0,0 +1,134 @@ | |||
# -*- coding: utf-8 -*- | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this code should move to https:/certtools/intelmq/blob/master/intelmq/lib/bot.py#L576
@@ -0,0 +1,158 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change accordingly (remove test_splitreports.py) to the code that was merged in CollectorBot (https:/certtools/intelmq/blob/master/intelmq/lib/bot.py#L576)
@@ -0,0 +1,25 @@ | |||
# Mail collector bots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all README.md files on this pull request and add the documentation text to the following location:
https:/certtools/intelmq/blob/master/docs/Bots.md#collectors
@@ -4,6 +4,8 @@ | |||
"description": "Fileinput collector fetches data from a file.", | |||
"module": "intelmq.bots.collectors.file.collector_file", | |||
"parameters": { | |||
"chunk_replicate_header": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctly if I'm wrong but I think we need to replicate this solution to all collectors, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written in #680 (comment) not every collector might be capable of splitting the data. It depends on the data-format if it can be split.
CSV or some Blocklists can be split this way, whilst XML, JSON, Binary cannot!
The solution can be extended to the collectors which are
- known to be capable of handling this data
- Currently collecting splitable data in some use-cases.
Candidates are:
HTTP, Mail-Url, Mail-Attachment, File, FTP/FTPs (?)
I was doing the review when you were writing your comment. sorry. I understand your points but from my perspective is too small feature to be in a new file. If you have chance, study other options where to put that code, however, here we can find code inside ParserBot class which is only usable by Parsers that receives CSV messages. |
No Problem, Thanks for reviewing!
Have you seen, that splitreports.py has more lines (with comments and documentation) than the affected collector-bot? BR |
I can't see it as an argument, and btw, documentation cannot count for that comparison and comparing lines of code, they have the roughly the same number.
I see the same situation for ParserBot class and currently the class has code dedicated, for example, to do CSV parsing.
Like the example before.
That is not an argument to submit code to new files. Also, v1.0 will be used by people who wants to have systems working with huge volume of data (I already have one scenario which is just waiting for the v1.0 being publish). So, from my side, No. sorry. However, @wagner-certat and @aaronkaplan can give you other feedback. |
It seems to me that it's probably a good idea to make the splitting Ability to split depends on parsers, not collectorsWhether a report generated by collector can be split is not a property So it does make sense to make the splitting functionality available Ideally IntelMQ would somehow make sure it will only be enabled for Splitting reports is a work-aroundThe main arguments against it AFAICT are that The splitting feature is a work-around for a limitation in Redis, one For practicality reasons it's probably something we can and should live API matters more than code organisationI think API design, and in this case the design of the 'CollectorBot' API ideasI'm not sure what the API should be. Ideally, it's designed so that the def process(self):
# ... removed lines to make the point clearer ...
template = self.new_report()
template.add("feed.url", "file://localhost%s" % filename)
with open(filename, 'rb') as f:
for report in generate_reports(template, f,
self.parameters.chunk_size,
self.parameters.chunk_replicate_header):
self.send_message(report) becomes something like this: def process(self):
# ... removed lines to make the point clearer ...
with open(filename, 'rb') as f:
self.send_reports_from_file(f, [("feed.url", "file://localhost%s" % filename)])
def send_reports_from_file(self, file, attributes):
template = self.new_report()
for key, value in attributes:
template.add(key, value)
for report in generate_reports(template, file,
self.parameters.chunk_size,
self.parameters.chunk_replicate_header):
self.send_message(report) where And perhaps analogously a method that sends split reports from a string |
I pushed some minor improvements to this PR to https:/certtools/intelmq/tree/dev-split-csv-reports |
I commented on one of the changes in d42b26d. Other than that it looks good. I haven't actually run the code, though. |
The tests are definately broken and have not been fixed yet.