This Code Provides a possibility to split large Reports into Chunks #680

dmth · 2016-09-13T08:18:48Z

The tests are definately broken and have not been fixed yet.

As a work-around for limitations in the size of strings allowed by Redis, we want to be able to split large inputs into several reports. This commit adds basic support code for this together with corresponding unit tests. Part of issue certtools#547

The file collector bot now has two new parameters, chunk_size and chunk_replicate_header, controlling whether the input is to be split into chunks and how big those chunks should be and what to do with (CSV) header lines. Part of issue certtools#547

The mail URL collector bot now has two new parameters, chunk_size and chunk_replicate_header, controlling whether the input is to be split into chunks and how big those chunks should be and what to do with (CSV) header lines. Part of issue certtools#547

…-csv-reports

…v-split-csv-reports Manually resolved two conflicts. Thei fields feed.name and feed.code are now set by the CollectorBots __add_report_fields method. Conflicts: intelmq/bots/collectors/file/collector_file.py intelmq/bots/collectors/mail/collector_mail_url.py

SYNchroACK · 2017-01-24T08:14:51Z

intelmq/bots/collectors/file/readme.md

@@ -6,12 +6,27 @@ events. In combination with the Generic CSV Parser this should work great.



rename file to 'README.md'

Fixed merge conflicts in: Conflicts: intelmq/bots/collectors/file/collector_file.py intelmq/bots/collectors/mail/collector_mail_url.py

dmth · 2017-01-24T09:54:59Z

@SYNchroACK and I talked about this PR on IRC. Thank you very much for your valuable input.

One Idea to continue the integration of this PR into IntelMQ is to remove splitreports.py and integrate it's functions into bot.py's CollectorBot. After looking at it, I think this is not as good as it sounds:

splitreports.py is not complex, but it isn't trivial either. Thus it was very well documented. Integrating it would break the documentation of the module.
The functions which splitreports.py provides cannot be used in every collector, but only those which are known to be capable of processing line-based data (such as csv-files). Integrating it into the CollectorBot in general, thus making it available to every collector natively, might lead to false assumption such as that every feeds data can be split into chunks.
Explicitly importing this module and using its capabilities is the better way to go, I think.

SYNchroACK · 2017-01-24T09:53:09Z

intelmq/lib/splitreports.py

@@ -0,0 +1,134 @@
+# -*- coding: utf-8 -*-
+"""


All this code should move to https:/certtools/intelmq/blob/master/intelmq/lib/bot.py#L576

SYNchroACK · 2017-01-24T09:54:55Z

intelmq/tests/lib/test_splitreports.py

@@ -0,0 +1,158 @@
+"""


change accordingly (remove test_splitreports.py) to the code that was merged in CollectorBot (https:/certtools/intelmq/blob/master/intelmq/lib/bot.py#L576)

SYNchroACK · 2017-01-24T09:57:45Z

intelmq/bots/collectors/mail/README.md

@@ -0,0 +1,25 @@
+# Mail collector bots


Remove all README.md files on this pull request and add the documentation text to the following location:
https:/certtools/intelmq/blob/master/docs/Bots.md#collectors

SYNchroACK · 2017-01-24T09:58:53Z

intelmq/bots/BOTS

@@ -4,6 +4,8 @@
 "description": "Fileinput collector fetches data from a file.",
 "module": "intelmq.bots.collectors.file.collector_file",
 "parameters": {
+ "chunk_replicate_header": true,


Correctly if I'm wrong but I think we need to replicate this solution to all collectors, right?

As written in #680 (comment) not every collector might be capable of splitting the data. It depends on the data-format if it can be split.
CSV or some Blocklists can be split this way, whilst XML, JSON, Binary cannot!

The solution can be extended to the collectors which are

known to be capable of handling this data

Currently collecting splitable data in some use-cases.

Candidates are:
HTTP, Mail-Url, Mail-Attachment, File, FTP/FTPs (?)

SYNchroACK · 2017-01-24T10:09:40Z

@SYNchroACK and I talked about this PR on IRC. Thank you very much for your valuable input.

One Idea to continue the integration of this PR into IntelMQ is to remove splitreports.py and integrate it's functions into bot.py's CollectorBot. After looking at it, I think this is not as good as it sounds:

splitreports.py is not complex, but it isn't trivial either. Thus it was very well documented. Integrating it would break the documentation of the module.
The functions which splitreports.py provides cannot be used in every collector, but only those which are known to be capable of processing line-based data (such as csv-files). Integrating it into the CollectorBot in general, thus making it available to every collector natively, might lead to false assumption such as that every feeds data can be split into chunks.
Explicitly importing this module and using its capabilities is the better way to go, I think.

I was doing the review when you were writing your comment. sorry. I understand your points but from my perspective is too small feature to be in a new file. If you have chance, study other options where to put that code, however, here we can find code inside ParserBot class which is only usable by Parsers that receives CSV messages.

dmth · 2017-01-24T10:34:13Z

I was doing the review when you were writing your comment. sorry.

No Problem, Thanks for reviewing!

I understand your points but from my perspective is too small feature to be in a new file.

Have you seen, that splitreports.py has more lines (with comments and documentation) than the affected collector-bot?
An integration would cause that the majority of code and documentation within the collectorbot would be a feature that can only be used by a certain subset of collectors, expecting a certain data-structure, and only handling a problem which occurs to a small amount of users (those getting HUGE datasets) but gets a showstopper for them when it happens . It's very reasonable for me to exclude this into a separate file.

BR
Dustin

SYNchroACK · 2017-01-24T11:10:36Z

I was doing the review when you were writing your comment. sorry.
No Problem, Thanks for reviewing!

I understand your points but from my perspective is too small feature to be in a new file.
Have you seen, that splitreports.py has more lines (with comments and documentation) than the affected collector-bot?

I can't see it as an argument, and btw, documentation cannot count for that comparison and comparing lines of code, they have the roughly the same number.

An integration would cause that the majority of code and documentation within the collectorbot would be a feature that can only be used by a certain subset of collectors

I see the same situation for ParserBot class and currently the class has code dedicated, for example, to do CSV parsing.

expecting a certain data-structure

Like the example before.

and only handling a problem which occurs to a small amount of users (those getting HUGE datasets) but gets a showstopper for them when it happens . It's very reasonable for me to exclude this into a separate file.

That is not an argument to submit code to new files. Also, v1.0 will be used by people who wants to have systems working with huge volume of data (I already have one scenario which is just waiting for the v1.0 being publish).

So, from my side, No. sorry. However, @wagner-certat and @aaronkaplan can give you other feedback.

bernhard-herzog · 2017-01-24T16:59:04Z

It seems to me that it's probably a good idea to make the splitting
functionality available with CollectorBot methods. Reasoning and a
sketch of an API:

Ability to split depends on parsers, not collectors

Whether a report generated by collector can be split is not a property
of the collector but of the parser that has to process it, at least for
the more generic collectors. Whether the raw data can be split depends
on the format of the data after all, and currently at least, splitting
only works for formats like CSV that often can be split at newline
characters.

So it does make sense to make the splitting functionality available
automatically to all collectors as long as it is not enabled by default
(and I don't think anybody is proposing that).

Ideally IntelMQ would somehow make sure it will only be enabled for
collectors that feed into parsers for formats that support it, but
that's outside the scope of this issue.

Splitting reports is a work-around

The main arguments against it AFAICT are that

The splitting feature is a work-around for a limitation in Redis, one
which hopefully will go away with some future version of Redis or by
replacing Reids with something else, so that the work-around will not be
needed anymore. Do we really want to burden the API of the collector
class with this?

For practicality reasons it's probably something we can and should live
with. Luckily, once the work-around is not needed anymore, it should be
easy to remove it and for compatibility, keep the API and configuration
options as no-ops, or make them behave as if the chunk size is bigger
than the raw data.

API matters more than code organisation

I think API design, and in this case the design of the 'CollectorBot'
API matters more than where the code is. So the important bit here is
how a collector bot can access the splitting functionality. Making it
available directly via CollectorBot methods is probably a good idea.
Whether that also means the actual code should be moved from
splitreports.py to bot.py is a different and less important question
and anyway I don't see much of a problem in moving the code, regardless
of the amount of the documentation.

API ideas

I'm not sure what the API should be. Ideally, it's designed so that the
splitting is handled transparently from the collect bot writer's point
of view. That would probably involve using a somewhat more declarative
approach than now. For instance, the file collector bot's code to split
the report should be moved from the process method into a separate
method, so that this

def process(self):

    # ... removed lines to make the point clearer ...

    template = self.new_report()
    template.add("feed.url", "file://localhost%s" % filename)

    with open(filename, 'rb') as f:
        for report in generate_reports(template, f,
                                       self.parameters.chunk_size,
                                       self.parameters.chunk_replicate_header):
            self.send_message(report)

becomes something like this:

def process(self):

    # ... removed lines to make the point clearer ...

    with open(filename, 'rb') as f:
        self.send_reports_from_file(f, [("feed.url", "file://localhost%s" % filename)])


def send_reports_from_file(self, file, attributes):
    template = self.new_report()
    for key, value in attributes:
        template.add(key, value)

    for report in generate_reports(template, file,
                                   self.parameters.chunk_size,
                                   self.parameters.chunk_replicate_header):
        self.send_message(report)

where send_reports_from_file would be a method of CollectorBot

And perhaps analogously a method that sends split reports from a string
with a raw report. An easy way to implement that would be as a front-end
to send_reports_from_file that wraps the string in io.StringIO.

ghost · 2017-04-24T15:09:39Z

I pushed some minor improvements to this PR to https:/certtools/intelmq/tree/dev-split-csv-reports
Please review them or add them to this PR.

bernhard-herzog · 2017-05-09T11:48:50Z

I pushed some minor improvements to this PR to https:/certtools/intelmq/tree/dev-split-csv-reports
Please review them or add them to this PR.

I commented on one of the changes in d42b26d. Other than that it looks good. I haven't actually run the code, though.

#680

bernhard-herzog and others added 6 commits July 28, 2016 21:22

Add chunking settings to bot documentation

3a47f9b

Merge branch 'master' of github.com:Intevation/intelmq into dev-split…

7ce46bb

…-csv-reports

sebix added this to the Release v1.1 milestone Sep 13, 2016

sebix added feature Indicates new feature requests or new features component: core needs: review labels Sep 13, 2016

sebix removed the needs: review label Jan 10, 2017

SYNchroACK requested changes Jan 24, 2017

View reviewed changes

Dustin Demuth added 3 commits January 24, 2017 10:18

Merge branch 'master' into dev-split-csv-reports

1f08610

Fixed merge conflicts in: Conflicts: intelmq/bots/collectors/file/collector_file.py intelmq/bots/collectors/mail/collector_mail_url.py

Renmed Readme for File Collector

135a9a7

Typos

fa7677a

SYNchroACK requested changes Jan 24, 2017

View reviewed changes

ghost assigned aaronkaplan Apr 21, 2017

ghost self-requested a review April 21, 2017 13:22

ghost assigned ghost and unassigned aaronkaplan Apr 21, 2017

ghost modified the milestones: v1.0 Stable Release, v1.1 Feature release Apr 24, 2017

ghost merged commit fa7677a into certtools:master May 10, 2017

ghost pushed a commit that referenced this pull request May 10, 2017

DOC: add changelog entry for splitreports feature

4ae6ced

#680

dmth deleted the dev-split-csv-reports branch June 27, 2017 15:12

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This Code Provides a possibility to split large Reports into Chunks #680

This Code Provides a possibility to split large Reports into Chunks #680

dmth commented Sep 13, 2016

SYNchroACK Jan 24, 2017

dmth commented Jan 24, 2017

SYNchroACK Jan 24, 2017

SYNchroACK Jan 24, 2017

SYNchroACK Jan 24, 2017

SYNchroACK Jan 24, 2017

dmth Jan 24, 2017 •

edited

Loading

SYNchroACK commented Jan 24, 2017

dmth commented Jan 24, 2017

SYNchroACK commented Jan 24, 2017 •

edited

Loading

bernhard-herzog commented Jan 24, 2017

ghost commented Apr 24, 2017

bernhard-herzog commented May 9, 2017

		@@ -6,12 +6,27 @@ events. In combination with the Generic CSV Parser this should work great.

This Code Provides a possibility to split large Reports into Chunks #680

This Code Provides a possibility to split large Reports into Chunks #680

Conversation

dmth commented Sep 13, 2016

SYNchroACK Jan 24, 2017

Choose a reason for hiding this comment

dmth commented Jan 24, 2017

SYNchroACK Jan 24, 2017

Choose a reason for hiding this comment

SYNchroACK Jan 24, 2017

Choose a reason for hiding this comment

SYNchroACK Jan 24, 2017

Choose a reason for hiding this comment

SYNchroACK Jan 24, 2017

Choose a reason for hiding this comment

dmth Jan 24, 2017 • edited Loading

Choose a reason for hiding this comment

SYNchroACK commented Jan 24, 2017

dmth commented Jan 24, 2017

SYNchroACK commented Jan 24, 2017 • edited Loading

bernhard-herzog commented Jan 24, 2017

Ability to split depends on parsers, not collectors

Splitting reports is a work-around

API matters more than code organisation

API ideas

ghost commented Apr 24, 2017

bernhard-herzog commented May 9, 2017

dmth Jan 24, 2017 •

edited

Loading

SYNchroACK commented Jan 24, 2017 •

edited

Loading