Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This Code Provides a possibility to split large Reports into Chunks #680

Merged
9 commits merged into from
May 10, 2017

Conversation

dmth
Copy link
Contributor

@dmth dmth commented Sep 13, 2016

The tests are definately broken and have not been fixed yet.

bernhard-herzog and others added 6 commits July 28, 2016 21:22
As a work-around for limitations in the size of strings allowed by
Redis, we want to be able to split large inputs into several reports.
This commit adds basic support code for this together with corresponding
unit tests.

Part of issue certtools#547
The file collector bot now has two new parameters, chunk_size and
chunk_replicate_header, controlling whether the input is to be split
into chunks and how big those chunks should be and what to do with (CSV)
header lines.

Part of issue certtools#547
The mail URL collector bot now has two new parameters, chunk_size and
chunk_replicate_header, controlling whether the input is to be split
into chunks and how big those chunks should be and what to do with (CSV)
header lines.

Part of issue certtools#547
…v-split-csv-reports

Manually resolved two conflicts. Thei fields feed.name and feed.code are now
set by the CollectorBots __add_report_fields method.

Conflicts:
	intelmq/bots/collectors/file/collector_file.py
	intelmq/bots/collectors/mail/collector_mail_url.py
@sebix sebix added this to the Release v1.1 milestone Sep 13, 2016
@sebix sebix added feature Indicates new feature requests or new features component: core needs: review labels Sep 13, 2016
@@ -6,12 +6,27 @@ events. In combination with the Generic CSV Parser this should work great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename file to 'README.md'

Dustin Demuth added 3 commits January 24, 2017 10:18
Fixed merge conflicts in:

Conflicts:
	intelmq/bots/collectors/file/collector_file.py
	intelmq/bots/collectors/mail/collector_mail_url.py
@dmth
Copy link
Contributor Author

dmth commented Jan 24, 2017

@SYNchroACK and I talked about this PR on IRC. Thank you very much for your valuable input.

One Idea to continue the integration of this PR into IntelMQ is to remove splitreports.py and integrate it's functions into bot.py's CollectorBot. After looking at it, I think this is not as good as it sounds:

  • splitreports.py is not complex, but it isn't trivial either. Thus it was very well documented. Integrating it would break the documentation of the module.
  • The functions which splitreports.py provides cannot be used in every collector, but only those which are known to be capable of processing line-based data (such as csv-files). Integrating it into the CollectorBot in general, thus making it available to every collector natively, might lead to false assumption such as that every feeds data can be split into chunks.
    Explicitly importing this module and using its capabilities is the better way to go, I think.

@@ -0,0 +1,134 @@
# -*- coding: utf-8 -*-
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,158 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change accordingly (remove test_splitreports.py) to the code that was merged in CollectorBot (https:/certtools/intelmq/blob/master/intelmq/lib/bot.py#L576)

@@ -0,0 +1,25 @@
# Mail collector bots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all README.md files on this pull request and add the documentation text to the following location:
https:/certtools/intelmq/blob/master/docs/Bots.md#collectors

@@ -4,6 +4,8 @@
"description": "Fileinput collector fetches data from a file.",
"module": "intelmq.bots.collectors.file.collector_file",
"parameters": {
"chunk_replicate_header": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctly if I'm wrong but I think we need to replicate this solution to all collectors, right?

Copy link
Contributor Author

@dmth dmth Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written in #680 (comment) not every collector might be capable of splitting the data. It depends on the data-format if it can be split.
CSV or some Blocklists can be split this way, whilst XML, JSON, Binary cannot!

The solution can be extended to the collectors which are

  1. known to be capable of handling this data
  2. Currently collecting splitable data in some use-cases.

Candidates are:
HTTP, Mail-Url, Mail-Attachment, File, FTP/FTPs (?)

@SYNchroACK
Copy link
Contributor

@SYNchroACK and I talked about this PR on IRC. Thank you very much for your valuable input.

One Idea to continue the integration of this PR into IntelMQ is to remove splitreports.py and integrate it's functions into bot.py's CollectorBot. After looking at it, I think this is not as good as it sounds:

splitreports.py is not complex, but it isn't trivial either. Thus it was very well documented. Integrating it would break the documentation of the module.
The functions which splitreports.py provides cannot be used in every collector, but only those which are known to be capable of processing line-based data (such as csv-files). Integrating it into the CollectorBot in general, thus making it available to every collector natively, might lead to false assumption such as that every feeds data can be split into chunks.
Explicitly importing this module and using its capabilities is the better way to go, I think.

I was doing the review when you were writing your comment. sorry. I understand your points but from my perspective is too small feature to be in a new file. If you have chance, study other options where to put that code, however, here we can find code inside ParserBot class which is only usable by Parsers that receives CSV messages.

@dmth
Copy link
Contributor Author

dmth commented Jan 24, 2017

I was doing the review when you were writing your comment. sorry.

No Problem, Thanks for reviewing!

I understand your points but from my perspective is too small feature to be in a new file.

Have you seen, that splitreports.py has more lines (with comments and documentation) than the affected collector-bot?
An integration would cause that the majority of code and documentation within the collectorbot would be a feature that can only be used by a certain subset of collectors, expecting a certain data-structure, and only handling a problem which occurs to a small amount of users (those getting HUGE datasets) but gets a showstopper for them when it happens . It's very reasonable for me to exclude this into a separate file.

BR
Dustin

@SYNchroACK
Copy link
Contributor

SYNchroACK commented Jan 24, 2017

I was doing the review when you were writing your comment. sorry.
No Problem, Thanks for reviewing!

I understand your points but from my perspective is too small feature to be in a new file.
Have you seen, that splitreports.py has more lines (with comments and documentation) than the affected collector-bot?

I can't see it as an argument, and btw, documentation cannot count for that comparison and comparing lines of code, they have the roughly the same number.

An integration would cause that the majority of code and documentation within the collectorbot would be a feature that can only be used by a certain subset of collectors

I see the same situation for ParserBot class and currently the class has code dedicated, for example, to do CSV parsing.

expecting a certain data-structure

Like the example before.

and only handling a problem which occurs to a small amount of users (those getting HUGE datasets) but gets a showstopper for them when it happens . It's very reasonable for me to exclude this into a separate file.

That is not an argument to submit code to new files. Also, v1.0 will be used by people who wants to have systems working with huge volume of data (I already have one scenario which is just waiting for the v1.0 being publish).

So, from my side, No. sorry. However, @wagner-certat and @aaronkaplan can give you other feedback.

@bernhard-herzog
Copy link
Contributor

It seems to me that it's probably a good idea to make the splitting
functionality available with CollectorBot methods. Reasoning and a
sketch of an API:

Ability to split depends on parsers, not collectors

Whether a report generated by collector can be split is not a property
of the collector but of the parser that has to process it, at least for
the more generic collectors. Whether the raw data can be split depends
on the format of the data after all, and currently at least, splitting
only works for formats like CSV that often can be split at newline
characters.

So it does make sense to make the splitting functionality available
automatically to all collectors as long as it is not enabled by default
(and I don't think anybody is proposing that).

Ideally IntelMQ would somehow make sure it will only be enabled for
collectors that feed into parsers for formats that support it, but
that's outside the scope of this issue.

Splitting reports is a work-around

The main arguments against it AFAICT are that

The splitting feature is a work-around for a limitation in Redis, one
which hopefully will go away with some future version of Redis or by
replacing Reids with something else, so that the work-around will not be
needed anymore. Do we really want to burden the API of the collector
class with this?

For practicality reasons it's probably something we can and should live
with. Luckily, once the work-around is not needed anymore, it should be
easy to remove it and for compatibility, keep the API and configuration
options as no-ops, or make them behave as if the chunk size is bigger
than the raw data.

API matters more than code organisation

I think API design, and in this case the design of the 'CollectorBot'
API matters more than where the code is. So the important bit here is
how a collector bot can access the splitting functionality. Making it
available directly via CollectorBot methods is probably a good idea.
Whether that also means the actual code should be moved from
splitreports.py to bot.py is a different and less important question
and anyway I don't see much of a problem in moving the code, regardless
of the amount of the documentation.

API ideas

I'm not sure what the API should be. Ideally, it's designed so that the
splitting is handled transparently from the collect bot writer's point
of view. That would probably involve using a somewhat more declarative
approach than now. For instance, the file collector bot's code to split
the report should be moved from the process method into a separate
method, so that this

def process(self):

    # ... removed lines to make the point clearer ...

    template = self.new_report()
    template.add("feed.url", "file://localhost%s" % filename)

    with open(filename, 'rb') as f:
        for report in generate_reports(template, f,
                                       self.parameters.chunk_size,
                                       self.parameters.chunk_replicate_header):
            self.send_message(report)

becomes something like this:

def process(self):

    # ... removed lines to make the point clearer ...

    with open(filename, 'rb') as f:
        self.send_reports_from_file(f, [("feed.url", "file://localhost%s" % filename)])


def send_reports_from_file(self, file, attributes):
    template = self.new_report()
    for key, value in attributes:
        template.add(key, value)

    for report in generate_reports(template, file,
                                   self.parameters.chunk_size,
                                   self.parameters.chunk_replicate_header):
        self.send_message(report)

where send_reports_from_file would be a method of CollectorBot

And perhaps analogously a method that sends split reports from a string
with a raw report. An easy way to implement that would be as a front-end
to send_reports_from_file that wraps the string in io.StringIO.

@ghost ghost assigned aaronkaplan Apr 21, 2017
@ghost ghost self-requested a review April 21, 2017 13:22
@ghost ghost assigned ghost and unassigned aaronkaplan Apr 21, 2017
@ghost ghost modified the milestones: v1.0 Stable Release, v1.1 Feature release Apr 24, 2017
@ghost
Copy link

ghost commented Apr 24, 2017

I pushed some minor improvements to this PR to https:/certtools/intelmq/tree/dev-split-csv-reports
Please review them or add them to this PR.

@bernhard-herzog
Copy link
Contributor

I pushed some minor improvements to this PR to https:/certtools/intelmq/tree/dev-split-csv-reports
Please review them or add them to this PR.

I commented on one of the changes in d42b26d. Other than that it looks good. I haven't actually run the code, though.

@ghost ghost merged commit fa7677a into certtools:master May 10, 2017
ghost pushed a commit that referenced this pull request May 10, 2017
@dmth dmth deleted the dev-split-csv-reports branch June 27, 2017 15:12
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: core feature Indicates new feature requests or new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants