Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlienVault OTX API #298

Merged
merged 15 commits into from
Sep 11, 2015
Merged

AlienVault OTX API #298

merged 15 commits into from
Sep 11, 2015

Conversation

robcza
Copy link
Contributor

@robcza robcza commented Aug 25, 2015

Creating this pull request to discuss several points:

  • I've added third party script directly to this branch https:/AlienVault-Labs/OTX-Python-SDK/blob/master/OTXv2.py not sure if this is smart as:
    • it will not get updated as the source changes
    • there is no license file
  • Created new collector - seems weird, however didn't see any other simple way. Should I create new specialized collectors for any other API sources?
  • The parser itself raises many questions
    • There is no straightforward method to determine classification.type of the event (AlienVault uses tags, however they are user defined and are close to random)
    • Included some metadata regarding the IoCs (author, name, description) and dumped them into the "comment" attribute. What format is preferable to store multiple pieces of information in an attribute like this?

Thank for your comments

@sebix
Copy link
Member

sebix commented Aug 25, 2015

It's not compatible to PEP8.

@robcza
Copy link
Contributor Author

robcza commented Aug 25, 2015

Ok, will study and check that .... should be ok by now

raw_report = utils.base64_decode(report.value("raw"))

for pulse in json.loads(raw_report):
comment = "author: " + pulse['author_name'] + "; name: " + \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using string formatting us usually more clear. Example:

            comment = ("author: {}; name: {}; description: {}"
                       "".format(pulse['author_name'], pulse['name'],
                                 pulse['description']))

The resulting string can be seen immediately.

@sebix
Copy link
Member

sebix commented Aug 25, 2015

Several lines are longer than 80 chars. You can use a tuple with one element for wrapping. This will be reduced to the element itself, e.g.:

myvar = ("long"
         "string")

In function calls, this is not necessary as the ending scope is clearly defined.

@sebix
Copy link
Member

sebix commented Aug 26, 2015

For the license of OTX-Python-SDK, I opened an issue there: AlienVault-OTX/OTX-Python-SDK#5
I am not the one who decides this case, but in my opinion we can't include that code as such. Additionally the license has to be compatible with Gnu Affero Public License.

I think the collector is fine.
For the classification I'm afraid we need a mapping.

@sebix
Copy link
Member

sebix commented Aug 26, 2015

OTX-Python-SDK is now licensed under Apache License 2.0: https:/AlienVault-Labs/OTX-Python-SDK/blob/master/LICENSE

@SYNchroACK
Copy link
Contributor

@robcza is it finished?

@sebix
Copy link
Member

sebix commented Sep 2, 2015

It should be at least mentioned in the COPYRIGHT(.md) files, that the included file is Apache licensed.

As the Apache license is deemed compatible with the (L)GPL v3, you can use Apache licensed code in your LGPL project.

If you use the Apache licensed code only in the form of a library, the conventional way is

  • keep the Apache licensed library in a separate directory/sub-tree of the project
  • mention in your documentation (readme) that you use such-and-such library and under which license it is being distributed.

In your source code, you don't need to mention that a function/class comes from a separate library. The fact that it has a different license should be obvious from the copyright statements in the respective source files.

programmers@SE: How to use apache license in my project which will be LGPL

def process(self):
report = self.receive_message()
if (report is None or not report.contains("raw") or
len(report.value("raw").strip()) == 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last check should probably be done in Bot.receive_message() and results in None. Was there any special message which caused problems and you fixed it that way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, will get rid of that check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty raw (after strip) will be rejected anyway: https:/certtools/intelmq/blob/master/intelmq/lib/harmonization.py#L59

…ienvault-otx

Conflicts:
	intelmq/bots/BOTS

def process(self):
report = self.receive_message()
if (report is None or not report.contains("raw"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a SyntaxError

@robcza
Copy link
Contributor Author

robcza commented Sep 2, 2015

Uraaa! I just merged something and it worked :) Anyway, sorry for "premature push", I will check the code, prepare tests and will get back to you to discuss a few issues. So don't waste your time looking at the code right now.
xvyosl

@sebix
Copy link
Member

sebix commented Sep 2, 2015

Forgot that we have a surveillance state? I will watch at every commit you do!!!!!1!111

@sebix
Copy link
Member

sebix commented Sep 2, 2015

Here is a good summary with all steps needed in a development process in git (branches, rebasing etc): https://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html

@robcza
Copy link
Contributor Author

robcza commented Sep 2, 2015

Well, ok, now it seems a lot better. However I'm aware of some issues:

  • there is no definition for malware.hash_type in harmonization.conf - I would like to add that, however I could be missing something
  • classification is still other/blacklist - I am not really happy with this, but the source is close to random regarding the tags they use for their IoCs. I believe, I could change the classification for some of the tags like "malware" and keep other/blacklist for the rest of them.
  • I tried to provide test data, however the test itself is still failing, too cryptic for me to solve
ERROR: test_event (__main__.TestAlienVaultOTXParserBot)
Test if correct Event has been produced.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "intelmq/tests/bots/parsers/alienvault/test_parser_otx.py", line 48, in test_event
    self.assertMessageEqual(0, EXAMPLE_EVENT)
  File "/usr/local/lib/python2.7/dist-packages/intelmq-1.0.0-py2.7.egg/intelmq/lib/test.py", line 333, in assertMessageEqual
    del expected_message['time.observation']
KeyError: u'time.observation'

----------------------------------------------------------------------
Ran 12 tests in 0.096s

FAILED (errors=1)

@sebix
Copy link
Member

sebix commented Sep 3, 2015

You are using the wrong format to add grouped fields:

"time": {
    "source": "2015-09-02T09:21:53+00:00",
    "observation": "2015-09-02T14:17:58+00:00"}}

should be:

"time.source": "2015-09-02T09:21:53+00:00",
"time.observation": "2015-09-02T14:17:58+00:00"

same applies to source, classification, feed and all others.

@SYNchroACK
Copy link
Contributor

@sebix can you explain a little bit more?

@sebix
Copy link
Member

sebix commented Sep 3, 2015

@SYNchroACK
Copy link
Contributor

yap, i got it! ;)

@robcza
Copy link
Contributor Author

robcza commented Sep 9, 2015

I've fixed the issue regarding the grouped attributes and added "additional" attribute to the event generated by the parser.
Also, I defined malware.hash_type in the harmonizotaion.conf and used it in the parser.

I see no easy way to make the classification more precise at the moment, however I will keep eye on changes introduced by AlienVault, hopefully they will come up with something to narrow it.

Feels like complete right now, however feel free to comment further. And btw. @sebix those tests are pretty useful, thanks for them

@SYNchroACK SYNchroACK added this to the Release 1 milestone Sep 9, 2015
@SYNchroACK SYNchroACK self-assigned this Sep 9, 2015
from OTXv2 import OTXv2
import json

from intelmq.bots.collectors.http.lib import fetch_url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_url is not used.

@sebix
Copy link
Member

sebix commented Sep 9, 2015

The hash type is usually stored along with the hash value as prefix. E.g $1$ for MD5, $$2a$ etc for Blowfish, $5$ for SHA-256.
See also WP: Crypt (C)

@SYNchroACK
Copy link
Contributor

@sebix so, malware.hash will also include the hash_type?!?!?!

@sebix
Copy link
Member

sebix commented Sep 9, 2015

On 09/09/2015 04:46 PM, Tomás Lima wrote:

@sebix https:/sebix so, malware.hash will also include
the hash_type?!?!?!

All comments made by me do represent my opinion and I don't want to
patronize neither you or any other "authoritative" project lead. I
indicate that the hash can also include the hash_type and a separate
filed is not necessary. Nothing else. Just that.

@aaronkaplan
Copy link
Member

On Sep 9, 2015, at 4:51 PM, Sebastian [email protected] wrote:

On 09/09/2015 04:46 PM, Tomás Lima wrote:

@sebix https:/sebix so, malware.hash will also include
the hash_type?!?!?!

All comments made by me do represent my opinion and I don't want to
patronize neither you or any other "authoritative" project lead. I
indicate that the hash can also include the hash_type and a separate
filed is not necessary. Nothing else. Just that.

Agreed, it makes more sense to rely on that.

@SYNchroACK
Copy link
Contributor

Agreed. It will require to add a util to convert a hash into to hash type along with the hash value as prefix.

@sebix
Copy link
Member

sebix commented Sep 9, 2015

I propose to replace this code:

HASHES = {
    'FileHash-SHA256': 'SHA-256',
    'FileHash-SHA1': 'SHA-1',
    'FileHash-MD5': 'MD5'
}

if indicator["type"] in [
        'FileHash-SHA256',
        'FileHash-SHA1',
        'FileHash-MD5']:
    event.add(
        'malware.hash',
        indicator["indicator"],
        sanitize=True)
    event.add(
        'malware.hash_type', HASHES[
            indicator["type"]], sanitize=True)

with this:

HASHES = {
    'FileHash-SHA256': '$5$',
    'FileHash-SHA1': '$sha1$',
    'FileHash-MD5': '$1$'
}
if indicator["type"] in HASHES:
    event.add('malware.hash',
              HASHES[indicator["type"]] + indicator["indicator"])

@robcza
Copy link
Contributor Author

robcza commented Sep 9, 2015

Included proposed changes, adjusted the description of malware.hash and made the parser code a bit more readable.

@sebix
Copy link
Member

sebix commented Sep 11, 2015

I just checked out your code locally. Had to rebase on master and fix several merge conflicts (please do not use merge to get changes from upstream ...) and adapted the files to current changes. Also fixed some style issues and made sure, the JSON dumps are always equal (the order of items in a dictionary is not always the same, thus we sort by keys now, also fixes Python 3 tests). Have a look here: sebix@8e2e225

@sebix
Copy link
Member

sebix commented Sep 11, 2015

OTXv2.py is ugly code and not compatible to py3. There is a PR on some issues, I commented there: AlienVault-OTX/OTX-Python-SDK#2
Hopefully they fix the issues upstream, otherwise I will fork the abandoned project and make the fixes myself. That's not blocking this PR, just wanted to rant...

@robcza
Copy link
Contributor Author

robcza commented Sep 11, 2015

Thank you @sebix, not using merge to get changes from upstream ever since you told me, so it shouldn't happen again. Key sorting makes sense, I'm totally ok with the changes you introduced.

event.add('comment', pulse['description'])
event.add('additional', additional, sanitize=True)
event.add('classification.type', 'blacklist', sanitize=True)
event.add('time.observation', report.value(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the following example to fill the event with time.observation, feed.url e feed.name

https:/certtools/intelmq/blob/master/intelmq/bots/parsers/autoshun/parser.py#L38

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robcza You can cherry-pick my commit after adding my repo as remote to your local branch:

git remote add sebix [email protected]:sebix/intelmq.git
git remote update
git cherry-pick 8e2e225beda1b0ff5b417649f44396aca1e962d8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SYNchroACK there seem to be a typo in that examplereports instead of report report

@sebix The proposal of yours also takes the "raw" field from the report making it impossible to add the "raw" record for the event itself later, failing later like this:
ERROR - Bot has found a problem.\nTraceback (most recent call last):\n File "/usr/local/lib/python2.7/dist-packages/intelmq-1.0.0-py2.7.egg/intelmq/lib/bot.py", line 117, in start\n self.process()\n File "/usr/local/lib/python2.7/dist-packages/intelmq-1.0.0-py2.7.egg/intelmq/bots/parsers/alienvault/parser_otx.py", line 75, in process\n event.add("raw", json.dumps(indicator), sanitize=True)\n File "/usr/local/lib/python2.7/dist-packages/intelmq-1.0.0-py2.7.egg/intelmq/lib/message.py", line 77, in add\n raise exceptions.KeyExists(key)\nKeyExists: key u\'raw\' already exists

Should we replace the raw field instead of adding (does not feel like a proper way copying the whole raw report again and again) it or change the "constructor" to ignore the "raw" attribute from report (sorry for the cpp terminology).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the code in lib/message.py: https:/certtools/intelmq/blob/master/intelmq/lib/message.py#L181
The Event-constructor differentiates between a Report and other parameter types like dict. Report is derived from Message which is in turn derived from dict. For a report instance, it only takes feed.(name|url) and time.observation, not raw. For everything else, like dicts and messages, the parameter is given to the constructor of the parent class, which is in the end dict, and all fields are copied. If this is really the case, it does mean that receive_message / the pipeline does not correctly unserialize the message!
So first please check the type of report e.g. self.logger.debug('report type {!r}'.format(type(report))), and/or some issubclass calls.

@robcza
Copy link
Contributor Author

robcza commented Sep 11, 2015

well, well, problem was actually in my outdated repo, tests are passing now with the cherry-picked changes from @sebix

@SYNchroACK
Copy link
Contributor

@sebix and @robcza , is it ok to merge it now?

@sebix
Copy link
Member

sebix commented Sep 11, 2015

Yes. I hope the merge commit from today 95afec8 does not break anything....

@robcza
Copy link
Contributor Author

robcza commented Sep 11, 2015

I'm not aware of any problems right now. I should focus on the classification later, but merging this PR is ok.

@SYNchroACK
Copy link
Contributor

ok. Thank you

SYNchroACK added a commit that referenced this pull request Sep 11, 2015
@SYNchroACK SYNchroACK merged commit efba3a8 into certtools:master Sep 11, 2015
@robcza robcza deleted the alienvault-otx branch September 11, 2015 13:33
import urlparse
import urllib
import urllib2
import simplejson as json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last question: why did you chose simplejson over json? To cite the release notes of Python 2.7:

Updated module: The json module was upgraded to version 2.0.9 of the simplejson package, which includes a C extension that makes encoding and decoding faster. (Contributed by Bob Ippolito; issue 4136.)

Also see SO:
What are the differences between json and simplejson Python modules?
, links a unicode bug in simplejson and states:

json is simplejson, added to the stdlib.

@sebix sebix mentioned this pull request Aug 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants