Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonization fixes #634

Merged
merged 19 commits into from
Sep 14, 2016
Merged

Harmonization fixes #634

merged 19 commits into from
Sep 14, 2016

Conversation

sebix
Copy link
Member

@sebix sebix commented Aug 9, 2016

  • Docs: types of fields link to type
  • tests for harmonization (regex is valid, length is int, type exists)
  • Fix a number of harmonization issues (together with @aaronkaplan )
    • set size limits for some fields
    • registry is an enum nowm, uppercase only
    • country codes are 2 chars, uppercase
    • Hashes are case-sensitive! (Previous data is invalid!)
    • New harmonization-type UppercaseString
  • intelmq_psql_initdb.py is adapted/fixed and tests for this file are added.

ping @bernhard-herzog

@sebix sebix added bug Indicates an unexpected problem or unintended behavior data-format component: core labels Aug 9, 2016
@sebix sebix added this to the Release v1.0 milestone Aug 9, 2016
@aaronkaplan
Copy link
Member

aaronkaplan commented Aug 9, 2016

looks ok, thx!!
@bernhard-herzog could you please still check if the changes work for you guys?

@sebix
Copy link
Member Author

sebix commented Aug 9, 2016

Before merging, we should announce it on intelmq-dev, there are some important changes included.

@sebix
Copy link
Member Author

sebix commented Aug 9, 2016

The tests fail because of issues on travis-ci's side, I hope they get fixed soon (as last time when the same issue appeared)

@bernhard-herzog
Copy link
Contributor

intelmq/bin/intelmq_psql_initdb.py is missing a mapping for the UppercaseString type.

@bernhard-herzog
Copy link
Contributor

What's the reasoning behind adding the "IF NOT EXISTS" to the "CREATE
TABLE" statement? It seems to me that it would be better to get an error
when attempting to execute the statement if the table already exists.

@bernhard-herzog
Copy link
Contributor

The new type and regex for event_hash seem wrong. The values are supposed to be SHA1 hashes, presumably represented as a hexadecimal string. This would fit the new length of 40 characters, but the regular expression allows much more than than only hexadecimal characters. In fact, the regex is identical to the one used for e.g. "malware.hash.md5", but that field does not have a length limit.

The easiest solution would be to change the regex on event_hash to allow only hexadecimal digits, or to use the same type definition (length, regex and type) for all fields containing cryptographic hashes of some form.

@sebix
Copy link
Member Author

sebix commented Aug 11, 2016

The problem with all the hashes is that we until recently never validated and normalized them, while they can have many different representations.
All these three strings are the SHA1-hashes for asd:

  • raw digest: b"\xe5N\xe7\xe2\x85\xfb\xb0'Ry\x14:\xbcLUNS\x14\xe7\xb4\x17\xec\xac\x83\xa5\x98J\x96O\xac\xba\xadh\x86j(A\xc3\xe8=\xdf\x12Z)\x85Vba\xc4\x01O\x9f\x96\x0e\xc6\x02S\xae\xbc\xda\x95\x13\xa9\xb4"
  • hex: 'e54ee7e285fbb0275279143abc4c554e5314e7b417ecac83a5984a964facbaad68866a2841c3e83ddf125a2985566261c4014f9f960ec60253aebcda9513a9b4'
  • base64: b'5U7n4oX7sCdSeRQ6vExVTlMU57QX7KyDpZhKlk+suq1ohmooQcPoPd8SWimFVmJhxAFPn5YOxgJTrrzalROptA=='

and then we can prefix all this with $sha1$ or $sha1$5000$.
We should allow only one representation, to allow comparing hashes and events. The examples above show 9 variations in total.

The first step in normalization was the attempt in #298 then we introduced different fields for md5 and sha1 as a hotfix 1d7a67b#diff-c8f7b29d126db063cf2e7268d2c9c5d6R152
This all started the mess we now have.

@Rafiot Could you please review if the rules for misp-related fields are correct?

@certtools/intelmq-contributors please review and give feedback

@bernhard-herzog
Copy link
Contributor

My concern about the regex for event_hash is specifically about that field itself. That field's regex allows the same variations as for the other hash fields even though its length of 40 and its description suggests that it should hold only hexadecimal SHA1 hashes. That's what seems inconsistent to me.

@bernhard-herzog
Copy link
Contributor

On the whole, apart from the relatively minor points I brought up in other comments, this change should be OK for us.

The important considerations are:

  • The schema for the events table should accept all events accepted by the validation of event
    attributes in intelmq/lib/message.py and intelmq/lib/harmonization.py. This makes sure that all
    events that reach the PostgreSQL output but can be stored in the database.

    This is mostly taken care of by the changes to intelmq_psql_initdb.py which doesn't impose
    arbitrary limits on the database types anymore. There's only one case missing: types it doesn't
    know about are mapped to varchar(2000). It does print a warning in that case, which is at least
    something, but I wonder whether it wouldn't be better to treat that case an an error.

  • Looking forward to future changes, particularly after a 1.0 release, we need to be careful when
    changing the types and particularly when tightening the constraints. By that point there will likely be
    installations with data that fits the old types/constraints but which may not fit the new ones.

    So are there any more constraints that maybe should be imposed on the data?
    If so, lets implement them now. Looking through harmonization.conf the following fields could
    perhaps use some constraints or more specific types:

    • destination.geolocation.latitude, destination.geolocation.longitude,
      source.geolocation.latitude, source.geolocation.longitude

      These are currently floats, but their values should perhaps be restricted to the obvious range.
      Min/max values are not supported yet, AFAICT.

@Rafiot
Copy link
Member

Rafiot commented Aug 16, 2016

@sebix: the MISP related changes are good for me.

@Rafiot
Copy link
Member

Rafiot commented Aug 16, 2016

I just re-ran the tests on travis, it still fails, but this time for a good reason :)

@sebix
Copy link
Member Author

sebix commented Aug 17, 2016

@Rafiot This is again travis' fault, it's fixed in the PR

Signed-off-by: Sebastian Wagner <[email protected]>
Signed-off-by: Sebastian Wagner <[email protected]>
registry is an enum nowm, uppercase only
country codes are 2 chars, uppercase
Hashes are case-sensitive!
New harmonization-type UppercaseString

Signed-off-by: Sebastian Wagner <[email protected]>
Signed-off-by: Sebastian Wagner <[email protected]>
Signed-off-by: Sebastian Wagner <[email protected]>
Signed-off-by: Sebastian Wagner <[email protected]>
@codecov-io
Copy link

codecov-io commented Aug 26, 2016

Current coverage is 72.27% (diff: 88.46%)

Merging #634 into master will increase coverage by 1.04%

@@             master       #634   diff @@
==========================================
  Files           199        201     +2   
  Lines          7339       7463   +124   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           5228       5394   +166   
+ Misses         2111       2069    -42   
  Partials          0          0          

Powered by Codecov. Last update a15cce0...c6487a3

@SYNchroACK
Copy link
Contributor

@sebix everything cool from my side

@sebix
Copy link
Member Author

sebix commented Sep 2, 2016

Concerning the hashes: I think we need to postpone this topic entirely, there are multiple things to fix. E.g. we need a list of hashes and normalization. Both are non-trivial. E.g. we could partially use passlib's identify function

For now I propose we loosen the constraints and checks, to not pose any additional burden to users:

  • no binary representation, thus only printable chars
  • nearly arbitrary prefixes
  • max length of 200 (128 base64-chars+some prefixes)

There's only one case missing: types it doesn't
know about are mapped to varchar(2000). It does print a warning in that case, which is at least
something, but I wonder whether it wouldn't be better to treat that case an an error.

It's raising now.

So are there any more constraints that maybe should be imposed on the data?
If so, lets implement them now. Looking through harmonization.conf the following fields could
perhaps use some constraints or more specific types:

  • destination.geolocation.latitude, destination.geolocation.longitude,
    source.geolocation.latitude, source.geolocation.longitude

    These are currently floats, but their values should perhaps be restricted to the obvious range.
    Min/max values are not supported yet, AFAICT.

The only missing restriction are the min/max afaict. As we already defined them as float, the only valid format is the Signed degrees format, as opposed to compass direction formats.

Possible value restrictions:

  • Latitudes: [-90, 90]
  • Longitudes: [-180, 180]

What's the reasoning behind adding the "IF NOT EXISTS" to the "CREATE
TABLE" statement? It seems to me that it would be better to get an error
when attempting to execute the statement if the table already exists.

We tried to create a script which updates the table definition otherwise, but that's not possible anyway. Removed.

@aaronkaplan
Copy link
Member

aaronkaplan commented Sep 9, 2016

  • need to still fix something: need to normalise (lowercase, remove "-" dashes etc) the registry names to our fixed list of enums of registry names.

@aaronkaplan
Copy link
Member

aaronkaplan commented Sep 9, 2016

  • increase event hash to 100. See above.

@coveralls
Copy link

coveralls commented Sep 9, 2016

Coverage Status

Coverage increased (+1.04%) to 72.277% when pulling c6487a3 on sebix:harmonization-fixes into a15cce0 on certtools:master.

@aaronkaplan
Copy link
Member

I reviewed @bernhard-herzog and all your comments. We also let this PR rest a bit so that enough people can comment.
Based on what I read, we are generally okay with this PR. There are future improvements (such as normalising hash values to standard format) possible. But we will address those in 1.1.
I will merge this PR therefore.

Thanks everyone!

@aaronkaplan aaronkaplan merged commit 9996db6 into certtools:master Sep 14, 2016
@sebix sebix deleted the harmonization-fixes branch September 16, 2016 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior component: core data-format
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants