Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented valid UTF8 character checks #180

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zbalkan
Copy link
Contributor

@zbalkan zbalkan commented Sep 26, 2024

Related issue
#179

Description

This is a continuation of issue wazuh/wazuh#23354, about the fix PR wazuh/wazuh#23543.

This PR addresses an issue with the UTF-8 validation logic in the agent where valid UTF-8 multibyte characters were mistakenly being identified as invalid. The original implementation performed overly restrictive checks on sequences of bytes representing characters like Ü, ü, Õ, õ, Ö, ö, Ä, ä, Ş, ş, Ç, ç, causing the File Integrity Monitoring (FIM) module to incorrectly ignore file paths containing these characters.

Problem

The original validation logic checked for valid UTF-8 sequences but incorrectly marked certain valid multibyte characters as invalid due to overly restrictive rules on the leading byte of 2-, 3-, and 4-byte sequences. As a result, characters that are fully compliant with the UTF-8 standard were ignored, causing the FIM module to overlook legitimate file paths containing these characters. This led to unintended behavior in path validation and monitoring.

Solution

The macros for validating UTF-8 sequences have been updated to properly handle all valid UTF-8 byte ranges:

  • valid_2: Now properly validates 2-byte sequences, ensuring no overlong encodings occur and that valid 2-byte sequences are recognized.
  • valid_3: Correctly handles special cases where the leading byte is 0xE0 or 0xED. Overlong encodings starting with 0xE0 are excluded, and surrogate halves (reserved for UTF-16) starting with 0xED are correctly identified as invalid.
  • valid_4: Properly validates 4-byte sequences, ensuring sequences that start with 0xF0 are not overlong and that sequences do not exceed the Unicode limit (U+10FFFF).

With these fixes, the validation logic correctly identifies all valid UTF-8 sequences, including multibyte characters commonly used in various languages.

Configuration options

Logs/Alerts example

Tests

  • Compilation without warnings in every supported platform
    • Linux
    • Windows
    • MAC OS X
  • Source installation
  • Package installation
  • Source upgrade
  • Package upgrade
  • Review logs syntax and correct language
  • QA templates contemplate the added capabilities
  • Memory tests for Linux
    • Scan-build report
    • Coverity
    • Valgrind (memcheck and descriptor leaks check)
    • Dr. Memory
    • AddressSanitizer
  • Memory tests for Windows
    • Scan-build report
    • Coverity
    • Dr. Memory
  • Memory tests for macOS
    • Scan-build report
    • Leaks
    • AddressSanitizer
  • Retrocompatibility with older Wazuh versions
  • Working on cluster environments
  • Configuration on demand reports new parameters
  • The data flow works as expected (agent-manager-api-app)
  • Added unit tests (for new features)
  • Stress test for affected components
  • Decoder/Rule tests
    • Added unit testing files ".ini"
    • runtests.py executed without errors

@zbalkan zbalkan changed the title Implemented valud UTF8 character checks Implemented valid UTF8 character checks Sep 26, 2024
@zbalkan
Copy link
Contributor Author

zbalkan commented Oct 13, 2024

Below is the breakdown of the UTF-8 characters based on acceptance.

Before

Byte Sequence Valid Unicode Range Byte Length Status Related Macro Comments
0x00 to 0x7F U+0000 to U+007F 1 Accepted valid_1 Single-byte ASCII characters are accepted.
0xC2 0x80 to 0xDF 0xBF U+0080 to U+07FF 2 Accepted valid_2 Two-byte characters, including extended Latin, Greek, and Cyrillic.
0xC0 0x80 to 0xC1 0xBF U+0000 to U+007F 2 Excluded valid_2 Overlong encodings for ASCII characters are excluded.
0xE1 0x80 0x80 to 0xEC 0xBF 0xBF U+1000 to U+CFFF 3 Accepted valid_3 Valid three-byte characters, including many scripts, are accepted.
0xEE 0x80 0x80 to 0xEF 0xBF 0xBF U+E000 to U+FFFF 3 Accepted valid_3 Valid three-byte characters, excluding surrogate pairs, are accepted.
0xE0 0xA0 0x80 to 0xE0 0xBF 0xBF U+0800 to U+0FFF 3 Excluded valid_3 Excluded due to the macro rejecting valid three-byte characters with 0xE0.
0xE0 0x80 0x80 to 0xE0 0x9F 0xBF U+0000 to U+07FF 3 Excluded valid_3 Overlong encodings using three bytes for the U+0000 to U+07FF range.
0xED 0x80 0x80 to 0xED 0x9F 0xBF U+D800 to U+DFFF 3 Excluded valid_3 Surrogate pairs reserved for UTF-16 are excluded (correctly).
0xF1 0x80 0x80 0x80 to 0xF3 0xBF 0xBF 0xBF U+40000 to U+10FFFF 4 Accepted valid_4 Valid four-byte characters in a narrow range (higher supplementary planes).
0xF0 0x90 0x80 0x80 to 0xF0 0xBF 0xBF 0xBF U+10000 to U+3FFFF 4 Excluded valid_4 Excluded valid four-byte characters in the U+10000 to U+3FFFF range.
0xF0 0x80 0x80 0x80 to 0xF0 0x8F 0xBF 0xBF U+0000 to U+FFFF 4 Excluded valid_4 Overlong encodings using four bytes for ranges that could be encoded with fewer bytes.

After

Byte Sequence Valid Unicode Range Byte Length Status Related Macro Comments
0x00 to 0x7F U+0000 to U+007F 1 Accepted valid_1 Single-byte ASCII characters are accepted.
0xC2 0x80 to 0xDF 0xBF U+0080 to U+07FF 2 Accepted valid_2 Two-byte characters, including extended Latin, Greek, and Cyrillic.
0xC0 0x80 to 0xC1 0xBF U+0000 to U+007F 2 Excluded valid_2 Overlong encodings for ASCII characters are excluded.
0xE1 0x80 0x80 to 0xEC 0xBF 0xBF U+1000 to U+CFFF 3 Accepted valid_3 Valid three-byte characters, including many scripts, are accepted.
0xEE 0x80 0x80 to 0xEF 0xBF 0xBF U+E000 to U+FFFF 3 Accepted valid_3 Valid three-byte characters, excluding surrogate pairs, are accepted.
0xE0 0xA0 0x80 to 0xE0 0xBF 0xBF U+0800 to U+0FFF 3 Accepted valid_3 Valid three-byte characters in the range U+0800 to U+0FFF are now accepted.
0xE0 0x80 0x80 to 0xE0 0x9F 0xBF U+0000 to U+07FF 3 Excluded valid_3 Overlong encodings using three bytes for the U+0000 to U+07FF range.
0xED 0x80 0x80 to 0xED 0x9F 0xBF U+D800 to U+DFFF 3 Excluded valid_3 Surrogate pairs reserved for UTF-16 are excluded (correctly).
0xF0 0x90 0x80 0x80 to 0xF4 0x8F 0xBF 0xBF U+10000 to U+10FFFF 4 Accepted valid_4 Valid four-byte characters from U+10000 to U+10FFFF are accepted.
0xF0 0x80 0x80 0x80 to 0xF0 0x8F 0xBF 0xBF U+0000 to U+FFFF 4 Excluded valid_4 Overlong encodings using four bytes for ranges that could be encoded with fewer bytes.
0xF4 0x90 0x80 0x80 to 0xF7 0xBF 0xBF 0xBF Out of Unicode range 4 Excluded valid_4 Invalid four-byte sequences that exceed Unicode limit (U+10FFFF).

@zbalkan
Copy link
Contributor Author

zbalkan commented Oct 13, 2024

Improved unit tests based on the edge cases.

UTF-8 Validation Test Coverage Matrix

Test Case Valid/Invalid Covered Case
test_valid_utf8_sequences Valid ASCII, 2-byte, 3-byte, 4-byte, complex scripts
test_invalid_utf8_sequences Invalid Overlong encodings, invalid sequences, surrogate halves
test_utf8_random_replace Valid Random byte stream with replacement, ensuring valid UTF-8
test_utf8_random_not_replace N/A Random byte stream without replacement
test_utf8_edge_cases Valid/Invalid Edge: U+10FFFF (valid), beyond U+10FFFF (invalid)
New: test_empty_string Valid Empty string (valid UTF-8)
New: test_incomplete_utf8_sequences Invalid Incomplete 2-byte, 3-byte, 4-byte sequences
New: test_overlong_encodings Invalid Overlong encodings with 2, 3, or 4 bytes
New: test_surrogate_pair_boundary Valid/Invalid Just below and just in the surrogate range
New: test_maximal_overhead_cases Valid Maximal valid cases for each UTF-8 length
New: test_continuation_without_leading Invalid Continuation byte without a valid leading byte

@zbalkan
Copy link
Contributor Author

zbalkan commented Oct 13, 2024

UTF-8 Validation Test Coverage Matrix

Test Case Valid/Invalid Covered Case
test_valid_utf8_sequences Valid ASCII, 2-byte, 3-byte, 4-byte, complex scripts
test_invalid_utf8_sequences Invalid Overlong encodings, invalid sequences, surrogate halves
test_utf8_random_replace Valid Random byte stream with replacement, ensuring valid UTF-8
test_utf8_random_not_replace N/A Random byte stream without replacement
test_utf8_edge_cases Valid/Invalid Edge: U+10FFFF (valid), beyond U+10FFFF (invalid)
test_empty_string Valid Empty string (valid UTF-8)
test_incomplete_utf8_sequences Invalid Incomplete 2-byte, 3-byte, 4-byte sequences
test_overlong_encodings Invalid Overlong encodings with 2, 3, or 4 bytes
test_surrogate_pair_boundary Valid/Invalid Just below and just in the surrogate range
test_maximal_overhead_cases Valid Maximal valid cases for each UTF-8 length
test_continuation_without_leading Invalid Continuation byte without a valid leading byte
New: test_surrogate_pair_extended_boundary Valid/Invalid U+D7FF (valid), U+DFFF (invalid, end of surrogate range)
New: test_multilingual_plane_cases Valid Characters from Supplementary Multilingual Plane (U+10000-U+1FFFF)
New: test_mixed_valid_invalid_utf8 Invalid Mixed valid and invalid UTF-8 sequences in a single string

@vikman90 vikman90 linked an issue Oct 14, 2024 that may be closed by this pull request
@zbalkan
Copy link
Contributor Author

zbalkan commented Oct 14, 2024

While this PR attempts to improve the existing solution, it is better to use https:/simdutf/is_utf8 for this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The non-UTF8 character check excludes valid UTF8 characters
1 participant