Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add "layout" mode for text extraction #2388

Merged
merged 41 commits into from
Jan 11, 2024

Commits on Jan 3, 2024

  1. ENH: text extraction "layout" mode

    - add _text_extraction/_layout_mode subpackage (initial version)
    - expose new subpackage functionality via new PageObject methods _layout_mode_fonts() and _layout_mode_text()
    - add "extraction_mode" parameter and layout_mode kwargs to existing PageObject.extract_text() method for experimental usage
    shartzog committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    86ed974 View commit details
    Browse the repository at this point in the history
  2. BUG: bad refactor in _layout_mode/_fonts.py

    Remove unnecessary "any()" wrapper after refactoring for python 3.7
    shartzog committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    f43b84e View commit details
    Browse the repository at this point in the history
  3. STY: Address ruff issues

    shartzog committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    220de15 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    21d9f1b View commit details
    Browse the repository at this point in the history
  5. STY: final ruff fixes?

    shartzog committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    9fa3b5f View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    1545a27 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    81b6a83 View commit details
    Browse the repository at this point in the history
  8. STY: Address mypy errors

    shartzog committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    bb9190b View commit details
    Browse the repository at this point in the history

Commits on Jan 4, 2024

  1. Configuration menu
    Copy the full SHA
    cefbfc6 View commit details
    Browse the repository at this point in the history
  2. MAINT: Address PR review comments

    - DOC: standardize language. use "layout", not "structure/structural".
    - BUG: address bug introduced by ruff refactoring (remove "TYPE_CHECKING" block for Literal import)
    - DEV: use sys.version_info based import switch (not try/except) for Literal and TypedDict to correct vscode colors and prevent odd mypy errors
    - TST: add test created by @MartinThoma in py-pdf#2390
    - ENH: add remaining standard fonts and aliases
    shartzog committed Jan 4, 2024
    Configuration menu
    Copy the full SHA
    ff7e40f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    f37909c View commit details
    Browse the repository at this point in the history
  4. TST: fp.read() encoding fix

    shartzog committed Jan 4, 2024
    Configuration menu
    Copy the full SHA
    48e971e View commit details
    Browse the repository at this point in the history

Commits on Jan 5, 2024

  1. MAINT: Address review comments

    - PI: move json imports (debug only)
    - DEV: move `_set_state_param()` definition nearer to usage
    - MAINT: use `PdfReadError` vs `ValueError`
    - DOC: Comment/docstring improvements per review
    shartzog committed Jan 5, 2024
    Configuration menu
    Copy the full SHA
    8742bcc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8e9d879 View commit details
    Browse the repository at this point in the history
  3. TST: space_vertically

    - DEV: add LAYOUT_NEW_BT_GROUP_SPACE_WIDTHS constant to _text_extraction __init__.py
    shartzog committed Jan 5, 2024
    Configuration menu
    Copy the full SHA
    4dc3250 View commit details
    Browse the repository at this point in the history

Commits on Jan 6, 2024

  1. TST: missed rstrip()

    shartzog committed Jan 6, 2024
    Configuration menu
    Copy the full SHA
    d1d85a0 View commit details
    Browse the repository at this point in the history
  2. Improve line coverage

    MartinThoma committed Jan 6, 2024
    Configuration menu
    Copy the full SHA
    70b2f31 View commit details
    Browse the repository at this point in the history
  3. test to_dict

    MartinThoma committed Jan 6, 2024
    Configuration menu
    Copy the full SHA
    e7d5edd View commit details
    Browse the repository at this point in the history
  4. Add test with form

    MartinThoma committed Jan 6, 2024
    Configuration menu
    Copy the full SHA
    64d1df0 View commit details
    Browse the repository at this point in the history

Commits on Jan 7, 2024

  1. ENH: TJ spacing and rotation handling

    - DEV: disambiguate "XformStack" and "xform" language in layout mode from existing extract_xform_text:
      - "xform" --> "transform"
      - "XformStack" --> "TextStateManager"
      - "xform_stack" --> "text_state_mgr" or "state_mgr"
    - DEV: move "TextStateParams" to its own file for easier discoverability and cross referencing during development
    - PI: reduce overhead of TextStateParams by eliminating unnecessary dataclass fields
    - DEV: rename _fonts.py to _font.py to properly reflect internal class name (Font)
    - DEV: rename "opands" to "operands" for all usages in _fixed_width_page.py
    - ENH: Use font.space_width * 2 as a fallback for assigning space_tx during TextStateParams.__post_init__()
      - applies when a font assigns width 0 for the " " char and uses TJ int operators for fine grained inter-word spacing as shown in crazyones.pdf
    - ENH: correct calculation for triggering a new BTGroup in recurs_to_target_op(). Remove abs() to prevent triggering on TJ spacing operators
    - ENH: rotation handling:
      - add "layout_mode_strip_rotated" kwarg to PageObject.extract_text() to assign new layout mode "strip_rotated" parameter.
      - produce a logger_warning if rotated text is found.
      - if strip_rotated == True, remove text that is rotated with respect to the page with warning "Rotated text discovered. Output will be incomplete."
      - if strip_rotated == False, include all text, rotated or not, and warn with "Rotated text discovered. Layout will be degraded."
    shartzog committed Jan 7, 2024
    Configuration menu
    Copy the full SHA
    377bbd1 View commit details
    Browse the repository at this point in the history

Commits on Jan 8, 2024

  1. Configuration menu
    Copy the full SHA
    3a0fc89 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    fe7bb69 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    cec0be3 View commit details
    Browse the repository at this point in the history
  4. Merge pull request #1 from shartzog/main

    Main
    shartzog authored Jan 8, 2024
    Configuration menu
    Copy the full SHA
    955bd38 View commit details
    Browse the repository at this point in the history
  5. BUG: address bugs caused by rename/refactor

    - correct submodule name in test_text_extraction.py
    - resolve indirect objects in /DescendantFonts
    - add "layout_mode_strip_rotated" explanation to extract-text.md
    - prevent double spacing for the first tj element of a bt group
    shartzog committed Jan 8, 2024
    Configuration menu
    Copy the full SHA
    579692a View commit details
    Browse the repository at this point in the history
  6. Merge branch 'text-layout-mode' of https:/shartzog/pypdf

    …into text-layout-mode
    shartzog committed Jan 8, 2024
    Configuration menu
    Copy the full SHA
    4402caa View commit details
    Browse the repository at this point in the history
  7. Fix ruff/mypy

    shartzog committed Jan 8, 2024
    Configuration menu
    Copy the full SHA
    41417eb View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    778f3c7 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    8279c79 View commit details
    Browse the repository at this point in the history
  10. Tests Bug Fixes "Uncommon" Operators

    - add toy.pdf and toy.layout.pdf and associated test case for handling T*, ', ", TD, Tc, Tw, Tz, TL, and Ts operators
    - correct bugs associated with TL impacting T*, ', and " (sign is reversed from 1.7 standard, side effect of layout mode algorithm)
    - make "_set_state_param" and "decode_tj" methods of the TextStateManager class rather than passing the text state manager to them manually
    shartzog committed Jan 8, 2024
    Configuration menu
    Copy the full SHA
    75aec12 View commit details
    Browse the repository at this point in the history
  11. Typing / Style Corrections

    shartzog committed Jan 8, 2024
    Configuration menu
    Copy the full SHA
    f25e9d5 View commit details
    Browse the repository at this point in the history

Commits on Jan 9, 2024

  1. Font refactoring/tests

    - cover both Type0 DecendantFonts /W formats in tests
    - add `set_font()` to TextStateManager instead of setting font/font_size attributes directly
    shartzog committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    744a6db View commit details
    Browse the repository at this point in the history
  2. utf-8 instead of utf8

    Co-authored-by: Stefan <[email protected]>
    MartinThoma and stefan6419846 authored Jan 9, 2024
    Configuration menu
    Copy the full SHA
    c5f0cd8 View commit details
    Browse the repository at this point in the history
  3. oops

    MartinThoma authored Jan 9, 2024
    Configuration menu
    Copy the full SHA
    1b65085 View commit details
    Browse the repository at this point in the history
  4. Fix sphinx build warning

    MartinThoma committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    373025d View commit details
    Browse the repository at this point in the history
  5. Run pre-commit

    MartinThoma committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    cdaa9ca View commit details
    Browse the repository at this point in the history
  6. Run pre-commit

    MartinThoma committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    64e4c83 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    878e407 View commit details
    Browse the repository at this point in the history
  8. Use splitlines

    MartinThoma committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    e9962b3 View commit details
    Browse the repository at this point in the history
  9. Use Optional

    MartinThoma committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    06e79d3 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    f43201a View commit details
    Browse the repository at this point in the history