Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTL multi-line text gets bottom-up #901

Closed
gmischler opened this issue Aug 18, 2023 · 19 comments
Closed

RTL multi-line text gets bottom-up #901

gmischler opened this issue Aug 18, 2023 · 19 comments

Comments

@gmischler
Copy link
Collaborator

When arabic text is placed in a multi-line context (multi_cell(), write(), text regions, etc.), in addition to getting correctly printed right-to-left on each line, the actual lines end up in reverse order, ie. bottom-to-top.

I've modified the relevant example from "test_text_shaping.py" to demonstrate:

def test_arabic_right_to_left(tmp_path):
    # issue #549
    pdf = FPDF()
    pdf.add_page()
    pdf.add_font(
        family="KFGQPC", fname=HERE / "KFGQPC Uthmanic Script HAFS Regular.otf"
    )
    pdf.set_font("KFGQPC", size=36)
    pdf.ln(36)
    pdf.set_text_shaping(True)
    pdf.cell(txt="مثال على اللغة العربية. محاذاة لليمين.", new_x="LEFT", new_y="NEXT")
    pdf.ln()
    pdf.multi_cell(w=80, txt="مثال على اللغة العربية. محاذاة لليمين.", new_x="LEFT", new_y="NEXT")

    assert_pdf_equal(pdf, HERE / "arabic.pdf", tmp_path)

grafik

Maybe we were a bit too eager to declare this topic solved... 🙄

Now in hindsight, this is not really surprising, given how our line wrapping is implemented.
For any rtl text, we really should start scanning from the back.
Unfortunately we can't just look at the current fragment in isolation, as there might be several rtl fragments in sequence, which we need to wrap as a whole.
Food for thought...

@andersonhc
Copy link
Collaborator

I believe we should add the text direction as a property of each fragment and perform the line break algorithm using the correct direction.

While at it maybe we can improve our line break algorithm to make it compliant with the Unicode Line Breaking Algorithm.

@andersonhc
Copy link
Collaborator

andersonhc commented Sep 27, 2023

I'm adding an example with English text forced to be shaped right to left.
When the text is in a single fragment it works OK, but when there's multiple fragments (like markdown breaking bold/italic fragments) it orders the fragments left to right and produce the incorrect output.

from fpdf import FPDF

text = "This is a very long string to test how line break is working when text shaping is rendering it right to left. Doesn't make much sense in English."
text_md = "This is a very long string to test **how** line break is working when __text shaping__ is rendering it right to left. Doesn't make much sense **in English**."

pdf = FPDF()
pdf.add_page()
pdf.add_font(family="DejaVuSans", style="", fname="DejaVuSans.ttf")
pdf.add_font(family="DejaVuSans", style="B", fname="DejaVuSans-Bold.ttf")
pdf.add_font(family="DejaVuSans", style="I", fname="DejaVuSans-Oblique.ttf")
pdf.set_font('DejaVuSans', size=12)
pdf.set_text_shaping(True, script="latn", language="eng", direction="rtl")
pdf.multi_cell(w=pdf.epw,txt=text,markdown=True)
pdf.ln(3)
pdf.multi_cell(w=pdf.epw,txt=text_md,markdown=True)
pdf.output("test_rtl_line_break.pdf")
Screenshot 2023-09-27 083635

@andersonhc
Copy link
Collaborator

Fixed in #1096

@andersonhc
Copy link
Collaborator

done

@andrewlutz
Copy link

Hey @andersonhc I've upgrade to 2.7.8 but am still seeing this issue when using reshaping. I am using Google's NotoSans AR font.

Screenshot 2024-02-21 at 7 40 36 PM

Repro:

import tempfile

from arabic_reshaper import reshape
from bidi.algorithm import get_display
from fpdf import FPDF

class RTLTest:
    def test(self):
        pdf = FPDF(orientation='P', unit='mm', format='A4')
        pdf.add_font(
                        family='STANDARD',
                        style='',
                        fname=PATH + 'Sans-Pro-Regular.otf',
                    )
        pdf.add_font(
                        family='NOTO_SANS_AR',
                        style='',
                        fname=PATH + 'NotoSansArabic-Regular.ttf',
                    )
        pdf.set_font('STANDARD')
        pdf.set_fallback_fonts(['NOTO_SANS_AR'])

        pdf.add_page()

        text = u'في Company X نعمل بجد لاتخاذ إجراءات ضد المحتوى الضار وغير القانوني. ونتيجة لذلك، قمنا بتقييد إمكانية الوصول إلى المنشور الذي قمتَ بـ تم إنشاؤه.'
        shaped_text = get_display(reshape(text))

        pdf.set_font_size(8)
        pdf.multi_cell(w=0, text=shaped_text, align='R')
        pdf.ln()

        pdf.set_font_size(20)
        pdf.multi_cell(w=0, text=shaped_text, align='R')

        _, tmp_filename = tempfile.mkstemp()
        pdf.output(tmp_filename)
        return tmp_filename

@andersonhc
Copy link
Collaborator

Hi @andrewlutz, can you please remove the calls to arabic_reshaper and bidi.algorithm and try again? Just pass text instead of shaped_text in multicell

@andrewlutz
Copy link

I think the issue I'm having is that the string coming from our translation system is in LTR order, I need to run get_display(reshape(text)) in order to get in RTL. I've also tried pdf.set_text_shaping(use_shaping_engine=True, direction="rtl") but that doesn't transform the string to RTL.

Does FPDF assume that the text is already in LTR order?

@andersonhc
Copy link
Collaborator

I think the issue I'm having is that the string coming from our translation system is in LTR order, I need to run get_display(reshape(text)) in order to get in RTL. I've also tried pdf.set_text_shaping(use_shaping_engine=True, direction="rtl") but that doesn't transform the string to RTL.

Does FPDF assume that the text is already in LTR order?

just pass the unshaped text - if you are using set_text_shaping() fpdf will handle all the LTR-RTL conversion

@andersonhc
Copy link
Collaborator

andersonhc commented Feb 22, 2024

This is a test I did:

        pdf.set_font('STANDARD')
        pdf.set_fallback_fonts(['NOTO_SANS_AR'])

        pdf.add_page()

        pdf.set_text_shaping(use_shaping_engine=True)
        text = u'في Company X نعمل بجد لاتخاذ إجراءات ضد المحتوى الضار وغير القانوني. ونتيجة لذلك، قمنا بتقييد إمكانية الوصول إلى المنشور الذي قمتَ بـ تم إنشاؤه.'
        
        pdf.set_font_size(8)
        pdf.multi_cell(w=0, text=text, align='R')
        pdf.ln()

        pdf.set_font_size(20)
        pdf.multi_cell(w=0, text=text, align='R')

        pdf.output("issue-901.pdf")

and it produced:

issue901

If you use the fpdf text shaping engine you don't need bidi_algorithm and arabic reshaper

@andrewlutz
Copy link

If you drop the string produced using just pdf.set_text_shaping(use_shaping_engine=True) into google translate it produces gibberish. If you drop the string produced using bidi_algorithm and reshaper then google translate is 100% correct.

@andersonhc
Copy link
Collaborator

If you drop the string produced using just pdf.set_text_shaping(use_shaping_engine=True) into google translate it produces gibberish. If you drop the string produced using bidi_algorithm and reshaper then google translate is 100% correct.

If you copy text from the PDF file and past elsewhere it won't work. We still need to implement marked-content tags to proper signalize to the client the original text order.

@andrewlutz
Copy link

Ah ok thanks for clarifying that. I'll check the final result with our internationalization team.

@andrewlutz
Copy link

@andersonhc I'm at a bit of a loss here, I checked with an Arabic speaking colleague and the feedback I received was that the text produced when using pdf.set_text_shaping(use_shaping_engine=True) was scrambled text with no logical order.

The text below was produced with text shaping disabled and with get_display(reshape(text)). The first sentence (one line) is correctly translated/printed. When the sentence becomes multi-line it is printing bottom up with order of the words reversed.

image

@andersonhc
Copy link
Collaborator

@andrewlutz

I found and fixed the bug with text_shaping that was causing the RTL fragments get incorrectly ordered.
If you install the latest version you can use it:
pip install git+https:/py-pdf/fpdf2.git@master

When you use get_display(reshape())) you transform the whole text into LTR and fpdf is not aware of the correct direction of each part of the text, so the line break is done as normal LTR text.

@andrewlutz
Copy link

Thanks! Will test this out later today

@andrewlutz
Copy link

Hey @andersonhc I tried this but am still getting the same bottom-up result. Were you able to verify with the repro I provided? I'm wondering if somehow the latest fpdf didn't get pulled in correctly.

@andersonhc
Copy link
Collaborator

Hey @andersonhc I tried this but am still getting the same bottom-up result. Were you able to verify with the repro I provided? I'm wondering if somehow the latest fpdf didn't get pulled in correctly.

I installed the latest fpdf version from github master:

pip install git+https:/py-pdf/fpdf2.git@master

and ran the following python file:

pdf = FPDF(orientation='P', unit='mm', format='A4')
pdf.add_font(family='STANDARD', style='', fname=PATH + 'NotoSans-Regular.ttf',)
pdf.add_font(family='NOTO_SANS_AR', style='', fname=PATH + 'NotoSansArabic-Regular.ttf',)
pdf.set_font('STANDARD')
pdf.set_fallback_fonts(['NOTO_SANS_AR'])

pdf.add_page()

pdf.set_text_shaping(use_shaping_engine=True, direction="rtl")
text = u'في Company X نعمل بجد لاتخاذ إجراءات ضد المحتوى الضار وغير القانوني. ونتيجة لذلك، قمنا بتقييد إمكانية الوصول إلى المنشور الذي قمتَ بـ تم إنشاؤه.'
pdf.set_font_size(8)
pdf.multi_cell(w=pdf.epw, text=text, align='R')
pdf.ln()

pdf.set_font_size(20)
pdf.multi_cell(w=pdf.epw, text=text, align='R')

pdf.output("issue-901.pdf")

The result I got is attached. Visually it is similar to what you are getting in your last message and the one breaking on multiple lines is also visually similar.
issue-901.pdf

@andrewlutz
Copy link

Glad to hear it's working for you, that result looks good. Likely an issue with importing on my end.

When do you estimate a release with this fix will be ready?

@andrewlutz
Copy link

Hey @andersonhc are you able to do a release soon with the RTL fix? We'd really appreciate as it's currently a blocker for us :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants