Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Item Size in KA-en #40

Closed
rgaudin opened this issue Oct 18, 2022 · 1 comment
Closed

Incorrect Item Size in KA-en #40

rgaudin opened this issue Oct 18, 2022 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Oct 18, 2022

In this run, a kolibri2zim over the full khan-academy in English crashed with

Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
 size[489062] == provider->getSize()[1226905]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_Z15_on_assert_failImmEvPKcS1_S1_T_T0_S1_i+0x1a9) [0x7f29e10d6c69]
/usr/local/lib/python3.8/site-packages/libzim.so.7(+0x197a44) [0x7f29e1103a44]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster13write_contentESt8functionIFvRKNS_4BlobEEE+0xde) [0x7f29e1103b2e]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster5writeEi+0xec) [0x7f29e110430c]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZN3zim6writer13clusterWriterEPv+0x111) [0x7f29e1106141]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f29e0ea3b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f29e558cfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f29e532eeff]
terminate called after throwing an instance of 'std::runtime_error'
  what():  
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
 size[489062] == provider->getSize()[1226905]

This is due to this assert inside libzim's writer

void Cluster::write_data(writer_t writer) const
{
  for (auto& provider: m_providers)
  {
    ASSERT(provider->getSize(), !=, 0U);
    zim::size_type size = 0;
    while(true) {
      auto blob = provider->feed();
      if(blob.size() == 0) {
        break;
      }
      size += blob.size();
      writer(blob);
    }
    ASSERT(size, ==, provider->getSize());
  }
}

Code has been modified since (see https:/openzim/libzim/blob/3a9f574d1aa2f722257f195fcdd6874e3517b8c6/src/writer/cluster.cpp#L246) and would generate a RuntimeError exception instead but the problem is the same: the size written to the ZIM is different from the size returned by the Provider's get_size().

Given kolibri2zim only prints debug after addition to the creator, we don't know which Entry caused the issue.

My investigations would point to a funneled file as other types of content are added via string and the size is automatically calculated.

Funneled ones on the other hand are files that we download directly from the Studio into the ZIM using scraperlib's URLItem.

Looking at the KA DB, I found a single file reported to have the expected size: c142275210f3f6dec3dfbdb1d9836e7b.mp4.

It works as expected when tested individually so my guess would be that there has been a network/server error that cause downloaded content to be a different. Note that we make an initial tiny request to find Size to decide whether we need to download to disk or not.

We could re-run this and hope this was fixed on it own but this sound like it could happen again given the large size of the content.

Fixing this would be difficult though ; this issue happens on a different libzim-handled thread long after we've added it so we can't catch the (libzim8+ only) exception and retry.

@rgaudin rgaudin added the bug Something isn't working label Oct 18, 2022
rgaudin added a commit that referenced this issue Feb 9, 2023
- Investigating #40, using a copy of scraperlib's URLItem with verbose details
to identify which URL causes the issue

- not crashing on resource duplicates (duplicate content in different node IDs)
- fixed suceeded boolean that would caused creating ZIM even on exception

- [debug] raising first exception

- updated scraperlib to 2.0
rgaudin added a commit that referenced this issue Feb 10, 2023
- Duplicated and modified the URLProvider:
  - reading source until we reached specified size (/!\ risk of being stuck)
  - clearly returning an empty Blob at the end (might ave been the reason)
- Added new feature to URLItem to not use URLProvided for content under 2MiB
@rgaudin rgaudin self-assigned this Feb 13, 2023
@rgaudin
Copy link
Member Author

rgaudin commented Feb 13, 2023

Fixed in openzim/python-scraperlib@2d0cd09. As with other scraper, we won't use URLItem/URLProvider directly has we need to integrate retry mechanism.

@rgaudin rgaudin closed this as completed Feb 13, 2023
@benoit74 benoit74 added this to the v1.1.0 milestone Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants