Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue parsing ePub files #14

Closed
tjenniges opened this issue Jul 19, 2018 · 6 comments
Closed

Issue parsing ePub files #14

tjenniges opened this issue Jul 19, 2018 · 6 comments

Comments

@tjenniges
Copy link

I have attached 3 ePub files that fail to be parsed by ePubReader.

I found these files in the wild, by google searching by file type to build up a ePub test
dataset to test ePubReader against.

I have other files that fail too but for same reasons as the ones attached (TOC error, etc)

Good job so far.

Thanks.
childrens-literature.zip
GhV-oeb-page.zip

CF General.zip

@vers-one
Copy link
Owner

childrens-literature has some incorrect formatting in its table of contents that does not conform to the EPUB navigation schema. More specifically, there are several pageTarget elements with the type attribute set to the body value while only front, normal, and special values are allowed. This is exactly the same issue as this one raised in the IDPF repository. I will add a workaround to the schema parser to ignore these kind of errors.

GhV-oeb-page and CF General on the other hand is a different story. These are EPUB 3 files and they use totally different TOC formatting which is essentially just a plain HTML5 file (with a few restrictions). Back in 2015 when this project was first published there were not many EPUB 3 files available so parsing the TOC out of HTML5 content seemed to be not worth it. On top of that, all EPUB 3 files I was able to find for testing were backwards compatible with the EPUB 2 format. However this is not the case with these files.

I will add TOC parsing support for EPUB 3 as well but it might take some time.

@tjenniges
Copy link
Author

Thanks a lot. If I find more EPUB 3 files with issues I will be sure to send them to you for your testing dataset

@vers-one
Copy link
Owner

Sorry for a delay. Working on this.

@tjenniges
Copy link
Author

Ok. Let me know when finished and I will run tests against an epub dataset I have.

@vers-one
Copy link
Owner

I've made a preliminary version with the improved support for EPUB 3 files. It can be found in the epub3 branch of the repository. I've also added an option to the NetCoreDemo application to test the library by opening all epub files in a directory (see the 3. Test the library by reading all EPUB files from a directory option there). You can run it on your epub dataset.

This is a preliminary version and it is available as the source code only. There are still some things I plan to finish before making a new release.

@vers-one
Copy link
Owner

vers-one commented Apr 8, 2019

I've released the version 3.0.0 of the library with the better EPUB 3 support. Please reopen this issue if you find any other problems with the EPUB 3 files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants