Valid XML and Recursive analysis #22

adamretter · 2012-09-21T16:21:45Z

I have corrected jpylyzer so that it will always create a valid XML document even when scanning more than one file.

I have fixed the bug with '*', it is now correctly expanded by the shell.

I have added support for:

Analysing all files in a folder
Recursively analysing all files in a folder

…expansion rather than python. Calling 'jpylyzer /some/folder/*' now works

…alid XML document still. Root element is still 'jpylyzer' but each file analysed is now in an ''analysis' element.

…files in a folder and sub-folders

bitsgalore · 2012-09-26T10:12:32Z

Hi Adam,

While doing some tests with your modified code I ran into a number of problems. I used Python 2.7.1 under Win XP.

Behaviour without command-line arguments

Executing jpylyzer with no command-line arguments: instead of getting the usage message I now get this:

`
User warning: no images to check!

`

(Perhaps this only happens under Windows?)

Unicode encoding errors for some corrupted images

Some images that work fine with jpylyzer 1.6.3 now result in a variety of Unicode errors (I can send you a link to those images by e-mail, can't publicly share them because I'm unsure of their copyright status).

For instance, one image that has an illegal character in a codestream comment gives me this error (while working fine with 1.6.3):

jpylyzer.py test_latin_corrupt.jp2

Result:

UnicodeEncodeError: 'ascii' codec can't encode character u'\x95' in position 476
8: ordinal not in range(128)`

For another file I ended up with this:

UnicodeEncodeError: 'charmap' codec can't encode character u'\x9a' in position 1
60328: character maps to <undefined>

(I got this while scanning a whole directory, and had to hack into the code to find out which file was producing the error because no output is written until all images are analysed!).

For the very first versions of jpylyzer I originally used UTF-8 as the output encoding as well, but I quickly gave up on it because of similar encoding errors. This was also the reason why I switched to ASCII in the end, because it makes it very easy to filter out any characters that can't be processed, and I couldn't find a reliable way of making this work for UTF-8 under all possible circumstances. But if you have any suggestions please let me know, because you're right that ASCII isn't strictly allowed as an encoding in XML. A (quick and dirty) workaround would be to keep the encodings settings as ASCII, and then do a search & replace on the encoding declaration in the generated output string. ASCII is a subset of UTF-8 so in principle this should result in valid XML, although I'm not entirely sure how that might influence things like the rendering of entity references.

Fix of * file path expansion

I have fixed the bug with '*', it is now correctly expanded by the shell.

If I now try this:

jpylyzer.py *.jp2`

I end up with:

E:\testjp2hecker>c:\python27\python C:\Temp\digital-preservation-jpylyzer-60406c
b\jpylyzer.py *.jp2
Traceback (most recent call last):
  File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 358, in
 
    main()
  File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 351, in
 main
    checkFiles(args.inputRecursiveFlag, root, jp2In)
  File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 311, in
 checkFiles
    for file in os.listdir(path):
WindowsError: [Error 123] The filename, directory name, or volume label syntax i
s incorrect: '*.jp2/*.*'

Looking into the code it seems that the issue with the '*' isn't actually fixed at all, it's simply removed altogether and replaced with the option of scanning folders. I'm not overly happy with this, also because of the implicit assumption that only files with a .jp2 extension are of interest. People may want to analyse files with non-standard extensions or JPX images as well.

Valid XML in case of multiple files

I definitely see how this could be useful, but there are some drawbacks as well.

The main one is that no output is written at all until the analysis of all files is completed. That means that if anything goes wrong in between (hardware problems, network interruptions, jpylyzer crashes) you simply won't get any output. For jpylyzer crashes we can largely avoid this using proper exception handlers, but then you also need some mechanism that tells you that an image wasn't processed because some exception occurred, i.e. put this somewhere in the output file.

See also my UnicodeEncodeError description above, where I couldn't even establish which file was causing the crash without digging into the source code.

Even then, imagine you're processing a directory tree with 100,000 images and then the network goes down while analysing the 99,999th image! Besides that I'm slightly worried about possible memory issues: all intermediate output is buffered in memory while jpylyzer is running, and I can imagine this causing problems for very large numbers of files. The main problem here is that unlike plain text you just can't write XML in append mode. In fact I've been deliberately avoiding writing 'proper' XML for multiple files for all the above reasons!

Other implications of changing XML output format

The proposed changes include a modification of jpylyzer's output format. I see this as an interface change (i.e. it changes the interface to any applications that use jpylyzer's output), meaning that this would be a new major version (i.e. 2.0, not 1.7!). That's fine as such, but I just don't think this is the right moment for it: it would break compatibility with existing jpylyzer workflows as well as making training materials and schemas obsolete. Eventually that will probably happen anyway, but for now I would really like to focus the development efforts in jpylyzer on fixing existing bugs as well as improving overall stability and robustness.

Conclusion

Based on the above I'm not in favor of accepting the proposed changes at this stage. However, some of these ideas should be considered for jpylyzer 2 (whenever that might happen):

Modification of XML output format to allow for multiple files, regardless of whether multi-file support is actually implemented at all.
Improve exception handling in such a way that errors during the processing of a single file never lead to a full crash (add exception handler to CheckOneFile)
Include information in output file on whether a file was successfully validated

These would be the minimum conditions for implementing multi-file XML support and recursive scanning. I would then still be slightly worried about the impact of memory issues and hardware failure, but I''m aware I may be slightly paranoid on this. Please let me know if this makes any sense!

adamretter · 2012-09-26T11:06:39Z

Hi Johan,

I only had a Mac to test on available at the time, I will test again on a
Windows PC, I will also resolve each of the issues that you describe below.

I will come back to you with details of the fixes inline below. In the mean
time can you please send me any images that I need to test with - my email
is [email protected]

Thanks Adam.

On 26 September 2012 11:12, Johan van der Knijff
[email protected]:

Hi Adam,

While doing some tests with your modified code I ran into a number of
problems. I used Python 2.7.1 under Win XP.
Behaviour without command-line arguments

Executing jpylyzer with no command-line arguments: instead of getting the
usage message I now get this:

User warning: no images to check!

(Perhaps this only happens under Windows?)
Unicode encoding errors for some corrupted images

Some images that work file with jpylyzer 1.6.3 now result in a variety
of Unicode errors (I can send you a link to those images by e-mail,
can't publicly share them because I'm unsure of their copyright status).

For instance, one image that has an illegal character in a codestream
comment gives me this error (while working fine with 1.6.3):

jpylyzer.py test_latin_corrupt.jp2

Result:

UnicodeEncodeError: 'ascii' codec can't encode character u'\x95' in position 476
8: ordinal not in range(128)`

For another file I ended up with this:

UnicodeEncodeError: 'charmap' codec can't encode character u'\x9a' in position 1
60328: character maps to

(I got this while scanning a whole directory, and had to hack into the
code to find out which file was producing the error because no output
is written until all images are analysed!).

For the very first versions of jpylyzer I originally used UTF-8 as the
output encoding as well, but I quickly gave up on it because of similar
encoding errors. This was also the reason why I switched to ASCII in
the end, because it makes it very easy to filter out any characters that
can't be processed, and I couldn't find a reliable way of making this work
for UTF-8 under all possible circumstances. But if you have any suggestions
please let me know, because you're right that ASCII isn't strictly
allowed as an encoding in XML. A (quick and dirty) workaround would be to
keep the encodings settings as ASCII, and then do a search & replace on
the encoding declaration in the generated output string. ASCII is a
subset of UTF-8 so in principle this should result in valid XML,
although I'm not entirely sure how that might influence things like the
rendering of entity references.
Fix of * file path expansion

I have fixed the bug with '*', it is now correctly expanded by the shell.

If I now try this:

jpylyzer.py *.jp2`

I end up with:

E:\testjp2hecker>c:\python27\python C:\Temp\digital-preservation-jpylyzer-60406c
b\jpylyzer.py *.jp2
Traceback (most recent call last):
File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 358, in
main()
File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 351, in
main
checkFiles(args.inputRecursiveFlag, root, jp2In)
File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 311, in
checkFiles
for file in os.listdir(path):
WindowsError: [Error 123] The filename, directory name, or volume label syntax i
s incorrect: '.jp2/.*'

Looking into the code it seems that the issue with the '' isn't actually
fixed at all, it's simply removed altogether and replaced with the option
of scanning folders. I'm not overly happy with this, also because of the
implicit assumption that only files with a *.jp2 extension are of
interest. People may want to analyse files with non-standard extensions or
JPX images as well.
Valid XML in case of multiple files

I definitely see how this could be useful, but there are some drawbacks as
well.

The main one is that no output is written at all until the analysis of *
all* files is completed. That means that if anything goes wrong in
between (hardware problems, network interruptions, jpylyzer crashes)
you simply won't get any output. For jpylyzer crashes we can largely
avoid this using proper exception handlers, but then you also need some
mechanism that tells you that an image wasn't processed because some
exception occurred, i.e. put this somewhere in the output file.

See also my UnicodeEncodeError description above, where I couldn't even
establish which file was causing the crash without digging into the
source code.

Even then, imagine you're processing a directory tree with 100,000 images
and then the network goes down while analysing the 99,999th image! Besides
that I'm slightly worried about possible memory issues: all intermediate
output is buffered in memory while jpylyzer is running, and I can
imagine this causing problems for very large numbers of files. The main
problem here is that unlike plain text you just can't write XML in append
mode. In fact I've been deliberately avoiding writing 'proper' XML for
multiple files for all the above reasons!
Other implications of changing XML output format

The proposed changes include a modification of jpylyzer's output
format. I see this as an interface change (i.e. it changes the interface to
any applications that use jpylyzer's output), meaning that this would
be a new major version (i.e. 2.0, not 1.7!). That's fine as such, but I
just don't think this is the right moment for it: it would break
compatibility with existing jpylyzer workflows as well as making
training materials and schemas obsolete. Eventually that will probably
happen anyway, but for now I would really like to focus the development
efforts in jpylyzer on fixing existing bugs as well as improving
overall stability and robustness.
Conclusion

Based on the above I'm not in favor of accepting the proposed changes at
this stage. However, some of these ideas should be considered for *
jpylyzer* 2 (whenever that might happen):

Modification of XML output format to allow for multiple files,
regardless of whether multi-file support is actually implemented at all.

Improve exception handling in such a way that errors during the
processing of a single file never lead to a full crash (add
exception handler to CheckOneFile)

Include information in output file on whether a file was
successfully validated

These would be the minimum conditions for implementing multi-file _XML_support and recursive scanning. I would then still be slightly worried
about the impact of memory issues and hardware failure, but I''m aware I
may be slightly paranoid on this. Please let me know if this makes any
sense!

—
Reply to this email directly or view it on GitHubhttps:/openplanets/jpylyzer/pull/22#issuecomment-8885096.

Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

…allow all file type(s) and improvised the recursive search to work with wildcard *

…wrapper (-w) commandline argument

…wn box'

adamretter · 2012-10-22T09:55:44Z

We added another member to our team here, and I was lucky enough to be able
to have her address the shortcomings in my jpylyzer changes that you
highlighted. Her name is Jaishree Davey and I have asked her to respond to
each of your concerns below. I include her responses below (annotated with
'JD:') along with some comments of my own.

We have spent a couple weeks on this, and if you are happy with the work we
have undertaken then I will have her send you a second pull request, that
we believe this time is well tested.

After all that, there is another change we would like to make to jpylyzer
whilst we have the resources available to do so. We would like to change
jpylyzer to stream its XML output as opposed to building up a DOM each time
it analyzes a file. The reason for this is that the DOM for any individual
file could become very large, especially if it was a jp2 file which
happened to have a huge XML box. By switching to a SAX like approach
instead (possibly using XMLGenerator) we could make jpylyzer much more
efficient and scalable. We have already invested almost a week into some
research on this and a first attempt. However, our conclusion was that we
would have to significantly refactor boxvalidator.py as the style of
recursion whilst building different parts of the globally defined DOM
cannot easily be adapted to a SAX approach.
In theory we are happy to undertake this refactoring, however I dont want
to do this work if it is unacceptable to you. My policy for Open Source
development here is to avoid maintaing forks of others projects, so I want
to create something that is useful for the community and that you are happy
to merge back into your project.

Comments inline below -

On 26 September 2012 11:12, Johan van der Knijff
[email protected]:

Hi Adam,

While doing some tests with your modified code I ran into a number of
problems. I used Python 2.7.1 under Win XP.
Behaviour without command-line arguments

Executing jpylyzer with no command-line arguments: instead of getting the
usage message I now get this:

User warning: no images to check!

(Perhaps this only happens under Windows?)

JD: Fixed Jpylyzer to print the usage message when it is run without
command-line arguments.

Unicode encoding errors for some corrupted images

Some images that work file with jpylyzer 1.6.3 now result in a variety
of Unicode errors (I can send you a link to those images by e-mail,
can't publicly share them because I'm unsure of their copyright status).

For instance, one image that has an illegal character in a codestream
comment gives me this error (while working fine with 1.6.3):

jpylyzer.py test_latin_corrupt.jp2

Result:

UnicodeEncodeError: 'ascii' codec can't encode character u'\x95' in position 476
8: ordinal not in range(128)`

For another file I ended up with this:

UnicodeEncodeError: 'charmap' codec can't encode character u'\x9a' in position 1
60328: character maps to

(I got this while scanning a whole directory, and had to hack into the
code to find out which file was producing the error because no output
is written until all images are analysed!).

For the very first versions of jpylyzer I originally used UTF-8 as the
output encoding as well, but I quickly gave up on it because of similar
encoding errors. This was also the reason why I switched to ASCII in
the end, because it makes it very easy to filter out any characters that
can't be processed, and I couldn't find a reliable way of making this work
for UTF-8 under all possible circumstances. But if you have any suggestions
please let me know, because you're right that ASCII isn't strictly
allowed as an encoding in XML. A (quick and dirty) workaround would be to
keep the encodings settings as ASCII, and then do a search & replace on
the encoding declaration in the generated output string. ASCII is a
subset of UTF-8 so in principle this should result in valid XML,
although I'm not entirely sure how that might influence things like the
rendering of entity references.

We think we have fixed this so that it should always work now with UTF-8.
It is certainly passing for the images that you sent us. A word of caution
about your choice of ASCII is that ASCII Extended does not map into UTF-8,
only the ASCII Base charset does. We did see problems with this and
jpylyzer when using accented characters etc.

JD: The images sent to us (Image1->test_latin_corrupt.jp2 ,
Image2->UTT_AdobeRGB_luratech.jp2 ) were analysed by Jpylyzer on Linux and
Windows7. Both the images worked on Linux but Image2 (UTT......jp2) wasn’t
processed on Windows7 resulting in UnicodeEncodeError.
Fixed the problem by checking the encoding of the output terminal and if
different, set it to ‘UTF-8’. In this case, the default encoding of
Windows 7 was ‘mbcs’ and Linux was ‘UTF-8’.

Fix of * file path expansion

I have fixed the bug with '*', it is now correctly expanded by the shell.

If I now try this:

jpylyzer.py *.jp2`

I end up with:

E:\testjp2hecker>c:\python27\python C:\Temp\digital-preservation-jpylyzer-60406c
b\jpylyzer.py *.jp2
Traceback (most recent call last):
File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 358, in
main()
File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 351, in
main
checkFiles(args.inputRecursiveFlag, root, jp2In)
File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 311, in
checkFiles
for file in os.listdir(path):
WindowsError: [Error 123] The filename, directory name, or volume label syntax i
s incorrect: '.jp2/.*'

Looking into the code it seems that the issue with the '' isn't actually
fixed at all, it's simply removed altogether and replaced with the option
of scanning folders. I'm not overly happy with this, also because of the
implicit assumption that only files with a *.jp2 extension are of
interest. People may want to analyse files with non-standard extensions or
JPX images as well.

Your right of course, my assumptions were far too many and my use cases too
narrow. My approach to using * worked differently on Windows and Unix like
platforms and has the problem that python never knows what the original
filenames were on Unix as they have been substituted by the shell doing the
expansion.
I think we have now fixed this * stuff properly. As you yourself pointed
out, you have to escape the * on Unix like platforms (you dont need to do
this on Windows as it doesnt expand it), this can be done with either
single quotes e.g. '_', of you can use the more common approach of
prefixing with a backslash to esacape it e.g. _
We have also addressed a couple of bugs in the recursion stuff, and we have
fixed it so that it does NOT just look for JPEG2000 files (.jp2').

JD: Fixed wildcard ()and recursive (-r) functionality to use the python
glob.glob() and further changes made to handle the recursive search in
subdirectories with or without the wildcard search. On Linux, wildcard ()
needs to be prefixed with a backslash (‘\’) in order to bypass the wildcard
expansion done by bash and passing the control to Python. If backslash is
not prefixed, Jpylyzer would still work but the number of files analysed
may be slightly different depending on the search criteria if it’s
recursive.

Valid XML in case of multiple files

I definitely see how this could be useful, but there are some drawbacks as
well.

The main one is that no output is written at all until the analysis of *
all* files is completed. That means that if anything goes wrong in
between (hardware problems, network interruptions, jpylyzer crashes)
you simply won't get any output. For jpylyzer crashes we can largely
avoid this using proper exception handlers, but then you also need some
mechanism that tells you that an image wasn't processed because some
exception occurred, i.e. put this somewhere in the output file.

We changed the XML output to be as you had it before, i.e. a DOM is built
and output for each file jpylyzer analyzes. We then added a command line
parameter to jpylyzer which allows you to wrap the results (it still
constructs multiple DOMs as you were doing before).
Regards error handling, I think the approach must be cleaned up in jpylyzer
and become consistent, ideally error message must be written to stderr and
not stdout so that when you are embedding jpylyzer in a unix/windows
processing pipeline you can seperately capture the output and errors.

JD: The output is now printed after each file is analysed. However, there
is a new command-line option introduced --wrapper or -w for wrapping the
entire XML output into a element. By default XML output
is not wrapped.

See also my UnicodeEncodeError description above, where I couldn't
even establish which file was causing the crash without digging into
the source code.

Even then, imagine you're processing a directory tree with 100,000 images
and then the network goes down while analysing the 99,999th image! Besides
that I'm slightly worried about possible memory issues: all intermediate
output is buffered in memory while jpylyzer is running, and I can
imagine this causing problems for very large numbers of files. The main
problem here is that unlike plain text you just can't write XML in append
mode. In fact I've been deliberately avoiding writing 'proper' XML for
multiple files for all the above reasons!

Regards the memory issues, you had them already by using DOM, see our
comments at the start about switching to SAX.

Other implications of changing XML output format

The proposed changes include a modification of jpylyzer's output
format. I see this as an interface change (i.e. it changes the interface to
any applications that use jpylyzer's output), meaning that this would
be a new major version (i.e. 2.0, not 1.7!). That's fine as such, but I
just don't think this is the right moment for it: it would break
compatibility with existing jpylyzer workflows as well as making
training materials and schemas obsolete. Eventually that will probably
happen anyway, but for now I would really like to focus the development
efforts in jpylyzer on fixing existing bugs as well as improving
overall stability and robustness.

Agreed. We changed the output format back to how you had it before. When
--wrapper is used (which is off by default) it all gets wrapped in a
element, but as this is not the default behaviour we have not
broken the API now.

Conclusion

Based on the above I'm not in favor of accepting the proposed changes at
this stage. However, some of these ideas should be considered for *
jpylyzer* 2 (whenever that might happen):

Modification of XML output format to allow for multiple files,
regardless of whether multi-file support is actually implemented at all.

Improve exception handling in such a way that errors during the
processing of a single file never lead to a full crash (add
exception handler to CheckOneFile)

Include information in output file on whether a file was
successfully validated

These would be the minimum conditions for implementing multi-file _XML_support and recursive scanning. I would then still be slightly worried
about the impact of memory issues and hardware failure, but I''m aware I
may be slightly paranoid on this. Please let me know if this makes any
sense!

I think (1) is now done in a way that is backwards compatible.
(2) needs to be done anyway as the current approach could be improved, even
before our changes!
(3) I think thats really part of (2).
Regards the memory, see my comments at the beginning regarding SAX.

—
Reply to this email directly or view it on GitHubhttps:/openplanets/jpylyzer/pull/22#issuecomment-8885096.

Thanks Adam.

Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

bitsgalore · 2013-01-16T16:52:19Z

Hi Adam and Jaishree,

I finaly managed to have a look at your modifications, and did some tests. Overall I'm realy happy with the changes, I didn't get any decode errors, and I also like the XML changes (use UTF-8 as default, incremental output updates).

However the recursive scanning option did still show problems (these all occurred running Windows XP):

Recursive scanning errors

Using Python 2.7, running something like this:

jpylyzer.py -r E:\somedirectory\ > output.xml

Results in:

IOError: [Errno 2] No such file or directory: 'E:\\somedirectory\\'

Alternatively, omitting the trailing backslash:

jpylyzer.py -r E:\somedirectory > output.xml

Gives:

IOError: [Errno 13] Permission denied: 'E:\\somedirectory'

Sometimes I get the same error message for subdirectories of the parent directory.

Under Python 3.2, command lines like the above give me a TypeError instead, e.g.:

TypeError: cannot serialize '<jpylyzer' (type str)

If I combine the -r options with wildcards, I get some unexpected behaviour as well. For instance, using the following command line (where a subdirectory of somedirectory contains .jp2 files!):

jpylyzer.py -r E:\somedirectory\*.jp2 > output.xml

Ends with:

User warning: no images found (or supplied) to check!

Although I can imagine this last example isn't really supposed to work in the first place?

This all looks like file paths are not handled properly by the code (most likely the behaviour is platform dependent), and it's probably quite easy to solve as well. One thing here that did catch my eye was this line in addRecursiveFiles:

if path.startswith("./"):

Which won't work under Windows, perhaps unless some normalisation happens before the call (I haven't looked in sufficient detail). Although I don't think this explains (all of) the bove issues.

Next steps

Depending on your preferences, I can either wait for a new commit from you with these issue fixed, or otherwise pull it to a separate branch that I've already set up right away and continue working from that. (The reason I've set this up is that with Github's decision to drop support for file downloads I want to move the Windows binaries and PDF doc inside the repo, and some links on the OPF website have to be adapted as well, so until that's all sorted I don't want to touch the current main branch!). The advantage of pulling it right away is that others (e.g. me) can also help solving these remaining issues.

On a side note I also saw that some of the formatting in the User Manual has gone a bit wrong because of the MS Word to OpenOffice conversion. That's no problem, and I can fix that by pasting your changes into my original Word doc and then re-generating the PDF. (Ideally it would be nice to have the doc in Markdown, but with all the tables that's just not very practical here).

Thanks again, and let me know how you want to proceed!

Cheers,

Johan

adamretter · 2013-01-16T17:10:02Z

Hi Johan, lets see if we can't fix these path handling isues ASAP before
you pull. Thanks for the review.

Jaishree can you look at fixing these issues asap please?
On 16 Jan 2013 17:52, "Johan van der Knijff" [email protected]
wrote:

Hi Adam and Jaishree,

I finaly managed to have a look at your modifications, and did some tests.
Overall I'm realy happy with the changes, I didn't get any decode errors,
and I also like the XML changes (use UTF-8 as default, incremental output
updates).

However the recursive scanning option did still show problems (these all
occurred running Windows XP):
Recursive scanning errors

Using Python 2.7, running something like this:

jpylyzer.py -r E:\somedirectory\ > output.xml

Results in:

IOError: [Errno 2] No such file or directory: 'E:\somedirectory'

Alternatively, omitting the trailing backslash:

jpylyzer.py -r E:\somedirectory > output.xml

Gives:

IOError: [Errno 13] Permission denied: 'E:\somedirectory'

Sometimes I get the same error message for subdirectories of the parent
directory.

Under Python 3.2, command lines like the above give me a _TypeError_instead, e.g.:

TypeError: cannot serialize '<jpylyzer' (type str)

If I combine the -r options with wildcards, I get some unexpected
behaviour as well. For instance, using the following command line (where a
subdirectory of somedirectory contains .jp2 files!):

jpylyzer.py -r E:\somedirectory*.jp2 > output.xml

Ends with:

User warning: no images found (or supplied) to check!

Although I can imagine this last example isn't really supposed to work in
the first place?

This all looks like file paths are not handled properly by the code (most
likely the behaviour is platform dependent), and it's probably quite easy
to solve as well. One thing here that did catch my eye was this line in *
addRecursiveFiles*:

if path.startswith("./"):

Which won't work under Windows, perhaps unless some normalisation happens
before the call (I haven't looked in sufficient detail). Although I don't
think this explains (all of) the bove issues.
Next steps

Depending on your preferences, I can either wait for a new commit from you
with these issue fixed, or otherwise pull it to a separate branch that
I've already set uphttps:/openplanets/jpylyzer/branches/test1.7right away and continue working from that. (The reason I've set this up is
that with Github's decision to drop support for file downloads I want to
move the Windows binaries and PDF doc inside the repo, and some links on
the OPF website have to be adapted as well, so until that's all sorted I
don't want to touch the current main branch!). The advantage of pulling it
right away is that others (e.g. me) can also help solving these remaining
issues.

On a side note I also saw that some of the formatting in the User Manual
has gone a bit wrong because of the MS Word to OpenOffice conversion.
That's no problem, and I can fix that by pasting your changes into my
original Word doc and then re-generating the PDF. (Ideally it would be nice
to have the doc in Markdown, but with all the tables that's just not very
practical here).

Thanks again, and let me know how you want to proceed!

Cheers,

Johan

—
Reply to this email directly or view it on GitHubhttps:/openplanets/jpylyzer/pull/22#issuecomment-12327914.

bitsgalore · 2013-01-16T17:12:44Z

Addition to above: after I wrote that I recalled that I once did something similar (i.e. recursively go through directory tree), and after some digging I found this bit of code:

def getFilesFromTree(rootDir):
    # Recurse into directory tree and return list of all files
    # NOTE: directory names are disabled here!!

    filesList=[]
    for dirname, dirnames, filenames in os.walk(rootDir):
        #Suppress directory names
        for subdirname in dirnames:
            thisDirectory=os.path.join(dirname, subdirname)

        for filename in filenames:
            thisFile=os.path.join(dirname, filename)
            filesList.append(thisFile)
    return filesList

This uses os.walk, which is quite a bit simpler than the code in addRecursiveFiles. Perhaps this might be a useful substitute?

bitsgalore · 2013-01-16T17:30:12Z

Btw, could you select the 'test1.7' branch when making the pull request for the new commit (so not the master branch? Thanks!

…r option.

…g with -r option." This reverts commit 4747a41.

…d Wildcard handling with -r option.

bitsgalore · 2013-01-31T14:42:12Z

Further update: the recursive option is still misbehaving as well! Under certain circumstances when you use -r with a path (e.g. .\mytestJP2s), instead of processing that directory it processes its root flder + all its underlying directories. Can't exactly pin down what triggers this, but I got tis behaviour both when the dir is specified as a relative path, and omitting the trailing backslash ('') seems to trigger this behaviour as well. (This happened under WinXP).

In one of my tests this resulted in jpylyzer trying to scan my entire hard disk!!

…L output in UTF-8 format and [bugfix] Recursive option

adamretter added 7 commits September 21, 2012 15:05

[bugfix] fixed * expansion for file paths so that the shell does the …

058c02a

…expansion rather than python. Calling 'jpylyzer /some/folder/*' now works

[feature] support for processing all jp2 files in a folder

03ffbfd

[bugfix] XML should really be rendered as UTF-8

5bcb31d

[bugfix] When analyzing more than one JP2 file, we have to create a v…

c5c6d8d

…alid XML document still. Root element is still 'jpylyzer' but each file analysed is now in an ''analysis' element.

[feature] Added --recursive option to enable jpylyzer to analyse all …

5feb2e4

…files in a folder and sub-folders

[version] bumped the version number

8d79919

[feature] updated documentation for 1.7 release

60406cb

adamretter and others added 7 commits October 3, 2012 12:44

[bugfix] Added usage message when no images are provided

9aab6d4

[bugfix] Fix * for file expansion, remove the check for .jp2 type to …

15afe12

…allow all file type(s) and improvised the recursive search to work with wildcard *

[bugfix] Fixed Unicode encoding errors on some corrupted images

a6c2b3d

[bugfix] Fixed a issue with the recursive search

27aa6b2

[ignore] testing commit and push

e64f78a

[bugfix] Fixed to output valid XML in case of multiple files, added -…

7b25b0e

…wrapper (-w) commandline argument

[bugfix] Removed print warning message from the xml output for 'unkno…

f89ddf1

…wn box'

userjd added 3 commits January 28, 2013 15:52

Fixed Recursive scanning errors and modified Wildcard handling with -…

4747a41

…r option.

Revert "Fixed Recursive scanning errors and modified Wildcard handlin…

341d08a

…g with -r option." This reverts commit 4747a41.

[bugfix] clean commit of: Fixed Recursive scanning errors and modifie…

6dc7241

…d Wildcard handling with -r option.

userjd added 2 commits February 8, 2013 17:39

[feature] Added Python3.x Compatibility to Jpylyzer for processing XM…

a5db6a3

…L output in UTF-8 format and [bugfix] Recursive option

Revert print warning message for 'ignoring unknown box'

157a0ea

bitsgalore merged commit 157a0ea into openpreserve:master Apr 24, 2013

bitsgalore mentioned this pull request Jan 19, 2022

Refactor findFiles function #178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valid XML and Recursive analysis #22

Valid XML and Recursive analysis #22

adamretter commented Sep 21, 2012

bitsgalore commented Sep 26, 2012

adamretter commented Sep 26, 2012

adamretter commented Oct 22, 2012

bitsgalore commented Jan 16, 2013

adamretter commented Jan 16, 2013

bitsgalore commented Jan 16, 2013

bitsgalore commented Jan 16, 2013

bitsgalore commented Jan 31, 2013

Valid XML and Recursive analysis #22

Valid XML and Recursive analysis #22

Conversation

adamretter commented Sep 21, 2012

bitsgalore commented Sep 26, 2012

Behaviour without command-line arguments

Unicode encoding errors for some corrupted images

Fix of * file path expansion

Valid XML in case of multiple files

Other implications of changing XML output format

Conclusion

adamretter commented Sep 26, 2012

adamretter commented Oct 22, 2012

bitsgalore commented Jan 16, 2013

Recursive scanning errors

Next steps

adamretter commented Jan 16, 2013

bitsgalore commented Jan 16, 2013

bitsgalore commented Jan 16, 2013

bitsgalore commented Jan 31, 2013