-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Valid XML and Recursive analysis #22
Conversation
…expansion rather than python. Calling 'jpylyzer /some/folder/*' now works
…alid XML document still. Root element is still 'jpylyzer' but each file analysed is now in an ''analysis' element.
…files in a folder and sub-folders
Hi Adam, While doing some tests with your modified code I ran into a number of problems. I used Python 2.7.1 under Win XP. Behaviour without command-line argumentsExecuting jpylyzer with no command-line arguments: instead of getting the usage message I now get this: ` (Perhaps this only happens under Windows?) Unicode encoding errors for some corrupted imagesSome images that work fine with jpylyzer 1.6.3 now result in a variety of Unicode errors (I can send you a link to those images by e-mail, can't publicly share them because I'm unsure of their copyright status). For instance, one image that has an illegal character in a codestream comment gives me this error (while working fine with 1.6.3):
Result: UnicodeEncodeError: 'ascii' codec can't encode character u'\x95' in position 476 8: ordinal not in range(128)` For another file I ended up with this: UnicodeEncodeError: 'charmap' codec can't encode character u'\x9a' in position 1 60328: character maps to <undefined> (I got this while scanning a whole directory, and had to hack into the code to find out which file was producing the error because no output is written until all images are analysed!). For the very first versions of jpylyzer I originally used UTF-8 as the output encoding as well, but I quickly gave up on it because of similar encoding errors. This was also the reason why I switched to ASCII in the end, because it makes it very easy to filter out any characters that can't be processed, and I couldn't find a reliable way of making this work for UTF-8 under all possible circumstances. But if you have any suggestions please let me know, because you're right that ASCII isn't strictly allowed as an encoding in XML. A (quick and dirty) workaround would be to keep the encodings settings as ASCII, and then do a search & replace on the encoding declaration in the generated output string. ASCII is a subset of UTF-8 so in principle this should result in valid XML, although I'm not entirely sure how that might influence things like the rendering of entity references. Fix of * file path expansion
If I now try this: jpylyzer.py *.jp2` I end up with: E:\testjp2hecker>c:\python27\python C:\Temp\digital-preservation-jpylyzer-60406c b\jpylyzer.py *.jp2 Traceback (most recent call last): File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 358, in main() File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 351, in main checkFiles(args.inputRecursiveFlag, root, jp2In) File "C:\Temp\digital-preservation-jpylyzer-60406cb\jpylyzer.py", line 311, in checkFiles for file in os.listdir(path): WindowsError: [Error 123] The filename, directory name, or volume label syntax i s incorrect: '*.jp2/*.*' Looking into the code it seems that the issue with the '*' isn't actually fixed at all, it's simply removed altogether and replaced with the option of scanning folders. I'm not overly happy with this, also because of the implicit assumption that only files with a .jp2 extension are of interest. People may want to analyse files with non-standard extensions or JPX images as well. Valid XML in case of multiple filesI definitely see how this could be useful, but there are some drawbacks as well. The main one is that no output is written at all until the analysis of all files is completed. That means that if anything goes wrong in between (hardware problems, network interruptions, jpylyzer crashes) you simply won't get any output. For jpylyzer crashes we can largely avoid this using proper exception handlers, but then you also need some mechanism that tells you that an image wasn't processed because some exception occurred, i.e. put this somewhere in the output file. See also my UnicodeEncodeError description above, where I couldn't even establish which file was causing the crash without digging into the source code. Even then, imagine you're processing a directory tree with 100,000 images and then the network goes down while analysing the 99,999th image! Besides that I'm slightly worried about possible memory issues: all intermediate output is buffered in memory while jpylyzer is running, and I can imagine this causing problems for very large numbers of files. The main problem here is that unlike plain text you just can't write XML in append mode. In fact I've been deliberately avoiding writing 'proper' XML for multiple files for all the above reasons! Other implications of changing XML output formatThe proposed changes include a modification of jpylyzer's output format. I see this as an interface change (i.e. it changes the interface to any applications that use jpylyzer's output), meaning that this would be a new major version (i.e. 2.0, not 1.7!). That's fine as such, but I just don't think this is the right moment for it: it would break compatibility with existing jpylyzer workflows as well as making training materials and schemas obsolete. Eventually that will probably happen anyway, but for now I would really like to focus the development efforts in jpylyzer on fixing existing bugs as well as improving overall stability and robustness. ConclusionBased on the above I'm not in favor of accepting the proposed changes at this stage. However, some of these ideas should be considered for jpylyzer 2 (whenever that might happen):
These would be the minimum conditions for implementing multi-file XML support and recursive scanning. I would then still be slightly worried about the impact of memory issues and hardware failure, but I''m aware I may be slightly paranoid on this. Please let me know if this makes any sense! |
Hi Johan, I only had a Mac to test on available at the time, I will test again on a I will come back to you with details of the fixes inline below. In the mean Thanks Adam. On 26 September 2012 11:12, Johan van der Knijff
Adam Retter skype: adam.retter |
…allow all file type(s) and improvised the recursive search to work with wildcard *
…wrapper (-w) commandline argument
We added another member to our team here, and I was lucky enough to be able We have spent a couple weeks on this, and if you are happy with the work we After all that, there is another change we would like to make to jpylyzer Comments inline below - On 26 September 2012 11:12, Johan van der Knijff
JD: Fixed Jpylyzer to print the usage message when it is run without
We think we have fixed this so that it should always work now with UTF-8. JD: The images sent to us (Image1->test_latin_corrupt.jp2 ,
Your right of course, my assumptions were far too many and my use cases too JD: Fixed wildcard ()and recursive (-r) functionality to use the python
We changed the XML output to be as you had it before, i.e. a DOM is built JD: The output is now printed after each file is analysed. However, there
Regards the memory issues, you had them already by using DOM, see our
Agreed. We changed the output format back to how you had it before. When
I think (1) is now done in a way that is backwards compatible.
Thanks Adam. Adam Retter skype: adam.retter |
Hi Adam and Jaishree, I finaly managed to have a look at your modifications, and did some tests. Overall I'm realy happy with the changes, I didn't get any decode errors, and I also like the XML changes (use UTF-8 as default, incremental output updates). However the recursive scanning option did still show problems (these all occurred running Windows XP): Recursive scanning errorsUsing Python 2.7, running something like this:
Results in:
Alternatively, omitting the trailing backslash:
Gives:
Sometimes I get the same error message for subdirectories of the parent directory. Under Python 3.2, command lines like the above give me a TypeError instead, e.g.:
If I combine the -r options with wildcards, I get some unexpected behaviour as well. For instance, using the following command line (where a subdirectory of somedirectory contains .jp2 files!):
Ends with:
Although I can imagine this last example isn't really supposed to work in the first place? This all looks like file paths are not handled properly by the code (most likely the behaviour is platform dependent), and it's probably quite easy to solve as well. One thing here that did catch my eye was this line in addRecursiveFiles:
Which won't work under Windows, perhaps unless some normalisation happens before the call (I haven't looked in sufficient detail). Although I don't think this explains (all of) the bove issues. Next stepsDepending on your preferences, I can either wait for a new commit from you with these issue fixed, or otherwise pull it to a separate branch that I've already set up right away and continue working from that. (The reason I've set this up is that with Github's decision to drop support for file downloads I want to move the Windows binaries and PDF doc inside the repo, and some links on the OPF website have to be adapted as well, so until that's all sorted I don't want to touch the current main branch!). The advantage of pulling it right away is that others (e.g. me) can also help solving these remaining issues. On a side note I also saw that some of the formatting in the User Manual has gone a bit wrong because of the MS Word to OpenOffice conversion. That's no problem, and I can fix that by pasting your changes into my original Word doc and then re-generating the PDF. (Ideally it would be nice to have the doc in Markdown, but with all the tables that's just not very practical here). Thanks again, and let me know how you want to proceed! Cheers, Johan |
Hi Johan, lets see if we can't fix these path handling isues ASAP before Jaishree can you look at fixing these issues asap please?
|
Addition to above: after I wrote that I recalled that I once did something similar (i.e. recursively go through directory tree), and after some digging I found this bit of code: def getFilesFromTree(rootDir):
# Recurse into directory tree and return list of all files
# NOTE: directory names are disabled here!!
filesList=[]
for dirname, dirnames, filenames in os.walk(rootDir):
#Suppress directory names
for subdirname in dirnames:
thisDirectory=os.path.join(dirname, subdirname)
for filename in filenames:
thisFile=os.path.join(dirname, filename)
filesList.append(thisFile)
return filesList This uses os.walk, which is quite a bit simpler than the code in addRecursiveFiles. Perhaps this might be a useful substitute? |
Btw, could you select the 'test1.7' branch when making the pull request for the new commit (so not the master branch? Thanks! |
…g with -r option." This reverts commit 4747a41.
…d Wildcard handling with -r option.
Further update: the recursive option is still misbehaving as well! Under certain circumstances when you use -r with a path (e.g. .\mytestJP2s), instead of processing that directory it processes its root flder + all its underlying directories. Can't exactly pin down what triggers this, but I got tis behaviour both when the dir is specified as a relative path, and omitting the trailing backslash ('') seems to trigger this behaviour as well. (This happened under WinXP). In one of my tests this resulted in jpylyzer trying to scan my entire hard disk!! |
…L output in UTF-8 format and [bugfix] Recursive option
I have corrected jpylyzer so that it will always create a valid XML document even when scanning more than one file.
I have fixed the bug with '*', it is now correctly expanded by the shell.
I have added support for: