Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up UFOZ read/write. #221

Open
typesupply opened this issue Dec 20, 2018 · 4 comments
Open

Speed up UFOZ read/write. #221

typesupply opened this issue Dec 20, 2018 · 4 comments

Comments

@typesupply
Copy link
Member

Reading and writing UFOZ is extremely slow when only part of the wrapped UFO data have been modified. This is partly due to some issues in ufoLib (which I will address separately) but the changes that I am studying for ufoLib will need to be deeply integrated within defcon.

This issue is a public notepad for my examinations and tests. I'll be updating the tasks and results in the comments of this issue.

@typesupply
Copy link
Member Author

@anthrotype, I'm going to do my studying here and if it does require a change in ufoLib, I'll make an issue for discussion and proposal over there. I figured it'd be best to consolidate it here for now since defcon is where the problem is most acute as of this moment.

@anthrotype
Copy link
Member

sounds good
thanks!

@typesupply
Copy link
Member Author

This is the current read process inside of UFOReader:

  1. open zip from disk
  2. extract files upon request

Inside of defcon.Font a ufoLib.UFOReader object is created during __init__. UFOReader retains a references to a ZipFS object for all contents of the UFOZ. Also __init__ the glyph sets for each layer are obtained for the reader and retained in defcon's Layer objects. (This is required for lazy loading of glyphs in defcon.) These glyph sets retain references to sub-directories of the parent ZipFS object. (This is required for lazy loading of glyphs by the ufoLib.glifLib.GlyphSet objects.) Once __init__ is complete, the UFOReader object is no longer retained.

When a file that was not read during __init__ is needed by a defcon caller, a UFOReader is created again. Most top level files are not read during __init__. This redundancy costs time. It's not a huge amount of time for an individual file read, but if multiple files are loaded in one sequence, the time can add up.

This is the current write process inside of UFOWriter:

  1. create a temp directory on disk
  2. open zip from disk
  3. decompress all files in zip into temporary directory
  4. make changes to files in temporary directory
  5. create new zip container
  6. copy all files from temporary directory into zip container
  7. write zip container to disk

This is a very expensive process. In my test case of one style of Source Serif Pro (1415 glyphs) the write time is always a minimum of around two seconds. That's just the overhead of the write process as it is currently structured. I've analyzed this and it breaks down like this:

steps 1-3: 0.818 seconds
step 4: 0.0003 seconds
steps 5-7: 0.995 seconds

The obvious solution to this is to retain UFOReader and UFOWriter outside of __init__ and save. However, zips are tricky little files so it's not quite so simple. First, when writing, the ZipFS object must closed to trigger the compression and writing of the actual file. Once the ZipFS is closed it can't be used again. So, retaining ZipFS doesn't do any good.

Further complicating things is the possibility of external changes to the UFOZ by another application/process. When reading from ZipFS the data inside the zip is immediately pulled into memory. If the data inside the original zip is changed, the data that was pulled into memory will not reflect the change.

My current idea is to modify the write process so that part of the work can be reused. The logic would be like this:

1. if we don't have a copy of the contents of the zip in a temporary file system, make one and set it to be retained
2. write changes to the files in the temporary file system
3. create new zip container
4. copy all files from temporary directory into zip container
5. write zip container to disk

Using the timed results from above, during the first save, the time should break down like this:

step 1: 0.818 seconds
step 2: 0.0003 seconds
steps 3-5: 0.995 seconds

And subsequent saves should break down like this:

step 1: very little time
step 2: 0.0003 seconds
steps 3-5: 0.995 seconds

This will reduce the overhead on subsequent saves by half. If an external change is made, we need to discard the retained temporary file system redo step 1 as the temporary file system is no longer in sync.

I think there may be more optimizations possible, but this is what I have for now. I haven't yet worked out how this logic would be distributed to defcon and ufoLib. I'm going to do more thinking and testing before I get into that.

@typesupply
Copy link
Member Author

Below is an outline of what I am going to try first. If this works, there will be no changes needed in ufoLib. All of these changes would be applicable only to defcon's Font class.


First, I'd establish a defcon specific version of ZipFS. This will be used for reading only:

from fs.zipfs import ReadZipFS

class DefconZipFS(ReadZipFS):

  def __init__(self, *args, writableFS=None, **kwargs):
    self.writableFS = writableFS
    super(DefconZipFS, self).__init__(*args, **kwargs)

  def getbytes(self, path):
    if self.writableFS.exists(path):
      return self.writableFS.getbytes(path)
    data = super(DefconZipFS, self).getbytes(path)
    self.writableFS.setbytes(path, data)
    return data

In Font.__init__:

- make a write enabled instance of `ZipFS` (let's call this "internalFS")
  - retain it
- make an instance of `DefconZipFS` (let's call this "sourceFS") pointed at the data on disk
  - retain it

When creating a UFOReader:

- provide sourceFS as the path argument

This will produce the following behavior when anything is read from the UFOReader:

- if the file is in internalFS
  - read from there
    - return the data
- else
  - if sourceFS is still available
    - read from there
      - write the data into internalFS
      - return the data
  - else (this won't happen unless a file that was never in sourceFS is requested)
    - error

In Font.save:

- if sourceFS is still around
  - copy unread files to internalFS
  - release sourceFS
- make the changes to internalFS
- make a copy of internalFS (using BytesIO as the file) 
- write the zip data from the copy
- write the zip data to disk

This behavior should be off by default and may be activated via a kwarg in __init__.

There are situations that need to be handled that I don't have answers for yet:

  • What happens if an external change is detected?
  • When are the fsobjects released?
  • Probably more stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants