Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor gac-run #55

Merged
merged 66 commits into from
Apr 1, 2020
Merged

Refactor gac-run #55

merged 66 commits into from
Apr 1, 2020

Conversation

carloshorn
Copy link
Collaborator

This PR should close #54 by introducing a the new module l1c_factory.
This factory is used by the command line script pygac-run, which now is also able to process directories and tar archives.

The critical point is the change in the Reader.read method, which now takes an open file object as input. This change could lead to backwards incompatibility if someone uses this method in a custom script. If you see any risk, I could use the utils.file_opener (where open files are passed through) again in the read method.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for taking the time to submit this big refactoring. Many good things in here, I like especially the simplification of the main functions.

However, I think the usage of the factory and builder classes are a bit misleading and not really necessary. Also don't forget to add unit tests for every new function and class that is added to the code.

pygac/utils.py Outdated
file - path to file or file object
"""
close = True
if is_gzip(file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use just a try/except instead of writing a function for this ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, gzip.open does not raise an exception immediately, but the following lines should do the job:

@contextmanager
def file_opener(file):
    """Open a file depending on the input.

    Args:
        file - path to file or file object
    """
    # open file if necessary
    if is_file_object(file):
        open_file = file
        close = False
    else:
        open_file = open(file, mode='rb')
        close = True
    # check if it is a gzip file
    try:
        file_object = gzip.open(open_file)
        file_object.read(1)
    except OSError:
        file_object = open_file
    finally:
        file_object.seek(0)
    # provide file_object with the context
    try:
        yield file_object
    finally:
        if close:
            file_object.close()

pygac/tests/test_reader.py Show resolved Hide resolved
@@ -574,12 +574,11 @@ class KLMReader(Reader):

tsm_affected_intervals = TSM_AFFECTED_INTERVALS_KLM

def read(self, filename):
def read(self, fileobj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the current API that other libs use (eg Satpy). Could we use the file opener in this function, but make it transparent if the file is already open ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected this change to be critical, but it's an easy step to fix it. The function file_opener has no effect on open files, i.e. it does not close them.

pygac/reader.py Outdated Show resolved Hide resolved
pygac/reader.py Outdated
@property
def filename(self):
"""Get the property 'filename'."""
return self.__filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this attribute needs to be initialized in __init__

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already initialized in the __init__. I added the filename to the argument list and therefore moved it upwards to the initialization of the other given arguments.
Since you are using the Reader.read method in satpy, I wanted to ask, if you keep a single reader instance and reuse it for may files, or if one reader instance represents one file.
In the former case, I would suggest to give the Reader.read the optional argument fileobj as seen in file_opener and keep setting the filename attribute in this method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should the filename attribute be initialized with the given filename, or would you prefer the variable Reader.head['data_set_name']?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe using 'data_set_name' would be more robust, right ?

pygac/l1c_factory.py Outdated Show resolved Hide resolved
bin/pygac-run Outdated Show resolved Hide resolved
pygac/l1c_factory.py Outdated Show resolved Hide resolved
pygac/l1c_factory.py Outdated Show resolved Hide resolved
pygac/l1c_factory.py Outdated Show resolved Hide resolved
@mraspaud
Copy link
Member

Also, in the future, please try to submit smaller changes at the time. This PR is quite large but could have been split into multiple smaller ones (ie one feature at the time). It would have made the review process less time consuming... ;)

@carloshorn
Copy link
Collaborator Author

Sorry for that lengthy PR, but it's hard to stop once started ;-)

Thank you for taking the time.

@carloshorn
Copy link
Collaborator Author

Hi @mraspaud,
Sorry again for the long PR.
Now, it seems to work. I manually tested it, but I will of cause write the unit tests for the new stuff.
Here, I wanted to summarize the changes witch should have gone into separate PRs:

  1. Added the possibility to pass the config file to the run-script (pygac-run)
  2. Added the possibility to pass directories to the run-script (pygac-run)
  3. Added the possibility to pass tar files to the run-script (pygac-run)
    3.1. Required to pass filename and file object to reader (all XxxReader.read methods)
    3.2. Offered to outsource and refactor the open method (Reader._open -> utils.file_opener)
    3.3. Required to set reader.filename in reader.read method
    3.4. Offered to implement Reader.filename as property with the setter taking care of the format
  4. Added the Reader.save method to simplify the file processing interface (reader main)
    4.1. Required to implement Reader._prepare_channels to pass the channels to gac_io.save_gac in the correct order
  5. Added l1c_processor to move the reader selection from pygac-run to the internal library.
    5.1. Required to write l1c_processor.get_reader to implement the reader selection
    5.2. Offered to implement the alternative constructor Reader.fromfile
    5.3. Offered to introduce ReaderError as exception during the search for the reader
    5.4. Required to introduce reader._validate_header to check if this file corresponds to the reader
    5.5. Offered to implement a decorator in pygac-run to log failed processing trials

Open questions/issues (not to be handled in this PR :-)):

  1. Use the header attribute data_set_name to set the attribute reader.filename. This would again simplify the reader.read interface, because then we don't need the filename attribute if a file object is given, because all the information would be gathered from inside the file.
  2. pygac should not expect any environment variable (especially the config path). The environment variable should only serve as a fallback if nothing has been given. That means that this variable should not be set in the init.py. At least no error log if everything is fine.

Many thanks again for your time for the code review!

@sfinkens
Copy link
Member

sfinkens commented Mar 6, 2020

Scanlines are masked based on the quality flags, e.g.

def _get_corrupt_mask(self):

But I don't know what they mean...

@carloshorn
Copy link
Collaborator Author

carloshorn commented Mar 6, 2020

The pod guide Table 3.1.2.1-2. Format of quality indicators and klm guide Table 8.3.1.3.2.1-1. and following tables give the details about each bit. These tables should be mentioned in the doc string of _get_corrupt_mask. Furthermore, the bit shift operations in this method(s) are very cryptic to read. Since this method is far from being the bottleneck, using np.binary_repr to split the array into its bits and give them the associated names would make it more readable end self-explanatory.

Edit: np.unpackbits could be even better.
Edit2: The code for POD could look like

def _get_corrupt_mask(self):
    """Get mask for corrupt scanlines.

    Note
        The quality indicator format is listed in
        https://www1.ncdc.noaa.gov/pub/data/satellite/publications/
        podguides/TIROS-N%20thru%20N-14/pdf/NCDCPOD3.pdf
        Table 3.1.2.1-2. Format of quality indicators.
    """
    BITS_PER_BYTE = 8
    # quality indicator bit position
    # Note the index numeration is inverted to Table 3.1.2.1-2. 
    # Format of quality indicators, because we skip the big
    # endian uint conversion and directly read the bitstream
    FATAL_FLAG = 0  # Data should not be used for product generation
    CALIBRATION = 4  # Insufficient data for calibration
    NO_EARTH_LOCATION = 5  # Earth location data not available
    # need a contiguous array for memory view
    quality_indicators = np.ascontiguousarray(
        self.scans['quality_indicators'])
    shape = (
        quality_indicators.nbytes//quality_indicators.itemsize, 
        BITS_PER_BYTE*quality_indicators.itemsize
    )
    bits = np.unpackbits(
        quality_indicators.view(dtype=np.ubyte)
    ).reshape(shape).astype(bool)
    subset = [FATAL_FLAG, CALIBRATION, NO_EARTH_LOCATION]
    mask = bits[:, subset].any(axis=1)
    return mask

The index locations of the quality indicator bits could also be stored in a dictionary in pod_reader and klm_reader.

@sfinkens
Copy link
Member

sfinkens commented Mar 9, 2020

Nice, thanks for clarifying and immediately fixing it! If the NO_EARTH_LOCATION flag is set, we probably shouldn't use the scanline for lat/lon interpolation. Of course we could start splitting the quality flags, but since we have quite a lot of changes in this PR already, I'd vote for restoring the original behaviour.

@carloshorn
Copy link
Collaborator Author

Nice, thanks for clarifying and immediately fixing it! If the NO_EARTH_LOCATION flag is set, we probably shouldn't use the scanline for lat/lon interpolation.

I had a look into the interpolator and would recommend to add an optional argument for a mask to the interpolation functions. This argument should be passed to the lat_lot_interpolator and mask the missing lines. It could look like this:

def lat_lon_interpolator(lons_subset, lats_subset, cols_subset, cols_full, mask=None):
    """Interpolate lat-lon values in the AVHRR data."""
    lines = lats_subset.shape[0]
    rows_subset = np.arange(lines)
    if mask is not None:
        rows_subset = rows_subset[mask]
    rows_full = np.arange(lines)

    along_track_order = 1
    cross_track_order = 3

    satint = gtp.SatelliteInterpolator((lons_subset, lats_subset),
                                       (rows_subset, cols_subset),
                                       (rows_full, cols_full),
                                       along_track_order,
                                       cross_track_order)

    return satint.interpolate()

What do you think? This time, I maybe wait before implementing...

Of course we could start splitting the quality flags, but since we have quite a lot of changes in this PR already, I'd vote for restoring the original behaviour.

I will revert to 487660d and keep the commits for another PR.

@mraspaud
Copy link
Member

Yes, can we take this in an other PR ?

@mraspaud
Copy link
Member

I'd like to stop the addition of features on this one for now, so that we can make a final test and merge if possible

@carloshorn
Copy link
Collaborator Author

As announced earlier, I reverted the bit unpacking commits and updated the author lists.
@mraspaud could you specify which tests you would like to see to close this PR.
Once closed, we can handle the bit unpacking, and lon/lat mask for interpolation in separate PRs.

@mraspaud
Copy link
Member

@carloshorn ok, thanks.
@ninahakansson @sfinkens could you please check that this PR works for you one last time ?

@ninahakansson
Copy link
Collaborator

@mraspaud PR works ok for my testcase.

@sfinkens
Copy link
Member

sfinkens commented Apr 1, 2020

Testing now

@sfinkens
Copy link
Member

sfinkens commented Apr 1, 2020

So, a couple of files are longer than before. Was the consensus to use these extra scanlines? Apart from that, I see some differences in two files

NSS.GHRR.NJ.D01335.S0618.E0812.B3567273.WI
NSS.GHRR.NJ.D01335.S1454.E1639.B3567778.GC

See: https://public.cmsaf.dwd.de/data/sfinkens/pygac_l1c_factory

@mraspaud
Copy link
Member

mraspaud commented Apr 1, 2020

@sfinkens aren't these just extra lines ?

@sfinkens
Copy link
Member

sfinkens commented Apr 1, 2020

@mraspaud I don't think so, both orbits have the same length and if you look at image4 for example, they look similar in the end, but the difference is not zero (although it is small)

@sfinkens
Copy link
Member

sfinkens commented Apr 1, 2020

Ah, but now I remember. This was because the extra lines are used for interpolation, right?

@mraspaud
Copy link
Member

mraspaud commented Apr 1, 2020

Yes that's probably the case.

@mraspaud
Copy link
Member

mraspaud commented Apr 1, 2020

So it looks like everyone is good with this, I'll be merging shortly. @carloshorn thank you very much for your sustained efforts and patience! That was an epic PR :)

@sfinkens
Copy link
Member

sfinkens commented Apr 1, 2020

Yeah, big thumbs up @carloshorn!

@mraspaud mraspaud changed the title L1c factory Refactor gac-run Apr 1, 2020
@mraspaud mraspaud merged commit 4c458ba into pytroll:master Apr 1, 2020
@carloshorn
Copy link
Collaborator Author

Nice, thank you all for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Function check_file_version should not be part of pygac-run
7 participants