Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn or raise when needed DataLoader attributes aren't included #136

Open
spencerahill opened this issue Feb 2, 2017 · 4 comments
Open

Comments

@spencerahill
Copy link
Owner

It is possible to instantiate a GFDLDataLoader that doesn't include all of the attributes (e.g. data_dur) that are ultimately required to find the file. Currently, we don't warn or raise when this happens.

This leads to non-intuitive crashes. E.g. when I try a calculation using a Run whose data_loader mistakenly didn't have a data_dur attribute, I get this traceback, which is coming from the fact that data_dur is None:

INFO:root:Getting input data: Var instance "prec_ls" (Thu Feb  2 00:42:23 2017)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/Spencer.Hill/py/scripts/main.py in <module>()
    127 if __name__ == '__main__':
    128     calcs = main(mp, print_table=mp.print_table, prompt_verify=True,
--> 129                  exec_calcs=mp.compute, parallelize=mp.parallelize)
    130
    131

/home/Spencer.Hill/py/aospy_user/aospy_user/main.pyc in main(main_params, exec_calcs, print_table, prompt_verify, parallelize)
    214     else:
    215         calcs = cs.create_calcs(param_combos, exec_calcs=exec_calcs,
--> 216                                 print_table=print_table)
    217     return calcs

/home/Spencer.Hill/py/aospy_user/aospy_user/main.pyc in create_calcs(self, param_combos, exec_calcs, print_table)
    160             if exec_calcs:
    161                 try:
--> 162                     calc.compute()
    163                 except RuntimeError as e:
    164                     logging.warn(repr(e))

/home/Spencer.Hill/py/aospy/aospy/calc.pyc in compute(self, save_files, save_tar_files)
    641         """Perform all desired calculations on the data and save externally."""
    642         data = self._prep_data(self._get_all_data(self.start_date,
--> 643                                                   self.end_date),
    644                                self.var.func_input_dtype)
    645         logging.info('Computing timeseries for {0} -- '

/home/Spencer.Hill/py/aospy/aospy/calc.pyc in _get_all_data(self, start_date, end_date)
    475                                                      end_date, n),
    476                                 self.var.func_input_dtype)
--> 477                 for n, var in enumerate(self.variables)]
    478
    479     def _local_ts(self, *data):

/home/Spencer.Hill/py/aospy/aospy/calc.pyc in _get_input_data(self, var, start_date, end_date, n)
    427             data = self.data_loader.load_variable(var, start_date, end_date,
    428                                                   self.time_offset,
--> 429                                                   **self.data_loader_attrs)
    430             name = data.name
    431             data = self._add_grid_attributes(

/home/Spencer.Hill/py/aospy/aospy/data_loader.pyc in load_variable(self, var, start_date, end_date, time_offset, **DataAttrs)
    187         """
    188         file_set = self._generate_file_set(var=var, start_date=start_date,
--> 189                                            end_date=end_date, **DataAttrs)
    190         ds = _load_data_from_disk(file_set)
    191         ds = _prep_time_data(ds)

/home/Spencer.Hill/py/aospy/aospy/data_loader.pyc in _generate_file_set(self, var, start_date, end_date, domain, intvl_in, dtype_in_vert, dtype_in_time, intvl_out)
    391             file_set = self._input_data_paths_gfdl(
    392                 name, start_date, end_date, domain, intvl_in, dtype_in_vert,
--> 393                 dtype_in_time, intvl_out)
    394             if all([os.path.isfile(filename) for filename in file_set]):
    395                 return file_set

/home/Spencer.Hill/py/aospy/aospy/data_loader.pyc in _input_data_paths_gfdl(self, name, start_date, end_date, domain, intvl_in, dtype_in_vert, dtype_in_time, intvl_out)
    422                     name, domain, dtype, intvl_in, year, intvl_out,
    423                     self.data_start_date.year, self.data_dur))
--> 424                  for year in range(start_date.year, end_date.year + 1)]
    425         files = list(set(files))
    426         files.sort()

/home/Spencer.Hill/py/aospy/aospy/utils/io.pyc in data_name_gfdl(name, domain, data_type, intvl_type, data_yr, intvl, data_in_start_yr, data_in_dur)
    153     """Determine the filename of GFDL model data output."""
    154     # Determine starting year of netCDF file to be accessed.
--> 155     extra_yrs = (data_yr - data_in_start_yr) % data_in_dur
    156     data_in_yr = data_yr - extra_yrs
    157     # Determine file name. Two cases: time series (ts) or time-averaged (av).

TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'

I don't want to switch to positional arguments, but I think we should at the very least warn when the needed attributes are missing. Maybe raising is too much, since then a user won't even be able to import their object library -- it might be more user-friendly to warn, so that they can use other objects but also know that this particular object will fail if they try to use it.

@spencerkclark
Copy link
Collaborator

Agreed, that is a bad error message

@haydenbetts
Copy link

haydenbetts commented Mar 3, 2019

Hey @spencerahill
I've been looking at this one today. I have a few prefatory questions.

TLDR - I couldn't get the loader to work with GDFL data from ftp://nomads.gfdl.noaa.gov/gfdl_am2_1/AM2.1_1979-2000-AllForc_h1/ I am not sure if I am getting out of scope data (obscure data format the loader is not designed to work with), or doing something wrong in AOSPY. Do you have a link to some post-processed GDFL data that is in-scope for the GDFL loader?

Details:
I downloaded GDFL data from an experimental run of AM 2.1, specifically from ftp://nomads.gfdl.noaa.gov/gfdl_am2_1/AM2.1_1979-2000-AllForc_h1/, and tried to go through the process of loading it.

I tried loading in files related to the variable hur, relative_humidity via https://pcmdi.llnl.gov/ipcc/standard_output14.html#Table_A1a.

Here are the paths that that the loader generated, and threw an error with:

[['.../pp/atmos/ts/monthly/5yr/atmos.198001-198412.hur.nc',
'.../pp/atmos/ts/monthly/5yr/atmos.198501-198912.hur.nc',
'..../pp/atmos/ts/monthly/5yr/atmos.199001-199412.hur.nc',
'.../pp/atmos/ts/monthly/5yr/atmos.199501-199912.hur.nc']]

This is pretty close to format of this GDFL post-processed data, but not quite.
A representative file from this directory was:
.../pp/atmos/ts/monthly/hur_A1.198001-198412.nc

e.g.:

the loader expected
[rootdir]/[domain]/[dtype_in_time]/[intvl_in]/[data_dur]yr/[domain].[date range].[variable name].nc
and in this dataset, the format seemed to be:
[rootdir]/[domain]/[dtype_in_time]/[intvl_in]/[variable_name][a shorthand version of the IPCC Table identifier].[date range].nc

Am I doing something wrong? (That could totally be what's going on! :) Here is what I tried: https:/haydenbetts/aospy-run-test) Is this post-processed data from AM 2.1 out of scope for the GFDLDataLoader? If so, where can I find in-scope data?

@spencerahill
Copy link
Owner Author

You're not doing anything wrong, as indeed the filenames for the data you found doesn't match the pattern we have built the GFDLDataLoader around. Here is another example.

Presumably both of these are for model data that was ultimately intended for the CMIP archives, based on the fact that they use the CMIP standard variable names rather than GFDL's in-house standard names, i.e. hur instead of rh (or 'tas' instead of 't_surf'). Ultimately, this isn't the use-case we're interested in; the directory structure and filenaming patterns that are written into the GFDLDataLoader are the ones that are used by modern GFDL's in-house models and are thus what we care about.

I'm sure there is some publicly available GFDL data in the proper format, but I can't find any right now; in fact many of the links from the GFDL data portal page seem broken. @spencerkclark, do you have any on hand?

Also, @spencerkclark wrote the unit tests e.g. here that cover the GFDLDataLoader, but AFAICT those tests generate the needed objects and data for the tests as they run. So another option is to try to do something similar. In fact, since this (like all new code) will require unit tests of its own, this ultimately might be the best way to proceed...i.e. you'll have to do it sooner or later.

In other words, maybe don't worry about finding test data in the wild matching the pattern: just construct it like those existing unit tests have.

Also, thanks for describing the problem very clearly, which is really helpful.

@spencerkclark
Copy link
Collaborator

@haydenbetts thanks for your interest; sorry for being silent for a bit.

I agree with @spencerahill regarding thinking about this problem in a more abstract sense, i.e. without worrying about having actual example files, as that will be useful for writing tests. In this case I think you might not need to worry about filenames. For testing you might be able to create GFDLDataLoaders with and without the required input arguments and make sure an error is raised under the appropriate circumstances.

Nevertheless, for your reference, I put up a small set of files in the form of a tar archive on Google Drive that fit this directory/naming structure in case you'd like to try things out in a practical setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants