Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions that import sample GMT remote files #1436

Closed
willschlitzer opened this issue Aug 11, 2021 · 14 comments
Closed

Functions that import sample GMT remote files #1436

willschlitzer opened this issue Aug 11, 2021 · 14 comments
Labels
deprecation Deprecating a feature feature request New feature wanted
Milestone

Comments

@willschlitzer
Copy link
Contributor

I've recently opened some pull requests (#1386 and #1420) that import the sample remote files, and I've realized there are a lot of sample files on the GMT server that are used in the GMT examples. Should functions be added that import remote files (e.g. "ternary.txt" and "EGM96_to_36.txt" for ternary and sph2grd) whenever they are used in a GMT module example? I think it's more Pythonic to have the datasets available as a DataFrame/array than just the text file, and I think those sample files are a great resource when writing a test or making an example, but I could also see it as bad practice to be adding all of these functions that import little-used sample files.

@willschlitzer willschlitzer added the question Further information is requested label Aug 11, 2021
@maxrjones
Copy link
Member

Functions to load the file are nice, though I am not sure that we should export them or include in the API docs since there's no stability guarantee for those files (see https://docs.generic-mapping-tools.org/dev/datasets/remote-data.html#cache-file-updates).

@maxrjones
Copy link
Member

xref GenericMappingTools/gmtserver-admin#100, which has some discussions about alternate options for data files used in examples.

@willschlitzer
Copy link
Contributor Author

willschlitzer commented Aug 31, 2021

It looks like the consensus is that cache files should be used only in examples. Should the PyGMT examples just call the remote files directly or should there be a Python function to load the remote files first before they're used in the examples?

@maxrjones
Copy link
Member

maxrjones commented Sep 3, 2021

It looks like the consensus is that cache files should be used only in examples. Should the PyGMT examples just call the remote files directly or should there be a Python function to load the remote files first before they're used in the examples?

This is a good question. I am not sure the best solution here, but my current thinking is that we should either have a separate namespace for the functions related to GMT’s cache data relative to the earth relief, age, etc grids or a single public function like load_cache_data with a filename parameter that can be used to specify the cache file. This way, there would be more information that the data accessed are not 'stable' (which is currently only documented on the GMT side) and the smaller functions for transforming the data in each file for the examples would not overwhelm the API docs.

@willschlitzer
Copy link
Contributor Author

It looks like the consensus is that cache files should be used only in examples. Should the PyGMT examples just call the remote files directly or should there be a Python function to load the remote files first before they're used in the examples?

This is a good question. I am not sure the best solution here, but my current thinking is that we should either have a separate namespace for the functions related to GMT’s cache data relative to the earth relief, age, etc grids or a single public function like load_cache_data with a filename parameter that can be used to specify the cache file. This way, there would be more information that the data accessed are not 'stable' (which is currently only documented on the GMT side) and the smaller functions for transforming the data in each file for the examples would not overwhelm the API docs.

I think functions that take a region argument, such as load_earth_relief (and the functions in #1457 and #1471) should still merit their own public functions to call the gridded data, but there's no need to have separate functions that all just read a remote file from the GMT server and return a DataFrame.

I'm not familiar with what it means to move the functions to a separate namespace. Does this mean there would be a separate class that would have sub-functions to call different datasets?

@maxrjones
Copy link
Member

I think functions that take a region argument, such as load_earth_relief (and the functions in #1457 and #1471) should still merit their own public functions to call the gridded data, but there's no need to have separate functions that all just read a remote file from the GMT server and return a DataFrame.

100% agree

I'm not familiar with what it means to move the functions to a separate namespace. Does this mean there would be a separate class that would have sub-functions to call different datasets?

Not a new class, just a different module structure. So there would still be pygmt.datasets.load_earth_relief but also pygmt.datasets.cache.load_hotspots, etc. But I expect that using a single public function (similar to https://seaborn.pydata.org/generated/seaborn.load_dataset.html) and internal function(s) (as needed) for loading and transforming the data would be the better option. I could open a request for comment PR with a potential solution, or leave this to you if you prefer.

@willschlitzer
Copy link
Contributor Author

Not a new class, just a different module structure. So there would still be pygmt.datasets.load_earth_relief but also pygmt.datasets.cache.load_hotspots, etc. But I expect that using a single public function (similar to https://seaborn.pydata.org/generated/seaborn.load_dataset.html) and internal function(s) (as needed) for loading and transforming the data would be the better option. I could open a request for comment PR with a potential solution, or leave this to you if you prefer.

As far as I can tell with the seaborn model, it would look something like this:

hotspot_data = pygmt.datasets.load_dataset(name="hotspot")

Then the function would look something like:

def load_dataset(name):
    if name == "hotspot":
         return _load_hotspot()

Does that look about right?

@maxrjones
Copy link
Member

Not a new class, just a different module structure. So there would still be pygmt.datasets.load_earth_relief but also pygmt.datasets.cache.load_hotspots, etc. But I expect that using a single public function (similar to https://seaborn.pydata.org/generated/seaborn.load_dataset.html) and internal function(s) (as needed) for loading and transforming the data would be the better option. I could open a request for comment PR with a potential solution, or leave this to you if you prefer.

As far as I can tell with the seaborn model, it would look something like this:

hotspot_data = pygmt.datasets.load_dataset(name="hotspot")

Then the function would look something like:

def load_dataset(name):
    if name == "hotspot":
         return _load_hotspot()

Does that look about right?

That looks right. It would also be necessary to decide whether to deprecate datasets.load_japan_quakes, datasets.load_ocean_ridge_points, datasets.load_sample_bathymetry, datasets.load_usgs_quakes, datasets.load_fractures_compilation, and datasets.load_hotspots in favor of pygmt.datasets.load_dataset(name=dataset_name) (I would suggest yes since it's more pythonic to have just one way to do a task) and, if so, implement a deprecation warning.

@weiji14
Copy link
Member

weiji14 commented Sep 18, 2021

in favor of pygmt.datasets.load_dataset(name=dataset_name) (I would suggest yes since it's more pythonic to have just one way to do a task) and, if so, implement a deprecation warning.

👍 for me. The seaborn.load_dataset code implementation is worth a look at https:/mwaskom/seaborn/blob/703259f2dca711205f81d2d8f9e3fa45774eb0da/seaborn/utils.py#L546-L641, this is the general structure (I've simplified it a bit):

def load_dataset(name, **kwargs):
    path = f"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/{name}.csv"
    df = pd.read_csv(path, **kwargs)

    if name == "flights":
        months = df["month"].str[:3]
        df["month"] = pd.Categorical(months, months.unique())

    if name == "titanic":
        df["class"] = pd.Categorical(df["class"], ["First", "Second", "Third"])
        df["deck"] = pd.Categorical(df["deck"], list("ABCDEFG"))

    return df

So we could probably get away with one big function with lots of if-statements, rather than have too many _load_data1, _load_data2 functions being called. OR, we could construct one big dictionary where the key is the name of the dataset, and the values are the arguments to pass to pd.read_csv:

dataset_dict = {
    "japan_quakes": dict(url="@tut_quakes.ngdc", header=1, sep=r"\s+"),
    "sample_bathymetry": dict(
        url="@tut_ship.xyz",
        sep="\t",
        header=None,
        names=["longitude", "latitude", "bathymetry"],
    ),
}
fname = which(fname=dataset_dict[name]["url"], download="c")
dataframe = pd.read_csv(fname, **dataset_dict[name])

There's also xarray.tutorial.open_dataset which loads NetCDF datasets, but maybe we should separate table datasets (read using pd.read_csv) and grid datasets (read using xr.load_dataarray). Let's just handle the table datasets in this issue for now.

@seisman seisman added deprecation Deprecating a feature feature request New feature wanted and removed question Further information is requested labels Sep 18, 2021
@seisman seisman added this to the 0.5.0 milestone Sep 18, 2021
@maxrjones
Copy link
Member

So we could probably get away with one big function with lots of if-statements, rather than have too many _load_data1, _load_data2 functions being called. OR, we could construct one big dictionary where the key is the name of the dataset, and the values are the arguments to pass to pd.read_csv:

I like the dictionary option, basically a switch...case for python.

There's also xarray.tutorial.open_dataset which loads NetCDF datasets, but maybe we should separate table datasets (read using pd.read_csv) and grid datasets (read using xr.load_dataarray). Let's just handle the table datasets in this issue for now.

I agree with just working on the table datasets for now.

@willschlitzer willschlitzer mentioned this issue Sep 19, 2021
5 tasks
@seisman
Copy link
Member

seisman commented Sep 23, 2021

dataset_dict = {
    "japan_quakes": dict(url="@tut_quakes.ngdc", header=1, sep=r"\s+"),
    "sample_bathymetry": dict(
        url="@tut_ship.xyz",
        sep="\t",
        header=None,
        names=["longitude", "latitude", "bathymetry"],
    ),
}
fname = which(fname=dataset_dict[name]["url"], download="c")
dataframe = pd.read_csv(fname, **dataset_dict[name])

Although the dictionary structure can make the code more compact, I feel that the if-statement option is easier to read, especially for new contributors who want to add examples.

BTW, the above code doesn't work, because pd.read_csv doesn't have the url parameter. If we use the dictionary option, we have to change the code, which may make it more difficult to read.

@willschlitzer
Copy link
Contributor Author

I think the if statement version is more accessible to new contributors, but I do also think dataset_dict is much more scalable if the plan is to add the capability to import most of the GMT remote files. I think if plan is to just wrap the remote files that are used in testing/examples, we can get by without needing to create a dictionary.

@weiji14 weiji14 modified the milestones: 0.5.0, 0.6.0 Sep 30, 2021
@seisman
Copy link
Member

seisman commented Oct 2, 2021

There's also xarray.tutorial.open_dataset which loads NetCDF datasets, but maybe we should separate table datasets (read using pd.read_csv) and grid datasets (read using xr.load_dataarray). Let's just handle the table datasets in this issue for now.

Please note that we already have a pygmt.load_dataarray function which reads a grid as xr.DataArray. Thus, pygmt.datasets.load_dataset is a bad function name too me, because this function is only used to read sample dataset, and the function name is a little misleading. Maybe pygmt.datasets.load_sample_dataset is a better option.

@seisman
Copy link
Member

seisman commented Jan 10, 2022

Can we close the issue?

@seisman seisman unpinned this issue Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation Deprecating a feature feature request New feature wanted
Projects
None yet
Development

No branches or pull requests

4 participants