Use a single function for loading any sample dataset #1685

maxrjones · 2021-12-23T23:30:12Z

Description of proposed changes

This PR partially implements the suggestions in #1436, in deprecating individual functions in favor of a single function for loading sample datasets.

It also adds a list_sample_dataframes function that lists the available dataset names to provide to load_sample_dataframe.

A couple questions:

Should we have one function that returns either an xarray.DataArray or pandas.DataFrame depending on the specific file requested or one function for tabular data samples and another for raster data samples?
Currently, the user would pass exact name of the file on the GMT server. Should we continue with this implementation or provide different names that are more similar to the existing functions (e.g., name="japan_quakes" vs name="tut_quakes.ngdc")?

To do:

Deprecate other functions for loading a sample pandas.DataFrame
Deprecate functions for loading a sample xarray.DataArray
Update syntax in gallery examples and tutorials.

Reminders

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst.
Write detailed docstrings for all functions/methods.
If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
If adding new functionality, add an example to docstrings or tutorials.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

/format: automatically format and lint the code
/test-gmt-dev: run full tests on the latest GMT development version

willschlitzer · 2021-12-24T06:47:32Z

I like the idea of having a single function to call datasets, but I think it may make load_sample_dataframe a big and cumbersome function. Do we want some non-user facing functions to handle the dataframe reading?

def load_sample_dataframe(name)
    names = list_sample_dataframes()
    if name not in names:
        raise GMTInvalidInput(f"Invalid dataset name '{name}'")

    fname = which("@" + name, download="c")

    if name == "tut_quakes.ngdc":
        return load_japan_quakes(fname=fname)

def load_japan_quakes(fname):
    data = pd.read_csv(fname, header=1, sep=r"\s+")
    data.columns = [
        "year",
        "month",
        "day",
        "latitude",
        "longitude",
        "depth_km",
        "magnitude",
    ]
    return data

maxrjones · 2021-12-27T19:14:00Z

I like the idea of having a single function to call datasets, but I think it may make load_sample_dataframe a big and cumbersome function. Do we want some non-user facing functions to handle the dataframe reading?

def load_sample_dataframe(name)
    names = list_sample_dataframes()
    if name not in names:
        raise GMTInvalidInput(f"Invalid dataset name '{name}'")

    fname = which("@" + name, download="c")

    if name == "tut_quakes.ngdc":
        return load_japan_quakes(fname=fname)

def load_japan_quakes(fname):
    data = pd.read_csv(fname, header=1, sep=r"\s+")
    data.columns = [
        "year",
        "month",
        "day",
        "latitude",
        "longitude",
        "depth_km",
        "magnitude",
    ]
    return data

I agree that separating the parsing code would be more readable. I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

willschlitzer · 2021-12-28T07:30:09Z

I agree that separating the parsing code would be more readable. I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

My solution would be to just copy and paste the load_japan_quakes function to something like load_japan_quakes_sample_data (name isn't great; just a placeholder suggestion). The load_japan_quakes function gets a deprecation notice, and we'll be alerted before a release when it is time to be take it out. The load_japan_quakes_sample_data function is called by load_sample_data and can be eventually renamed to load_japan_quakes without concerns for deprecation, as it isn't user-facing.

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

maxrjones · 2021-12-28T17:52:17Z

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

Sounds good, I updated the code based on your suggestions.

willschlitzer · 2021-12-29T17:06:42Z

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

Sounds good, I updated the code based on your suggestions.

I think your implementation works better than what I envisioned; I like that the original functions still get a returned value from load_sample_dataframe. I'm assuming you want the other functions to be updated on later PRs (before the next release)?

maxrjones · 2021-12-29T17:16:36Z

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

Sounds good, I updated the code based on your suggestions.

I think your implementation works better than what I envisioned; I like that the original functions still get a returned value from load_sample_dataframe. I'm assuming you want the other functions to be updated on later PRs (before the next release)?

I updated these two functions first to get agreement on the structure before putting a lot of work in. I would lean slightly towards updating the other functions in other PRs to keep the PR size down and allow new additions in the meantime (e.g., MaunaLao_CO2 for #1512), but could update them in this PR (likely after Jan 3) if that's what others prefer.

weiji14

I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

I feel that there should be a way to silence the warning somehow so that we don't need to make new temporary functions. Will look into this in the coming days (I'm currently reviewing this on the plane, haha).

doc/api/index.rst

pygmt/datasets/samples.py

Co-authored-by: Wei Ji <[email protected]>

maxrjones · 2022-01-03T16:55:37Z

I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

I feel that there should be a way to silence the warning somehow so that we don't need to make new temporary functions. Will look into this in the coming days (I'm currently reviewing this on the plane, haha).

OK, I pushed a revised module that avoids using new functions at the expense of being a bit trickier. Rather than removing the original function (e.g., load_japan_quakes) at the end of the deprecation cycle, we could make the function internal (leading underscore, remove from imports and API docs) and remove the kwargs/suppress warning check.

seisman · 2022-01-04T04:41:26Z

Should we have one function that returns either an xarray.DataArray or pandas.DataFrame depending on the specific file requested or one function for tabular data samples and another for raster data samples?

Personally, I prefer to have a single function load_sample_data to load everything.

maxrjones · 2022-01-04T18:33:24Z

Should we have one function that returns either an xarray.DataArray or pandas.DataFrame depending on the specific file requested or one function for tabular data samples and another for raster data samples?

Personally, I prefer to have a single function load_sample_data to load everything.

Sounds good to me, I updated the name accordingly. I actually realized we do not have any functions yet for loading sample DataArray objects from the GMT cache, but I assume that we will in the future.

pygmt/datasets/samples.py

Co-authored-by: Dongdong Tian <[email protected]>

pygmt/datasets/samples.py

seisman · 2022-01-06T03:55:21Z

pygmt/datasets/samples.py

+ return names
+
+
+def load_sample_data(name):


I'm thinking if we can merge the list_sample_data() function into load_sample_data(), so that we don't have to maintain two dictionaries.

For example, calling load_sample_data() without giving a name can return the name-description dict.

I'd prefer to keep them separate even though it requires two dictionaries because I think overall it's simpler to have each function have one purpose.

pygmt/datasets/samples.py

Co-authored-by: Dongdong Tian <[email protected]>

This PR is a follow-up of #1685 and updates the syntax for loading sample datasets in all corresponding gallery examples.

This PR is a follow-up of #1685 and updates the syntax for loading sample datasets in all corresponding tutorials.

…ools#1685) Co-authored-by: Wei Ji <[email protected]> Co-authored-by: Dongdong Tian <[email protected]>

Use a single function for loading any sample dataset

479c336

vercel bot temporarily deployed to Preview December 28, 2021 17:47 Inactive

Use separate functions for reading tables

28e0597

Merge branch 'main' into load-sample-datasets

c34cfa8

vercel bot temporarily deployed to Preview December 28, 2021 17:56 Inactive

seisman added this to the 0.6.0 milestone Dec 29, 2021

maxrjones changed the title ~~RFC: Use a single function for loading any sample dataset~~ Use a single function for loading any sample dataset Dec 29, 2021

weiji14 reviewed Dec 29, 2021

View reviewed changes

doc/api/index.rst Outdated Show resolved Hide resolved

pygmt/datasets/samples.py Show resolved Hide resolved

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

Apply suggestions from code review

2f6e3ad

Co-authored-by: Wei Ji <[email protected]>

vercel bot temporarily deployed to Preview January 3, 2022 16:31 Inactive

vercel bot temporarily deployed to Preview January 3, 2022 16:51 Inactive

maxrjones added 2 commits January 3, 2022 11:53

Suppress warning with kwargs rather than new functions

4df5b37

Merge branch 'main' into load-sample-datasets

0a7fcdd

vercel bot temporarily deployed to Preview January 3, 2022 17:01 Inactive

seisman added feature Brand new feature deprecation Deprecating a feature labels Jan 4, 2022

maxrjones added 2 commits January 4, 2022 11:44

Use only one function for all sample datasets

7dc6a31

Update warnings

9f2a32f

vercel bot temporarily deployed to Preview January 4, 2022 18:31 Inactive

Deprecate other sample dataset functions

6e38361

seisman reviewed Jan 5, 2022

View reviewed changes

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

Apply suggestions from code review

a909a29

Co-authored-by: Dongdong Tian <[email protected]>

vercel bot temporarily deployed to Preview January 5, 2022 14:12 Inactive

vercel bot temporarily deployed to Preview January 5, 2022 19:08 Inactive

maxrjones added 2 commits January 5, 2022 14:09

Use dict mapping in load_sample_data

e83feb5

Merge branch 'main' into load-sample-datasets

1209764

seisman reviewed Jan 6, 2022

View reviewed changes

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

seisman reviewed Jan 6, 2022

View reviewed changes

Alphabetize keys

31807e7

vercel bot temporarily deployed to Preview January 6, 2022 22:54 Inactive

seisman approved these changes Jan 7, 2022

View reviewed changes

seisman added the final review call This PR requires final review and approval from a second reviewer label Jan 7, 2022

maxrjones mentioned this pull request Jan 7, 2022

Add inline example for grdcut #1689

Merged

6 tasks

seisman reviewed Jan 9, 2022

View reviewed changes

pygmt/datasets/samples.py Outdated Show resolved Hide resolved

Apply suggestions from code review

12dfc9a

Co-authored-by: Dongdong Tian <[email protected]>

vercel bot temporarily deployed to Preview January 9, 2022 19:30 Inactive

Merge branch 'main' into load-sample-datasets

c11ddeb

vercel bot temporarily deployed to Preview January 9, 2022 19:39 Inactive

willschlitzer approved these changes Jan 10, 2022

View reviewed changes

seisman removed the final review call This PR requires final review and approval from a second reviewer label Jan 10, 2022

seisman merged commit a5a0a20 into main Jan 10, 2022

seisman deleted the load-sample-datasets branch January 10, 2022 09:00

michaelgrund added a commit that referenced this pull request Feb 12, 2022

Update syntax for loading sample datasets in gallery examples

1424559

This PR is a follow-up of #1685 and updates the syntax for loading sample datasets in all corresponding gallery examples.

michaelgrund mentioned this pull request Feb 12, 2022

Update syntax for loading sample datasets in gallery examples #1749

Merged

9 tasks

michaelgrund added a commit that referenced this pull request Feb 12, 2022

Update syntax for loading sample datasets in tutorials

c55cd9a

This PR is a follow-up of #1685 and updates the syntax for loading sample datasets in all corresponding tutorials.

michaelgrund mentioned this pull request Feb 12, 2022

Update syntax for loading sample datasets in tutorials #1750

Merged

7 tasks

sixy6e pushed a commit to sixy6e/pygmt that referenced this pull request Dec 21, 2022

Use a single function for loading any sample dataset (GenericMappingT…

f8b8c73

…ools#1685) Co-authored-by: Wei Ji <[email protected]> Co-authored-by: Dongdong Tian <[email protected]>

seisman mentioned this pull request Jan 4, 2023

Remove the deprecated load_xx functions in v0.9.0 (deprecated since v0.6.0) #2302

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a single function for loading any sample dataset #1685

Use a single function for loading any sample dataset #1685

maxrjones commented Dec 23, 2021

willschlitzer commented Dec 24, 2021

maxrjones commented Dec 27, 2021

willschlitzer commented Dec 28, 2021

maxrjones commented Dec 28, 2021

willschlitzer commented Dec 29, 2021

maxrjones commented Dec 29, 2021

weiji14 left a comment

maxrjones commented Jan 3, 2022

seisman commented Jan 4, 2022

maxrjones commented Jan 4, 2022

seisman Jan 6, 2022 •

edited

Loading

maxrjones Jan 6, 2022

Use a single function for loading any sample dataset #1685

Use a single function for loading any sample dataset #1685

Conversation

maxrjones commented Dec 23, 2021

willschlitzer commented Dec 24, 2021

maxrjones commented Dec 27, 2021

willschlitzer commented Dec 28, 2021

maxrjones commented Dec 28, 2021

willschlitzer commented Dec 29, 2021

maxrjones commented Dec 29, 2021

weiji14 left a comment

Choose a reason for hiding this comment

maxrjones commented Jan 3, 2022

seisman commented Jan 4, 2022

maxrjones commented Jan 4, 2022

seisman Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

maxrjones Jan 6, 2022

Choose a reason for hiding this comment

seisman Jan 6, 2022 •

edited

Loading