Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single function for loading any sample dataset #1685

Merged
merged 15 commits into from
Jan 10, 2022
Merged

Conversation

maxrjones
Copy link
Member

Description of proposed changes

This PR partially implements the suggestions in #1436, in deprecating individual functions in favor of a single function for loading sample datasets.

It also adds a list_sample_dataframes function that lists the available dataset names to provide to load_sample_dataframe.

A couple questions:

  1. Should we have one function that returns either an xarray.DataArray or pandas.DataFrame depending on the specific file requested or one function for tabular data samples and another for raster data samples?
  2. Currently, the user would pass exact name of the file on the GMT server. Should we continue with this implementation or provide different names that are more similar to the existing functions (e.g., name="japan_quakes" vs name="tut_quakes.ngdc")?

To do:

  • Deprecate other functions for loading a sample pandas.DataFrame
  • Deprecate functions for loading a sample xarray.DataArray
  • Update syntax in gallery examples and tutorials.

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

  • /format: automatically format and lint the code
  • /test-gmt-dev: run full tests on the latest GMT development version

@willschlitzer
Copy link
Contributor

I like the idea of having a single function to call datasets, but I think it may make load_sample_dataframe a big and cumbersome function. Do we want some non-user facing functions to handle the dataframe reading?

def load_sample_dataframe(name)
    names = list_sample_dataframes()
    if name not in names:
        raise GMTInvalidInput(f"Invalid dataset name '{name}'")

    fname = which("@" + name, download="c")

    if name == "tut_quakes.ngdc":
        return load_japan_quakes(fname=fname)

def load_japan_quakes(fname):
    data = pd.read_csv(fname, header=1, sep=r"\s+")
    data.columns = [
        "year",
        "month",
        "day",
        "latitude",
        "longitude",
        "depth_km",
        "magnitude",
    ]
    return data

@maxrjones
Copy link
Member Author

I like the idea of having a single function to call datasets, but I think it may make load_sample_dataframe a big and cumbersome function. Do we want some non-user facing functions to handle the dataframe reading?

def load_sample_dataframe(name)
    names = list_sample_dataframes()
    if name not in names:
        raise GMTInvalidInput(f"Invalid dataset name '{name}'")

    fname = which("@" + name, download="c")

    if name == "tut_quakes.ngdc":
        return load_japan_quakes(fname=fname)

def load_japan_quakes(fname):
    data = pd.read_csv(fname, header=1, sep=r"\s+")
    data.columns = [
        "year",
        "month",
        "day",
        "latitude",
        "longitude",
        "depth_km",
        "magnitude",
    ]
    return data

I agree that separating the parsing code would be more readable. I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

@willschlitzer
Copy link
Contributor

I agree that separating the parsing code would be more readable. I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

My solution would be to just copy and paste the load_japan_quakes function to something like load_japan_quakes_sample_data (name isn't great; just a placeholder suggestion). The load_japan_quakes function gets a deprecation notice, and we'll be alerted before a release when it is time to be take it out. The load_japan_quakes_sample_data function is called by load_sample_data and can be eventually renamed to load_japan_quakes without concerns for deprecation, as it isn't user-facing.

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

@maxrjones
Copy link
Member Author

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

Sounds good, I updated the code based on your suggestions.

@seisman seisman added this to the 0.6.0 milestone Dec 29, 2021
@willschlitzer
Copy link
Contributor

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

Sounds good, I updated the code based on your suggestions.

I think your implementation works better than what I envisioned; I like that the original functions still get a returned value from load_sample_dataframe. I'm assuming you want the other functions to be updated on later PRs (before the next release)?

@maxrjones
Copy link
Member Author

I know it's a little cumbersome, but I think it's better to have some redundant code to set the precedent that load_sample_data will be calling functions.

Sounds good, I updated the code based on your suggestions.

I think your implementation works better than what I envisioned; I like that the original functions still get a returned value from load_sample_dataframe. I'm assuming you want the other functions to be updated on later PRs (before the next release)?

I updated these two functions first to get agreement on the structure before putting a lot of work in. I would lean slightly towards updating the other functions in other PRs to keep the PR size down and allow new additions in the meantime (e.g., MaunaLao_CO2 for #1512), but could update them in this PR (likely after Jan 3) if that's what others prefer.

@maxrjones maxrjones changed the title RFC: Use a single function for loading any sample dataset Use a single function for loading any sample dataset Dec 29, 2021
Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

I feel that there should be a way to silence the warning somehow so that we don't need to make new temporary functions. Will look into this in the coming days (I'm currently reviewing this on the plane, haha).

doc/api/index.rst Outdated Show resolved Hide resolved
pygmt/datasets/samples.py Show resolved Hide resolved
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
@maxrjones
Copy link
Member Author

I went with the current implementation anyways because I thought we should issue a deprecation notice if load_japan_quakes is called by the user, but not if it is called by load_sample_dataframe and could not find any satisfactory solutions for distinguishing between those two cases.

I feel that there should be a way to silence the warning somehow so that we don't need to make new temporary functions. Will look into this in the coming days (I'm currently reviewing this on the plane, haha).

OK, I pushed a revised module that avoids using new functions at the expense of being a bit trickier. Rather than removing the original function (e.g., load_japan_quakes) at the end of the deprecation cycle, we could make the function internal (leading underscore, remove from imports and API docs) and remove the kwargs/suppress warning check.

@seisman seisman added feature Brand new feature deprecation Deprecating a feature labels Jan 4, 2022
@seisman
Copy link
Member

seisman commented Jan 4, 2022

  1. Should we have one function that returns either an xarray.DataArray or pandas.DataFrame depending on the specific file requested or one function for tabular data samples and another for raster data samples?

Personally, I prefer to have a single function load_sample_data to load everything.

@maxrjones
Copy link
Member Author

  1. Should we have one function that returns either an xarray.DataArray or pandas.DataFrame depending on the specific file requested or one function for tabular data samples and another for raster data samples?

Personally, I prefer to have a single function load_sample_data to load everything.

Sounds good to me, I updated the name accordingly. I actually realized we do not have any functions yet for loading sample DataArray objects from the GMT cache, but I assume that we will in the future.

pygmt/datasets/samples.py Outdated Show resolved Hide resolved
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
return names


def load_sample_data(name):
Copy link
Member

@seisman seisman Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if we can merge the list_sample_data() function into load_sample_data(), so that we don't have to maintain two dictionaries.

For example, calling load_sample_data() without giving a name can return the name-description dict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep them separate even though it requires two dictionaries because I think overall it's simpler to have each function have one purpose.

@seisman seisman added the final review call This PR requires final review and approval from a second reviewer label Jan 7, 2022
@maxrjones maxrjones mentioned this pull request Jan 7, 2022
6 tasks
pygmt/datasets/samples.py Outdated Show resolved Hide resolved
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Jan 10, 2022
@seisman seisman merged commit a5a0a20 into main Jan 10, 2022
@seisman seisman deleted the load-sample-datasets branch January 10, 2022 09:00
michaelgrund added a commit that referenced this pull request Feb 12, 2022
This PR is a follow-up of #1685 and updates the syntax for loading sample datasets in all corresponding gallery examples.
michaelgrund added a commit that referenced this pull request Feb 12, 2022
This PR is a follow-up of #1685 and updates the syntax for loading sample datasets in all corresponding tutorials.
sixy6e pushed a commit to sixy6e/pygmt that referenced this pull request Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation Deprecating a feature feature Brand new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants