Skip to content

Commit

Permalink
Updated data team training with intern feedback, all edits described …
Browse files Browse the repository at this point in the history
  • Loading branch information
kracha committed Nov 30, 2023
1 parent db17cc1 commit 1db0b56
Show file tree
Hide file tree
Showing 20 changed files with 194 additions and 89 deletions.
Binary file added images/XML_series_pbject.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/physical_screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 24 additions & 16 deletions training/01_introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,27 @@ Read Matt Jones et al.'s paper on <a href = 'https://esajournals.onlinelibrary.w

(Please note that while the tips outlined in this article are best practices, we often do not reformat data files submitted to our repositories unless necessary. It is best to be conservative and not alter other people's data without good reason.)

You may also want to explore the DataONE <a href='https://www.dataone.org/Education' target='_blank'>education resources</a> related to data management.
You may also want to explore the DataONE <a href='https://dataoneorg.github.io/Education/' target='_blank'>education resources</a> related to data management.

## Using DataONE

**Data Observation Network for Earth** (DataONE) is a community driven initiative that provides access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data.

Read more about what DataONE is <a href = 'https://www.dataone.org/what-dataone' target='_blank'>here</a> and about DataONE member node (MN) guidelines <a href = 'https://www.dataone.org/sites/all/documents/DataONE_MN_Partner_Guidelines_20170501.pdf' target='_blank'>here</a>. Please feel free to ask Jeanette any questions you have about DataONE.
Read more about what DataONE is <a href = 'https://www.dataone.org/about/' target='_blank'>here</a> and about DataONE member node (MN) guidelines <a href = 'https://dataone-operations.readthedocs.io/en/latest/MN/deployment/baseline_implementation_requirements.html' target='_blank'>here</a>. Please feel free to ask Jeanette any questions you have about DataONE.

We will be applying these concepts in the next chapter.

## Working on a remote server

All of the work that we do at NCEAS is done on our remote server, datateam.nceas.ucsb.edu. If you have never worked on a remote server before, you can think of it like working on a different computer via the internet.

We access RStudio on our server through this <a href = 'https://datateam.nceas.ucsb.edu/rstudio/' target='_blank'>link.</a> This is the same as your desktop version of RStudio with one main difference is that files are on the server. Please do all your work here. This way you can share your code with the rest of us.
We access RStudio on our server through this <a href = 'https://datateam.nceas.ucsb.edu/rstudio/' target='_blank'>link.</a> This is the same as your desktop version of RStudio with one main difference is that files are on the server. **Please do all your work here, and bookmark this link. Do not use RStudio on your local computer.** By only using the RStudio server, it is easier to share your code with the rest of us.

### Check your understanding {.exercise}
* Open a new tab in your browser and try logging into the [remote server](https://datateam.nceas.ucsb.edu/rstudio/) using your SSH credentials.

```{block, type = "note"}
If you R session is frozen and unresponsive check out [the guide](https://help.nceas.ucsb.edu/NCEAS/Computing/rstudio_server.html) on how to fix it.
If your R session is frozen and unresponsive check out [the guide](https://help.nceas.ucsb.edu/NCEAS/Computing/rstudio_server.html) on how to fix it.
```

## A note on paths
Expand All @@ -52,14 +55,17 @@ When you write scripts, try to avoid writing relative paths (which rely on what

## A note on R

This training assumes basic knowledge of R and RStudio. If you want a quick R refresher, walk through Jenny Bryan's excellent materials [here](http://stat545.com/block002_hello-r-workspace-wd-project.html).
This training assumes basic knowledge of R and RStudio. Spend at least 30 minutes walking through Jenny Bryan's excellent materials [here](http://stat545.com/block002_hello-r-workspace-wd-project.html) for a refresher.

Throughout this training we will occasionally use the namespace syntax `package_name::function_name()` when writing a function. This syntax denotes which package a function came from. For example `dataone::getSystemMetadata` selects the `getSystemMetadata` function from the `dataone` R package. More detailed information on namespaces can be found [here](http://r-pkgs.had.co.nz/namespace.html).

## A note on effective troubleshooting in R
We suggest using a combination of **m**inimal **r**eproducible **e**xamples (MRE) and the package `reprex` to create **rep**roducible **ex**amples. This will allow others to better help you if we can run the code on our own computers.

A MRE is stripping down your code to only the parts that cause the bug.
One of the advantages with using the R programming language is the extensive documentation that is available for R packages. The R help operator `?` can be used to learn more about functions from all of the R packages we use. You can put the operator before the name of any function to view its documentation in RStudio: `?function`

When asking for help in the `#datateam` channel in Slack, we suggest using a combination of **m**inimal **r**eproducible **e**xamples (MRE) and the package `reprex` to create **rep**roducible **ex**amples. This will allow others to better help you if we can run the code on our own computers.

A MRE is stripping down your code to only the parts that cause the bug. When troubleshooting errors over Slack, send the code that returned an error **and** the error message itself.

How to generate a reprex:

Expand Down Expand Up @@ -90,7 +96,7 @@ att_list <- set_attributes(attributes)
doc_ex <- list(packageId = "id", system = "system",
dataset = list(title = "A Mimimal Valid EML Dataset",
dataset = list(title = "A Minimal Valid EML Dataset",
creator = me,
contact = me,
dataTable = list(entityName = "data table", attributeList = att_list))
Expand All @@ -103,17 +109,19 @@ The rest of the training has a series of exercises. These are meant to take you
Please note that you will be completing everything on the <a href = 'https://test.arcticdata.io' target='_blank'> site</a> for the training. In the future if you are unsure about doing anything with a dataset. The test site is a good place to try things out!

## Exercise 1 {.exercise}
This part of the exercise walks you through submitting data through the web form on "<a href = 'https://test.arcticdata.io' target='_blank'>test.arcticdata.io</a>"
This part of the exercise walks you through submitting data through the web form on "<a href = 'https://test.arcticdata.io' target='_blank'>test.arcticdata.io</a>". In addition to learning to use the webform, this exercise will also help you practice sleuthing for information in order to provide complete metadata. Most datasets do not come with all contextual information, so you will need to skim cited literature and search google for definitions of discipline-specific jargon. Don't be afraid to use the internet as a resource!

### Part 1
* Download the [csv](data/Loranty_2016_Environ._Res._Lett._11_095008.csv) of Table 1 from <a href = 'http://iopscience.iop.org/article/10.1088/1748-9326/11/9/095008/meta' target='_blank'>this paper.</a>
* Reformat the table to meet the guidelines outlined in the journal article on effective data management (this might be easier to do in an interactive environment like Excel).
* Note - we usually don't edit the content in data submissions so don't stress over this part too much
* Reformat the table to meet the guidelines outlined in the journal article on effective data management (this might be easier to do in an interactive environment like Excel).
+ Hint: This table is in wide format and can be made [longer](https://arcticdata.io/submit/#file-content-guidelines).
* Note: we usually don't edit the content in data submissions so don't stress over this part too much

### Part 2
* Go to "<a href = 'https://test.arcticdata.io/#data' target='_blank'>test.arcticdata.io</a>" and submit your reformatted file with appropriate metadata that you derive from the text of the paper:
+ list yourself as the first 'Creator' so your test submission can easily be found,
+ for the purposes of this training exercise, not every single author needs to be listed with full contact details, listing the first two authors is fine,
+ directly copying and pasting sections from the paper (abstract, methods, etc.) is also fine,
+ attributes (column names) should be defined, including correct units and missing value codes.
+ submit the dataset
+ List yourself as the first 'Creator' so your test submission can easily be found.
+ For the purposes of this training exercise, not every single author needs to be listed with full contact details, listing the first two authors is fine.
+ Directly copying and pasting sections from the paper (abstract, methods, etc.) is also fine.
+ Attributes (column names) should be defined, including correct units and missing value codes.
* Click "describe" to the right of the file name in order to add file-specific information. The title and description can be edited in the "Overview" tab, while attributes are defined in the "Attributes" tab.
+ Submit the dataset and post a message to the #datateam channel with a link to your package.
17 changes: 11 additions & 6 deletions training/02_creating_a_data_package.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,17 @@ This chapter will teach you how to create and submit a data package to a DataONE

A data package generally consists of at least 3 components.

1. Metadata: One object is the metadata file itself. In case you are unfamiliar with metadata, metadata are information that describe data (e.g. who made the data, how were the data made, etc.). The metadata file will be in an XML format, and have the extension `.xml` (extensible markup language). We often refer to this file as the EML, which is the metadata standard that it uses. This is also what you see when you click on a page in the Arctic Data Center.
1. Metadata: One object is the metadata file itself. In case you are unfamiliar with metadata, metadata are information that describe data (e.g. who made the data, how were the data made, etc.). The metadata file will be in an XML format, and have the extension `.xml` (extensible markup language). We often refer to this file as the EML (Ecological Metadata Language), which is the metadata standard that it uses. Each dataset page in the Arctic Data Center is a direct representation of an EML document, made to look prettier for the web.

2. Data: Other objects in a package are the data files themselves. Most commonly these are data tables (`.csv`), but they can also be audio files, NetCDF files, plain text files, PDF documents, image files, etc.

3. Resource Map: The final object is the resource map. This object is a plain text file with the extension `.rdf` (<a href = 'https://www.w3.org/RDF/' target='_blank'>Resource Description Framework</a>) that defines the relationships between all of the other objects in the data package. It says things like "this metadata file describes this data file," and is critical to making a data package render correctly on the website with the metadata file and all of the data files together in the correct place. Fortunately, we rarely, if ever, have to actually look at the contents of resource maps; they are generated for us using tools in R.
3. Resource Map: The final object is the resource map. This object is a plain text file with the extension `.rdf` (<a href = 'https://www.w3.org/RDF/' target='_blank'>Resource Description Framework</a>) that defines the relationships between all of the other objects in the data package. You can think of it like a "basket" that holds the metadata file and all data files together. It says things like "this metadata file describes this data file," and is critical to making a data package render correctly on the website. Fortunately, we rarely, if ever, have to actually look at the contents of resource maps; they are generated for us using tools in R.

![From the DataOne Community Meeting (Session 7)](images/data-submission-workflow2.png)

## Packages on the Website

All of the package information is represented when you go to the landing page for a dataset. When you make changes through R those published changes will be reflected here. Although you can edit the metadata directly from the webpage but we recommend to use R in most cases.
All of the package information is represented when you go to the landing page for a dataset. In the previous section, you uploaded a data file and made edits to the metadata using the web editor. When you make changes to the metadata and data files through R, those published changes will also be reflected here.

![](images/arctic_data_center_web.png)

Expand All @@ -34,14 +34,16 @@ Different versions of a package are linked together by what we call the "version

## Upload a package

We will be using R to connect to the <a href = 'https://arcticdata.io/catalog/#data' target='_blank'>NSF Arctic Data Center (ADC)</a> data repository to push and pull edits in actual datasets. To identify yourself as an admin you will need to pass a 'token' into R. Do this by signing in to the ADC with your ORCid and password, then hovering over your name in the top right corner and clicking on "My profile", then navigating to "Settings" and "Authentication Token", copying the "Token for DataONE R", and finally pasting and running it in your *R console*.
We will be using R to connect to the <a href = 'https://arcticdata.io/catalog/#data' target='_blank'>NSF Arctic Data Center (ADC)</a> data repository to push and pull edits in actual datasets. To identify yourself as an admin you will need to pass a 'token' into R. Do this by signing in to the ADC with your ORCid and password, then hovering over your name in the top right corner and clicking on "My profile", then navigating to "Settings" and "Authentication Token", copying the "Token for DataONE R", and finally pasting and running it in your *R console*. The console is the bottom left window in RStudio.

```{block, type = "warning"}
**This token is your identity on these sites, please treat it as you would a password** (i.e. don't paste into scripts that will be shared). The easiest way to do this is to always run the token in the *console*. There's no need to keep it in your script since it's temporary anyway.
```

You will need to retrieve a new one after it either expires or you quit your R session.

Setting the token does not produce any output in the console. If the token is not set or is set incorrectly, you will know when an error is produced after trying to load a private dataset.

Sometimes you'll see a placeholder in scripts to remind users to get their token, such as:

```{r token, message=FALSE, eval=FALSE}
Expand Down Expand Up @@ -87,7 +89,9 @@ library(EML)
library(arcticdatautils)
```

For this training, we will be working exclusively on the Arctic test site, or "node." In many of the functions you will use this will be the first argument. It is often referred to in documentation as `mn`, short for member node. More information on the other nodes can be found in the reference section under Set DataONE nodes [Set DataONE nodes](https://nceas.github.io/datateam-training/reference/set-dataone-nodes.html)
For this training, we will be working exclusively on the Arctic test site, or "node." In many of the functions you will use this will be the first argument. It is often referred to in documentation as `mn`, short for member node.

Different repositories use different member nodes. More information on the other nodes can be found in the reference section under Set DataONE nodes [Set DataONE nodes](https://nceas.github.io/datateam-training/reference/set-dataone-nodes.html)

For example, if we are using the test site, set the node to the test Arctic node:

Expand Down Expand Up @@ -147,5 +151,6 @@ myAccessRules <- data.frame(...)
packageId <- uploadDataPackage(...)
```

* View your new data set by appending the metadata PID to the end of the URL test.arcticdata.io/#view/...
* View your new data set by appending the metadata PID to the end of the URL test.arcticdata.io/view/...
* If you are successful it should look the same as the dataset you created in exercise 1
* Send a message to #datateam with the exercise number and a link to your new package.
9 changes: 9 additions & 0 deletions training/03_exploring_eml.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,14 @@ If you aren't too familiar with lists and how to navigate them yet take a look a
```{r, child = '../workflows/explore_eml/understand_eml_schema.Rmd'}
```

### Check your understanding {.exercise}
* Find `otherEntity` within the EML schema. Which elements are required? Can `otherEntity` be a series object?

<details>
<summary>Answer</summary>
<br>
`otherEntity` requires `entityType` and `entityName` children, or alternatively will accept only `references`. It is a series object, so there can be multiple `otherEntities`. Along with `otherEntity` and `creator`, `dataTable` and `attribute` can also be series objects.
</details>

```{r, child = '../workflows/explore_eml/access_specific_elements.Rmd'}
```
17 changes: 13 additions & 4 deletions training/04_editing_eml.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ Most of the functions you will see in this chapter will use the `arcticdatautils
```{block, type = "note"}
This chapter will be longest of all the sections! This is a reminder to take frequent breaks when completing this section.
```
```{block, type = "note"}
When using R to edit EML documents, run each line individually by highlighting the line and using CTRL+ENTER). Many EML functions only need to be ran once, and will either produce errors or make the EML invalid if run multiple times.
```

```{r, child = '../workflows/edit_eml/edit_an_eml_element.Rmd'}
```
Expand Down Expand Up @@ -64,7 +67,7 @@ resource_map_pid <- ...
dp <- getDataPackage(d1c_test, identifier=resource_map_pid, lazyLoad=TRUE, quiet=FALSE)
# get metadata pid
mo <- selectMember(...)
metadataId <- selectMember(...)
# read in EML
doc <- read_eml(getObject(...))
Expand Down Expand Up @@ -95,10 +98,16 @@ You should see something like if everything passes:
>attr(,"errors")
>character(0)
```{block, type = "note"}
When troubleshooting EML errors, it is helpful to run `eml_validate()` after every edit to the EML document in order to pinpoint the problematic code.
```


Then save your EML to a path of your choice or a temp file. You will later pass this path as an argument to update the package.

```{r, eval = F}
eml_path <- "path/to/save/eml.xml"
# Create a standardized EML name from the dataset title
eml_path <- arcticdatautils::title_to_file_name(doc$dataset$title)
write_eml(doc, eml_path)
```

Expand All @@ -111,7 +120,7 @@ After adding more metadata, we want to publish the dataset onto `test.arcticdata

* Validate your metadata using `eml_validate`.
* Use the [checklist](#final-checklist) to review your submission.
* Make edits where necessary
* Make edits where necessary (e.g. physicals)

Once `eml_validate` returns `TRUE` go ahead and run `write_eml`, `replaceMember`, and `uploadDataPackage`. There might be a small lag for your changes to appear on the website. This part of the workflow will look roughly like this:

Expand All @@ -121,7 +130,7 @@ eml_validate(...)
write_eml(...)
# replace the old metadata file with the new one in the local package
dp <- replaceMember(dp, ...)
dp <- replaceMember(dp, metadataId, replacement = eml_path)
# upload the data package
packageId <- uploadDataPackage(...)
Expand Down
2 changes: 1 addition & 1 deletion training/06_editing_sysmeta.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ Sometimes the system doesn't recognize the file types properly. For example you

* Read the system metadata in from the data file you uploaded [previously](#exercise-4).
* Check to make sure the `fileName` and `formatId` are set correctly (the extension in `fileName` should match the `formatId`).
* Update the system metadata if necessary.
* Update the system metadata if necessary. CSVs have the formatId "text/csv".
Loading

0 comments on commit 1db0b56

Please sign in to comment.