Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14

Closed
Inquisitive-Geek opened this issue Aug 8, 2016 · 8 comments
Closed

Comments

@Inquisitive-Geek
Copy link

Hi @dhimmel ,

The documentation links provided for the 3 datasets did not explain the variables involved clearly. It would be great if you could share some links around that.

Thanks,
Roshan

@dhimmel
Copy link
Member

dhimmel commented Aug 8, 2016

@Inquisitive-Geek, I believe you're referring to these three downloads and corresponding links:

The links are to the Xena Browser info pages, since Xena is team that makes the data. These pages don't provide much documentation of each column. I know Xena may have some additional documentation on various help pages.

@gwaygenomics or @jingchunzhu do you know of any documentation of what each variable means in PANCAN_clinicalMatrix?

@Inquisitive-Geek if you have questions about specifics columns then we can provide our best guess. For more authoritative documentation, I recommend messaging the UCSC Xena Browser Google Group. They've been really helpful so far and can likely give the best answers for these questions.

@dhimmel dhimmel changed the title Variable descriptions unclear Variable documentation for Xena Browser datasets Aug 8, 2016
@Inquisitive-Geek
Copy link
Author

Please correct me wherever I am wrong as my knowledge of genomics is nill.

Thanks. Let's start with the clinical matrix dataset. Here's what I understand from variables whose names start with GENOMIC_ID_TCGA_PANCAN.. for eg. _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27

They seem to be some sort of flag variable denoting the gene (eg. HumanMethylation27) present in the sample. If the value is not NaN (it looks like it is the patient ID when it isn't), then the gene is present in the sample.

Also, I did not understand what _RFS, _RFS_UNIT & _RFS_IND mean. It seems like _TIME_TO_EVENT means the time it took for the cell to mutate.

@jingchunzhu
Copy link

jingchunzhu commented Aug 9, 2016

Forward to google group.

Update by @dhimmel: see the Google Group post here https://groups.google.com/forum/#!topic/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q.

@jingchunzhu
Copy link

jingchunzhu commented Aug 9, 2016

Identifiers:

_EVENT: event in this case it is overall survival event
_INTEGRATION: id used for integrating data on the xena browser and across cohort
_OS : overall survival time
_OS_IND : overall survival event
_OS_UNIT: overall survival time unit
_PANCAN_CNA_PANCAN_K8: 2012 pancan paper publication data
_PANCAN_Cluster_Cluster_PANCAN: 2012 pancan paper publication data
_PANCAN_DNAMethyl_PANCAN: 2012 pancan paper publication data
_PANCAN_RPPA_PANCAN_K8: 2012 pancan paper publication data
_PANCAN_UNC_RNAseq_PANCAN_K16: 2012 pancan paper publication data
_PANCAN_miRNA_PANCAN: 2012 pancan paper publication data
_PANCAN_mutation_PANCAN: 2012 pancan paper publication data
_PATIENT: TCGA patient id
_RFS: recurrent free survival (xena curated, note: i trust the overall survival data much better)
_RFS_IND: recurrece free survival event
_RFS_UNIT: RFS time unit
_TIME_TO_EVENT: time to event (in this case, it is exactly like overall survival event)
_TIME_TO_EVENT_UNIT: time unit
_cohort: cohort name (also used as cohort id)
_primary_disease: primary_disease
_primary_site: primary organ of origin
age_at_initial_pathologic_diagnosis
gender
sampleID: sample id (same as _INTEGRATION)
sample_type: sample type
sample_type_id

anything start with _GENOMIC_ID holds legacy mapping information of the original uuids from TCGD DCC (which has been replaced with GDC), therefore I don't think any of these mappings is going to useful anymore, at least to vast majority of people.

also, note you can take a look of the dataset detail page at https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?dataset=TCGA.PANCAN.sampleMap/PANCAN_clinicalMatrix&host=https://tcga.xenahubs.net

then, click on "all identifiers" link to see all the variables available:
https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?host=https%3A%2F%2Ftcga.xenahubs.net&dataset=TCGA.PANCAN.sampleMap%2FPANCAN_clinicalMatrix&label=Phenotypes&allIdentifiers=true

Jing

@jingchunzhu
Copy link

jingchunzhu commented Aug 11, 2016

And to follow up, '_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27' is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset:
https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data.

For _RFS, _RFS_UNIT _RFS_IND and _TIME_TO_EVENT, please see this help page: http://xena.ucsc.edu/km-plot-help/. _RFS is 'recurrence free survival'

Author: Mary Goldman
Source : https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q/9Q2b3QPoAQAJ

@dhimmel
Copy link
Member

dhimmel commented Aug 16, 2016

@jingchunzhu / Mary -- are the "Sample IDs" in Xena Browser:

  1. TCGA Barcodes?
  2. TCGA UUIDs?
  3. Xena-specific identifiers?

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset.

It now makes sense now why fields like _PANCAN_mutation_PANCAN are encoded as missing / sample_id rather than binary (0 / 1).

@jingchunzhu
Copy link

jingchunzhu commented Aug 16, 2016

are the "Sample IDs" in Xena Browser TCGA Barcodes, TCGA UUIDs, or Xena-specific identifiers?

​Sample IDs in Xena Browser is TCGA Barcode, in particular, at the sample level ​TCGA gives different IDs https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data. The reason you use this is to get the best integration of the various of genomics data types. You go with level below samples, you will have a lot of more entities with missing dimentions like there is mutation data but no expression data. If you go with patient level, then u will have to handle primary tumor, recurrent tumor and mostly normal sample from the same patient, essentially you probably will end out throw out normal sample data.

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset.

​I can't tell​ if there is question about _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27?​

Source: https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q/-vJtHmN4AwAJ

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Aug 24, 2016
Keeps only samples with type equal to "Primary Tumor". This filters multiple
samples from the same patient, which could cause an issue for machine learning
due to a dependent observations (discussed in cognoma#10). This filter reduced the
number of samples with expression and mutation from 7,705 to 7,306.

Closes cognoma#10: all variables that could help with sample selection or covariates,
that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the
Xena Browser team in cognoma#14.

Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are
output to `data/`.
clairemcleod pushed a commit that referenced this issue Aug 25, 2016
* Extract sample info from PANCAN_clinicalMatrix

Keeps only samples with type equal to "Primary Tumor". This filters multiple
samples from the same patient, which could cause an issue for machine learning
due to a dependent observations (discussed in #10). This filter reduced the
number of samples with expression and mutation from 7,705 to 7,306.

Closes #10: all variables that could help with sample selection or covariates,
that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the
Xena Browser team in #14.

Closes #17: only sample_ids with expression, mutation, and clinical data are
output to `data/`.

* Retain primary blood cancers

Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood".
See #20 (comment)
@dhimmel dhimmel changed the title Variable documentation for Xena Browser datasets Variable documentation for Xena Browser's PANCAN_clinicalMatrix Aug 26, 2016
@dhimmel
Copy link
Member

dhimmel commented Aug 26, 2016

I don't think we have any outstanding questions related to variables in PANCAN_clinicalMatrix, so I'm going to close this issue.

For future Xena questions, we can open new issues and mention @jingchunzhu and @maryjgoldman, who are part of the Xena team and have graciously offered their support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants