Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14

Inquisitive-Geek · 2016-08-08T03:49:28Z

The documentation links provided for the 3 datasets did not explain the variables involved clearly. It would be great if you could share some links around that.

Thanks,
Roshan

dhimmel · 2016-08-08T13:22:16Z

@Inquisitive-Geek, I believe you're referring to these three downloads and corresponding links:

The links are to the Xena Browser info pages, since Xena is team that makes the data. These pages don't provide much documentation of each column. I know Xena may have some additional documentation on various help pages.

@gwaygenomics or @jingchunzhu do you know of any documentation of what each variable means in PANCAN_clinicalMatrix?

@Inquisitive-Geek if you have questions about specifics columns then we can provide our best guess. For more authoritative documentation, I recommend messaging the UCSC Xena Browser Google Group. They've been really helpful so far and can likely give the best answers for these questions.

Inquisitive-Geek · 2016-08-09T01:14:44Z

Please correct me wherever I am wrong as my knowledge of genomics is nill.

Thanks. Let's start with the clinical matrix dataset. Here's what I understand from variables whose names start with GENOMIC_ID_TCGA_PANCAN.. for eg. _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27

They seem to be some sort of flag variable denoting the gene (eg. HumanMethylation27) present in the sample. If the value is not NaN (it looks like it is the patient ID when it isn't), then the gene is present in the sample.

Also, I did not understand what _RFS, _RFS_UNIT & _RFS_IND mean. It seems like _TIME_TO_EVENT means the time it took for the cell to mutate.

jingchunzhu · 2016-08-09T05:30:01Z

Forward to google group.

Update by @dhimmel: see the Google Group post here https://groups.google.com/forum/#!topic/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q.

jingchunzhu · 2016-08-09T18:17:07Z

Identifiers:

_EVENT: event in this case it is overall survival event
_INTEGRATION: id used for integrating data on the xena browser and across cohort
_OS : overall survival time
_OS_IND : overall survival event
_OS_UNIT: overall survival time unit
_PANCAN_CNA_PANCAN_K8: 2012 pancan paper publication data
_PANCAN_Cluster_Cluster_PANCAN: 2012 pancan paper publication data
_PANCAN_DNAMethyl_PANCAN: 2012 pancan paper publication data
_PANCAN_RPPA_PANCAN_K8: 2012 pancan paper publication data
_PANCAN_UNC_RNAseq_PANCAN_K16: 2012 pancan paper publication data
_PANCAN_miRNA_PANCAN: 2012 pancan paper publication data
_PANCAN_mutation_PANCAN: 2012 pancan paper publication data
_PATIENT: TCGA patient id
_RFS: recurrent free survival (xena curated, note: i trust the overall survival data much better)
_RFS_IND: recurrece free survival event
_RFS_UNIT: RFS time unit
_TIME_TO_EVENT: time to event (in this case, it is exactly like overall survival event)
_TIME_TO_EVENT_UNIT: time unit
_cohort: cohort name (also used as cohort id)
_primary_disease: primary_disease
_primary_site: primary organ of origin
age_at_initial_pathologic_diagnosis
gender
sampleID: sample id (same as _INTEGRATION)
sample_type: sample type
sample_type_id

anything start with _GENOMIC_ID holds legacy mapping information of the original uuids from TCGD DCC (which has been replaced with GDC), therefore I don't think any of these mappings is going to useful anymore, at least to vast majority of people.

also, note you can take a look of the dataset detail page at https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?dataset=TCGA.PANCAN.sampleMap/PANCAN_clinicalMatrix&host=https://tcga.xenahubs.net

then, click on "all identifiers" link to see all the variables available:
https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?host=https%3A%2F%2Ftcga.xenahubs.net&dataset=TCGA.PANCAN.sampleMap%2FPANCAN_clinicalMatrix&label=Phenotypes&allIdentifiers=true

Jing

jingchunzhu · 2016-08-11T15:37:54Z

And to follow up, '_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27' is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset:
https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data.

For _RFS, _RFS_UNIT _RFS_IND and _TIME_TO_EVENT, please see this help page: http://xena.ucsc.edu/km-plot-help/. _RFS is 'recurrence free survival'

Author: Mary Goldman
Source : https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q/9Q2b3QPoAQAJ

dhimmel · 2016-08-16T15:28:23Z

@jingchunzhu / Mary -- are the "Sample IDs" in Xena Browser:

TCGA Barcodes?
TCGA UUIDs?
Xena-specific identifiers?

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset.

It now makes sense now why fields like _PANCAN_mutation_PANCAN are encoded as missing / sample_id rather than binary (0 / 1).

jingchunzhu · 2016-08-16T17:54:51Z

are the "Sample IDs" in Xena Browser TCGA Barcodes, TCGA UUIDs, or Xena-specific identifiers?

Sample IDs in Xena Browser is TCGA Barcode, in particular, at the sample level TCGA gives different IDs https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data. The reason you use this is to get the best integration of the various of genomics data types. You go with level below samples, you will have a lot of more entities with missing dimentions like there is mutation data but no expression data. If you go with patient level, then u will have to handle primary tumor, recurrent tumor and mostly normal sample from the same patient, essentially you probably will end out throw out normal sample data.

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset.

I can't tell if there is question about _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27?

Source: https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q/-vJtHmN4AwAJ

Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in cognoma#10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes cognoma#10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in cognoma#14. Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are output to `data/`.

* Extract sample info from PANCAN_clinicalMatrix Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14. Closes #17: only sample_ids with expression, mutation, and clinical data are output to `data/`. * Retain primary blood cancers Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood". See #20 (comment)

dhimmel · 2016-08-26T22:10:57Z

I don't think we have any outstanding questions related to variables in PANCAN_clinicalMatrix, so I'm going to close this issue.

For future Xena questions, we can open new issues and mention @jingchunzhu and @maryjgoldman, who are part of the Xena team and have graciously offered their support.

dhimmel changed the title ~~Variable descriptions unclear~~ Variable documentation for Xena Browser datasets Aug 8, 2016

gwaybio mentioned this issue Aug 9, 2016

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

Open

dhimmel mentioned this issue Aug 24, 2016

Extract sample info from PANCAN_clinicalMatrix #20

Merged

dhimmel changed the title ~~Variable documentation for Xena Browser datasets~~ Variable documentation for Xena Browser's PANCAN_clinicalMatrix Aug 26, 2016

dhimmel closed this as completed Aug 26, 2016

dhimmel mentioned this issue Sep 29, 2016

Add disease acronyms and update covariates.tsv #27

Merged

dhimmel mentioned this issue Dec 19, 2016

Recurrence and Distant Metastasis #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14

Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14

Inquisitive-Geek commented Aug 8, 2016

dhimmel commented Aug 8, 2016

Inquisitive-Geek commented Aug 9, 2016

jingchunzhu commented Aug 9, 2016 •

edited by dhimmel

Loading

jingchunzhu commented Aug 9, 2016 •

edited by dhimmel

Loading

jingchunzhu commented Aug 11, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 16, 2016

jingchunzhu commented Aug 16, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 26, 2016

Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14

Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14

Comments

Inquisitive-Geek commented Aug 8, 2016

dhimmel commented Aug 8, 2016

Inquisitive-Geek commented Aug 9, 2016

jingchunzhu commented Aug 9, 2016 • edited by dhimmel Loading

jingchunzhu commented Aug 9, 2016 • edited by dhimmel Loading

jingchunzhu commented Aug 11, 2016 • edited by dhimmel Loading

dhimmel commented Aug 16, 2016

jingchunzhu commented Aug 16, 2016 • edited by dhimmel Loading

dhimmel commented Aug 26, 2016

jingchunzhu commented Aug 9, 2016 •

edited by dhimmel

Loading

jingchunzhu commented Aug 9, 2016 •

edited by dhimmel

Loading

jingchunzhu commented Aug 11, 2016 •

edited by dhimmel

Loading

jingchunzhu commented Aug 16, 2016 •

edited by dhimmel

Loading