-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14
Comments
@Inquisitive-Geek, I believe you're referring to these three downloads and corresponding links: The links are to the Xena Browser info pages, since Xena is team that makes the data. These pages don't provide much documentation of each column. I know Xena may have some additional documentation on various help pages. @gwaygenomics or @jingchunzhu do you know of any documentation of what each variable means in @Inquisitive-Geek if you have questions about specifics columns then we can provide our best guess. For more authoritative documentation, I recommend messaging the UCSC Xena Browser Google Group. They've been really helpful so far and can likely give the best answers for these questions. |
Please correct me wherever I am wrong as my knowledge of genomics is nill. Thanks. Let's start with the clinical matrix dataset. Here's what I understand from variables whose names start with GENOMIC_ID_TCGA_PANCAN.. for eg. _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 They seem to be some sort of flag variable denoting the gene (eg. HumanMethylation27) present in the sample. If the value is not NaN (it looks like it is the patient ID when it isn't), then the gene is present in the sample. Also, I did not understand what _RFS, _RFS_UNIT & _RFS_IND mean. It seems like _TIME_TO_EVENT means the time it took for the cell to mutate. |
Forward to google group. Update by @dhimmel: see the Google Group post here https://groups.google.com/forum/#!topic/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q. |
Identifiers: _EVENT: event in this case it is overall survival event anything start with _GENOMIC_ID holds legacy mapping information of the original uuids from TCGD DCC (which has been replaced with GDC), therefore I don't think any of these mappings is going to useful anymore, at least to vast majority of people. also, note you can take a look of the dataset detail page at https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?dataset=TCGA.PANCAN.sampleMap/PANCAN_clinicalMatrix&host=https://tcga.xenahubs.net then, click on "all identifiers" link to see all the variables available: Jing |
And to follow up, '_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27' is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset: For _RFS, _RFS_UNIT _RFS_IND and _TIME_TO_EVENT, please see this help page: http://xena.ucsc.edu/km-plot-help/. _RFS is 'recurrence free survival' Author: Mary Goldman |
@jingchunzhu / Mary -- are the "Sample IDs" in Xena Browser:
It now makes sense now why fields like |
Sample IDs in Xena Browser is TCGA Barcode, in particular, at the sample level TCGA gives different IDs https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data. The reason you use this is to get the best integration of the various of genomics data types. You go with level below samples, you will have a lot of more entities with missing dimentions like there is mutation data but no expression data. If you go with patient level, then u will have to handle primary tumor, recurrent tumor and mostly normal sample from the same patient, essentially you probably will end out throw out normal sample data.
I can't tell if there is question about _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27? Source: https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q/-vJtHmN4AwAJ |
Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in cognoma#10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes cognoma#10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in cognoma#14. Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are output to `data/`.
* Extract sample info from PANCAN_clinicalMatrix Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14. Closes #17: only sample_ids with expression, mutation, and clinical data are output to `data/`. * Retain primary blood cancers Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood". See #20 (comment)
I don't think we have any outstanding questions related to variables in For future Xena questions, we can open new issues and mention |
Hi @dhimmel ,
The documentation links provided for the 3 datasets did not explain the variables involved clearly. It would be great if you could share some links around that.
Thanks,
Roshan
The text was updated successfully, but these errors were encountered: