Generate comprehensive comparisons between RNAseq, Mutation, and Clinical Matrix #17

gwaybio · 2016-08-09T23:57:37Z

We need to generate a comparison between the sample IDs that exist in all three data sources. It will be good to subset the clinical matrix to only samples that are measured by RNAseq and to file a pull request with this report.

gwaybio · 2016-08-10T00:04:10Z

@mike19106

dhimmel · 2016-08-10T14:07:55Z

You should be able to use the sample_ids from data/subset/expression-matrix-all-samples.tsv, which are the intersect of expression and mutation samples.

Also I usually wait till the last moment to drop samples, meaning we probably could process the clinical matrix without this information?

Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in cognoma#10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes cognoma#10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in cognoma#14. Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are output to `data/`.

* Extract sample info from PANCAN_clinicalMatrix Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14. Closes #17: only sample_ids with expression, mutation, and clinical data are output to `data/`. * Retain primary blood cancers Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood". See #20 (comment)

gwaybio added the task label Aug 9, 2016

dhimmel mentioned this issue Aug 24, 2016

Extract sample info from PANCAN_clinicalMatrix #20

Merged

clairemcleod closed this as completed in #20 Aug 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate comprehensive comparisons between RNAseq, Mutation, and Clinical Matrix #17

Generate comprehensive comparisons between RNAseq, Mutation, and Clinical Matrix #17

gwaybio commented Aug 9, 2016

gwaybio commented Aug 10, 2016

dhimmel commented Aug 10, 2016

Generate comprehensive comparisons between RNAseq, Mutation, and Clinical Matrix #17

Generate comprehensive comparisons between RNAseq, Mutation, and Clinical Matrix #17

Comments

gwaybio commented Aug 9, 2016

gwaybio commented Aug 10, 2016

dhimmel commented Aug 10, 2016