Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate comprehensive comparisons between RNAseq, Mutation, and Clinical Matrix #17

Closed
gwaybio opened this issue Aug 9, 2016 · 2 comments
Labels

Comments

@gwaybio
Copy link
Member

gwaybio commented Aug 9, 2016

We need to generate a comparison between the sample IDs that exist in all three data sources. It will be good to subset the clinical matrix to only samples that are measured by RNAseq and to file a pull request with this report.

@gwaybio gwaybio added the task label Aug 9, 2016
@gwaybio
Copy link
Member Author

gwaybio commented Aug 10, 2016

@mike19106

@dhimmel
Copy link
Member

dhimmel commented Aug 10, 2016

You should be able to use the sample_ids from data/subset/expression-matrix-all-samples.tsv, which are the intersect of expression and mutation samples.

Also I usually wait till the last moment to drop samples, meaning we probably could process the clinical matrix without this information?

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Aug 24, 2016
Keeps only samples with type equal to "Primary Tumor". This filters multiple
samples from the same patient, which could cause an issue for machine learning
due to a dependent observations (discussed in cognoma#10). This filter reduced the
number of samples with expression and mutation from 7,705 to 7,306.

Closes cognoma#10: all variables that could help with sample selection or covariates,
that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the
Xena Browser team in cognoma#14.

Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are
output to `data/`.
clairemcleod pushed a commit that referenced this issue Aug 25, 2016
* Extract sample info from PANCAN_clinicalMatrix

Keeps only samples with type equal to "Primary Tumor". This filters multiple
samples from the same patient, which could cause an issue for machine learning
due to a dependent observations (discussed in #10). This filter reduced the
number of samples with expression and mutation from 7,705 to 7,306.

Closes #10: all variables that could help with sample selection or covariates,
that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the
Xena Browser team in #14.

Closes #17: only sample_ids with expression, mutation, and clinical data are
output to `data/`.

* Retain primary blood cancers

Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood".
See #20 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants