Precomputing a sample × mutation-in-gene-set matrix #21

stephenshank · 2016-08-25T21:51:42Z

At the 8/23 meetup, @dhimmel expressed interest in incorporating metabolic pathway information by combining the dataset that we have and the hetnet database that was described at the first meetup. The hetnet has information on what pathways the mutated genes in the current dataset participate in.

I figured I'd open this issue to get the conversation started. Initially, I am wondering what this dataset would look like, and do we envision it being created from what we already have? And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

gwaybio · 2016-08-25T22:26:59Z

I think this would be the next logical step for the cancer data group - and like @stephenshank mentioned, would require some communication with the ML group.

I did some work on this issue today and am shooting to file a pull request in the ML group tomorrow afternoon.

I am wondering what this dataset would look like, and we envision it being created from what we already have?

From my perspective, you can think of this matrix as very similar to the gene-based mutation matrix except with the gene names as columns, there will be pathways.

And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

Tweaking to the actual classifier is extremely minimal. The algorithm will simply take in a Y matrix of {0,1} where 1 means a mutation in any gene in the pathway. The visualizations of input data and classifier performance on a per tissue basis is where this approach is likely to have the most difference

cgreene · 2016-08-26T13:39:16Z

I think that the long-term aim of this part is to do queries to the live hetnet database to return a gene set. This way, whenever the hetnets get updated, we automatically get the improved versions. It may be best to start there (queries against the live hetnets) instead of a downloaded version.

dhimmel · 2016-08-26T21:49:53Z

the long-term aim of this part is to do queries to the live hetnet database

Agreed, but I think there is an R&D argument for generating a sample by pathway matrix. For example, we will want to know the distribution of positive prevalence across all pathways.

@stephenshank, if you're still interested in this task, I recommend it. It will be convenient to have a cached mutation matrix for gene sets rather than genes.

You can still work with Hetionet Cypher queries to construct this dataset, as @gwaygenomics started in cognoma/machine-learning#39.

dhimmel · 2016-08-26T21:50:49Z

Also interesting is how often does Hetionet return genes that aren't in our mutation dataset.

stephenshank · 2016-09-19T12:28:14Z

@dhimmel I believe I'm ready to submit a PR for this, but had one quick question. The resulting sample-pathway matrix is about 26 MB uncompressed. I wasn't sure how big was too big to track, or if we want to track compressed files. Any suggestions would be most appreciated.

dhimmel · 2016-09-19T13:19:45Z

Can you bz2 compress the file so it's smaller? Our data/.gitignore file will then make sure the dataset isn't tracked.

stephenshank · 2016-09-19T14:50:07Z

See #25.

gwaybio mentioned this issue Aug 26, 2016

Machine Learning Pathway Classifier Example cognoma/machine-learning#39

Merged

dhimmel changed the title ~~Creating a dataset that incorporates metabolic pathway information~~ Precomputing a sample × mutation-in-gene-set matrix Aug 26, 2016

dhimmel added the task label Aug 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precomputing a sample × mutation-in-gene-set matrix #21

Precomputing a sample × mutation-in-gene-set matrix #21

stephenshank commented Aug 25, 2016 •

edited

Loading

gwaybio commented Aug 25, 2016

cgreene commented Aug 26, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 26, 2016

dhimmel commented Aug 26, 2016

stephenshank commented Sep 19, 2016

dhimmel commented Sep 19, 2016

stephenshank commented Sep 19, 2016

Precomputing a sample × mutation-in-gene-set matrix #21

Precomputing a sample × mutation-in-gene-set matrix #21

Comments

stephenshank commented Aug 25, 2016 • edited Loading

gwaybio commented Aug 25, 2016

cgreene commented Aug 26, 2016 • edited by dhimmel Loading

dhimmel commented Aug 26, 2016

dhimmel commented Aug 26, 2016

stephenshank commented Sep 19, 2016

dhimmel commented Sep 19, 2016

stephenshank commented Sep 19, 2016

stephenshank commented Aug 25, 2016 •

edited

Loading

cgreene commented Aug 26, 2016 •

edited by dhimmel

Loading