Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precomputing a sample × mutation-in-gene-set matrix #21

Open
stephenshank opened this issue Aug 25, 2016 · 7 comments
Open

Precomputing a sample × mutation-in-gene-set matrix #21

stephenshank opened this issue Aug 25, 2016 · 7 comments
Labels

Comments

@stephenshank
Copy link
Member

stephenshank commented Aug 25, 2016

At the 8/23 meetup, @dhimmel expressed interest in incorporating metabolic pathway information by combining the dataset that we have and the hetnet database that was described at the first meetup. The hetnet has information on what pathways the mutated genes in the current dataset participate in.

I figured I'd open this issue to get the conversation started. Initially, I am wondering what this dataset would look like, and do we envision it being created from what we already have? And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

@gwaybio
Copy link
Member

gwaybio commented Aug 25, 2016

I think this would be the next logical step for the cancer data group - and like @stephenshank mentioned, would require some communication with the ML group.

I did some work on this issue today and am shooting to file a pull request in the ML group tomorrow afternoon.

I am wondering what this dataset would look like, and we envision it being created from what we already have?

From my perspective, you can think of this matrix as very similar to the gene-based mutation matrix except with the gene names as columns, there will be pathways.

And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

Tweaking to the actual classifier is extremely minimal. The algorithm will simply take in a Y matrix of {0,1} where 1 means a mutation in any gene in the pathway. The visualizations of input data and classifier performance on a per tissue basis is where this approach is likely to have the most difference

@cgreene
Copy link
Member

cgreene commented Aug 26, 2016

I think that the long-term aim of this part is to do queries to the live hetnet database to return a gene set. This way, whenever the hetnets get updated, we automatically get the improved versions. It may be best to start there (queries against the live hetnets) instead of a downloaded version.

@dhimmel
Copy link
Member

dhimmel commented Aug 26, 2016

the long-term aim of this part is to do queries to the live hetnet database

Agreed, but I think there is an R&D argument for generating a sample by pathway matrix. For example, we will want to know the distribution of positive prevalence across all pathways.

@stephenshank, if you're still interested in this task, I recommend it. It will be convenient to have a cached mutation matrix for gene sets rather than genes.

You can still work with Hetionet Cypher queries to construct this dataset, as @gwaygenomics started in cognoma/machine-learning#39.

@dhimmel
Copy link
Member

dhimmel commented Aug 26, 2016

Also interesting is how often does Hetionet return genes that aren't in our mutation dataset.

@dhimmel dhimmel changed the title Creating a dataset that incorporates metabolic pathway information Precomputing a sample × mutation-in-gene-set matrix Aug 26, 2016
@dhimmel dhimmel added the task label Aug 26, 2016
@stephenshank
Copy link
Member Author

@dhimmel I believe I'm ready to submit a PR for this, but had one quick question. The resulting sample-pathway matrix is about 26 MB uncompressed. I wasn't sure how big was too big to track, or if we want to track compressed files. Any suggestions would be most appreciated.

@dhimmel
Copy link
Member

dhimmel commented Sep 19, 2016

Can you bz2 compress the file so it's smaller? Our data/.gitignore file will then make sure the dataset isn't tracked.

@stephenshank
Copy link
Member Author

See #25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants