-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create sample-by-pathway dataset #25
Conversation
I may have done something undesirable with git/github... for some reason, changes from my last pull request are showing up in this one. The file to look at is |
You need to rebase your pull request on top of the latest upstream master. This will be a little more difficult because you based your pull request off of your master branch. Usually you make pull requests using a new branch, so your master branch can easily stay synced with upstream. If you've configured
|
hetnet_results = pd.DataFrame() | ||
with driver.session() as session: | ||
result = session.run(query) | ||
hetnet_results = ( pd.DataFrame((x.values() for x in result), columns=result.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call hetnet_results
something more informative such as pathway_df
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the prettiest spacing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Awesome to see that you were able to interact programmatically with Hetionet. I made a few comments in addition to the rebase required as discussed above.
How long did the queries take? May also be nice to do this for Biological Processes.
Also I think we may want to export a pathway-info.tsv
with identifier, name, n_mutations, and maybe some of the other Hetionet attributes such as source.
columns=pathways) | ||
|
||
|
||
# Now populate this data frame. This is a slow Python loop, hence the progress bar. It takes a few minutes on my laptop. The idea is to loop over all gene-pathway interactions in the hetnet query. If the gene is in the Cognoma dataset, we grab the pathway id in that gene-pathway interaction. We look at Cognoma samples where that gene is labeled 1, i.e., at Cognoma samples that have a mutation in that gene, and grab the corresponding indices. Then, in the pathway matrix all samples get the associated pathway tagged as a 1, since they have a mutated gene that participates in that pathway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider the following strategy:
- melt mutation_df so you get at
sample_id, entrez_gene_id, mutation
dataframe. - filter that for
mutation == 1
- merge with
pathway_df
- pivot_table to create a sample by pathway matrix. See this example for this step.
Hopefully that's a little bit faster and more readable. You won't be able to use the progress bar anymore.
import pandas as pd | ||
import numpy as np | ||
import os | ||
from ipywidgets import FloatProgress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about this... looks cool and nice that it's builtin to notebooks.
number_of_samples = len(mutation_df) | ||
number_of_pathways = len(pathways) | ||
sample_pathway_df = pd.DataFrame(np.zeros((number_of_samples, number_of_pathways), dtype=np.int), | ||
index=mutation_df.index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like only a single indent here. May be personal preference but this style has is a space hog. Maybe move np.zeros...
to a newline to improve readability.
First attempt, would be very grateful for some code review!