Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create sample-by-pathway dataset #25

Closed
wants to merge 5 commits into from

Conversation

stephenshank
Copy link
Member

First attempt, would be very grateful for some code review!

@stephenshank
Copy link
Member Author

I may have done something undesirable with git/github... for some reason, changes from my last pull request are showing up in this one. The file to look at is 5.hetnet-pathways.ipynb.

@dhimmel
Copy link
Member

dhimmel commented Sep 19, 2016

I may have done something undesirable with git/github...

You need to rebase your pull request on top of the latest upstream master. This will be a little more difficult because you based your pull request off of your master branch. Usually you make pull requests using a new branch, so your master branch can easily stay synced with upstream.

If you've configured cognoma/cancer-data to be upstream, then you should be able to do the following, while your master branch is checked out:

git fetch upstream
git rebase upstream/master
git push --force

@dhimmel dhimmel changed the title Create sample-by-pathway dataset. Create sample-by-pathway dataset Sep 19, 2016
hetnet_results = pd.DataFrame()
with driver.session() as session:
result = session.run(query)
hetnet_results = ( pd.DataFrame((x.values() for x in result), columns=result.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call hetnet_results something more informative such as pathway_df?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the prettiest spacing

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Awesome to see that you were able to interact programmatically with Hetionet. I made a few comments in addition to the rebase required as discussed above.

How long did the queries take? May also be nice to do this for Biological Processes.

Also I think we may want to export a pathway-info.tsv with identifier, name, n_mutations, and maybe some of the other Hetionet attributes such as source.

columns=pathways)


# Now populate this data frame. This is a slow Python loop, hence the progress bar. It takes a few minutes on my laptop. The idea is to loop over all gene-pathway interactions in the hetnet query. If the gene is in the Cognoma dataset, we grab the pathway id in that gene-pathway interaction. We look at Cognoma samples where that gene is labeled 1, i.e., at Cognoma samples that have a mutation in that gene, and grab the corresponding indices. Then, in the pathway matrix all samples get the associated pathway tagged as a 1, since they have a mutated gene that participates in that pathway.
Copy link
Member

@dhimmel dhimmel Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider the following strategy:

  1. melt mutation_df so you get at sample_id, entrez_gene_id, mutation dataframe.
  2. filter that for mutation == 1
  3. merge with pathway_df
  4. pivot_table to create a sample by pathway matrix. See this example for this step.

Hopefully that's a little bit faster and more readable. You won't be able to use the progress bar anymore.

import pandas as pd
import numpy as np
import os
from ipywidgets import FloatProgress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about this... looks cool and nice that it's builtin to notebooks.

number_of_samples = len(mutation_df)
number_of_pathways = len(pathways)
sample_pathway_df = pd.DataFrame(np.zeros((number_of_samples, number_of_pathways), dtype=np.int),
index=mutation_df.index,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like only a single indent here. May be personal preference but this style has is a space hog. Maybe move np.zeros... to a newline to improve readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants