Create sample-by-pathway dataset #25

stephenshank · 2016-09-19T14:49:03Z

First attempt, would be very grateful for some code review!

stephenshank · 2016-09-19T14:51:34Z

I may have done something undesirable with git/github... for some reason, changes from my last pull request are showing up in this one. The file to look at is 5.hetnet-pathways.ipynb.

dhimmel · 2016-09-19T14:57:14Z

I may have done something undesirable with git/github...

You need to rebase your pull request on top of the latest upstream master. This will be a little more difficult because you based your pull request off of your master branch. Usually you make pull requests using a new branch, so your master branch can easily stay synced with upstream.

If you've configured cognoma/cancer-data to be upstream, then you should be able to do the following, while your master branch is checked out:

git fetch upstream
git rebase upstream/master
git push --force

dhimmel · 2016-09-19T17:10:15Z

scripts/5.hetnet-pathways.py

+hetnet_results = pd.DataFrame()
+with driver.session() as session:
+ result = session.run(query)
+ hetnet_results = ( pd.DataFrame((x.values() for x in result), columns=result.keys())


Can we call hetnet_results something more informative such as pathway_df?

not the prettiest spacing

dhimmel

Very nice. Awesome to see that you were able to interact programmatically with Hetionet. I made a few comments in addition to the rebase required as discussed above.

How long did the queries take? May also be nice to do this for Biological Processes.

Also I think we may want to export a pathway-info.tsv with identifier, name, n_mutations, and maybe some of the other Hetionet attributes such as source.

dhimmel · 2016-09-19T17:15:35Z

scripts/5.hetnet-pathways.py

+ columns=pathways)
+
+
+# Now populate this data frame. This is a slow Python loop, hence the progress bar. It takes a few minutes on my laptop. The idea is to loop over all gene-pathway interactions in the hetnet query. If the gene is in the Cognoma dataset, we grab the pathway id in that gene-pathway interaction. We look at Cognoma samples where that gene is labeled 1, i.e., at Cognoma samples that have a mutation in that gene, and grab the corresponding indices. Then, in the pathway matrix all samples get the associated pathway tagged as a 1, since they have a mutated gene that participates in that pathway.


I would consider the following strategy:

melt mutation_df so you get at sample_id, entrez_gene_id, mutation dataframe.

filter that for mutation == 1

merge with pathway_df

pivot_table to create a sample by pathway matrix. See this example for this step.

Hopefully that's a little bit faster and more readable. You won't be able to use the progress bar anymore.

dhimmel · 2016-09-19T17:16:10Z

scripts/5.hetnet-pathways.py

+import pandas as pd
+import numpy as np
+import os
+from ipywidgets import FloatProgress


Didn't know about this... looks cool and nice that it's builtin to notebooks.

dhimmel · 2016-09-19T17:19:38Z

scripts/5.hetnet-pathways.py

+number_of_samples = len(mutation_df)
+number_of_pathways = len(pathways) 
+sample_pathway_df = pd.DataFrame(np.zeros((number_of_samples, number_of_pathways), dtype=np.int),
+ index=mutation_df.index,


I like only a single indent here. May be personal preference but this style has is a space hog. Maybe move np.zeros... to a newline to improve readability.

stephenshank added 5 commits September 16, 2016 16:18

Encodes categorical covariate data from samples.

ca90e19

Includes script file.

ec23d42

Changes names of covariates file, removes exploratory visualizations.

5df7450

Adds script for covariates.

1fc4727

First attempt at creating sample by pathway matrix.

2a6fa09

stephenshank mentioned this pull request Sep 19, 2016

Precomputing a sample × mutation-in-gene-set matrix #21

Open

dhimmel changed the title ~~Create sample-by-pathway dataset.~~ Create sample-by-pathway dataset Sep 19, 2016

dhimmel reviewed Sep 19, 2016

View reviewed changes

dhimmel requested changes Sep 19, 2016

View reviewed changes

stephenshank closed this Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create sample-by-pathway dataset #25

Create sample-by-pathway dataset #25

stephenshank commented Sep 19, 2016

stephenshank commented Sep 19, 2016

dhimmel commented Sep 19, 2016

dhimmel Sep 19, 2016

dhimmel Sep 19, 2016

dhimmel left a comment

dhimmel Sep 19, 2016 •

edited

Loading

dhimmel Sep 19, 2016

dhimmel Sep 19, 2016

		columns=pathways)


		# Now populate this data frame. This is a slow Python loop, hence the progress bar. It takes a few minutes on my laptop. The idea is to loop over all gene-pathway interactions in the hetnet query. If the gene is in the Cognoma dataset, we grab the pathway id in that gene-pathway interaction. We look at Cognoma samples where that gene is labeled 1, i.e., at Cognoma samples that have a mutation in that gene, and grab the corresponding indices. Then, in the pathway matrix all samples get the associated pathway tagged as a 1, since they have a mutated gene that participates in that pathway.

Create sample-by-pathway dataset #25

Create sample-by-pathway dataset #25

Conversation

stephenshank commented Sep 19, 2016

stephenshank commented Sep 19, 2016

dhimmel commented Sep 19, 2016

dhimmel Sep 19, 2016

Choose a reason for hiding this comment

dhimmel Sep 19, 2016

Choose a reason for hiding this comment

dhimmel left a comment

Choose a reason for hiding this comment

dhimmel Sep 19, 2016 • edited Loading

Choose a reason for hiding this comment

dhimmel Sep 19, 2016

Choose a reason for hiding this comment

dhimmel Sep 19, 2016

Choose a reason for hiding this comment

dhimmel Sep 19, 2016 •

edited

Loading