Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for calculating niches #831

Open
wants to merge 71 commits into
base: main
Choose a base branch
from
Open

Add function for calculating niches #831

wants to merge 71 commits into from

Conversation

LLehner
Copy link
Member

@LLehner LLehner commented May 27, 2024

Description

Adds a function that calculates niches using different strategies. The initial function calculates niches based on neighborhood profiles similar to here.

This PR will get updated with methods discussed in #789.

@LLehner LLehner marked this pull request as draft May 27, 2024 16:24
@LLehner LLehner requested a review from timtreis May 27, 2024 16:43
@timtreis timtreis added the squidpy2.0 Everything releated to a Squidpy 2.0 release label May 29, 2024
@LLehner LLehner marked this pull request as ready for review October 10, 2024 12:44
@LLehner LLehner requested a review from giovp October 10, 2024 12:44
@LLehner
Copy link
Member Author

LLehner commented Oct 10, 2024

@giovp @timtreis PR is ready for review!

Some additional questions that came up, which you could have a look at:

  1. How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.
  2. I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?
  3. If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).
  4. Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

@timtreis
Copy link
Member

Hey @LLehner

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.

You mean a scenario where the user is just storing multiple slides in the same sdata object, like the point8, point16, ´point24` dataset we have? These should be fully independent since they have no biological connection.

I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?

Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.

If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).

Could it be that the OS has some of the compiled bytecode or data in cache? 🤔

Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

What runtime are we speaking here about? I tend to be a fan of some intermediate output (if one can also shut it up, f.e. with some verbosity level)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
graph 🕸️ squidpy2.0 Everything releated to a Squidpy 2.0 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants