Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indel #1394

Merged
merged 28 commits into from
Apr 4, 2020
Merged

Indel #1394

merged 28 commits into from
Apr 4, 2020

Conversation

ekiefl
Copy link
Contributor

@ekiefl ekiefl commented Apr 3, 2020

PR opens up several new doors into the anvi-profile world.

indels can be stored in DB

With --profile-indels, one can store each indel. Here is the table structure:

image

percentage ID filtering during anvi-profile

With --min-percent-identity, one can filter reads based on percentage identity.

ekiefl added 17 commits April 1, 2020 14:13
in a "speed is everything" mentally. In actuality, including the reference
sequence is less than ~5% cost in speed and opens up many doors
parameters, is seen by the correct classses, is a created table during
profile db creation
is passed, which makes room for any number of read processing methods to
be applied
method to BAMFileObject. Add --min-percent-identity flag to
anvi-profile; currently applies to coverage data, not SNVs/SAAVs
@ekiefl ekiefl requested a review from meren April 3, 2020 03:04
@ekiefl
Copy link
Contributor Author

ekiefl commented Apr 3, 2020

Checking accuracy of indels

Here are the first couple in my test dataset compared against IGV

image

Here are screenshots of the regions read from top to bottom.

image

image

image

image

image

image

image

image

image

image

Verdict

Everything I have checked out so far has worked. There is obviously a shit-load we can do with this, but for now in this PR I simply want to make sure the information reported is correct and succinct.

@@ -36,7 +36,7 @@ class ComputeCoverage(object):
self.sanity_check()
self.f = self.init_output()

self.bam = pysam.Samfile(self.args.bam_file, 'rb')
self.bam = bamops.BAMFileObject(self.args.bam_file, 'rb')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean this previous version of this line will explode in tagged v6.2.? :) asking for a friend.

Copy link
Contributor Author

@ekiefl ekiefl Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, no. Samfile is a deprecated but functioning class as of pysam 0.15.4 (EDIT: which is what I've been using)...

But I'm realizing that the requirements.txt has 0.15.2.... What should we do?

EDIT: I can't get on midway right now, but I suspect the master environment is using 0.15.2 and it is working like a charm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is working in v6.2 which has 0.15.4 for pysam, so we are good (about the requirements.txt: I will update it soon .. it is a very confusing situation right now. The conda package (which I care more about) has different versions set for many of those requirements).

But I run into a bug in anvi-script-get-coverage-from-bam that is irrelevant to this discussion.

cd $anvio/tests
./run_mini_test.sh
cd sandbox/test-output/

# this works perfect
anvi-script-get-coverage-from-bam -b SAMPLE-01.bam -c 204_10M_MERGED.PERFECT.gz.keep_contig_878 -o test.txt -m pos

# this results in an empty file (here method accepts contig for
# -m even though help says option #1 (-c) is not valid for this):
anvi-script-get-coverage-from-bam -b SAMPLE-01.bam -c 204_10M_MERGED.PERFECT.gz.keep_contig_878 -o test.txt -m contig

# but this results in an empty file, too:
echo "204_10M_MERGED.PERFECT.gz.keep_contig_878" > contig_list.txt
anvi-script-get-coverage-from-bam -b SAMPLE-01.bam -l contig_list.txt -o test.txt -m contig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Fixed in 5f6632c

@meren
Copy link
Member

meren commented Apr 3, 2020

profile db version is one up, there is a migration script, I just tested on infant gut data and it seems to be working. what is the added time to profiling? you had the performance tests from the previous profiler revamping effort. did you re-run any of them?

one problem is that our online documentation fails to represent any of these additions. we will be fine, but this will impact others' ability to use or understand these improvements. I am not sure what is the best way to solve this :/

thank you Evan! great contribution, and I'm certain we will find great ways to exploit the information in the indels table.

SNVs, INDELs, SCVs, SAAVs looks just much more consistent than SNVs, indels, SCVs, SAAVs.
@meren
Copy link
Member

meren commented Apr 3, 2020

@ekiefl, please update your branch and run mini-test :)

@ekiefl
Copy link
Contributor Author

ekiefl commented Apr 3, 2020

@ekiefl, please update your branch and run mini-test :)

✖ anvi-profile encountered an error after 0:22:57.205934
Traceback (most recent call last):
  File "/Users/evan/Software/anvio/bin/anvi-profile", line 98, in <module>
    main(args)
  File "/Users/evan/Software/anvio/anvio/terminal.py", line 748, in wrapper
    program_method(*args, **kwargs)
  File "/Users/evan/Software/anvio/bin/anvi-profile", line 33, in main
    profiler.BAMProfiler(args)._run()
  File "/Users/evan/Software/anvio/anvio/profiler.py", line 283, in _run
    self.profile_single_thread()
  File "/Users/evan/Software/anvio/anvio/profiler.py", line 731, in profile_single_thread
    contig = self.process_contig(bam_file, contig_name, contig_length)
  File "/Users/evan/Software/anvio/anvio/profiler.py", line 685, in process_contig
    split.auxiliary.process(bam_file)
  File "/Users/evan/Software/anvio/anvio/contigops.py", line 210, in process
    self.run_SNVs_and_indels(bam)
  File "/Users/evan/Software/anvio/anvio/contigops.py", line 452, in run_SNVs_and_indels
    for read in read_iterator(self.split.parent, self.split.start, self.split.end, **kwargs):
  File "/Users/evan/Software/anvio/anvio/bamops.py", line 79, in fetch_and_trim
    read = Read(read)
  File "/Users/evan/Software/anvio/anvio/bamops.py", line 153, in __init__
    self.reference_sequence = np.frombuffer(read.get_reference_sequence().upper().encode('ascii'), np.uint8)
  File "pysam/libcalignedsegment.pyx", line 1841, in pysam.libcalignedsegment.AlignedSegment.get_reference_sequence
  File "pysam/libcalignedsegment.pyx", line 864, in pysam.libcalignedsegment.build_reference_sequence
ValueError: MD tag not present

What kind of diabolical .bam files are these? MD tags specify info about the reference bases at the aligned positions. I can create a pre-test in profiler.py that opens up a handful of reads from the BAM file and if none of them contain an MD tag, I will run a warning and remove --min-percent-identity. What do you think?

It would helpful if these BAM files were updated so --min-percent-identity could be tested in mini test.

@meren
Copy link
Member

meren commented Apr 3, 2020

I can create a pre-test in profiler.py that opens up a handful of reads from the BAM file and if none of them contain an MD tag, I will run a warning and remove --min-percent-identity.

I think this would be a great way to solve it. By removing it, I assumed you mean setting --min-percent-identity to None or 0.0 in cases where it can't be applicable.

It would helpful if these BAM files were updated so --min-percent-identity could be tested in mini test.

To be honest I hate these BAM files because they are ancient. But updating them will require A LOT of work, so I've been being lazy about it. But I will put an issue now and this will be addressed sooner or later.

@ekiefl
Copy link
Contributor Author

ekiefl commented Apr 3, 2020

To be honest I hate these BAM files because they are ancient. But updating them will require A LOT of work, so I've been being lazy about it. But I will put an issue now and this will be addressed sooner or later.

Yeah, sooooo much of our tests are reliant on these lil guys so I can imagine. We may hate them but they discovered an important bug :)

@meren
Copy link
Member

meren commented Apr 3, 2020

True. I'm actually sending some virtual hugs to those shitty-but-ours BAM files.

@ekiefl
Copy link
Contributor Author

ekiefl commented Apr 4, 2020

Ok I measured the time taken and database sizes for the following cases.

  1. anvi-profile in master branch

Time taken: ✓ anvi-profile took 0:30:27.722558
DB size: 2.4GB


  1. anvi-profile in indel branch

Time taken: ✓ anvi-profile took 0:31:46.993346
DB size: 2.4GB

Things are just slower. Ok, fine. I think it's because of a new nested yield in Coverage and because Read.__init__ has one more pysam attribute lookup and an if statement. Amazing that there are so many reads that this adds up to a minute.


  1. anvi-profile --profile-indels

Time taken: ✓ anvi-profile took 0:31:49.627898
DB size: 2.6GB

The cost is almost nothing to --profile-indels. And the DB size is only 200MB larger.


  1. anvi-profile --min-percent-identity 95

Time taken: ✓ anvi-profile took 0:35:12.351169
DB size: 1.5GB

The DB is smaller because there are less SNVs. The time cost for filtering is appreciable but not insane and probably not noticeable to the user.


Ok, based on these timings, I think we should get rid of --profile-indels and the INDEL calculation should piggy-back on --skip-SNV-profiling: If --skip-SNV-profiling, they don't get SNVs or INDELs. Otherwise they get both. Does this seem reasonable @meren? After this is settled I will merge.

@meren
Copy link
Member

meren commented Apr 4, 2020

Ok, based on these timings, I think we should get rid of --profile-indels and the INDEL calculation should piggy-back on --skip-SNV-profiling: If --skip-SNV-profiling, they don't get SNVs or INDELs.

I agree with this! Probably INDEL time/space requirements will change as a function of the distance between environmental populations and the genome that recruits reads from them. But I like the idea to make it the default setting regardless.

I also agree with --skip-SNV-profiling turning off INDEL profiling, but I think there should also be an independent --skip-INDEL-profiling flag. I think it is a better practice.

Other than this, I think it is ready to merge as far as I can see :)

multi-thread and --skip-SNV, so I deleted what are most likely
unneecessary delete statements
is to run INDELs. They are skipped if --skip-SNV or --skip-INDEL
@ekiefl
Copy link
Contributor Author

ekiefl commented Apr 4, 2020

Ok here goes!

@ekiefl ekiefl merged commit d79675b into master Apr 4, 2020
@ekiefl ekiefl deleted the indel branch April 4, 2020 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants