-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indel #1394
Conversation
in a "speed is everything" mentally. In actuality, including the reference sequence is less than ~5% cost in speed and opens up many doors
parameters, is seen by the correct classses, is a created table during profile db creation
and initiates empty 'indels' table
is passed, which makes room for any number of read processing methods to be applied
method to BAMFileObject. Add --min-percent-identity flag to anvi-profile; currently applies to coverage data, not SNVs/SAAVs
Checking accuracy of indelsHere are the first couple in my test dataset compared against IGV Here are screenshots of the regions read from top to bottom. VerdictEverything I have checked out so far has worked. There is obviously a shit-load we can do with this, but for now in this PR I simply want to make sure the information reported is correct and succinct. |
@@ -36,7 +36,7 @@ class ComputeCoverage(object): | |||
self.sanity_check() | |||
self.f = self.init_output() | |||
|
|||
self.bam = pysam.Samfile(self.args.bam_file, 'rb') | |||
self.bam = bamops.BAMFileObject(self.args.bam_file, 'rb') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean this previous version of this line will explode in tagged v6.2.
? :) asking for a friend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, no. Samfile
is a deprecated but functioning class as of pysam 0.15.4 (EDIT: which is what I've been using)...
But I'm realizing that the requirements.txt
has 0.15.2.... What should we do?
EDIT: I can't get on midway right now, but I suspect the master environment is using 0.15.2 and it is working like a charm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything is working in v6.2
which has 0.15.4
for pysam
, so we are good (about the requirements.txt: I will update it soon .. it is a very confusing situation right now. The conda package (which I care more about) has different versions set for many of those requirements).
But I run into a bug in anvi-script-get-coverage-from-bam
that is irrelevant to this discussion.
cd $anvio/tests
./run_mini_test.sh
cd sandbox/test-output/
# this works perfect
anvi-script-get-coverage-from-bam -b SAMPLE-01.bam -c 204_10M_MERGED.PERFECT.gz.keep_contig_878 -o test.txt -m pos
# this results in an empty file (here method accepts contig for
# -m even though help says option #1 (-c) is not valid for this):
anvi-script-get-coverage-from-bam -b SAMPLE-01.bam -c 204_10M_MERGED.PERFECT.gz.keep_contig_878 -o test.txt -m contig
# but this results in an empty file, too:
echo "204_10M_MERGED.PERFECT.gz.keep_contig_878" > contig_list.txt
anvi-script-get-coverage-from-bam -b SAMPLE-01.bam -l contig_list.txt -o test.txt -m contig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. Fixed in 5f6632c
profile db version is one up, there is a migration script, I just tested on infant gut data and it seems to be working. what is the added time to profiling? you had the performance tests from the previous profiler revamping effort. did you re-run any of them? one problem is that our online documentation fails to represent any of these additions. we will be fine, but this will impact others' ability to use or understand these improvements. I am not sure what is the best way to solve this :/ thank you Evan! great contribution, and I'm certain we will find great ways to exploit the information in the indels table. |
SNVs, INDELs, SCVs, SAAVs looks just much more consistent than SNVs, indels, SCVs, SAAVs.
@ekiefl, please update your branch and run mini-test :) |
What kind of diabolical .bam files are these? MD tags specify info about the reference bases at the aligned positions. I can create a pre-test in It would helpful if these BAM files were updated so |
I think this would be a great way to solve it. By removing it, I assumed you mean setting
To be honest I hate these BAM files because they are ancient. But updating them will require A LOT of work, so I've been being lazy about it. But I will put an issue now and this will be addressed sooner or later. |
Yeah, sooooo much of our tests are reliant on these lil guys so I can imagine. We may hate them but they discovered an important bug :) |
True. I'm actually sending some virtual hugs to those shitty-but-ours BAM files. |
Ok I measured the time taken and database sizes for the following cases.
Time taken:
Time taken: Things are just slower. Ok, fine. I think it's because of a new nested yield in
Time taken: The cost is almost nothing to
Time taken: The DB is smaller because there are less SNVs. The time cost for filtering is appreciable but not insane and probably not noticeable to the user. Ok, based on these timings, I think we should get rid of |
I agree with this! Probably INDEL time/space requirements will change as a function of the distance between environmental populations and the genome that recruits reads from them. But I like the idea to make it the default setting regardless. I also agree with Other than this, I think it is ready to merge as far as I can see :) |
multi-thread and --skip-SNV, so I deleted what are most likely unneecessary delete statements
is to run INDELs. They are skipped if --skip-SNV or --skip-INDEL
Ok here goes! |
PR opens up several new doors into the anvi-profile world.
indels can be stored in DB
With
--profile-indels
, one can store each indel. Here is the table structure:percentage ID filtering during anvi-profile
With
--min-percent-identity
, one can filter reads based on percentage identity.