Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

variant calling with paired-end sequencing #968

Closed
emmecola opened this issue Feb 16, 2024 · 3 comments
Closed

variant calling with paired-end sequencing #968

emmecola opened this issue Feb 16, 2024 · 3 comments
Labels

Comments

@emmecola
Copy link

emmecola commented Feb 16, 2024

Dear fgbio developers

We are using fgbio for a sequencing project, and I need a clarification regarding the consensus generation. Reading from the documentation, I understand that “this tool calls each end of a pair independently, and does not jointly call bases that overlap within a pair”.

If I understand correctly, this basically means that the two ends of a read pair will generate two separate consensus sequences, even if they are tagged with the same UMI and map to the same genomic location (they may even overlap).

If that’s the case, when I call the variants on these consensus sequences, the two ends of the same read pair will count as two molecules. Correct? So, looking at the final VCF file, one cannot distinguish a variant detected in two different molecules from a variant that is detected in the two ends of a read pair. Is my reasoning correct?

Thanks!

@nh13 nh13 added the question label Feb 16, 2024
@nh13
Copy link
Member

nh13 commented Feb 16, 2024

We did add support for consensus calling the overlapping bases in a read pair which is turned on by default in both CallMolecularConsensusReads and CallDuplexConsensusReads: #805. There exists a separate tool CallOverlappingConsensusBases that can also perform this correction. The former tools use the consensus method (see the latter's documentation) on the raw reads prior to building the consensus for each read independently.

I hope that helps!

@nh13 nh13 closed this as completed Feb 16, 2024
@emmecola
Copy link
Author

Thank you! So the information coming from the overlapping mate read is taken into account, but in any case two separate consensus sequences will be reported in the output file. Can you confirm that my understanding is correct?

I think this is an important point to consider when calling variants on the consensus sequences: as far as I can see, the real number of distinct template molecules supporting a mutation might be lower than the number of consensus sequences carrying the mutation, simply because the same molecule is represented twice in the consensus file.

@nh13
Copy link
Member

nh13 commented Feb 16, 2024

You are correct that two read pairs are output. But variant callers (like GATK) do not count the same bace twice when reporting a mutation from an overlapping read pair. This is regardless of whether the read is a raw read pair or made from a consensus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants