variant calling with paired-end sequencing #968

emmecola · 2024-02-16T07:44:23Z

Dear fgbio developers

We are using fgbio for a sequencing project, and I need a clarification regarding the consensus generation. Reading from the documentation, I understand that “this tool calls each end of a pair independently, and does not jointly call bases that overlap within a pair”.

If I understand correctly, this basically means that the two ends of a read pair will generate two separate consensus sequences, even if they are tagged with the same UMI and map to the same genomic location (they may even overlap).

If that’s the case, when I call the variants on these consensus sequences, the two ends of the same read pair will count as two molecules. Correct? So, looking at the final VCF file, one cannot distinguish a variant detected in two different molecules from a variant that is detected in the two ends of a read pair. Is my reasoning correct?

Thanks!

nh13 · 2024-02-16T17:42:06Z

We did add support for consensus calling the overlapping bases in a read pair which is turned on by default in both CallMolecularConsensusReads and CallDuplexConsensusReads: #805. There exists a separate tool CallOverlappingConsensusBases that can also perform this correction. The former tools use the consensus method (see the latter's documentation) on the raw reads prior to building the consensus for each read independently.

I hope that helps!

emmecola · 2024-02-16T21:46:00Z

Thank you! So the information coming from the overlapping mate read is taken into account, but in any case two separate consensus sequences will be reported in the output file. Can you confirm that my understanding is correct?

I think this is an important point to consider when calling variants on the consensus sequences: as far as I can see, the real number of distinct template molecules supporting a mutation might be lower than the number of consensus sequences carrying the mutation, simply because the same molecule is represented twice in the consensus file.

nh13 · 2024-02-16T22:47:10Z

You are correct that two read pairs are output. But variant callers (like GATK) do not count the same bace twice when reporting a mutation from an overlapping read pair. This is regardless of whether the read is a raw read pair or made from a consensus.

nh13 added the question label Feb 16, 2024

nh13 closed this as completed Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

variant calling with paired-end sequencing #968

variant calling with paired-end sequencing #968

emmecola commented Feb 16, 2024 •

edited

Loading

nh13 commented Feb 16, 2024

emmecola commented Feb 16, 2024

nh13 commented Feb 16, 2024

variant calling with paired-end sequencing #968

variant calling with paired-end sequencing #968

Comments

emmecola commented Feb 16, 2024 • edited Loading

nh13 commented Feb 16, 2024

emmecola commented Feb 16, 2024

nh13 commented Feb 16, 2024

emmecola commented Feb 16, 2024 •

edited

Loading