Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool to consensus call overlapping bases in paired end reads. Adds direct support in the consensus calling tools too. #805

Merged
merged 25 commits into from
Apr 4, 2022
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/main/scala/com/fulcrumgenomics/bam/Bams.scala
Original file line number Diff line number Diff line change
Expand Up @@ -504,7 +504,7 @@ object Bams extends LazyLogging {
"Input was not queryname sorted or query grouped, found: " +
s"SO:${header.getSortOrder} GO:${header.getGroupOrder}" +
Option(header.getAttribute("SS")).map(ss => f" SS:$ss").getOrElse("") +
f". Use `samtools sort -n -u in.bam | fgbio $toolName -i /dev/stdin`" +
f". Use `fgbio --compression 0 SortBam -i in.bam -o out.bam -s queryname | fgbio $toolName -i /dev/stdin`" +
path.map(p => f"\nPath: $p").getOrElse("")
)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

package com.fulcrumgenomics.bam

import com.fulcrumgenomics.FgBioDef.{FgBioEnum, FilePath, PathToBam, SafelyClosable}
import com.fulcrumgenomics.FgBioDef.{FgBioEnum, FilePath, PathToBam, PathToFasta, SafelyClosable}
import com.fulcrumgenomics.bam.api.{SamOrder, SamSource, SamWriter}
import com.fulcrumgenomics.cmdline.{ClpGroups, FgBioTool}
import com.fulcrumgenomics.commons.collection.ParIterator
Expand All @@ -44,68 +44,78 @@ import scala.collection.immutable
|
|## Inputs and Outputs
|
|In order to correctly correct reads by template, the input BAM must be either `queryname` or `query` grouped. The
|sort can be done in streaming fashion with:
|In order to correctly correct reads by template, the input BAM must be either `queryname` sorted or `query` grouped.
|The sort can be done in streaming fashion with:
|
|```
|samtools sort -n -u in.bam | fgbio ClipBam -i /dev/stdin ...
|fgbio --compression 0 SortBam -i in.bam -o out.bam -s queryname | fgbio CallOverlappingConsensusBases -i /dev/stdin ...
|```
|
|The output sort order may be specified with `--sort-order`. If not given, then the output will be in the same
|order as input.
|
|The reference FASTA must be given so that any existing `NM`, `UQ` and `MD` tags can be repaired.
|
|## Correction
|
|Only mapped read pairs with overlapping bases will be eligible for correction.
|
|Next, each read base from the read and its mate that map to same position in the reference will be used to create
|Each read base from the read and its mate that map to same position in the reference will be used to create
|a consensus base as follows:
|
|1. if the read and mate bases are the same, the consensus base is that base with the base quality equal to the sum
| of the two base qualities. The base quality can be the maximum base quality off the two base qualities if
| `--max-qual-on-agreement` is used.
|2. if the read and mate bases differ, then the base with the highest associated base quality will be the consensus
| call. If the read and mate have the same base quality, then the output base quality will be 2. Otherwise,
| the base quality will be the difference between the larger and smaller base quality. The
| `--only-mask-disagreements` option overrides behavior and sets all differing bases to `N` with a base quality of
| 2.
|1. If the base agree, then the chosen agreement strategy (`--agreement-strategy`) will be used.
|2. If the base disagree, then the chosen disagreement strategy (`--disagreement-strategy`) will be used.
|
|The agreement strategies are as follows:
|
|* Consensus: Call the consensus base and return a new base quality that is the sum of the two base qualities.
|* MaxQual: Call the consensus base and return a new base quality that is the maximum of the two base qualities.
|* PassThrough: Leave the bases and base qualities unchanged.
|
|In the context of disagreement strategies, masking a base will make the base an "N" with base quality phred-value "2".
|The disagreement strategies are as follows:
|
|The read and its mate will have their bases corrected and base qualities updated based on the procedure above.
""")
|* MaskBoth: Mask both bases.
|* MaskLowerQual: Mask the base with the lowest base quality, with the other base unchanged. If the base qualities
| are the same, mask both bases.
|* Consensus: Consensus call the base. If the base qualities are the same, mask both bases. Otherwise, call the
| base with the highest base quality and return a new base quality that is the difference between the
| highest and lowest base quality.
| """)
class CallOverlappingConsensusBases
( @arg(flag='i', doc="Input SAM or BAM file of aligned reads.") val input: PathToBam,
@arg(flag='o', doc="Output SAM or BAM file.") val output: PathToBam,
@arg(flag='m', doc="Output metrics file.") val metrics: FilePath,
@arg(doc="The number of threads to use while consensus calling.") val threads: Int = 1,
@arg(flag='S', doc="The sort order of the output. If not given, output will be in the same order as input if the input.")
val sortOrder: Option[SamOrder] = None,
@arg(doc="""If the read and mate bases disagree at a given reference position, true to mask (make 'N') the read and mate
|bases, otherwise pick the base with the highest base quality and return a base quality that's the difference
|between the higher and lower base qualities.""")
val onlyMaskDisagreements: Boolean = false,
@arg(doc= """If the read and mate bases agree at a given reference position, true to for the resulting base quality
|to be the maximum base quality, otherwise the sum of the base qualities.""")
val maxQualOnAgreement: Boolean = false
(@arg(flag='i', doc="Input SAM or BAM file of aligned reads.") val input: PathToBam,
@arg(flag='o', doc="Output SAM or BAM file.") val output: PathToBam,
@arg(flag='m', doc="Output metrics file.") val metrics: FilePath,
@arg(flag='r', doc="Reference sequence fasta file.") val ref: PathToFasta,
@arg(doc="The number of threads to use while consensus calling.") val threads: Int = 1,
@arg(flag='S', doc="The sort order of the output. If not given, output will be in the same order as input if the input.")
val sortOrder: Option[SamOrder] = None,
@arg(doc="The strategy to consensus call when both bases agree. See the usage for more details")
val agreementStrategy: AgreementStrategy = AgreementStrategy.Consensus,
@arg(doc="The strategy to consensus call when both bases disagree. See the usage for more details")
val disagreementStrategy: DisagreementStrategy = DisagreementStrategy.Consensus
) extends FgBioTool with LazyLogging {
Io.assertReadable(input)
Io.assertReadable(ref)
Io.assertCanWriteFile(output)

private case class ThreadData(caller: OverlappingBasesConsensusCaller, templateMetric: CallOverlappingConsensusBasesMetric, basesMetric: CallOverlappingConsensusBasesMetric)
private object ThreadData {
def apply(): ThreadData = ThreadData(
caller = new OverlappingBasesConsensusCaller(onlyMaskDisagreements=onlyMaskDisagreements, maxQualOnAgreement=maxQualOnAgreement),
templateMetric = CallOverlappingConsensusBasesMetric(tpe=CountType.Templates),
basesMetric = CallOverlappingConsensusBasesMetric(tpe=CountType.Bases)
)
}
private case class ThreadData
(caller: OverlappingBasesConsensusCaller = new OverlappingBasesConsensusCaller(agreementStrategy=agreementStrategy, disagreementStrategy=disagreementStrategy),
templateMetric: CallOverlappingConsensusBasesMetric = CallOverlappingConsensusBasesMetric(kind=CountKind.Templates),
basesMetric: CallOverlappingConsensusBasesMetric = CallOverlappingConsensusBasesMetric(kind=CountKind.Bases)
)

override def execute(): Unit = {
val source = SamSource(input)
val writer = SamWriter(output, source.header)
val outSort = sortOrder.flatMap { order => if (SamOrder(source.header).contains(order)) None else Some(order) }
val writer = Bams.nmUqMdTagRegeneratingWriter(writer=SamWriter(output, source.header.clone(), sort=outSort), ref=ref)
val progress = new ProgressLogger(logger)
val templateIterator = Bams.templateIterator(source)
val threadData = new IterableThreadLocal(() => ThreadData())

// Require queryname sorted or query grouped
Bams.requireQueryGrouped(header=source.header, toolName="CallOverlappingConsensusBases")

ParIterator(templateIterator, threads=threads)
.map { template =>
val threadDatum = threadData.get()
Expand All @@ -114,12 +124,11 @@ class CallOverlappingConsensusBases
threadDatum.templateMetric.total += 1
threadDatum.basesMetric.total += template.primaryReads.map(_.length).sum
nh13 marked this conversation as resolved.
Show resolved Hide resolved
// corrects
val stats = threadDatum.caller.call(template)
val overlappingBases = stats.r1OverlappingBases + stats.r2OverlappingBases
val correctedBases = stats.r1CorrectedBases + stats.r2CorrectedBases
if (overlappingBases > 0) {
val stats = threadDatum.caller.call(template)
val correctedBases = stats.r1CorrectedBases + stats.r2CorrectedBases
if (stats.overlappingBases > 0) {
threadDatum.templateMetric.overlapping += 1
threadDatum.basesMetric.overlapping += overlappingBases
threadDatum.basesMetric.overlapping += stats.overlappingBases
if (correctedBases > 0) {
threadDatum.templateMetric.corrected += 1
threadDatum.basesMetric.corrected += correctedBases
Expand All @@ -135,48 +144,47 @@ class CallOverlappingConsensusBases
source.safelyClose()
writer.close()

val templatesMetric = CallOverlappingConsensusBasesMetric(tpe=CountType.Templates)
val basesMetric = CallOverlappingConsensusBasesMetric(tpe=CountType.Bases)
val templatesMetric = CallOverlappingConsensusBasesMetric(kind=CountKind.Templates)
val basesMetric = CallOverlappingConsensusBasesMetric(kind=CountKind.Bases)
threadData.foreach { datum =>
templatesMetric += datum.templateMetric
basesMetric += datum.basesMetric
}

Metric.write(metrics, templatesMetric, basesMetric)
}

}

/** Collects the the number of reads or bases that were examined, had overlap, and were corrected as part of
* the [[CallOverlappingConsensusBases]] tool.
*
* @param tpe template if the counts are per template, bases if counts are in units of bases.
* @param kind template if the counts are per template, bases if counts are in units of bases.
* @param total the total number of templates (bases) examined
* @param overlapping the total number of templates (bases) that were overlapping
* @param corrected the total number of templates (bases) that were corrected.
*/
case class CallOverlappingConsensusBasesMetric
(
tpe: CountType,
kind: CountKind,
var total: Long = 0,
var overlapping: Long = 0,
var corrected: Long = 0,
) extends Metric {
def +=(other: CallOverlappingConsensusBasesMetric): CallOverlappingConsensusBasesMetric = {
require(this.tpe == other.tpe)
require(this.kind == other.kind)
this.total += other.total
this.overlapping += other.overlapping
this.corrected += other.corrected
this
}
}

sealed trait CountType extends EnumEntry
sealed trait CountKind extends EnumEntry

/** Enumeration for the type of counts in [[CallOverlappingConsensusBasesMetric]]. */
object CountType extends FgBioEnum[CountType] {
case object Templates extends CountType
case object Bases extends CountType
object CountKind extends FgBioEnum[CountKind] {
case object Templates extends CountKind
case object Bases extends CountKind

override def values: immutable.IndexedSeq[CountType] = findValues
override def values: immutable.IndexedSeq[CountKind] = findValues
}
2 changes: 1 addition & 1 deletion src/main/scala/com/fulcrumgenomics/bam/ClipBam.scala
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ import scala.collection.immutable.IndexedSeq
|done in streaming fashion with, for example:
|
|```
|samtools sort -n -u in.bam | fgbio ClipBam -i /dev/stdin ...
|fgbio --compression 0 SortBam -i in.bam -o out.bam -s queryname | fgbio ClipBam -i /dev/stdin ...
nh13 marked this conversation as resolved.
Show resolved Hide resolved
|```
|
|The output sort order may be specified with `--sort-order`. If not given, then the output will be in the same
Expand Down
Loading