Skip to content

Detecting putative chimeric loci sequences

chrisjackson edited this page Sep 10, 2024 · 14 revisions

Documentation current for HybPiper version 2.3.0


In cases where sequence from a single SPAdes contig covering most of a target locus isn't recovered for a given sample/gene, HybPiper will attempt to reconstruct as much sequence as possible by stitching together multiple contigs that correspond to different regions of the locus. During this process it creates 'stitched contigs'; see the 'Coding Sequence' in figure.

In an effort to detect cases where stitched contig creation results in a chimeric final locus sequence (i.e. a sequence derived from stitching together coding sequence from different paralogs), HybPiper performs the read-mapping approach shown in the figure below:

chimera_test

Note: this check only gets performed if the option --chimeric_stitched_contig_check is provided to command hybpiper assemble.

The example illustrated involves the following steps:

  1. A pool of paired-end reads is assigned to a given sample/gene and used as input to a SPAdes assembly. Ideally this would produce a single contig corresponding to the full length of the target sequence. However, depending on read-depth and exon size for the given sample/gene, it's more likely that more than one contig is produced, and these can correspond to different regions of the gene (e.g. the 5' half and the 3' half, as shown in step ii)).

  2. SPAdes assembly generates two contigs corresponding to the 5' and 3' ends of the target sequence. These could be derived from the same paralog, or different paralogs (or indeed alleles).

  3. HybPiper generates a 'stitched contig' (details are simplified for illustrative purposes), concatenating the contigs derived from different paralogs.

  4. The pool of paired-end reads assigned to the sample/gene is mapped back to the stitched contig using BBmap.sh with default parameters of:

    pairedonly=t mappedonly=t maxindel=0 strictmaxindel=t ambiguous=toss subfilter=7

    Note that the BBmap.sh subfilter parameter is used to "Ban alignments with more than this many substitutions". This value can be changed from the default of 7 using the flag --bbmap_subfilter with the hybpiper assemble command. By default, BBmap.sh is run using 1 GB of RAM (memory) and 1 thread; these defaults can be changed using the parameters --bbmap_memory and --bbmap_threads with the hybpiper assemble command.

  5. Each mapped read pair is then filtered according to the following criteria:

    • The forward read and the reverse read must map to different SPAdes contigs within the stitched contig, and the full length each read must map entirely with the contig boundaries.
    • Each read must map entirely within a single exon within each contig (i.e. they don't overlap exon-intron boundaries within a contig).
    • One read must match with 100% identity, whereas the other read must have a minimum number of mismatches (default 5) with the stitched contig reference sequence. This mismatch threshold can be changed using the flag --chimeric_stitched_contig_edit_distance with the hybpiper assemble command.
  6. If there are a minimum number of 'discordant' read pairs remaining after filtering (default 5), the stitched contig is flagged as being a potential chimera, and a warning is written to file. The discordant read-pair threshold can be changed using the flag --chimeric_stitched_contig_discordant_reads_cutoff with the hybpiper assemble command.

Caveats, limitations and assumptions

  • Read mapping can only detect putative chimeric sequences when the exon sequences within SPAdes contigs that contribute to the stitched contig are equal to or longer than the read length. Even if a sequence is above this threshold, the shorter the sequence the less likely it is to have sufficient numbers of reads mapped to it to flag the sequence as a putative chimera.

  • When a SPAdes assembly is performed for a given sample/gene, the input reads potentially comprise a pool of reads derived from different paralogs and alleles. SPAdes can not definitively distinguish between these read sub-sets, and so even a single output contig might be a chimera of e.g. different alleles; this has the potential to result is both false positive and false negative stitched contig chimera detection when using a read-mapping approach as described.

  • In the example above, if the region of each sequence flanking the site of concatenation is identical between each paralog, and this combined length is greater than the read library length, no discordant read-pairs will be found, even if the two contributing contigs are derived from paralogs that differ beyond the flanking regions.