Skip to content

Results and output files

chrisjackson edited this page Sep 11, 2024 · 19 revisions

Documentation current for HybPiper version 2.2.0


1.0 hybpiper assemble

Output Directory

Optional. The parent output directory if supplied using the parameter --hybpiper_output or -o.

Base Directory

The name of the base directory is specified by supplying the parameter --prefix to the hybpiper assemble command. If --prefix is not provided, it is generated from the read file names.

  • The master target file (e.g. target_file.fasta).
  • translated_target_file.fasta. A fasta file with amino-acid sequences, translated from a nucleotide target file. Note that this is only present if a nucleotide target file was supplied, but the flag --bwa was not used.
  • check_targetfile_report-<target_file_name>.txt. A text report file summarising details of the target file check performed. Note that this is only present if flag --skip_targetfile_checks is not used.
  • A BLAST (<target_file_name>.psq, etc.), DIAMOND (<target_file_name>.dmnd) or BWA database (<target_file_name>.amb, etc.).
  • A BLAST/DIAMOND (<prefix>.blastx) or BWA (<prefix>.bam) mapping results file.
  • A directory for every gene with BLAST/DIAMOND or BWA hits, e.g. gene001, gene002, etc.
  • target_tallies.txt. A text file summarizing the chosen target reference sequences for the sample run.
  • spades_initial_commands.txt. A text file listing the spades.py commands used to assemble reads from each gene.
  • gnu_parallel_log.txt. A text log file produced by GNU parallel when running SPAdes gene assemblies.
  • gnu_parallel_log.txt. A text log file produced by GNU parallel when running SPAdes gene assemblies.
  • spades_genelist.txt. A text file listing all genes with mapped reads.
  • exonerate_genelist.txt. A text file listing all genes with assembled SPAdes contigs. Note that this file is called exonerate_genelist.txt even if BLAST was used to extract sequences (i.e. option --not_protein_coding was used).
  • genes_with_seqs.txt. A text file listing all genes for which a coding sequence was extracted via Exonerate.
  • <prefix>_chimera_check_performed.txt. A text file containing 'True' or 'False' depending on whether the option --skip_chimeric_genes was provided to command hybpiper assemble. Used by hybpiper retrieve_sequences and hybpiper paralog_retriever.
  • <prefix>_genes_with_non_terminal_stop_codons.txt. A text log file containing gene names for any gene with an output sequence containing one or more internal (i.e., non-terminal) stop codons.
  • <prefix>_genes_with_long_paralog_warnings.txt. A text file listing all genes which had multiple long-length sequences from different SPAdes contigs (putative paralogs).
  • <prefix>_genes_with_paralog_warnings_by_contig_depth.csv. A comma-separated-values file listing all genes that had a SPAdes contig depth >1 for at least 75% (default) the length of the reference target file sequence.
  • <prefix>_genes_with_stitched_contig.csv. A comma-separated-values file with details on whether a stitched contig was created for a given gene.
  • <prefix>_genes_derived_from_putative_chimera_stitched_contig.csv. A comma-separated-values file listing all genes that might be derived from a chimeric stitched contig (i.e. comprising multiple paralogs).
  • <prefix>_hybpiper_assemble_<date_time>.log. A text log file containing many details regarding the pipeline run for the sample.
  • spades.log. A text log file containing the concatenated output of the SPAdes assembler for initial SPAdes assemblies for all genes.
  • failed_spades.txt. A text file listing all genes that had a failed initial SPAdes assembly.
  • redo_spades_commands.txt. A text file containing commands to re-run SPAdes for genes with a failed initial assembly.
  • spades_redo.log. A text log file containing the concatenated output of the SPAdes assembler for SPAdes re-runs.
  • spades_duds.txt. A text file listing all genes with failed SPAdes re-runs.
  • total_input_reads_paired.txt. A text file containing the number of paired-end reads (if supplied) in the input read files.
  • total_input_reads_single.txt. A text file containing the number of single-end reads (if supplied) in the input read files.
  • total_input_reads_unpaired.txt. A text file containing the number of unpaired reads (if supplied) in the input read files.

Base Directory -> Gene Directory

The gene directories will be named according the unique gene names present in the target file used for the run.

  • <gene_name>_interleaved.fasta. A fasta file containing all reads provided using the --readfiles or -r parameter that mapped to any target sequence for this gene. In cases where only one read of a read pair mapped, both R1 and R2 reads are included in this file. If paired-end reads files were used as input, this fasta file is in interleaved format; not that this file will be have the suffix interleaved.fasta even if you provide single-end reads.
  • <gene_name>_merged.fastq. A fastq file of merged reads from paired-end input. This file will only be present if the flag --merged is used with the hybpiper assemble command and paired-end reads are provided.
  • <gene_name>_unmerged.fastq. A fastq file of paired-end reads that could not be merged. in interleaved format. This file will only be present if the flag --merged is used with the hybpiper assemble command and paired-end reads are provided.
  • <gene_name>_unpaired.fasta. A fasta file containing all reads provided using the --unpaired parameter that mapped to any target sequence for this gene.
  • <gene_name>_contigs.fasta. The contigs assembled from the input read using SPAdes.
  • <gene_name>_target.fasta. A fasta file with the amino-acid sequence of the 'best' reference target for the given gene/sample.
  • <gene_name>_<date_time>.log. The log file produced by the exonerate_hits.py module for the given gene/sample. This will only be present if the flag --keep_intermediate_files was provided to the command hybpiper assemble; default behaviour is to delete the log file after it has been re-logged to the main sample logfile in the base directory.
  • <sample_name>. A directory of Exonerate results; the directory has the same name as the sample. See below for details.
  • <gene_name>_spades. The directory produced by the SPAdes assembler for the given gene/sample. See below for details.

Base Directory -> Gene Directory -> SPAdes Directory

The SPAdes assembly directory is produced by the SPAdes assembler; in this case it will have a prefix corresponding to the given gene name, i.e. <gene_name>_spades. This directory will only be present if the flag --keep_intermediate_files was provided to the command hybpiper assemble; default behaviour is to delete the directory after processing. It contains standard SPAdes output files and folders as described here.

Base Directory -> Gene Directory -> Exonerate Directory

The Exonerate directory will have the same name as the base directory (i.e. the sample name), and contains output files and folder produced by the exonerate_hits.py module.

  • exonerate_results.fasta. The output of the initial Exonerate search of the target protein against the SPAdes contigs. This file contains both Exonerate alignments, and fasta sequence for the extracted coding region.
  • exonerate_stats.tsv. A table in tab-separated-values format, containing information on SPAdes contigs with Exonerate hits against the 'best' reference target sequence, if they passed the initial global similarity filter set by --thresh.
  • exonerate_hits_trimmed.FAA. A fasta file containing amino-acid sequences of one or more Exonerate hits used to create the output gene sequence.
  • exonerate_hits_trimmed.FNA. A fasta file containing nucleotide sequences of one or more Exonerate hits used to create the output gene sequence.
  • genes_with_stitched_contig.csv. A file in comma-separated-values format, providing details on whether the given gene/sample sequence was derived from a stitched contig.
  • paralog_warning_long.txt. A text file produced if the given gene/sample had 'long' paralog warnings, listing the corresponding SPAdes contigs along with Exonerate hit details.
  • paralog_warning_by_contig_depth.txt. A text file detailing whether the given gene/sample has a paralog warning produced by sequence depth across the reference target sequence after Exonerate searches.
  • chimera_test_stitched_contig.fasta. A fasta file containing a stitched contig nucleotide sequence, used for read mapping during the chimera test.
  • chimera_test_stitched_contig.sam. A mapping file in Sequence Alignment Map (SAM) format, produced by mapping paired-end reads against the chimera_test_stitched_contig.fasta sequence.
  • putative_chimeric_stitched_contig.csv. A file in comma-separated-values format, produced if a stitched contig for the given gene/sample appears to be chimeric. Lists the sample name, gene name, and chimera warning details.
  • chimera_test_diagnostic_reads.sam A headless mapping file in Sequence Alignment Map (SAM) format, produced by filtering the chimera_test_stitched_contig.sam file to retain read pairs diagnostic for a chimeric stitched contig.
  • sequences. A directory containing subdirectories with recovered sequences. See below for details.
  • intronerate. A directory containing intron and supercontig processing results. See below for details.
  • paralogs. A directory containing paralog sequence results, if present. See below for details.

If option --not_protein_coding is used:

This directory will contain BLASTn output files rather than Exonerate output, as follows:

  • blastn_results.xml. The output of the BLASTn search of the target sequence against the SPAdes contigs, in *.xml format (blastn -outfmt 5).
  • blast_stats.tsv. A table in tab-separated-values format, containing information on SPAdes contigs with BLASTn hits against the 'best' reference target sequence, if they passed the initial global similarity filter set by --thresh.
  • blast_hits_trimmed.FNA. A fasta file containing nucleotide sequences of one or more BLASTn hits used to create the output sequence.
  • genes_with_stitched_contig.csv. A file in comma-separated-values format, providing details on whether the given locus/sample sequence was derived from a stitched contig.
  • paralog_warning_long.txt. A text file produced if the given locus/sample had 'long' paralog warnings, listing the corresponding SPAdes contigs along with BLASTn hit details.
  • paralog_warning_by_contig_depth.txt. A text file detailing whether the given locus/sample has a paralog warning produced by sequence depth across the reference target sequence after BLASTn searches.
  • chimera_test_stitched_contig.fasta. A fasta file containing a stitched contig nucleotide sequence, used for read mapping during the chimera test.
  • chimera_test_stitched_contig.sam. A mapping file in Sequence Alignment Map (SAM) format, produced by mapping paired-end reads against the chimera_test_stitched_contig.fasta sequence.
  • putative_chimeric_stitched_contig.csv. A file in comma-separated-values format, produced if a stitched contig for the given locus/sample appears to be chimeric. Lists the sample name, gene name, and chimera warning details.
  • chimera_test_diagnostic_reads.sam A headless mapping file in Sequence Alignment Map (SAM) format, produced by filtering the chimera_test_stitched_contig.sam file to retain read pairs diagnostic for a chimeric stitched contig.
  • sequences. A directory containing subdirectories with recovered sequences. See below for details.
  • paralogs. A directory containing paralog sequence results, if present. See below for details

Base Directory -> Gene Directory -> Exonerate Directory -> Sequences Directory

The directory sequences contains subdirectories containing fasta files with recovered sequences, as follows:

  • FAA. A directory containing the fasta file <gene_name>.FNA with the recovered gene sequence in amino-acids.
  • FNA. A directory containing the fasta file <gene_name>.FNA with the recovered gene sequence in nucleotides.
  • intron. A directory containing the fasta files <gene_name>_introns.fasta and <gene_name>supercontig.fasta. These files contain recovered intron sequence, and the recovered supercontig sequence (the latter containing both introns and exons), of recovered for the gene/sampke. This directory will only be present if the flag --run_intronerate was provided to the command hybpiper assemble.

If option --not_protein_coding is used:

  • FNA. A directory containing the fasta file <gene_name>.FNA with the recovered locus sequence in nucleotides.

Base Directory -> Gene Directory -> Exonerate Directory -> Paralogs Directory

The directory paralogs contains the fasta file <gene_name>_paralogs.fasta with paralog sequences, if recovered for the gene/sample.

Base Directory -> Gene Directory -> Exonerate Directory -> Intronerate Directory

The directory intronerate will only be present if the flag --run_intronerate was provided to the command hybpiper assemble and --not_protein_coding was not used. It contains output files produced by Intronerate (the process used to recover introns and supercontigs, if present for the gene/sample).

  • intronerate_query_stripped.fasta. A fasta file containing the recovered gene sequence in amino-acid format, with and 'X' characters removed. Used as a query in Exonerate searches to generate a gff file.
  • <gene_name>_supercontig_without_Ns.fasta. A fasta file containing a supercontig (i.e. exons and introns) for the given gene/sample. Used as a target in Exonerate searches to generate a gff file.
  • <gene_name>_intronerate_supercontig_individual_contig_hits.fasta. A fasta file containing the individual SPAdes contigs used to create the supercontig sequence.
  • <gene_name>_intronerate_fasta_and_gff.txt. A text file containing both Exonerate search alignment and gff details.
  • intronerate.gff. The gff details only, extracted from the <gene_name>_intronerate_fasta_and_gff.txt file.

2.0 hybpiper stats

Parent Directory

The parent directory contains one or more Base directories corresponding to the output of hybiper assemble for each sample. The descriptions below assume that the command hybpiper stats has been run from the parent directory.

  • seq_lengths.tsv. A table in tab-separated-values format, containing the lengths of each recovered gene sequence for each sample, along with the mean sequence length for each gene within the target file. The name of this file can be changed using the parameter --seq_lengths_filename <filename>.

  • hybpiper_stats.tsv. A table in tab-separated-values format, containing statistics on the HybPiper run. The name of this file can be changed using the parameter --stats_filename <filename>.

3.0 hybpiper retrieve_sequences

Parent Directory

The parent directory contains one or more Base directories corresponding to the output of hybiper assemble for each sample. The descriptions below assume that the command hybpiper retrieve_sequences has been run from the parent directory.

  • <gene_name>.FNA. A fasta file containing the recovered gene sequence from each sample in nucleotides (if parameter dna was supplied). A fasta file will be produced for each gene.

  • <gene_name>.FAA. A fasta file containing the recovered gene sequence from each sample in amino-acids (if parameter aa was supplied). A fasta file will be produced for each gene.

  • <gene_name_introns>.fasta. A fasta file containing the recovered gene intron sequence from each sample in nucleotides (if parameter intron was supplied). A fasta file will be produced for each gene.

  • <gene_name_supercontig>.fasta. A fasta file containing the recovered gene supercontig sequence (exons and introns) from each sample in nucleotides (if parameter supercontig was supplied). A fasta file will be produced for each gene.

Optional sequence directory

If the parameter --fasta_dir <directory_name> is provided, the directory will be created and the fasta files described above will be placed within it, rather than in the parent directory.

4.0 hybpiper filter_by_length

Parent Directory

The parent directory contains one or more Base directories corresponding to the output of hybiper assemble for each sample. The descriptions below assume that the command hybpiper retrieve_sequences has been run from the parent directory.

  • <gene_name>.filtered.FNA. A fasta file containing the gene sequence from each sample in nucleotides (if parameter dna was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene.

  • <gene_name>.filtered.FAA. A fasta file containing the recovered gene sequence from each sample in amino-acids (if parameter aa was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene.

  • <gene_name_introns>.filtered.fasta. A fasta file containing the recovered gene intron sequence from each sample in nucleotides (if parameter intron was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene.

  • <gene_name_supercontig>.filtered.fasta. A fasta file containing the recovered gene supercontig sequence (exons and introns) from each sample in nucleotides (if parameter supercontig was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene.

Optional sequence directory

If the parameter --filtered_dir <directory_name> is provided, the directory will be created and the fasta files described above will be placed within it, rather than in the parent directory.

5.0 hybpiper paralog_retriever

Parent Directory

The parent directory contains one or more Base directories corresponding to the output of hybiper assemble for each sample. The descriptions below assume that the command hybpiper paralog_retriever has been run from the parent directory.

  • paralog_report.tsv. A table in tab-separated-values format, containing the number of long sequences recovered for each gene and sample (i.e. potential paralogs if > 1)

  • paralog_heatmap.png. A heatmap image file in *.png format, depicting the number of long sequences recovered for each gene and sample. The name of this file can be changed using the parameter --heatmap_filename <filename>. The format of the file can be changed using the parameter --heatmap_filetype {png,pdf,eps,tiff,svg}.

  • paralogs_above_threshold_report.txt. A text file that lists 1) The number and names of genes with paralogs in a minimum percentage of samples; 2) The number and names of samples that have paralogs in a minimum percentage of genes. By default, this percentage is set to zero, so all genes and samples with paralogs will be reported.

  • paralogs_all. A directory containing a *.fasta file for each sample/gene, containing paralog sequences if present, or the *.FNA sequence recovered by HybPiper is no paralogs were detected.

  • paralogs_no_chimeras. A directory containing a *.fasta file for each sample/gene as above, but with any putative chimeric *.FNA sequences removed. This folder will only be present if at least one of your samples had a chimera check performed during hybpiper assemble (i.e. the option --chimeric_stitched_contig_check was provided).

6.0 hybpiper recovery_heatmap

Parent Directory

The parent directory contains one or more Base directories corresponding to the output of hybiper assemble for each sample. The descriptions below assume that the command hybpiper recovery_heatmap has been run from the parent directory.

  • recovery_heatmap.png. A heatmap image file in *.png format, depicting the length of the recovered sequence for each gene and each sample, relative to the mean length of the gene sequence references in the target file. The name of this file can be changed using the parameter --heatmap_filename <filename>. The format of the file can be changed using the parameter --heatmap_filetype {png,pdf,eps,tiff,svg}.

7.0 hybpiper check_dependencies

No output files are produced by this command. Results are printed to the terminal screen.

8.0 hybpiper check_targetfile

In addition to results printed to the terminal screen, the following file is produced:

  • fix_targetfile_<date_time>.ctl. A control file in text format, logging parameters of the hybpiper check_targetfile run, as well as a list of target file sequence names for sequences with low-complexity regions. This *.ctl file is required as input for the hybpiper fix_targetfile command (see below).

9.0 hybpiper fix_targetfile

In addition to results printed to the terminal screen, the following files are produced:

  • <targetfile_name>_fixed.fasta. A fasta file containing filtered and/or fixed target sequences.
  • fix_targetfile_report.tsv. A table in tab-separated-values format, containing a list of sequences that were removed from the input target file, and a corresponding reason. Note that this list can include multiple frames for a single input sequence (suffix _frame_1, _frame_2, etc.).
  • fix_targetfile_<date_time>.log. A text log file containing details of the hybpiper fix_targetfile run.

Parent Directory -> Alignments Directory

The directory fix_targetfile_alignments will only be present if the flag --alignments was provided to the command hybpiper fix_targetfile. It contains directories with per-gene unaligned and aligned fasta files, from the trimmed/filtered targetfile. By default, this directory will not be created.

  • translated_gene_seqs_unaligned. A directory containing unaligned fasta files <gene_name>_unaligned.fasta with translated, unaligned, per-gene fixed target file sequences. Only present if the input target file contains nucleotide sequences.
  • translated_gene_seqs_aligned. A directory containing aligned fasta files <gene_name>_aligned.fasta with translated, aligned, per-gene fixed target file sequences. Only present if the input target file contains nucleotide sequences.
  • protein_gene_seqs_unaligned. A directory containing fasta files <gene_name>_unaligned.fasta with aligned per-gene fixed target file sequences. Only present if the input target file contains protein sequences.
  • protein_gene_seqs_aligned. A directory containing fasta files <gene_name>_aligned.fasta with aligned per-gene fixed target file sequences. Only present if the input target file contains protein sequences.

Parent Directory -> Additional Sequences Directory

The directory fix_targetfile_additional_sequence_files will only be present if the flag --write_all_fasta_files was provided to the command hybpiper fix_targetfile. It contains fasta files for sequences removed from the fixed target file, grouped according to filtering categories (length threshold, low-complexity regions, etc.). By default, these files will not be written.

  • <targetfile_name>_low_complexity_regions.fasta. A fasta file containing all sequences listed as having low-complexity regions in the *.ctl file, regardless of whether they were removed from the fixed targetfile or not.
  • <targetfile_name>_short_sequences.fasta. A fasta file containing sequences shorter than the "--filter_by_length_percentage" threshold, when compared to the longest representative gene sequence.
  • <targetfile_name>_stop_codons_all_frames.fasta. A fasta file containing sequences with unexpected stop codons in all forward frames.
  • <targetfile_name>_undetermined_frame.fasta. A fasta file containing sequences with multiple candidate forward reading frames, but no reference sequence to select the 'correct' candidate. Each candidate frame is present as a unique sequence.
  • <targetfile_name>_exceeding_maximum_distance_frames_multi.fasta. A fasta file containing sequences with multiple candidate forward reading frames and a corresponding reference sequence, but all frames exceeded the maximum allowed distance threshold from the reference.
  • <targetfile_name>_exceeding_maximum_distance_frames_single.fasta. A fasta file containing sequences with a single candidate forward reading frame and a corresponding reference sequence, but the frame exceeded the maximum allowed distance threshold from the reference.