Use lowercase in DNA sequences #275

ruysan · 2023-05-30T02:50:21Z

Uppercase and lowercase are commonly used in DNA sequence data to identify distinctive features, such as coding regions, repeats, or other arbitrary regions of interest.

Currently all DNA sequences are modified to uppercase in BioSequences.

Expected Behavior

Suggested change/improvement: Allow for DNA sequences with mixed upper- and lowercase.

Current Behavior

Current behaviour: All DNA is written in uppercase letters.

Possible Solution / Implementation

Context

The trivial example: Provide the reverse complement of this sequence:
GACGTCGCCAGAGAggcataTAACGATAtgacacagagagagcaGAGACAAGT
cannot be answered by using BioSequences without losing information.

jakobnissen · 2023-05-30T06:32:35Z

Dear @ruysan

The lowercase/uppercase is metadata associated with the DNA sequence, and not part of the DNA sequence itself. Therefore, it should not be contained in the DNA sequence type.
There are several reasons for this:

First, one can always encode extra information in the string representation of any object. For example, suppose I have a FASTA sequence where the line length is used to encode e.g. exons/introns. Or suppose I have an array literal where I use like breaks to signify different parts of the array. The information encoded may be arbitrary, and it's not possible to design a data structure to be able to contain this extra data
Second, suppose we specialize lowercase/uppercase in BioSequences. Then, every symbol would need one extra bit of storage, which means DNA would take either 3 or 5 bits instead of 2 or 4 bits. Since it's convenient to have the number of bits be a power of two, this would round up to 4 or 8 bits, which means the memory footprint would double. The implementation would also be more complex.

Instead, I propose you simply extract the metadata to a separate vector:

julia> using BioSequences

julia> s = "GACGTCGCCAGAGAggcataTAACGATAtgacacagagagagcaGAGACAAGT";

julia> mask = BitVector(isuppercase(i) for i in s);

julia> dna = LongDNA{2}(s)

jakobnissen closed this as completed May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lowercase in DNA sequences #275

Use lowercase in DNA sequences #275

ruysan commented May 30, 2023

jakobnissen commented May 30, 2023

Use lowercase in DNA sequences #275

Use lowercase in DNA sequences #275

Comments

ruysan commented May 30, 2023

Expected Behavior

Current Behavior

Possible Solution / Implementation

Context

jakobnissen commented May 30, 2023