Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lowercase in DNA sequences #275

Closed
ruysan opened this issue May 30, 2023 · 1 comment
Closed

Use lowercase in DNA sequences #275

ruysan opened this issue May 30, 2023 · 1 comment

Comments

@ruysan
Copy link

ruysan commented May 30, 2023

Uppercase and lowercase are commonly used in DNA sequence data to identify distinctive features, such as coding regions, repeats, or other arbitrary regions of interest.

Currently all DNA sequences are modified to uppercase in BioSequences.

Expected Behavior

Suggested change/improvement: Allow for DNA sequences with mixed upper- and lowercase.

Current Behavior

Current behaviour: All DNA is written in uppercase letters.

Possible Solution / Implementation

Context

The trivial example: Provide the reverse complement of this sequence:
GACGTCGCCAGAGAggcataTAACGATAtgacacagagagagcaGAGACAAGT
cannot be answered by using BioSequences without losing information.

@jakobnissen
Copy link
Member

Dear @ruysan

The lowercase/uppercase is metadata associated with the DNA sequence, and not part of the DNA sequence itself. Therefore, it should not be contained in the DNA sequence type.
There are several reasons for this:

  • First, one can always encode extra information in the string representation of any object. For example, suppose I have a FASTA sequence where the line length is used to encode e.g. exons/introns. Or suppose I have an array literal where I use like breaks to signify different parts of the array. The information encoded may be arbitrary, and it's not possible to design a data structure to be able to contain this extra data
  • Second, suppose we specialize lowercase/uppercase in BioSequences. Then, every symbol would need one extra bit of storage, which means DNA would take either 3 or 5 bits instead of 2 or 4 bits. Since it's convenient to have the number of bits be a power of two, this would round up to 4 or 8 bits, which means the memory footprint would double. The implementation would also be more complex.

Instead, I propose you simply extract the metadata to a separate vector:

julia> using BioSequences

julia> s = "GACGTCGCCAGAGAggcataTAACGATAtgacacagagagagcaGAGACAAGT";

julia> mask = BitVector(isuppercase(i) for i in s);

julia> dna = LongDNA{2}(s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants