Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Kmers #121

Merged
merged 3 commits into from
Apr 29, 2021
Merged

Remove Kmers #121

merged 3 commits into from
Apr 29, 2021

Conversation

TransGirlCodes
Copy link
Member

@TransGirlCodes TransGirlCodes commented Aug 8, 2020

Types of changes

This PR implements the following changes:
(Please tick any or all of the following that are applicable)

  • ✨ New feature (A non-breaking change which adds functionality).
  • 🐛 Bug fix (A non-breaking change, which fixes an issue).
  • 💥 Breaking change (fix or feature that would cause existing functionality to change).

📋 Additional detail

This PR removes Kmers from BioSequences in prep for v3.

☑️ Checklist

  • 🎨 The changes implemented is consistent with the julia style guide.
  • 📘 I have updated and added relevant docstrings, in a manner consistent with the documentation styleguide.
  • 📘 I have added or updated relevant user and developer manuals/documentation in docs/src/.
  • 🆗 There are unit tests that cover the code changes I have made.
  • 🆗 The unit tests cover my code changes AND they pass.
  • 📝 I have added an entry to the [UNRELEASED] section of the manually curated CHANGELOG.md file for this repository.
  • 🆗 All changes should be compatible with the latest stable version of Julia.
  • 💭 I have commented liberally for any complex pieces of internal code.

@TransGirlCodes TransGirlCodes self-assigned this Aug 8, 2020
@TransGirlCodes TransGirlCodes force-pushed the better_kmers branch 5 times, most recently from 8c4f6a8 to ac1f65c Compare August 20, 2020 01:43
Copy link
Member

@CiaranOMara CiaranOMara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a bit here. I've made a start. I'd need to mull over the MerIter stuff and then get back to you.

src/composition.jl Outdated Show resolved Hide resolved
src/composition.jl Outdated Show resolved Hide resolved
src/iterators/eachmer.jl Outdated Show resolved Hide resolved
src/iterators/eachmer.jl Outdated Show resolved Hide resolved
Copy link
Member

@jakobnissen jakobnissen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @benjward , so I've looked through it now - except the SkipMerFactory, which will take a little longer time. It's actually pretty extensive. Good job!

Besides my smaller comments, I do have a few larger-scale points:

  • First, I think we should really allow kmers of any arbitrary alphabet type. It's not actually that much harder to do, and it's much easier to do it now than realize a year down the road we want it, and then redo half this work. It's not just that it should work with other alphabets on principle, but rather that bioinformatics software actually uses kmers of other alphabets. Aligners, for example, often uses kmers of reduced amino acids alphabets. Support for kmers of such homemade alphabets out of the box would be really nice.
  • Second, there is the whole BiKmer or CanonicalKmer ot whatever to call it. The issue is that kmer iteration has some conflicting requirements:
    • Some applications want absolutely maximal speed just generating forward kmers. Here, having to generate a MerIterResult is just wasted CPU power.
    • Other applications want canonical kmers, or mer positions. Here, we would have to create something like a MerIterResult.

Like in Julia in general, the solution to this should be specialization and dispatch. We define the kmer operations like pushfirst on either the kmer object and the MerIterResult object. Then we can create kmer iterators with either each(DNAKmer{K}, myseq) or each(MerIterResult{K}, myseq), and pick whichever we want.

I'm happy to help implementing this, but I'd rather not step on your toes. So I'll wait until you think you're done with want you want to do.

src/mers/transformations.jl Outdated Show resolved Hide resolved
src/mers/transformations.jl Outdated Show resolved Hide resolved
src/minhash.jl Outdated Show resolved Hide resolved
src/minhash.jl Outdated Show resolved Hide resolved
src/mers/predicates.jl Outdated Show resolved Hide resolved
src/mers/kmer.jl Outdated Show resolved Hide resolved
src/mers/kmer.jl Outdated Show resolved Hide resolved
src/mers/kmer.jl Outdated Show resolved Hide resolved
@TransGirlCodes
Copy link
Member Author

Thanks, @jakobnissen & @CiaranOMara, I will update the checklist and start working on your suggestions and comments.

@TransGirlCodes TransGirlCodes force-pushed the better_kmers branch 6 times, most recently from 73176e2 to 0dd6c43 Compare December 10, 2020 14:31
@jakobnissen jakobnissen mentioned this pull request Mar 11, 2021
5 tasks
@jakobnissen jakobnissen added this to the v3.0.0 milestone Mar 12, 2021
@TransGirlCodes TransGirlCodes changed the title Better kmers Remove Kmers Apr 28, 2021
@TransGirlCodes
Copy link
Member Author

@jakobnissen Ok the purpose of this PR is now to remove Kmers from BioSequences for v3.

Some code changes are still made, for example with the BitIndex type etc as some changes were made to accommodate the new tuple kmer style.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Apr 28, 2021

@jakobnissen We need to discuss what to do about DNA and RNA codons, used in translation, as they were just kmer aliases.

One solution is to implement a simple bitstype, to replace that functionality.

@codecov
Copy link

codecov bot commented Apr 28, 2021

Codecov Report

Merging #121 (ab049c8) into v3 (c67ec95) will decrease coverage by 2.27%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##               v3     #121      +/-   ##
==========================================
- Coverage   82.39%   80.12%   -2.28%     
==========================================
  Files          39       31       -8     
  Lines        2568     2179     -389     
==========================================
- Hits         2116     1746     -370     
+ Misses        452      433      -19     
Flag Coverage Δ
unittests 80.12% <50.00%> (-2.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/BioSequences.jl 0.00% <ø> (-50.00%) ⬇️
src/biosequence/biosequence.jl 50.00% <0.00%> (-4.55%) ⬇️
src/bit-manipulation/bit-manipulation.jl 69.23% <0.00%> (ø)
src/longsequences/counting.jl 91.30% <ø> (ø)
src/longsequences/stringliterals.jl 100.00% <ø> (ø)
src/bit-manipulation/bitindex.jl 58.97% <40.00%> (-4.92%) ⬇️
src/geneticcode.jl 59.59% <66.66%> (-13.41%) ⬇️
src/biosequence/indexing.jl 67.10% <100.00%> (-0.47%) ⬇️
src/bit-manipulation/bitpar-compiler.jl 38.73% <0.00%> (-20.14%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c67ec95...ab049c8. Read the comment docs.

@jakobnissen
Copy link
Member

Ah, right. It was kind of neat that a codon was just a kmer. Bummer. You're right, let's just implement a simple, internal-only type that wraps a UInt8, that's fine implementation-wise.

@TransGirlCodes
Copy link
Member Author

Ok genetic code and translation now works without RNA and DNA codon. I don't anticipate this to be an issue.

@TransGirlCodes
Copy link
Member Author

I think this is ready now.

@TransGirlCodes TransGirlCodes merged commit 8cf14bc into v3 Apr 29, 2021
@@ -44,7 +44,7 @@ macro aa_str(seq, flag)
return LongAminoAcidSeq(remove_newlines(seq))
elseif flag == "d"
return quote
LongAminoAcidSeq($(remove_newlines(seq)))
LongAASeq($(remove_newlines(seq)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which are we going with, LongAminoAcidSeq or LongAASeq?

@@ -66,7 +66,7 @@ end
# Non-nucleotide characters should throw
@test_throws Exception LongDNASeq("ACCNNCATTTTTTAGATXATAG")
@test_throws Exception LongRNASeq("ACCNNCATTTTTTAGATXATAG")
@test_throws Exception LongAminoAcidSeq("ATGHLMY@ZACAGNM")
@test_throws Exception LongAASeq("ATGHLMY@ZACAGNM")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct LongAASeq does not exist, so "ATGHLMY@ZACAGNM" is not tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants