Remove Kmers #121

TransGirlCodes · 2020-08-08T13:49:41Z

Types of changes

This PR implements the following changes:
(Please tick any or all of the following that are applicable)

✨ New feature (A non-breaking change which adds functionality).
🐛 Bug fix (A non-breaking change, which fixes an issue).
💥 Breaking change (fix or feature that would cause existing functionality to change).

📋 Additional detail

This PR removes Kmers from BioSequences in prep for v3.

☑️ Checklist

🎨 The changes implemented is consistent with the julia style guide.
📘 I have updated and added relevant docstrings, in a manner consistent with the documentation styleguide.
📘 I have added or updated relevant user and developer manuals/documentation in docs/src/.
🆗 There are unit tests that cover the code changes I have made.
🆗 The unit tests cover my code changes AND they pass.
📝 I have added an entry to the [UNRELEASED] section of the manually curated CHANGELOG.md file for this repository.
🆗 All changes should be compatible with the latest stable version of Julia.
💭 I have commented liberally for any complex pieces of internal code.

CiaranOMara

There's quite a bit here. I've made a start. I'd need to mull over the MerIter stuff and then get back to you.

src/composition.jl

src/iterators/eachmer.jl

jakobnissen

Right @benjward , so I've looked through it now - except the SkipMerFactory, which will take a little longer time. It's actually pretty extensive. Good job!

Besides my smaller comments, I do have a few larger-scale points:

First, I think we should really allow kmers of any arbitrary alphabet type. It's not actually that much harder to do, and it's much easier to do it now than realize a year down the road we want it, and then redo half this work. It's not just that it should work with other alphabets on principle, but rather that bioinformatics software actually uses kmers of other alphabets. Aligners, for example, often uses kmers of reduced amino acids alphabets. Support for kmers of such homemade alphabets out of the box would be really nice.
Second, there is the whole BiKmer or CanonicalKmer ot whatever to call it. The issue is that kmer iteration has some conflicting requirements:
- Some applications want absolutely maximal speed just generating forward kmers. Here, having to generate a MerIterResult is just wasted CPU power.
- Other applications want canonical kmers, or mer positions. Here, we would have to create something like a MerIterResult.

Like in Julia in general, the solution to this should be specialization and dispatch. We define the kmer operations like pushfirst on either the kmer object and the MerIterResult object. Then we can create kmer iterators with either each(DNAKmer{K}, myseq) or each(MerIterResult{K}, myseq), and pick whichever we want.

I'm happy to help implementing this, but I'd rather not step on your toes. So I'll wait until you think you're done with want you want to do.

src/mers/transformations.jl

src/minhash.jl

src/mers/predicates.jl

src/mers/kmer.jl

TransGirlCodes · 2020-08-26T18:35:05Z

Thanks, @jakobnissen & @CiaranOMara, I will update the checklist and start working on your suggestions and comments.

TransGirlCodes · 2021-04-28T10:35:57Z

@jakobnissen Ok the purpose of this PR is now to remove Kmers from BioSequences for v3.

Some code changes are still made, for example with the BitIndex type etc as some changes were made to accommodate the new tuple kmer style.

TransGirlCodes · 2021-04-28T10:38:10Z

@jakobnissen We need to discuss what to do about DNA and RNA codons, used in translation, as they were just kmer aliases.

One solution is to implement a simple bitstype, to replace that functionality.

codecov · 2021-04-28T14:32:30Z

Codecov Report

Merging #121 (ab049c8) into v3 (c67ec95) will decrease coverage by 2.27%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##               v3     #121      +/-   ##
==========================================
- Coverage   82.39%   80.12%   -2.28%     
==========================================
  Files          39       31       -8     
  Lines        2568     2179     -389     
==========================================
- Hits         2116     1746     -370     
+ Misses        452      433      -19

Flag	Coverage Δ
unittests	`80.12% <50.00%> (-2.28%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/BioSequences.jl	`0.00% <ø> (-50.00%)`	⬇️
src/biosequence/biosequence.jl	`50.00% <0.00%> (-4.55%)`	⬇️
src/bit-manipulation/bit-manipulation.jl	`69.23% <0.00%> (ø)`
src/longsequences/counting.jl	`91.30% <ø> (ø)`
src/longsequences/stringliterals.jl	`100.00% <ø> (ø)`
src/bit-manipulation/bitindex.jl	`58.97% <40.00%> (-4.92%)`	⬇️
src/geneticcode.jl	`59.59% <66.66%> (-13.41%)`	⬇️
src/biosequence/indexing.jl	`67.10% <100.00%> (-0.47%)`	⬇️
src/bit-manipulation/bitpar-compiler.jl	`38.73% <0.00%> (-20.14%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c67ec95...ab049c8. Read the comment docs.

jakobnissen · 2021-04-28T15:24:18Z

Ah, right. It was kind of neat that a codon was just a kmer. Bummer. You're right, let's just implement a simple, internal-only type that wraps a UInt8, that's fine implementation-wise.

TransGirlCodes · 2021-04-28T16:18:25Z

Ok genetic code and translation now works without RNA and DNA codon. I don't anticipate this to be an issue.

TransGirlCodes · 2021-04-28T21:41:22Z

I think this is ready now.

CiaranOMara · 2021-04-29T12:22:27Z

src/longsequences/stringliterals.jl

@@ -44,7 +44,7 @@ macro aa_str(seq, flag)
 return LongAminoAcidSeq(remove_newlines(seq))
 elseif flag == "d"
 return quote
- LongAminoAcidSeq($(remove_newlines(seq)))
+ LongAASeq($(remove_newlines(seq)))


Which are we going with, LongAminoAcidSeq or LongAASeq?

CiaranOMara · 2021-04-29T12:23:52Z

test/longsequences/conversion.jl

@@ -66,7 +66,7 @@ end
 # Non-nucleotide characters should throw
 @test_throws Exception LongDNASeq("ACCNNCATTTTTTAGATXATAG")
 @test_throws Exception LongRNASeq("ACCNNCATTTTTTAGATXATAG")
- @test_throws Exception LongAminoAcidSeq("ATGHLMY@ZACAGNM")
+ @test_throws Exception LongAASeq("ATGHLMY@ZACAGNM")


The struct LongAASeq does not exist, so "ATGHLMY@ZACAGNM" is not tested.

TransGirlCodes self-assigned this Aug 8, 2020

TransGirlCodes force-pushed the better_kmers branch 5 times, most recently from 8c4f6a8 to ac1f65c Compare August 20, 2020 01:43

CiaranOMara reviewed Aug 20, 2020

View reviewed changes

src/composition.jl Outdated Show resolved Hide resolved

src/composition.jl Outdated Show resolved Hide resolved

src/iterators/eachmer.jl Outdated Show resolved Hide resolved

src/iterators/eachmer.jl Outdated Show resolved Hide resolved

jakobnissen reviewed Aug 23, 2020

View reviewed changes

TransGirlCodes force-pushed the better_kmers branch 6 times, most recently from 73176e2 to 0dd6c43 Compare December 10, 2020 14:31

jakobnissen mentioned this pull request Feb 25, 2021

Future of subsequences #118

Closed

jakobnissen mentioned this pull request Mar 11, 2021

Fixup tests before v3 #139

Closed

5 tasks

jakobnissen added this to the v3.0.0 milestone Mar 12, 2021

TransGirlCodes changed the title ~~Better kmers~~ Remove Kmers Apr 28, 2021

TransGirlCodes requested review from jakobnissen and CiaranOMara April 28, 2021 10:34

TransGirlCodes changed the base branch from master to v3 April 28, 2021 13:12

TransGirlCodes force-pushed the better_kmers branch from 32b6f43 to 71e7811 Compare April 28, 2021 13:25

Remove Kmers

83689ba

TransGirlCodes force-pushed the better_kmers branch from 71e7811 to 83689ba Compare April 28, 2021 13:27

BioJulia deleted a comment from codecov bot Apr 28, 2021

Fix a couple of errors

63c650b

Get all tests to pass without RNACodon or DNACodon

ab049c8

TransGirlCodes merged commit 8cf14bc into v3 Apr 29, 2021

CiaranOMara reviewed Apr 29, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Kmers #121

Remove Kmers #121

TransGirlCodes commented Aug 8, 2020 •

edited

Loading

CiaranOMara left a comment

jakobnissen left a comment

TransGirlCodes commented Aug 26, 2020

TransGirlCodes commented Apr 28, 2021

TransGirlCodes commented Apr 28, 2021 •

edited

Loading

codecov bot commented Apr 28, 2021 •

edited

Loading

jakobnissen commented Apr 28, 2021

TransGirlCodes commented Apr 28, 2021

TransGirlCodes commented Apr 28, 2021

CiaranOMara Apr 29, 2021

CiaranOMara Apr 29, 2021

Remove Kmers #121

Remove Kmers #121

Conversation

TransGirlCodes commented Aug 8, 2020 • edited Loading

Types of changes

📋 Additional detail

☑️ Checklist

CiaranOMara left a comment

Choose a reason for hiding this comment

jakobnissen left a comment

Choose a reason for hiding this comment

TransGirlCodes commented Aug 26, 2020

TransGirlCodes commented Apr 28, 2021

TransGirlCodes commented Apr 28, 2021 • edited Loading

codecov bot commented Apr 28, 2021 • edited Loading

Codecov Report

jakobnissen commented Apr 28, 2021

TransGirlCodes commented Apr 28, 2021

TransGirlCodes commented Apr 28, 2021

CiaranOMara Apr 29, 2021

Choose a reason for hiding this comment

CiaranOMara Apr 29, 2021

Choose a reason for hiding this comment

TransGirlCodes commented Aug 8, 2020 •

edited

Loading

TransGirlCodes commented Apr 28, 2021 •

edited

Loading

codecov bot commented Apr 28, 2021 •

edited

Loading