Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Alignment class #41

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Added Alignment class #41

wants to merge 1 commit into from

Conversation

jacarey
Copy link

@jacarey jacarey commented May 15, 2023

Replicated the Alignment class from fgbio

@jacarey jacarey requested review from nh13 and tfenne May 15, 2023 23:10
Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good, a number of comments to address and then I'll take one final look!

@@ -0,0 +1,232 @@
from typing import Callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this to fgpyo/alignment.py and then move the tests similarly. Any reason to keep it this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add module docs. Can you also add an example or two of the padded string method? See examples in other modules:

Examples
~~~~~~~~
.. code-block:: python
>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
'75T', '8B', '75T'
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
... ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
... print(str(ex))
...
Read structure missing length information: 23T2T[T]23T

Suggested change
from typing import Callable
"""
Classes for representing Alignments
-----------------------------------
<INSERT an example or TWO HERE>
Module Contents
~~~~~~~~~~~~~~~
The module contains the following public classes:
- :class:`~fgpyo.alignment.Alignment` -- Describes the alignment between two sequences
"""
from typing import Callable

target: the target sequence
query_start: the 1-based position in the query sequence where the alignment begins
target_start: the 1-based position in the target sequence where the alignment begins
cigar: a Cigar object describing the alignment of the two sequences
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend not to use "class" or "object" in docstrings since they we know they are classes or objects (since that's what we're document)

Suggested change
cigar: a Cigar object describing the alignment of the two sequences
cigar: the Cigar describing the alignment of the two sequences


class Alignment:
"""
A general class to describe the alignment between two sequences or partial ranges thereof
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A general class to describe the alignment between two sequences or partial ranges thereof
Describes the alignment between two sequences.

"""
return self.target_start + self.cigar.length_on_target() - 1

def padded_string(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main reason this class is here, so I'd highlight it in the module docs in an example

mismatch_char: str = ".",
gap_char: str = " ",
pad_char: str = "-",
) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer a tuple when we have a fixed # of items returned (or use a namedtuple/dataclass, but a tuple is best here). That allows us to unpack on assignment: q, a, t = alignment.padded_string(...)

Suggested change
) -> List[str]:
) -> Tuple[str, str, str]:

Comment on lines +138 to +142
if not (
self.query_start <= start <= self.query_end
and self.query_start <= end <= self.query_end
):
raise ValueError("start or end is outside of aligned region of target sequence")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's more useful to have the conditions separated so you know if it start or end that's invalid. And why not use assert?

Suggested change
if not (
self.query_start <= start <= self.query_end
and self.query_start <= end <= self.query_end
):
raise ValueError("start or end is outside of aligned region of target sequence")
assert self.query_start <= start <= self.query_end, f"start '{start}' is outside of aligned region of query sequence "
assert self.query_start <= end <= self.query_end, f"end '{end}' is outside of aligned region of query sequence "

Also, we carried over a bug from fgbio (it's query, not target): fulcrumgenomics/fgbio#914

Returns:
A new Alignment with updated coordinates and cigar.
"""
if not (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

elems.append(CigarElement(length, elem.operator))
curr_start += length

return Alignment(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you make this an attr class, you can do attr.evolve :)

start, end, self.target_start, lambda elem: elem.operator.consumes_reference
)

def sub(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make "private" as the docstring suggest

Suggested change
def sub(
def _sub(

@@ -332,7 +332,7 @@ class CigarOp(enum.Enum):
code (int): The :py:mod:`~pysam` cigar operator code.
character (int): The single character cigar operator.
consumes_query (bool): True if this operator consumes query bases, False otherwise.
consumes_target (bool): True if this operator consumes target bases, False otherwise.
consumes_reference (bool): True if this operator consumes target bases, False otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
consumes_reference (bool): True if this operator consumes target bases, False otherwise.
consumes_reference (bool): True if this operator consumes reference bases, False otherwise.

@nh13 nh13 added the good first issue Good for newcomers label May 20, 2024
@nh13 nh13 added the help wanted Extra attention is needed label Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants