Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipe_input target #1173

Merged
merged 5 commits into from
Oct 13, 2024
Merged

Conversation

jernejfrank
Copy link
Contributor

Addresses #1161.

Default behaviour reverts back to first parameter chaining for backwards compatibility.

Changes

  • Corrected minor bug for Applicable (passing through target when using other methods)
  • Added user facing API on_input
  • Added on_input to pipe_input for global setting

How I tested this

  • Unit tests

Notes

There are two differences to on_output:

  1. Since we do not use NodeTransformer we can have a mixture of global and local on_input settings. I made it so that the local (on the level of Applicable) adds to the global one (on the level of pipe_input). The other option would be to allow the local to override the global one. I also added a safeguard that if on_input is used, all transforms need to have a target to have it clear.
  2. The namespace convention needs slight adjustment in case we apply pipe_input to multiple parameters. Since each node is chained through the parameter name, I have now that in case the same transform is applied to two parameters, we create two nodes with the same namespace but add to them "_param_name" so that we don't have problems with same-node names in the DAG.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 5441199 in 33 seconds

More details
  • Looked at 838 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. hamilton/function_modifiers/macros.py:920
  • Draft comment:
    Ensure self.target is not None before calling extend on it. This can lead to a TypeError if self.target is None. Consider initializing self.target as an empty list if it's None.
  • Reason this comment was not posted:
    Comment did not seem useful.
2. hamilton/function_modifiers/macros.py:885
  • Draft comment:
    Initialize self.target as an empty list if on_input is None to avoid issues when extending it later.
  • Reason this comment was not posted:
    Marked as duplicate.
3. hamilton/function_modifiers/macros.py:939
  • Draft comment:
    Explicitly check if self.namespace is a string before using it in _resolve_namespace method for clarity and to avoid potential issues.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The current implementation of _resolve_namespace assumes that if self.namespace is not ... or None, it is a string. The suggestion to explicitly check if it is a string could add clarity and prevent potential issues if self.namespace is accidentally set to a non-string value. However, the current logic seems to handle the expected cases, and the suggestion might be considered overly cautious unless there is a known issue with self.namespace being set incorrectly.
    I might be overthinking the need for an explicit string check. The current logic seems to handle the expected cases, and the suggestion might be unnecessary unless there is a specific reason to expect self.namespace could be set to a non-string value.
    Explicitly checking for a string could prevent future bugs if self.namespace is accidentally set to an unexpected type, but it might be unnecessary if the current logic is sufficient.
    The comment suggests a precautionary measure that might not be necessary given the current logic. However, it could prevent potential issues if self.namespace is set incorrectly. Consider keeping the comment if there's a history of such issues.
4. hamilton/function_modifiers/macros.py:857
  • Draft comment:
    Please add documentation for the new on_input functionality in the pipe_input decorator to the Sphinx documentation under docs/. This will help users understand how to use this feature effectively.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment is about documentation, which is not directly related to the code changes in the diff. The rules specify not to ask for updates to the PR description or documentation unless it's part of the code changes. Since the comment is about documentation outside the code changes, it should be removed.
    I might be overlooking the importance of having external documentation for new features, but the rules are clear about not commenting on documentation unless it's part of the code changes.
    The rules are strict about not commenting on documentation unless it's part of the code changes, so the comment should be removed.
    Remove the comment as it pertains to documentation outside the scope of the code changes in the diff.

Workflow ID: wflow_iN66poHVEvrsbEW2


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left some thoughts -- I think this is valuable (E.G. the first parameter is a bit of an awkward assumption), but we might want to restrict the API a bit. Also some naming/consistency stuff.

"""

def __init__(
self,
*transforms: Applicable,
namespace: NamespaceType = ...,
on_input: base.TargetType = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not sure that on_input makes sense for anything but a single string, and the semantics aren't quite the same. Some thoughts:

  1. By default, this applies to the first parameter
  2. Does ... apply to everything?

IMO you should restrict it more -- either a string, or default (which is the first parameter). Also, @load_from (which is similar) has the parameter named target. But that's confusing, and probably not the best (maybe it should be called on_input there?. Thoughts on naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not sure that on_input makes sense for anything but a single string, and the semantics aren't quite the same. Some thoughts:

  1. By default, this applies to the first parameter

Yes, I left the default to chain the first parameter and keep the namespace as is for backwards compatibility 👍 .

  1. Does ... apply to everything?

I think we can add ..., but you could also just use the global on_input and pass in a list with the parameters.

IMO you should restrict it more -- either a string, or default (which is the first parameter). Also, @load_from (which is similar) has the parameter named target. But that's confusing, and probably not the best (maybe it should be called on_input there?. Thoughts on naming?

Ah okay, so the idea I had in mind is to leave flexibility that you can run pipe_input on more than one parameter: 3 different cases specifically

  1. You can set a global on_input and it will apply it to those input arguments, like if we want to run it on the second argument w-->_add_one --> _add_two --> 2nd arg of foo
def _add_one(v: int) -> int:
    return v + 1

def _add_two(v: int) -> int:
    return v + 2

@pipe_input(
    step(_add_one).named("a"), 
    step(_add_two).named("b"),
    on_input="w",
    namespace="global",
)
def fool(v: int, w: int) -> int:
    return w
  1. You can set a local on_input and choose which transform runs on which input argument, like v-->_add_one --> 1st arg of foo and w-->_add_two --> 2nd arg of foo.
def _add_one(v: int) -> int:
    return v + 1

def _add_two(v: int) -> int:
    return v + 2

@pipe_input(
    step(_add_one).on_input("v").named("a"), 
    step(_add_two).on_input("w").named("b"),
    namespace="local",
)
def fool(v: int, w: int) -> int:
    return v+w
  1. You can set have a mixture on_input for global and local. The local adds specific transforms on top of what the global has. Here the whole pipeline gets applied to v, but only _add_two to w
def _add_one(v: int) -> int:
    return v + 1

def _add_two(v: int) -> int:
    return v + 2

@pipe_input(
    step(_add_one).named("a"), 
    step(_add_two).on_input("w").named("b"),
    on_input="v"
    namespace="local",
)
def fool(v: int, w: int) -> int:
    return v+w
  1. The None defaults to first parameter and if someone sets on_input but not all the steps in pipe_input get hit, we error out that is unclear what the user wants and he either needs to specify locally or globally or omit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jernejfrank -- this makes a lot of sense.

To sum up the offline conversation we had, we decided to restrict it to a single input (for now), knowing that we can turn it on for more. The reasoning is to reduce supported surface area until we get a request for this feature!

self.chain = _chain
self.namespace = namespace

if isinstance(on_input, str): # have to do extra since strings are collections in python
self.target = [on_input]
elif isinstance(on_input, Collection):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use-case for a collection here? It's interesting, but feels a bit more complex than necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collection was to apply it for a subset of params, here is an example of globally, but it also works locally:
Lets say we only want to pipe_input first two params

def _add_one(v: int) -> int:
    return v + 1

def _add_two(v: int) -> int:
    return v + 2

@pipe_input(
    step(_add_one).named("a"), 
    step(_add_two).named("b"),
    on_input=["v","w"],
    namespace="global",
)
def foo(v: int, w: int, z:str, .... kwargs) -> int:
    return w

total_nodes = []
total_rename_maps = {}

for param in parameters_transforms_map:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is getting complex -- we should probably break it out into a separate function and add a unit test or two for it...

@jernejfrank
Copy link
Contributor Author

Nice! Left some thoughts -- I think this is valuable (E.G. the first parameter is a bit of an awkward assumption), but we might want to restrict the API a bit. Also some naming/consistency stuff.

The API expanded substantially to leave as much flexibility in there as possible. It also is a complete dual to pipe_output in the sense that you can target multiple parameters (like we do end nodes) and can allocate transforms individually between them -- so you could use them interchangeably / becomes a personal preference if you rather chain on output nodes or input parameter.

But I'm happy to constrain it more, like @elijahbenizzy suggested, to only allow only a single parameter and you pass it in as a string seems reasonable. In case you want more flexibility use pipe_output. There would be one case I don't see this working out, but that is probably then more indicative of a code smell than a lack of flexibility:

def _transform():
    ...

def _other_transform():
    ...

@pipe_output(step(_transform)) # this won't work
def interesting_node():
    ...

@pipe_input(
        step(_transform).on_input="interesting_node", # this should work
        step(_other_transform).on_input("foo") # leaving flexibility to also transform other params
) 
def downstream(foo, bar, baz, interesting_node,...):
    ...

def some_other_dependency(interesting_node):
# here you don't want to have the pipe transform happening just the original `interesting_node` input
    ...

@jernejfrank
Copy link
Contributor Author

Nice! Left some thoughts -- I think this is valuable (E.G. the first parameter is a bit of an awkward assumption), but we might want to restrict the API a bit. Also some naming/consistency stuff.

The API expanded substantially to leave as much flexibility in there as possible. It also is a complete dual to pipe_output in the sense that you can target multiple parameters (like we do end nodes) and can allocate transforms individually between them -- so you could use them interchangeably / becomes a personal preference if you rather chain on output nodes or input parameter.

But I'm happy to constrain it more, like @elijahbenizzy suggested, to only allow only a single parameter and you pass it in as a string seems reasonable. In case you want more flexibility use pipe_output. There would be one case I don't see this working out, but that is probably then more indicative of a code smell than a lack of flexibility:

def _transform():
    ...

def _other_transform():
    ...

@pipe_output(step(_transform)) # this won't work
def interesting_node():
    ...

@pipe_input(
        step(_transform).on_input="interesting_node", # this should work
        step(_other_transform).on_input("foo") # leaving flexibility to also transform other params
) 
def downstream(foo, bar, baz, interesting_node,...):
    ...

def some_other_dependency(interesting_node):
# here you don't want to have the pipe transform happening just the original `interesting_node` input
    ...

Maybe it would make sense to keep it like this because the subtle difference in the implementations between pipe_input and pipe_output can work hand in hand.

Specifically, for pipe_output we touch the original node by creating this chain node_raw--chain_nodes--node, whereby for pipe_input we have param--chain_node--renamed_param_in_upstream_node so we don't touch the original node, but rename the argument in the next node.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some small docs suggestions.

Also, let's create a ticket here. You've done so much for multiple already, we should track + link from the docs. Nice work!

@@ -345,7 +345,7 @@ def __init__(
:param _resolvers: Resolvers to use for the function
:param _name: Name of the node to be created
:param _namespace: Namespace of the node to be created -- currently only single-level namespaces are supported
:param _target: Selects which target nodes it will be appended onto. By default all.
:param _target: Selects which target nodes it will be appended onto. Default None gets resolved on decorator level.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explain this more -- E.G. reference the decorators/how they resolve it.

.. code-block:: python

@pipe_input(
step(_add_one).on_input(["p1","p3"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docs -- we disabled this (for now), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I found a hack how to "comment out" things from rst so Sphinx doesn't generate this part of the docstring.

"""

def __init__(
self,
*transforms: Applicable,
namespace: NamespaceType = ...,
on_input: base.TargetType = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jernejfrank -- this makes a lot of sense.

To sum up the offline conversation we had, we decided to restrict it to a single input (for now), knowing that we can turn it on for more. The reasoning is to reduce supported surface area until we get a request for this feature!

The new name makes the functionality clearer.
Pipe_input can select which parameter to apply to

- Added on_input to Applicable for user facing API
- Corrected Applicable named and namespaced to pass through target
- Global on_input for pipe_input
The implementation allows for global string input for a target.

- Functionality is there for multiple global and local parameter targets, but
  we shortcircuit for the moment with an error if the arg is not a
  string.
- At the moment all transforms in the pipe get applied to the same
  argument.
- Test and docs have commented out the multiple parameter cases for easy
  enabling should we want this.
- Added URL of the GitHub issue tracking interest to enable multiple
  params
@elijahbenizzy elijahbenizzy merged commit 97cf127 into DAGWorks-Inc:main Oct 13, 2024
24 checks passed
@jernejfrank jernejfrank deleted the feat/target_pipe_input branch October 21, 2024 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants