Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDN-4604: Networking: egress IP per destination proposal #1594

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

martinkennelly
Copy link
Contributor

/hold
/cc

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 13, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 13, 2024

@martinkennelly: This pull request references SDN-4604 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

/hold
/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2024
Copy link
Contributor

openshift-ci bot commented Mar 13, 2024

@martinkennelly: GitHub didn't allow me to request PR reviews from the following users: martinkennelly.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/hold
/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Mar 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign trozet for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@martinkennelly martinkennelly force-pushed the eip-per-dst branch 6 times, most recently from 04adba5 to 1435ec2 Compare March 14, 2024 14:30
@martinkennelly
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2024
With the Kubernetes label selector semantics, cluster administrators can define a set of destination networks that multiple `EgressIP`
CRs can consume.

If greater than one `EgressIPTraffic` CRs are selected, the networks are added together, and if a destination IP is within one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the EgressIPTraffic have overlapping subnets? Do we do longest prefix match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they have overlapping subnets, its not a concern because all we care about is if the destination IP is within one of the subnets. Once one matches, SNATing occurs.

Lets say we have EIPTraffic1 with networks 1.1.1.0/24 and EIPTraffic2 with networks 1.1.0.0/16. The aforementioned EgressIPTraffic CRs are selected by an EIP and if a destination IP falls within either, then SNAT occurs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the reverse "overlap" occurs where two EgressIP on the same pod have overlapping destination CIDR? It could be the same EgressIPTraffic matching two EIP, or separate EgressIPTraffic with overlapping CIDRs on the two EIP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a pod attempts to connect to an IP in a network that is selected by multiple `EgressIP`, then the behaviour is undefined, and either `EgressIP` maybe selected as source IP.

Added this to the enhancement.

CRs can consume.

If greater than one `EgressIPTraffic` CRs are selected, the networks are added together, and if a destination IP is within one
of the networks, then the pods source IP will be the `EgressIP`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if TrafficSelector is specified, and the traffic does not match any of the destinations. Does it get dropped? Does it SNAT out its normal node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't get dropped, just SNATing to an EIP does not occur but SNAT to node IP occurs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm... Is it correct that in this case the packet would egress at the node where the pod is hosted (instead of being forwarded to one of the egress-able nodes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it would egress the node where the pod is hosted.


If greater than one `EgressIPTraffic` CRs are selected, the networks are added together, and if a destination IP is within one
of the networks, then the pods source IP will be the `EgressIP`.
If `TrafficSelector` is not specific, existing `EgressIP` behaviour is conserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is adding a TrafficSelector with 0.0.0.0/0 the same as not specifying a traffic selector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because theres no traffic constraints specified.

If greater than one `EgressIPTraffic` CRs are selected, the networks are added together, and if a destination IP is within one
of the networks, then the pods source IP will be the `EgressIP`.
If `TrafficSelector` is not specific, existing `EgressIP` behaviour is conserved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone specifies Egress IP A with a TrafficSelector and Egress IP B with no traffic selector, and they both match on the same pod. The pod sends traffic that does not match the TrafficSelector, will it still be sent with Egress IP A?

Copy link
Contributor Author

@martinkennelly martinkennelly Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That scenario would be undefined - either EgressIP CR can be "active" but not both because EgressIP B would select all traffic.

Today, if 2 EgressIPs select the same set of pods, one is active and the other is a standby.

With this enhancement, we allow overlapping (i.e. select set of one or more pods) EgressIP CRs only if the selected EgressIPTraffic destination networks do not overlap. If the destination networks overlap, then the behaviour is undefined.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining this question with the previous one (a selector of 0.0.0.0/0). I would anticipate that a common pattern would be a pod with 2 egress networks where one is the "default" and the other hosts specific functionality on a given subnet. In this scenario the user would likely set EIP-A with a specific traffic selector (eg 1.2.3.0/24) and EIB-B with the default 0.0.0.0/0 to catch everything else.

  • Would these be treated as overlapping and fall into the one-active-one-standby scenario?
  • If so, would it be possible to allow at least this narrower use of overlap where one of the EgressIPTraffic contains the default route? Or could one EgressIPTraffic have an isDefault: <true|false> specifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for this insight from the users perspective.

This would complicate the implementation. We've have to have multiple logical router policy priorities to implement this.

Would these be treated as overlapping and fall into the one-active-one-standby scenario?

Good question and this makes me rethink the existing one-active one standby and instead if there are overlapping networks (like your example), its simply undefined which one gets selected.
Id love to do what you suggested but it increases the complexity of the implementation that I am slow to agree to.

If so, would it be possible to allow at least this narrower use of overlap where one of the EgressIPTraffic contains the default route? Or could one EgressIPTraffic have an isDefault: <true|false> specifier?

We could do this in a future RFE if its a limitation. If we added this, we'd need to add new logical route policies in OVN and new rule priorities in iproute2 ip rules.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question pending with the customer on whether this aspect is a limitation that can be worked with, or if it is seen as a serious issue. I'll follow up...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up with customer did not raise any concerns.

enhancements/network/egress-ip-per-destinations.md Outdated Show resolved Hide resolved
// EgressIPTrafficSpec defines the desired state description of EgressIPTraffic.
// +kubebuilder:validation:MaxProperties=1
// +kubebuilder:validation:MinProperties=1
type EgressIPTrafficSpec struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with this CRD we can specify a list of networks. I expect some customers will want to also match by layer 4 ports. Should we include that in scope for this enhancement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id rather leave it out of this enhancement as it wasnt a defined goal - but its up to you, if you feel strongly about it.


### API Extensions

Add `TrafficSelector` field to `EgressIPSpec`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this exist (?) type defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// EgressIPTraffic defines a set of networks outside the cluster network.
type EgressIPTraffic struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

// TrafficSelector applies the egress IP only to the network traffic defined within the selected EgressIPTraffic(s).
// If not set, all egress traffic is selected.
// +optional
TrafficSelector metav1.LabelSelector `json:"trafficSelector"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a single traffic selector matches many egressIPSpecs?

What happens when a traffic selector doesn't match any egressIPSpecs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a single traffic selector matches many egressIPSpecs?

Ill assume you meant matchs many EgressIPTraffic CRs. Ive added an update to the enhancement.

What happens when a traffic selector doesn't match any egressIPSpecs?

Ditto.


```golang
// EgressIPSpec is a desired state description of EgressIP.
type EgressIPSpec struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan to featuregate the new fields: // +openshift:enable:FeatureGate=Example

Copy link
Contributor Author

@martinkennelly martinkennelly Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
@trozet , will i have to create a downstream only ovn-k patch to add this featuregate restriction to the CRD?
yes, in CNO

Comment on lines +56 to +59
A new `TrafficSelector` field is added to `EgressIP` spec which uses Kubernetes label selector semantics
to select a new CRD called `EgressIPTraffic` which contains a list of destination networks. This list of networks defines when an `EgressIP` should
be used as source IP when an application attempts to talk to a destination address, and it is within one of the predefined networks
defined within a `EgressIPTraffic` CR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows a user to define the external networks once per network (ie in the EgressIPTraffic CR) which is really nice for maintenance. If a newly created subnet is reachable behind an interface the cluster admin only needs to update the EgressIPTraffic CR which contains the reachable routes for that interface and all EgressIPs making use of it will be updated.
One question...

  • When adding another reachable subnet via an interface, is there a difference in functionality/performance if the user creates an additional EgressIPTraffic CR (with the same label) vs adding an additional CIDR within the existing EgressIPTraffic CR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope - no data path performance difference and negligible control plane cpu usage increase.

With the Kubernetes label selector semantics, cluster administrators can define a set of destination networks that multiple `EgressIP`
CRs can consume.

If greater than one `EgressIPTraffic` CRs are selected, the networks are added together, and if a destination IP is within one

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the reverse "overlap" occurs where two EgressIP on the same pod have overlapping destination CIDR? It could be the same EgressIPTraffic matching two EIP, or separate EgressIPTraffic with overlapping CIDRs on the two EIP.

CRs can consume.

If greater than one `EgressIPTraffic` CRs are selected, the networks are added together, and if a destination IP is within one
of the networks, then the pods source IP will be the `EgressIP`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm... Is it correct that in this case the packet would egress at the node where the pod is hosted (instead of being forwarded to one of the egress-able nodes)?

If greater than one `EgressIPTraffic` CRs are selected, the networks are added together, and if a destination IP is within one
of the networks, then the pods source IP will be the `EgressIP`.
If `TrafficSelector` is not specific, existing `EgressIP` behaviour is conserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining this question with the previous one (a selector of 0.0.0.0/0). I would anticipate that a common pattern would be a pod with 2 egress networks where one is the "default" and the other hosts specific functionality on a given subnet. In this scenario the user would likely set EIP-A with a specific traffic selector (eg 1.2.3.0/24) and EIB-B with the default 0.0.0.0/0 to catch everything else.

  • Would these be treated as overlapping and fall into the one-active-one-standby scenario?
  • If so, would it be possible to allow at least this narrower use of overlap where one of the EgressIPTraffic contains the default route? Or could one EgressIPTraffic have an isDefault: <true|false> specifier?

enhancements/network/egress-ip-per-destinations.md Outdated Show resolved Hide resolved
// match this pod selector.
// +optional
PodSelector metav1.LabelSelector `json:"podSelector,omitempty"`
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be an update to the EgressIP status fields to show what EgressIPTraffic CRs are bound and/or the set of resolved destination CIDR applicable to this EIP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trozet WDYT? I am slow to add a new status field for this because it would be optionally filled.

All traffic egress-ing the cluster will be SNAT'd.

If a user defines an `EgressIP` CR and populates the `TrafficSelector` field with a selector, and it matches or does not match an `EgressIPTraffic`
CR but regardless, there are no defined networks among the selected `EgressIPTraffic` CRs, then no cluster external traffic is SNAT'd. This

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also the same case as the question @trozet asked above about a packet which does not fall within any CIDR in any selected EgressIPTraffic CRs? In this case there is no SNAT due to EgressIP but the packet may (would?) be SNAT to the node's interface IP as it egresses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its sent out the primary interface and SNAT'd to that IP.

## Proposal

A new `TrafficSelector` field is added to `EgressIP` spec which uses Kubernetes label selector semantics
to select a new CRD called `EgressIPTraffic` which contains a list of destination networks. This list of networks defines when an `EgressIP` should

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is implied, but to confirm, these aspects of EgressIP do not change as a result of this enhancement (either with or without a traffic selector applied):

  • Traffic egressing under an EIP (either without traffic selector or as a result of being selected) will be load balanced across multiple IP addresses defined within that EIP.
  • Placement of IP addresses from an EIP is based on egress-assignable nodes and interfaces with appropriate subnets to host the IP.

One aspect of this is that the user may design networking connectivity as desired/needed. The multiple egress networks (EIP) do not necessarily need to be connected to the same node(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traffic egressing under an EIP (either without traffic selector or as a result of being selected) will be load balanced across multiple IP addresses defined within that EIP.

yes

Placement of IP addresses from an EIP is based on egress-assignable nodes and interfaces with appropriate subnets to host the IP.

yes

The cluster admin labels this node egress-able.
The cluster admin creates two `EgressIPTraffic` CRs, specifies a destination network in each and labels each CR differently.
The cluster admin creates two `EgressIP` CRs each containing an IP that falls within the previously added networks range.
The two `EgressIP`s selects the same set of pods but different `TrafficSelector`s to select the previously created `EgressIPTraffic` CRs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an invalid configuration if the EgressIP CRs had IP addresses (not traffic selection cidr) which overlap on the same subnet (ie potentially would egress the same interface on the egress-able nodes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following - can you elaborate?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just a theoretical case...
EIP-a has eip 1.1.1.1 and traffic selector of 10.10.10.0/24
EIP-b has eip 1.1.1.2 and traffic selector of 11.11.11.0/24
Both EIP could be bound to the same interface on an egress node, but traffic going to the 10 net gets SNAT to 1.1.1.1 and traffic going to the 11 net gets SNAT to 1.1.1.2.

@martinkennelly
Copy link
Contributor Author

martinkennelly commented Apr 23, 2024

@trozet @cgoncalves @deads2k @imiller0 Ive updated the enhancement following your comments. PTAL.

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@martinkennelly: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@martinkennelly
Copy link
Contributor Author

Debugging / introspection of this feature is not defined within this enhancement and requires definition. I will add.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https:/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https:/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2024
@openshift-ci openshift-ci bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 6, 2024
@martinkennelly
Copy link
Contributor Author

martinkennelly commented Jun 6, 2024

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 6, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https:/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2024
@imiller0
Copy link

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https:/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2024
@martinkennelly
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https:/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https:/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 14, 2024
@imiller0
Copy link

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 16, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https:/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants