Migrate reference details out Service concept
Migrate away: - details of virtual IP mechanism for Services - detailed information about protocols for Services (UDP, TCP, SCTP) Co-authored-by: Antonio Ojea <aojea@redhat.com> Co-authored-by: Qiming Teng <tengqm@outlook.com>pull/36675/head
parent
8662b026e9
commit
1d68a02353
|
@ -145,14 +145,16 @@ spec:
|
|||
targetPort: http-web-svc
|
||||
```
|
||||
|
||||
|
||||
This works even if there is a mixture of Pods in the Service using a single
|
||||
configured name, with the same network protocol available via different
|
||||
port numbers. This offers a lot of flexibility for deploying and evolving
|
||||
your Services. For example, you can change the port numbers that Pods expose
|
||||
in the next version of your backend software, without breaking clients.
|
||||
|
||||
The default protocol for Services is TCP; you can also use any other
|
||||
[supported protocol](#protocol-support).
|
||||
The default protocol for Services is
|
||||
[TCP](/docs/reference/networking/service-protocols/#protocol-tcp); you can also
|
||||
use any other [supported protocol](/docs/reference/networking/service-protocols/).
|
||||
|
||||
As many Services need to expose more than one port, Kubernetes supports multiple
|
||||
port definitions on a Service object.
|
||||
|
@ -316,150 +318,6 @@ This field follows standard Kubernetes label syntax. Values should either be
|
|||
[IANA standard service names](https://www.iana.org/assignments/service-names) or
|
||||
domain prefixed names such as `mycompany.com/my-custom-protocol`.
|
||||
|
||||
## Virtual IPs and service proxies
|
||||
|
||||
Every node in a Kubernetes cluster runs a `kube-proxy`. `kube-proxy` is
|
||||
responsible for implementing a form of virtual IP for `Services` of type other
|
||||
than [`ExternalName`](#externalname).
|
||||
|
||||
### Why not use round-robin DNS?
|
||||
|
||||
A question that pops up every now and then is why Kubernetes relies on
|
||||
proxying to forward inbound traffic to backends. What about other
|
||||
approaches? For example, would it be possible to configure DNS records that
|
||||
have multiple A values (or AAAA for IPv6), and rely on round-robin name
|
||||
resolution?
|
||||
|
||||
There are a few reasons for using proxying for Services:
|
||||
|
||||
* There is a long history of DNS implementations not respecting record TTLs,
|
||||
and caching the results of name lookups after they should have expired.
|
||||
* Some apps do DNS lookups only once and cache the results indefinitely.
|
||||
* Even if apps and libraries did proper re-resolution, the low or zero TTLs
|
||||
on the DNS records could impose a high load on DNS that then becomes
|
||||
difficult to manage.
|
||||
|
||||
Later in this page you can read about how various kube-proxy implementations work. Overall,
|
||||
you should note that, when running `kube-proxy`, kernel level rules may be
|
||||
modified (for example, iptables rules might get created), which won't get cleaned up,
|
||||
in some cases until you reboot. Thus, running kube-proxy is something that should
|
||||
only be done by an administrator which understands the consequences of having a
|
||||
low level, privileged network proxying service on a computer. Although the `kube-proxy`
|
||||
executable supports a `cleanup` function, this function is not an official feature and
|
||||
thus is only available to use as-is.
|
||||
|
||||
### Configuration
|
||||
|
||||
Note that the kube-proxy starts up in different modes, which are determined by its configuration.
|
||||
- The kube-proxy's configuration is done via a ConfigMap, and the ConfigMap for kube-proxy
|
||||
effectively deprecates the behavior for almost all of the flags for the kube-proxy.
|
||||
- The ConfigMap for the kube-proxy does not support live reloading of configuration.
|
||||
- The ConfigMap parameters for the kube-proxy cannot all be validated and verified on startup.
|
||||
For example, if your operating system doesn't allow you to run iptables commands,
|
||||
the standard kernel kube-proxy implementation will not work.
|
||||
Likewise, if you have an operating system which doesn't support `netsh`,
|
||||
it will not run in Windows userspace mode.
|
||||
|
||||
### User space proxy mode {#proxy-mode-userspace}
|
||||
|
||||
In this (legacy) mode, kube-proxy watches the Kubernetes control plane for the addition and
|
||||
removal of Service and Endpoint objects. For each Service it opens a
|
||||
port (randomly chosen) on the local node. Any connections to this "proxy port"
|
||||
are proxied to one of the Service's backend Pods (as reported via
|
||||
Endpoints). kube-proxy takes the `SessionAffinity` setting of the Service into
|
||||
account when deciding which backend Pod to use.
|
||||
|
||||
Lastly, the user-space proxy installs iptables rules which capture traffic to
|
||||
the Service's `clusterIP` (which is virtual) and `port`. The rules
|
||||
redirect that traffic to the proxy port which proxies the backend Pod.
|
||||
|
||||
By default, kube-proxy in userspace mode chooses a backend via a round-robin algorithm.
|
||||
|
||||
![Services overview diagram for userspace proxy](/images/docs/services-userspace-overview.svg)
|
||||
|
||||
### `iptables` proxy mode {#proxy-mode-iptables}
|
||||
|
||||
In this mode, kube-proxy watches the Kubernetes control plane for the addition and
|
||||
removal of Service and Endpoint objects. For each Service, it installs
|
||||
iptables rules, which capture traffic to the Service's `clusterIP` and `port`,
|
||||
and redirect that traffic to one of the Service's
|
||||
backend sets. For each Endpoint object, it installs iptables rules which
|
||||
select a backend Pod.
|
||||
|
||||
By default, kube-proxy in iptables mode chooses a backend at random.
|
||||
|
||||
Using iptables to handle traffic has a lower system overhead, because traffic
|
||||
is handled by Linux netfilter without the need to switch between userspace and the
|
||||
kernel space. This approach is also likely to be more reliable.
|
||||
|
||||
If kube-proxy is running in iptables mode and the first Pod that's selected
|
||||
does not respond, the connection fails. This is different from userspace
|
||||
mode: in that scenario, kube-proxy would detect that the connection to the first
|
||||
Pod had failed and would automatically retry with a different backend Pod.
|
||||
|
||||
You can use Pod [readiness probes](/docs/concepts/workloads/pods/pod-lifecycle/#container-probes)
|
||||
to verify that backend Pods are working OK, so that kube-proxy in iptables mode
|
||||
only sees backends that test out as healthy. Doing this means you avoid
|
||||
having traffic sent via kube-proxy to a Pod that's known to have failed.
|
||||
|
||||
![Services overview diagram for iptables proxy](/images/docs/services-iptables-overview.svg)
|
||||
|
||||
### IPVS proxy mode {#proxy-mode-ipvs}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.11" state="stable" >}}
|
||||
|
||||
In `ipvs` mode, kube-proxy watches Kubernetes Services and Endpoints,
|
||||
calls `netlink` interface to create IPVS rules accordingly and synchronizes
|
||||
IPVS rules with Kubernetes Services and Endpoints periodically.
|
||||
This control loop ensures that IPVS status matches the desired
|
||||
state.
|
||||
When accessing a Service, IPVS directs traffic to one of the backend Pods.
|
||||
|
||||
The IPVS proxy mode is based on netfilter hook function that is similar to
|
||||
iptables mode, but uses a hash table as the underlying data structure and works
|
||||
in the kernel space.
|
||||
That means kube-proxy in IPVS mode redirects traffic with lower latency than
|
||||
kube-proxy in iptables mode, with much better performance when synchronizing
|
||||
proxy rules. Compared to the other proxy modes, IPVS mode also supports a
|
||||
higher throughput of network traffic.
|
||||
|
||||
IPVS provides more options for balancing traffic to backend Pods;
|
||||
these are:
|
||||
|
||||
* `rr`: round-robin
|
||||
* `lc`: least connection (smallest number of open connections)
|
||||
* `dh`: destination hashing
|
||||
* `sh`: source hashing
|
||||
* `sed`: shortest expected delay
|
||||
* `nq`: never queue
|
||||
|
||||
{{< note >}}
|
||||
To run kube-proxy in IPVS mode, you must make IPVS available on
|
||||
the node before starting kube-proxy.
|
||||
|
||||
When kube-proxy starts in IPVS proxy mode, it verifies whether IPVS
|
||||
kernel modules are available. If the IPVS kernel modules are not detected, then kube-proxy
|
||||
falls back to running in iptables proxy mode.
|
||||
{{< /note >}}
|
||||
|
||||
![Services overview diagram for IPVS proxy](/images/docs/services-ipvs-overview.svg)
|
||||
|
||||
In these proxy models, the traffic bound for the Service's IP:Port is
|
||||
proxied to an appropriate backend without the clients knowing anything
|
||||
about Kubernetes or Services or Pods.
|
||||
|
||||
If you want to make sure that connections from a particular client
|
||||
are passed to the same Pod each time, you can select the session affinity based
|
||||
on the client's IP addresses by setting `service.spec.sessionAffinity` to "ClientIP"
|
||||
(the default is "None").
|
||||
You can also set the maximum session sticky time by setting
|
||||
`service.spec.sessionAffinityConfig.clientIP.timeoutSeconds` appropriately.
|
||||
(the default value is 10800, which works out to be 3 hours).
|
||||
|
||||
{{< note >}}
|
||||
On Windows, setting the maximum session sticky time for Services is not supported.
|
||||
{{< /note >}}
|
||||
|
||||
## Multi-Port Services
|
||||
|
||||
For some Services, you need to expose more than one port.
|
||||
|
@ -507,40 +365,6 @@ The IP address that you choose must be a valid IPv4 or IPv6 address from within
|
|||
If you try to create a Service with an invalid clusterIP address value, the API
|
||||
server will return a 422 HTTP status code to indicate that there's a problem.
|
||||
|
||||
## Traffic policies
|
||||
|
||||
### External traffic policy
|
||||
|
||||
You can set the `spec.externalTrafficPolicy` field to control how traffic from external sources is routed.
|
||||
Valid values are `Cluster` and `Local`. Set the field to `Cluster` to route external traffic to all ready endpoints
|
||||
and `Local` to only route to ready node-local endpoints. If the traffic policy is `Local` and there are no node-local
|
||||
endpoints, the kube-proxy does not forward any traffic for the relevant Service.
|
||||
|
||||
{{< note >}}
|
||||
{{< feature-state for_k8s_version="v1.22" state="alpha" >}}
|
||||
If you enable the `ProxyTerminatingEndpoints`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
for the kube-proxy, the kube-proxy checks if the node
|
||||
has local endpoints and whether or not all the local endpoints are marked as terminating.
|
||||
If there are local endpoints and **all** of those are terminating, then the kube-proxy ignores
|
||||
any external traffic policy of `Local`. Instead, whilst the node-local endpoints remain as all
|
||||
terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere,
|
||||
as if the external traffic policy were set to `Cluster`.
|
||||
This forwarding behavior for terminating endpoints exists to allow external load balancers to
|
||||
gracefully drain connections that are backed by `NodePort` Services, even when the health check
|
||||
node port starts to fail. Otherwise, traffic can be lost between the time a node is still in the node pool of a load
|
||||
balancer and traffic is being dropped during the termination period of a pod.
|
||||
{{< /note >}}
|
||||
|
||||
### Internal traffic policy
|
||||
|
||||
{{< feature-state for_k8s_version="v1.22" state="beta" >}}
|
||||
|
||||
You can set the `spec.internalTrafficPolicy` field to control how traffic from internal sources is routed.
|
||||
Valid values are `Cluster` and `Local`. Set the field to `Cluster` to route internal traffic to all ready endpoints
|
||||
and `Local` to only route to ready node-local endpoints. If the traffic policy is `Local` and there are no node-local
|
||||
endpoints, traffic is dropped by kube-proxy.
|
||||
|
||||
## Discovering services
|
||||
|
||||
Kubernetes supports 2 primary modes of finding a Service - environment
|
||||
|
@ -666,6 +490,12 @@ Kubernetes `ServiceTypes` allow you to specify what kind of Service you want.
|
|||
to use the `ExternalName` type.
|
||||
{{< /note >}}
|
||||
|
||||
The `type` field was designed as nested functionality - each level adds to the
|
||||
previous. This is not strictly required on all cloud providers (for example: Google
|
||||
Compute Engine does not need to allocate a node port to make `type: LoadBalancer` work,
|
||||
but another cloud provider integration might do). Although strict nesting is not required,
|
||||
but the Kubernetes API design for Service requires it anyway.
|
||||
|
||||
You can also use [Ingress](/docs/concepts/services-networking/ingress/) to expose your Service.
|
||||
Ingress is not a Service type, but it acts as the entry point for your cluster.
|
||||
It lets you consolidate your routing rules into a single resource as it can expose multiple
|
||||
|
@ -793,6 +623,7 @@ _As an alpha feature_, you can configure a load balanced Service to
|
|||
[omit](#load-balancer-nodeport-allocation) assigning a node port, provided that the
|
||||
cloud provider implementation supports this.
|
||||
|
||||
|
||||
{{< note >}}
|
||||
|
||||
On **Azure**, if you want to use a user-specified public type `loadBalancerIP`, you first need
|
||||
|
@ -1352,211 +1183,26 @@ spec:
|
|||
- 80.11.12.10
|
||||
```
|
||||
|
||||
## Shortcomings
|
||||
## Session stickiness
|
||||
|
||||
Using the userspace proxy for VIPs works at small to medium scale, but will
|
||||
not scale to very large clusters with thousands of Services. The
|
||||
[original design proposal for portals](https://github.com/kubernetes/kubernetes/issues/1107)
|
||||
has more details on this.
|
||||
|
||||
Using the userspace proxy obscures the source IP address of a packet accessing
|
||||
a Service.
|
||||
This makes some kinds of network filtering (firewalling) impossible. The iptables
|
||||
proxy mode does not
|
||||
obscure in-cluster source IPs, but it does still impact clients coming through
|
||||
a load balancer or node-port.
|
||||
|
||||
The `Type` field is designed as nested functionality - each level adds to the
|
||||
previous. This is not strictly required on all cloud providers (e.g. Google Compute Engine does
|
||||
not need to allocate a `NodePort` to make `LoadBalancer` work, but AWS does)
|
||||
but the Kubernetes API design for Service requires it anyway.
|
||||
|
||||
## Virtual IP implementation {#the-gory-details-of-virtual-ips}
|
||||
|
||||
The previous information should be sufficient for many people who want to
|
||||
use Services. However, there is a lot going on behind the scenes that may be
|
||||
worth understanding.
|
||||
|
||||
### Avoiding collisions
|
||||
|
||||
One of the primary philosophies of Kubernetes is that you should not be
|
||||
exposed to situations that could cause your actions to fail through no fault
|
||||
of your own. For the design of the Service resource, this means not making
|
||||
you choose your own port number if that choice might collide with
|
||||
someone else's choice. That is an isolation failure.
|
||||
|
||||
In order to allow you to choose a port number for your Services, we must
|
||||
ensure that no two Services can collide. Kubernetes does that by allocating each
|
||||
Service its own IP address from within the `service-cluster-ip-range`
|
||||
CIDR range that is configured for the API server.
|
||||
|
||||
To ensure each Service receives a unique IP, an internal allocator atomically
|
||||
updates a global allocation map in {{< glossary_tooltip term_id="etcd" >}}
|
||||
prior to creating each Service. The map object must exist in the registry for
|
||||
Services to get IP address assignments, otherwise creations will
|
||||
fail with a message indicating an IP address could not be allocated.
|
||||
|
||||
In the control plane, a background controller is responsible for creating that
|
||||
map (needed to support migrating from older versions of Kubernetes that used
|
||||
in-memory locking). Kubernetes also uses controllers to check for invalid
|
||||
assignments (e.g. due to administrator intervention) and for cleaning up allocated
|
||||
IP addresses that are no longer used by any Services.
|
||||
|
||||
#### IP address ranges for `type: ClusterIP` Services {#service-ip-static-sub-range}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.25" state="beta" >}}
|
||||
However, there is a problem with this `ClusterIP` allocation strategy, because a user
|
||||
can also [choose their own address for the service](#choosing-your-own-ip-address).
|
||||
This could result in a conflict if the internal allocator selects the same IP address
|
||||
for another Service.
|
||||
|
||||
The `ServiceIPStaticSubrange`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled by default in v1.25
|
||||
and later, using an allocation strategy that divides the `ClusterIP` range into two bands, based on
|
||||
the size of the configured `service-cluster-ip-range` by using the following formula
|
||||
`min(max(16, cidrSize / 16), 256)`, described as _never less than 16 or more than 256,
|
||||
with a graduated step function between them_. Dynamic IP allocations will be preferentially
|
||||
chosen from the upper band, reducing risks of conflicts with the IPs
|
||||
assigned from the lower band.
|
||||
This allows users to use the lower band of the `service-cluster-ip-range` for their
|
||||
Services with static IPs assigned with a very low risk of running into conflicts.
|
||||
|
||||
### Service IP addresses {#ips-and-vips}
|
||||
|
||||
Unlike Pod IP addresses, which actually route to a fixed destination,
|
||||
Service IPs are not actually answered by a single host. Instead, kube-proxy
|
||||
uses iptables (packet processing logic in Linux) to define _virtual_ IP addresses
|
||||
which are transparently redirected as needed. When clients connect to the
|
||||
VIP, their traffic is automatically transported to an appropriate endpoint.
|
||||
The environment variables and DNS for Services are actually populated in
|
||||
terms of the Service's virtual IP address (and port).
|
||||
|
||||
kube-proxy supports three proxy modes—userspace, iptables and IPVS—which
|
||||
each operate slightly differently.
|
||||
|
||||
#### Userspace
|
||||
|
||||
As an example, consider the image processing application described above.
|
||||
When the backend Service is created, the Kubernetes master assigns a virtual
|
||||
IP address, for example 10.0.0.1. Assuming the Service port is 1234, the
|
||||
Service is observed by all of the kube-proxy instances in the cluster.
|
||||
When a proxy sees a new Service, it opens a new random port, establishes an
|
||||
iptables redirect from the virtual IP address to this new port, and starts accepting
|
||||
connections on it.
|
||||
|
||||
When a client connects to the Service's virtual IP address, the iptables
|
||||
rule kicks in, and redirects the packets to the proxy's own port.
|
||||
The "Service proxy" chooses a backend, and starts proxying traffic from the client to the backend.
|
||||
|
||||
This means that Service owners can choose any port they want without risk of
|
||||
collision. Clients can connect to an IP and port, without being aware
|
||||
of which Pods they are actually accessing.
|
||||
|
||||
#### iptables
|
||||
|
||||
Again, consider the image processing application described above.
|
||||
When the backend Service is created, the Kubernetes control plane assigns a virtual
|
||||
IP address, for example 10.0.0.1. Assuming the Service port is 1234, the
|
||||
Service is observed by all of the kube-proxy instances in the cluster.
|
||||
When a proxy sees a new Service, it installs a series of iptables rules which
|
||||
redirect from the virtual IP address to per-Service rules. The per-Service
|
||||
rules link to per-Endpoint rules which redirect traffic (using destination NAT)
|
||||
to the backends.
|
||||
|
||||
When a client connects to the Service's virtual IP address the iptables rule kicks in.
|
||||
A backend is chosen (either based on session affinity or randomly) and packets are
|
||||
redirected to the backend. Unlike the userspace proxy, packets are never
|
||||
copied to userspace, the kube-proxy does not have to be running for the virtual
|
||||
IP address to work, and Nodes see traffic arriving from the unaltered client IP
|
||||
address.
|
||||
|
||||
This same basic flow executes when traffic comes in through a node-port or
|
||||
through a load-balancer, though in those cases the client IP does get altered.
|
||||
|
||||
#### IPVS
|
||||
|
||||
iptables operations slow down dramatically in large scale cluster e.g. 10,000 Services.
|
||||
IPVS is designed for load balancing and based on in-kernel hash tables.
|
||||
So you can achieve performance consistency in large number of Services from IPVS-based kube-proxy.
|
||||
Meanwhile, IPVS-based kube-proxy has more sophisticated load balancing algorithms
|
||||
(least conns, locality, weighted, persistence).
|
||||
If you want to make sure that connections from a particular client are passed to
|
||||
the same Pod each time, you can configure session affinity based on the client's
|
||||
IP address. Read [session affinity](/docs/reference/networking/virtual-ips/#session-affinity)
|
||||
to learn more.
|
||||
|
||||
## API Object
|
||||
|
||||
Service is a top-level resource in the Kubernetes REST API. You can find more details
|
||||
about the [Service API object](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#service-v1-core).
|
||||
|
||||
## Supported protocols {#protocol-support}
|
||||
|
||||
### TCP
|
||||
|
||||
You can use TCP for any kind of Service, and it's the default network protocol.
|
||||
|
||||
### UDP
|
||||
|
||||
You can use UDP for most Services. For type=LoadBalancer Services, UDP support
|
||||
depends on the cloud provider offering this facility.
|
||||
|
||||
### SCTP
|
||||
|
||||
{{< feature-state for_k8s_version="v1.20" state="stable" >}}
|
||||
|
||||
When using a network plugin that supports SCTP traffic, you can use SCTP for
|
||||
most Services. For type=LoadBalancer Services, SCTP support depends on the cloud
|
||||
provider offering this facility. (Most do not).
|
||||
|
||||
#### Warnings {#caveat-sctp-overview}
|
||||
|
||||
##### Support for multihomed SCTP associations {#caveat-sctp-multihomed}
|
||||
|
||||
{{< warning >}}
|
||||
The support of multihomed SCTP associations requires that the CNI plugin can support the
|
||||
assignment of multiple interfaces and IP addresses to a Pod.
|
||||
|
||||
NAT for multihomed SCTP associations requires special logic in the corresponding kernel modules.
|
||||
{{< /warning >}}
|
||||
|
||||
##### Windows {#caveat-sctp-windows-os}
|
||||
|
||||
{{< note >}}
|
||||
SCTP is not supported on Windows based nodes.
|
||||
{{< /note >}}
|
||||
|
||||
##### Userspace kube-proxy {#caveat-sctp-kube-proxy-userspace}
|
||||
|
||||
{{< warning >}}
|
||||
The kube-proxy does not support the management of SCTP associations when it is in userspace mode.
|
||||
{{< /warning >}}
|
||||
|
||||
### HTTP
|
||||
|
||||
If your cloud provider supports it, you can use a Service in LoadBalancer mode
|
||||
to set up external HTTP / HTTPS reverse proxying, forwarded to the Endpoints
|
||||
of the Service.
|
||||
|
||||
{{< note >}}
|
||||
You can also use {{< glossary_tooltip term_id="ingress" >}} in place of Service
|
||||
to expose HTTP/HTTPS Services.
|
||||
{{< /note >}}
|
||||
|
||||
### PROXY protocol
|
||||
|
||||
If your cloud provider supports it,
|
||||
you can use a Service in LoadBalancer mode to configure a load balancer outside
|
||||
of Kubernetes itself, that will forward connections prefixed with
|
||||
[PROXY protocol](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt).
|
||||
|
||||
The load balancer will send an initial series of octets describing the
|
||||
incoming connection, similar to this example
|
||||
|
||||
```
|
||||
PROXY TCP4 192.0.2.202 10.0.42.7 12345 7\r\n
|
||||
```
|
||||
|
||||
followed by the data from the client.
|
||||
|
||||
## {{% heading "whatsnext" %}}
|
||||
|
||||
* Follow the [Connecting Applications with Services](/docs/tutorials/services/connect-applications-service/) tutorial
|
||||
* Read about [Ingress](/docs/concepts/services-networking/ingress/)
|
||||
* Read about [EndpointSlices](/docs/concepts/services-networking/endpoint-slices/)
|
||||
|
||||
For more context:
|
||||
* Read [Virtual IPs and Service Proxies](/docs/reference/networking/virtual-ips/)
|
||||
* Read the [API reference](/docs/reference/kubernetes-api/service-resources/service-v1/) for the Service API
|
||||
* Read the [API reference](/docs/reference/kubernetes-api/service-resources/endpoints-v1/) for the Endpoints API
|
||||
* Read the [API reference](/docs/reference/kubernetes-api/service-resources/endpoint-slice-v1/) for the EndpointSlice API
|
||||
|
|
|
@ -0,0 +1,10 @@
|
|||
---
|
||||
title: Networking Reference
|
||||
content_type: reference
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
This section of the Kubernetes documentation provides reference details
|
||||
of Kubernetes networking.
|
||||
|
||||
<!-- body -->
|
|
@ -0,0 +1,124 @@
|
|||
---
|
||||
title: Protocols for Services
|
||||
content_type: reference
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
If you configure a {{< glossary_tooltip text="Service" term_id="service" >}},
|
||||
you can select from any network protocol that Kubernetes supports.
|
||||
|
||||
Kubernetes supports the following protocols with Services:
|
||||
|
||||
- [`SCTP`](#protocol-sctp)
|
||||
- [`TCP`](#protocol-tcp) _(the default)_
|
||||
- [`UDP`](#protocol-udp)
|
||||
|
||||
When you define a Service, you can also specify the
|
||||
[application protocol](/docs/concepts/services-networking/service/#application-protocol)
|
||||
that it uses.
|
||||
|
||||
This document details some special cases, all of them typically using TCP
|
||||
as a transport protocol:
|
||||
|
||||
- [HTTP](#protocol-http-special) and [HTTPS](#protocol-http-special)
|
||||
- [PROXY protocol](#protocol-proxy-special)
|
||||
- [TLS](#protocol-tls-special) termination at the load balancer
|
||||
|
||||
<!-- body -->
|
||||
## Supported protocols {#protocol-support}
|
||||
|
||||
There are 3 valid values for the `protocol` of a port for a Service:
|
||||
|
||||
### `SCTP` {#protocol-sctp}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.20" state="stable" >}}
|
||||
|
||||
When using a network plugin that supports SCTP traffic, you can use SCTP for
|
||||
most Services. For `type: LoadBalancer` Services, SCTP support depends on the cloud
|
||||
provider offering this facility. (Most do not).
|
||||
|
||||
SCTP is not supported on nodes that run Windows.
|
||||
|
||||
#### Support for multihomed SCTP associations {#caveat-sctp-multihomed}
|
||||
|
||||
The support of multihomed SCTP associations requires that the CNI plugin can support the assignment of multiple interfaces and IP addresses to a Pod.
|
||||
|
||||
NAT for multihomed SCTP associations requires special logic in the corresponding kernel modules.
|
||||
|
||||
{{< note >}}
|
||||
The kube-proxy does not support the management of SCTP associations when it is in userspace mode.
|
||||
{{< /note >}}
|
||||
|
||||
|
||||
### `TCP` {#protocol-tcp}
|
||||
|
||||
You can use TCP for any kind of Service, and it's the default network protocol.
|
||||
|
||||
### `UDP` {#protocol-udp}
|
||||
|
||||
You can use UDP for most Services. For `type: LoadBalancer` Services,
|
||||
UDP support depends on the cloud provider offering this facility.
|
||||
|
||||
|
||||
## Special cases
|
||||
|
||||
### HTTP {#protocol-http-special}
|
||||
|
||||
If your cloud provider supports it, you can use a Service set to `type: LoadBalancer` as a way
|
||||
to set up external HTTP / HTTPS reverse proxying, forwarded to the EndpointSlices / Endpoints of that Service.
|
||||
|
||||
Typically, you set the protocol to `TCP` and add an
|
||||
{{< glossary_tooltip text="annotation" term_id="annotation" >}}
|
||||
(usually specific to your cloud provider) that configures the load balancer
|
||||
to handle traffic at the HTTP level.
|
||||
This configuration might also include serving HTTPS (HTTP over TLS) and
|
||||
reverse-proxying plain HTTP to your workload.
|
||||
|
||||
{{< note >}}
|
||||
You can also use an {{< glossary_tooltip term_id="ingress" >}} to expose
|
||||
HTTP/HTTPS Services.
|
||||
{{< /note >}}
|
||||
|
||||
You might additionally want to specify that the
|
||||
[application protocol](/docs/concepts/services-networking/service/#application-protocol)
|
||||
of the connection is `http` or `https`. Use `http` if the session from the
|
||||
load balancer to your workload is HTTP without TLS, and use `https` if the
|
||||
session from the load balancer to your workload uses TLS encryption.
|
||||
|
||||
### PROXY protocol {#protocol-proxy-special}
|
||||
|
||||
If your cloud provider supports it, you can use a Service set to `type: LoadBalancer`
|
||||
to configure a load balancer outside of Kubernetes itself, that will forward connections
|
||||
wrapped with the
|
||||
[PROXY protocol](https://www.haproxy.org/download/2.5/doc/proxy-protocol.txt).
|
||||
|
||||
The load balancer then sends an initial series of octets describing the
|
||||
incoming connection, similar to this example (PROXY protocol v1):
|
||||
|
||||
```
|
||||
PROXY TCP4 192.0.2.202 10.0.42.7 12345 7\r\n
|
||||
```
|
||||
|
||||
The data after the proxy protocol preamble are the original
|
||||
data from the client. When either side closes the connection,
|
||||
the load balancer also triggers a connection close and sends
|
||||
any remaining data where feasible.
|
||||
|
||||
Typically, you define a Service with the protocol to `TCP`.
|
||||
You also set an annotation, specific to your
|
||||
cloud provider, that configures the load balancer to wrap each incoming connection in the PROXY protocol.
|
||||
|
||||
### TLS {#protocol-tls-special}
|
||||
|
||||
If your cloud provider supports it, you can use a Service set to `type: LoadBalancer` as
|
||||
a way to set up external reverse proxying, where the connection from client to load
|
||||
balancer is TLS encrypted and the load balancer is the TLS server peer.
|
||||
The connection from the load balancer to your workload can also be TLS,
|
||||
or might be plain text. The exact options available to you depend on your
|
||||
cloud provider or custom Service implementation.
|
||||
|
||||
Typically, you set the protocol to `TCP` and set an annotation
|
||||
(usually specific to your cloud provider) that configures the load balancer
|
||||
to act as a TLS server. You would configure the TLS identity (as server,
|
||||
and possibly also as a client that connects to your workload) using
|
||||
mechanisms that are specific to your cloud provider.
|
|
@ -0,0 +1,337 @@
|
|||
---
|
||||
title: Virtual IPs and Service Proxies
|
||||
content_type: reference
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
Every {{< glossary_tooltip term_id="node" text="node" >}} in a Kubernetes
|
||||
cluster runs a [kube-proxy](/docs/reference/command-line-tools-reference/kube-proxy/)
|
||||
(unless you have deployed your own alternative component in place of `kube-proxy`).
|
||||
|
||||
The `kube-proxy` component is responsible for implementing a _virtual IP_
|
||||
mechanism for {{< glossary_tooltip term_id="service" text="Services">}}
|
||||
of `type` other than
|
||||
[`ExternalName`](/docs/concepts/services-networking/service/#externalname).
|
||||
|
||||
|
||||
A question that pops up every now and then is why Kubernetes relies on
|
||||
proxying to forward inbound traffic to backends. What about other
|
||||
approaches? For example, would it be possible to configure DNS records that
|
||||
have multiple A values (or AAAA for IPv6), and rely on round-robin name
|
||||
resolution?
|
||||
|
||||
There are a few reasons for using proxying for Services:
|
||||
|
||||
* There is a long history of DNS implementations not respecting record TTLs,
|
||||
and caching the results of name lookups after they should have expired.
|
||||
* Some apps do DNS lookups only once and cache the results indefinitely.
|
||||
* Even if apps and libraries did proper re-resolution, the low or zero TTLs
|
||||
on the DNS records could impose a high load on DNS that then becomes
|
||||
difficult to manage.
|
||||
|
||||
Later in this page you can read about how various kube-proxy implementations work.
|
||||
Overall, you should note that, when running `kube-proxy`, kernel level rules may be modified
|
||||
(for example, iptables rules might get created), which won't get cleaned up, in some
|
||||
cases until you reboot. Thus, running kube-proxy is something that should only be done
|
||||
by an administrator which understands the consequences of having a low level, privileged
|
||||
network proxying service on a computer. Although the `kube-proxy` executable supports a
|
||||
`cleanup` function, this function is not an official feature and thus is only available
|
||||
to use as-is.
|
||||
|
||||
|
||||
<a id="example"></a>
|
||||
Some of the details in this reference refer to an example: the back end Pods for a stateless
|
||||
image-processing workload, running with three replicas. Those replicas are
|
||||
fungible—frontends do not care which backend they use. While the actual Pods that
|
||||
compose the backend set may change, the frontend clients should not need to be aware of that,
|
||||
nor should they need to keep track of the set of backends themselves.
|
||||
|
||||
|
||||
<!-- body -->
|
||||
|
||||
## Proxy modes
|
||||
|
||||
Note that the kube-proxy starts up in different modes, which are determined by its configuration.
|
||||
|
||||
- The kube-proxy's configuration is done via a ConfigMap, and the ConfigMap for
|
||||
kube-proxy effectively deprecates the behavior for almost all of the flags for
|
||||
the kube-proxy.
|
||||
- The ConfigMap for the kube-proxy does not support live reloading of configuration.
|
||||
- The ConfigMap parameters for the kube-proxy cannot all be validated and verified on startup.
|
||||
For example, if your operating system doesn't allow you to run iptables commands,
|
||||
the standard kernel kube-proxy implementation will not work.
|
||||
Likewise, if you have an operating system which doesn't support `netsh`,
|
||||
it will not run in Windows userspace mode.
|
||||
|
||||
### User space proxy mode {#proxy-mode-userspace}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.23" state="deprecated" >}}
|
||||
|
||||
This (legacy) mode uses iptables to install interception rules, and then performs
|
||||
traffic forwarding with the assistance of the kube-proxy tool.
|
||||
The kube-procy watches the Kubernetes control plane for the addition, modification
|
||||
and removal of Service and Endpoints objects. For each Service, the kube-proxy
|
||||
opens a port (randomly chosen) on the local node. Any connections to this _proxy port_
|
||||
are proxied to one of the Service's backend Pods (as reported via
|
||||
Endpoints). The kube-proxy takes the `sessionAffinity` setting of the Service into
|
||||
account when deciding which backend Pod to use.
|
||||
|
||||
The user-space proxy installs iptables rules which capture traffic to the
|
||||
Service's `clusterIP` (which is virtual) and `port`. Those rules redirect that traffic
|
||||
to the proxy port which proxies the backend Pod.
|
||||
|
||||
By default, kube-proxy in userspace mode chooses a backend via a round-robin algorithm.
|
||||
|
||||
{{< figure src="/images/docs/services-userspace-overview.svg" title="Services overview diagram for userspace proxy" class="diagram-medium" >}}
|
||||
|
||||
|
||||
#### Example {#packet-processing-userspace}
|
||||
|
||||
As an example, consider the image processing application described [earlier](#example)
|
||||
in the page.
|
||||
When the backend Service is created, the Kubernetes control plane assigns a virtual
|
||||
IP address, for example 10.0.0.1. Assuming the Service port is 1234, the
|
||||
Service is observed by all of the kube-proxy instances in the cluster.
|
||||
When a proxy sees a new Service, it opens a new random port, establishes an
|
||||
iptables redirect from the virtual IP address to this new port, and starts accepting
|
||||
connections on it.
|
||||
|
||||
When a client connects to the Service's virtual IP address, the iptables
|
||||
rule kicks in, and redirects the packets to the proxy's own port.
|
||||
The "Service proxy" chooses a backend, and starts proxying traffic from the client to the backend.
|
||||
|
||||
This means that Service owners can choose any port they want without risk of
|
||||
collision. Clients can connect to an IP and port, without being aware
|
||||
of which Pods they are actually accessing.
|
||||
|
||||
#### Scaling challenges {#scaling-challenges-userspace}
|
||||
|
||||
Using the userspace proxy for VIPs works at small to medium scale, but will
|
||||
not scale to very large clusters with thousands of Services. The
|
||||
[original design proposal for portals](https://github.com/kubernetes/kubernetes/issues/1107)
|
||||
has more details on this.
|
||||
|
||||
Using the userspace proxy obscures the source IP address of a packet accessing
|
||||
a Service.
|
||||
This makes some kinds of network filtering (firewalling) impossible. The iptables
|
||||
proxy mode does not
|
||||
obscure in-cluster source IPs, but it does still impact clients coming through
|
||||
a load balancer or node-port.
|
||||
|
||||
### `iptables` proxy mode {#proxy-mode-iptables}
|
||||
|
||||
In this mode, kube-proxy watches the Kubernetes control plane for the addition and
|
||||
removal of Service and Endpoints objects. For each Service, it installs
|
||||
iptables rules, which capture traffic to the Service's `clusterIP` and `port`,
|
||||
and redirect that traffic to one of the Service's
|
||||
backend sets. For each endpoint, it installs iptables rules which
|
||||
select a backend Pod.
|
||||
|
||||
By default, kube-proxy in iptables mode chooses a backend at random.
|
||||
|
||||
Using iptables to handle traffic has a lower system overhead, because traffic
|
||||
is handled by Linux netfilter without the need to switch between userspace and the
|
||||
kernel space. This approach is also likely to be more reliable.
|
||||
|
||||
If kube-proxy is running in iptables mode and the first Pod that's selected
|
||||
does not respond, the connection fails. This is different from userspace
|
||||
mode: in that scenario, kube-proxy would detect that the connection to the first
|
||||
Pod had failed and would automatically retry with a different backend Pod.
|
||||
|
||||
You can use Pod [readiness probes](/docs/concepts/workloads/pods/pod-lifecycle/#container-probes)
|
||||
to verify that backend Pods are working OK, so that kube-proxy in iptables mode
|
||||
only sees backends that test out as healthy. Doing this means you avoid
|
||||
having traffic sent via kube-proxy to a Pod that's known to have failed.
|
||||
|
||||
{{< figure src="/images/docs/services-iptables-overview.svg" title="Services overview diagram for iptables proxy" class="diagram-medium" >}}
|
||||
|
||||
#### Example {#packet-processing-iptables}
|
||||
|
||||
Again, consider the image processing application described [earlier](#example).
|
||||
When the backend Service is created, the Kubernetes control plane assigns a virtual
|
||||
IP address, for example 10.0.0.1. For this example, assume that the
|
||||
Service port is 1234.
|
||||
All of the kube-proxy instances in the cluster observe the creation of the new
|
||||
Service.
|
||||
|
||||
When kube-proxy on a node sees a new Service, it installs a series of iptables rules
|
||||
which redirect from the virtual IP address to more iptables rules, defined per Service.
|
||||
The per-Service rules link to further rules for each backend endpoint, and the per-
|
||||
endpoint rules redirect traffic (using destination NAT) to the backends.
|
||||
|
||||
When a client connects to the Service's virtual IP address the iptables rule kicks in.
|
||||
A backend is chosen (either based on session affinity or randomly) and packets are
|
||||
redirected to the backend. Unlike the userspace proxy, packets are never
|
||||
copied to userspace, the kube-proxy does not have to be running for the virtual
|
||||
IP address to work, and Nodes see traffic arriving from the unaltered client IP
|
||||
address.
|
||||
|
||||
This same basic flow executes when traffic comes in through a node-port or
|
||||
through a load-balancer, though in those cases the client IP address does get altered.
|
||||
|
||||
### IPVS proxy mode {#proxy-mode-ipvs}
|
||||
|
||||
In `ipvs` mode, kube-proxy watches Kubernetes Services and Endpoints,
|
||||
calls `netlink` interface to create IPVS rules accordingly and synchronizes
|
||||
IPVS rules with Kubernetes Services and Endpoints periodically.
|
||||
This control loop ensures that IPVS status matches the desired
|
||||
state.
|
||||
When accessing a Service, IPVS directs traffic to one of the backend Pods.
|
||||
|
||||
The IPVS proxy mode is based on netfilter hook function that is similar to
|
||||
iptables mode, but uses a hash table as the underlying data structure and works
|
||||
in the kernel space.
|
||||
That means kube-proxy in IPVS mode redirects traffic with lower latency than
|
||||
kube-proxy in iptables mode, with much better performance when synchronizing
|
||||
proxy rules. Compared to the other proxy modes, IPVS mode also supports a
|
||||
higher throughput of network traffic.
|
||||
|
||||
IPVS provides more options for balancing traffic to backend Pods;
|
||||
these are:
|
||||
|
||||
* `rr`: round-robin
|
||||
* `lc`: least connection (smallest number of open connections)
|
||||
* `dh`: destination hashing
|
||||
* `sh`: source hashing
|
||||
* `sed`: shortest expected delay
|
||||
* `nq`: never queue
|
||||
|
||||
{{< note >}}
|
||||
To run kube-proxy in IPVS mode, you must make IPVS available on
|
||||
the node before starting kube-proxy.
|
||||
|
||||
When kube-proxy starts in IPVS proxy mode, it verifies whether IPVS
|
||||
kernel modules are available. If the IPVS kernel modules are not detected, then kube-proxy
|
||||
falls back to running in iptables proxy mode.
|
||||
{{< /note >}}
|
||||
|
||||
{{< figure src="/images/docs/services-ipvs-overview.svg" title="Services overview diagram for IPVS proxy" class="diagram-medium" >}}
|
||||
|
||||
## Session affinity
|
||||
|
||||
In these proxy models, the traffic bound for the Service's IP:Port is
|
||||
proxied to an appropriate backend without the clients knowing anything
|
||||
about Kubernetes or Services or Pods.
|
||||
|
||||
If you want to make sure that connections from a particular client
|
||||
are passed to the same Pod each time, you can select the session affinity based
|
||||
on the client's IP addresses by setting `.spec.sessionAffinity` to `ClientIP`
|
||||
for a Service (the default is `None`).
|
||||
|
||||
### Session stickiness timeout
|
||||
|
||||
You can also set the maximum session sticky time by setting
|
||||
`.spec.sessionAffinityConfig.clientIP.timeoutSeconds` appropriately for a Service.
|
||||
(the default value is 10800, which works out to be 3 hours).
|
||||
|
||||
{{< note >}}
|
||||
On Windows, setting the maximum session sticky time for Services is not supported.
|
||||
{{< /note >}}
|
||||
|
||||
## IP address assignment to Services
|
||||
|
||||
Unlike Pod IP addresses, which actually route to a fixed destination,
|
||||
Service IPs are not actually answered by a single host. Instead, kube-proxy
|
||||
uses packet processing logic (such as Linux iptables) to define _virtual_ IP
|
||||
addresses which are transparently redirected as needed.
|
||||
|
||||
When clients connect to the VIP, their traffic is automatically transported to an
|
||||
appropriate endpoint. The environment variables and DNS for Services are actually
|
||||
populated in terms of the Service's virtual IP address (and port).
|
||||
|
||||
### Avoiding collisions
|
||||
|
||||
One of the primary philosophies of Kubernetes is that you should not be
|
||||
exposed to situations that could cause your actions to fail through no fault
|
||||
of your own. For the design of the Service resource, this means not making
|
||||
you choose your own port number if that choice might collide with
|
||||
someone else's choice. That is an isolation failure.
|
||||
|
||||
In order to allow you to choose a port number for your Services, we must
|
||||
ensure that no two Services can collide. Kubernetes does that by allocating each
|
||||
Service its own IP address from within the `service-cluster-ip-range`
|
||||
CIDR range that is configured for the API server.
|
||||
|
||||
To ensure each Service receives a unique IP, an internal allocator atomically
|
||||
updates a global allocation map in {{< glossary_tooltip term_id="etcd" >}}
|
||||
prior to creating each Service. The map object must exist in the registry for
|
||||
Services to get IP address assignments, otherwise creations will
|
||||
fail with a message indicating an IP address could not be allocated.
|
||||
|
||||
In the control plane, a background controller is responsible for creating that
|
||||
map (needed to support migrating from older versions of Kubernetes that used
|
||||
in-memory locking). Kubernetes also uses controllers to check for invalid
|
||||
assignments (e.g. due to administrator intervention) and for cleaning up allocated
|
||||
IP addresses that are no longer used by any Services.
|
||||
|
||||
#### IP address ranges for Service virtual IP addresses {#service-ip-static-sub-range}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.25" state="beta" >}}
|
||||
|
||||
Kubernetes divides the `ClusterIP` range into two bands, based on
|
||||
the size of the configured `service-cluster-ip-range` by using the following formula
|
||||
`min(max(16, cidrSize / 16), 256)`. That formula paraphrases as _never less than 16 or
|
||||
more than 256, with a graduated step function between them_.
|
||||
|
||||
Kubernetes prefers to allocate dynamic IP addresses to Services by choosing from the upper band,
|
||||
which means that if you want to assign a specific IP address to a `type: ClusterIP`
|
||||
Service, you should manually assign an IP address from the **lower** band. That approach
|
||||
reduces the risk of a conflict over allocation.
|
||||
|
||||
If you disable the `ServiceIPStaticSubrange`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) then Kubernetes
|
||||
uses a single shared pool for both manually and dynamically assigned IP addresses,
|
||||
that are used for `type: ClusterIP` Services.
|
||||
|
||||
## Traffic policies
|
||||
|
||||
You can set the `.spec.internalTrafficPolicy` and `.spec.externalTrafficPolicy` fields
|
||||
to control how Kubernetes routes traffic to healthy (“ready”) backends.
|
||||
|
||||
### External traffic policy
|
||||
|
||||
You can set the `.spec.externalTrafficPolicy` field to control how traffic from
|
||||
external sources is routed. Valid values are `Cluster` and `Local`. Set the field
|
||||
to `Cluster` to route external traffic to all ready endpoints and `Local` to only
|
||||
route to ready node-local endpoints. If the traffic policy is `Local` and there are
|
||||
are no node-local endpoints, the kube-proxy does not forward any traffic for the
|
||||
relevant Service.
|
||||
|
||||
{{< note >}}
|
||||
{{< feature-state for_k8s_version="v1.22" state="alpha" >}}
|
||||
|
||||
If you enable the `ProxyTerminatingEndpoints`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
for the kube-proxy, the kube-proxy checks if the node
|
||||
has local endpoints and whether or not all the local endpoints are marked as terminating.
|
||||
If there are local endpoints and **all** of those are terminating, then the kube-proxy ignores
|
||||
any external traffic policy of `Local`. Instead, whilst the node-local endpoints remain as all
|
||||
terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere,
|
||||
as if the external traffic policy were set to `Cluster`.
|
||||
|
||||
This forwarding behavior for terminating endpoints exists to allow external load balancers to
|
||||
gracefully drain connections that are backed by `NodePort` Services, even when the health check
|
||||
node port starts to fail. Otherwise, traffic can be lost between the time a node is
|
||||
still in the node pool of a load balancer and traffic is being dropped during the
|
||||
termination period of a pod.
|
||||
{{< /note >}}
|
||||
|
||||
### Internal traffic policy
|
||||
|
||||
{{< feature-state for_k8s_version="v1.22" state="beta" >}}
|
||||
|
||||
You can set the `.spec.internalTrafficPolicy` field to control how traffic from
|
||||
internal sources is routed. Valid values are `Cluster` and `Local`. Set the field to
|
||||
`Cluster` to route internal traffic to all ready endpoints and `Local` to only route
|
||||
to ready node-local endpoints. If the traffic policy is `Local` and there are no
|
||||
node-local endpoints, traffic is dropped by kube-proxy.
|
||||
|
||||
## {{% heading "whatsnext" %}}
|
||||
|
||||
To learn more about Services,
|
||||
read [Connecting Applications with Services](/docs/concepts/services-networking/connect-applications-service/).
|
||||
|
||||
You can also:
|
||||
|
||||
* Read about [Services](/docs/concepts/services-networking/service/)
|
||||
* Read the [API reference](/docs/reference/kubernetes-api/service-resources/service-v1/) for the Service API
|
Loading…
Reference in New Issue