Merge pull request #36673 from sftim/20220825_favor_endpointslices

Favor EndpointSlice over Endpoints
pull/37390/head
Kubernetes Prow Robot 2022-10-19 08:45:02 -07:00 committed by GitHub
commit ce7e330c57
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
12 changed files with 192 additions and 106 deletions

View File

@ -107,7 +107,7 @@ routes appropriately. It requires Get access to Node objects.
### Service controller {#authorization-service-controller}
The service controller listens to Service object Create, Update and Delete events and then configures Endpoints for those Services appropriately.
The service controller listens to Service object Create, Update and Delete events and then configures Endpoints for those Services appropriately (for EndpointSlices, the kube-controller-manager manages these on demand).
To access Services, it requires List, and Watch access. To update Services, it requires Patch and Update access.

View File

@ -53,8 +53,8 @@ Some types of these controllers are:
* Node controller: Responsible for noticing and responding when nodes go down.
* Job controller: Watches for Job objects that represent one-off tasks, then creates
Pods to run those tasks to completion.
* Endpoints controller: Populates the Endpoints object (that is, joins Services & Pods).
* Service Account & Token controllers: Create default accounts and API access tokens for new namespaces.
* EndpointSlice controller: Populates EndpointSlice objects (to provide a link between Services and Pods).
* ServiceAccount controller: Create default ServiceAccounts for new namespaces.
### cloud-controller-manager

View File

@ -91,10 +91,14 @@ my-nginx ClusterIP 10.0.162.149 <none> 80/TCP 21s
```
As mentioned previously, a Service is backed by a group of Pods. These Pods are
exposed through `endpoints`. The Service's selector will be evaluated continuously
and the results will be POSTed to an Endpoints object also named `my-nginx`.
When a Pod dies, it is automatically removed from the endpoints, and new Pods
matching the Service's selector will automatically get added to the endpoints.
exposed through
{{<glossary_tooltip term_id="endpoint-slice" text="EndpointSlices">}}.
The Service's selector will be evaluated continuously and the results will be POSTed
to an EndpointSlice that is connected to the Service using a
{{< glossary_tooltip text="labels" term_id="label" >}}.
When a Pod dies, it is automatically removed from the EndpointSlices that contain it
as an endpoint. New Pods that match the Service's selector will automatically get added
to an EndpointSlice for that Service.
Check the endpoints, and note that the IPs are the same as the Pods created in
the first step:
@ -115,11 +119,11 @@ Session Affinity: None
Events: <none>
```
```shell
kubectl get ep my-nginx
kubectl get endpointslices -l kubernetes.io/service-name=my-nginx
```
```
NAME ENDPOINTS AGE
my-nginx 10.244.2.5:80,10.244.3.4:80 1m
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
my-nginx-7vzhx IPv4 80 10.244.2.5,10.244.3.4 21s
```
You should now be able to curl the nginx Service on `<CLUSTER-IP>:<PORT>` from

View File

@ -186,8 +186,8 @@ the same namespace, the Pod will see its own FQDN as
A or AAAA record at that name, pointing to the Pod's IP. Both Pods "`busybox1`" and
"`busybox2`" can have their distinct A or AAAA records.
The Endpoints object can specify the `hostname` for any endpoint addresses,
along with its IP.
An {{<glossary_tooltip term_id="endpoint-slice" text="EndpointSlice">}} can specify
the DNS hostname for any endpoint addresses, along with its IP.
{{< note >}}
Because A or AAAA records are not created for Pod names, `hostname` is required for the Pod's A or AAAA

View File

@ -23,24 +23,7 @@ Endpoints.
<!-- body -->
## Motivation
The Endpoints API has provided a simple and straightforward way of
tracking network endpoints in Kubernetes. Unfortunately as Kubernetes clusters
and {{< glossary_tooltip text="Services" term_id="service" >}} have grown to handle and
send more traffic to more backend Pods, limitations of that original API became
more visible.
Most notably, those included challenges with scaling to larger numbers of
network endpoints.
Since all network endpoints for a Service were stored in a single Endpoints
resource, those resources could get quite large. That affected the performance
of Kubernetes components (notably the master control plane) and resulted in
significant amounts of network traffic and processing when Endpoints changed.
EndpointSlices help you mitigate those issues as well as provide an extensible
platform for additional features such as topological routing.
## EndpointSlice resources {#endpointslice-resource}
## EndpointSlice API {#endpointslice-resource}
In Kubernetes, an EndpointSlice contains references to a set of network
endpoints. The control plane automatically creates EndpointSlices
@ -52,7 +35,7 @@ Service name.
The name of a EndpointSlice object must be a valid
[DNS subdomain name](/docs/concepts/overview/working-with-objects/names#dns-subdomain-names).
As an example, here's a sample EndpointSlice resource for the `example`
As an example, here's a sample EndpointSlice object, that's owned by the `example`
Kubernetes Service.
```yaml
@ -85,8 +68,7 @@ flag, up to a maximum of 1000.
EndpointSlices can act as the source of truth for
{{< glossary_tooltip term_id="kube-proxy" text="kube-proxy" >}} when it comes to
how to route internal traffic. When enabled, they should provide a performance
improvement for services with large numbers of endpoints.
how to route internal traffic.
### Address types
@ -96,6 +78,10 @@ EndpointSlices support three address types:
* IPv6
* FQDN (Fully Qualified Domain Name)
Each `EndpointSlice` object represents a specific IP address type. If you have
a Service that is available via IPv4 and IPv6, there will be at least two
`EndpointSlice` objects (one for IPv4, and one for IPv6).
### Conditions
The EndpointSlice API stores conditions about endpoints that may be useful for consumers.
@ -245,11 +231,45 @@ getting replaced.
Due to the nature of EndpointSlice changes, endpoints may be represented in more
than one EndpointSlice at the same time. This naturally occurs as changes to
different EndpointSlice objects can arrive at the Kubernetes client watch/cache
at different times. Implementations using EndpointSlice must be able to have the
endpoint appear in more than one slice. A reference implementation of how to
perform endpoint deduplication can be found in the `EndpointSliceCache`
implementation in `kube-proxy`.
different EndpointSlice objects can arrive at the Kubernetes client watch / cache
at different times.
{{< note >}}
Clients of the EndpointSlice API must be able to handle the situation where
a particular endpoint address appears in more than one slice.
You can find a reference implementation for how to perform this endpoint deduplication
as part of the `EndpointSliceCache` code within `kube-proxy`.
{{< /note >}}
## Comparison with Endpoints {#motivation}
The original Endpoints API provided a simple and straightforward way of
tracking network endpoints in Kubernetes. As Kubernetes clusters
and {{< glossary_tooltip text="Services" term_id="service" >}} grew to handle
more traffic and to send more traffic to more backend Pods, the
limitations of that original API became more visible.
Most notably, those included challenges with scaling to larger numbers of
network endpoints.
Since all network endpoints for a Service were stored in a single Endpoints
object, those Endpoints objects could get quite large. For Services that stayed
stable (the same set of endpoints over a long period of time) the impact was
less noticeable; even then, some use cases of Kubernetes weren't well served.
When a Service had a lot of backend endpoints and the workload was either
scaling frequently, or rolling out new changes frequently, each update to
the single Endpoints object for that Service meant a lot of traffic between
Kubernetes cluster components (within the control plane, and also between
nodes and the API server). This extra traffic also had a cost in terms of
CPU use.
With EndpointSlices, adding or removing a single Pod triggers the same _number_
of updates to clients that are watching for changes, but the size of those
update message is much smaller at large scale.
EndpointSlices also enabled innovation around new features such dual-stack
networking and topology-aware routing.
## {{% heading "whatsnext" %}}

View File

@ -63,7 +63,8 @@ The Service abstraction enables this decoupling.
If you're able to use Kubernetes APIs for service discovery in your application,
you can query the {{< glossary_tooltip text="API server" term_id="kube-apiserver" >}}
for Endpoints, that get updated whenever the set of Pods in a Service changes.
for matching EndpointSlices. Kubernetes updates the EndpointSlices for a Service
whenever the set of Pods in a Service changes.
For non-native applications, Kubernetes offers ways to place a network port or load
balancer in between your application and the backend Pods.
@ -161,8 +162,12 @@ Each port definition can have the same `protocol`, or a different one.
### Services without selectors
Services most commonly abstract access to Kubernetes Pods thanks to the selector,
but when used with a corresponding Endpoints object and without a selector, the Service can abstract other kinds of backends,
including ones that run outside the cluster. For example:
but when used with a corresponding set of
{{<glossary_tooltip term_id="endpoint-slice" text="EndpointSlices">}}
objects and without a selector, the Service can abstract other kinds of backends,
including ones that run outside the cluster.
For example:
* You want to have an external database cluster in production, but in your
test environment you use your own databases.
@ -186,73 +191,119 @@ spec:
targetPort: 9376
```
Because this Service has no selector, the corresponding Endpoints object is not
created automatically. You can manually map the Service to the network address and port
where it's running, by adding an Endpoints object manually:
Because this Service has no selector, the corresponding EndpointSlice (and
legacy Endpoints) objects are not created automatically. You can manually map the Service
to the network address and port where it's running, by adding an EndpointSlice
object manually. For example:
```yaml
apiVersion: v1
kind: Endpoints
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
# the name here should match the name of the Service
name: my-service
subsets:
name: my-service-1 # by convention, use the name of the Service
# as a prefix for the name of the EndpointSlice
labels:
# You should set the "kubernetes.io/service-name" label.
# Set its value to match the name of the Service
kubernetes.io/service-name: my-service
addressType: IPv4
ports:
- name: '' # empty because port 9376 is not assigned as a well-known
# port (by IANA)
appProtocol: http
protocol: TCP
port: 9376
endpoints:
- addresses:
- ip: 192.0.2.42
ports:
- port: 9376
- "10.4.5.6" # the IP addresses in this list can appear in any order
- "10.1.2.3"
```
The name of the Endpoints object must be a valid
[DNS subdomain name](/docs/concepts/overview/working-with-objects/names#dns-subdomain-names).
#### Custom EndpointSlices
When you create an [Endpoints](/docs/reference/kubernetes-api/service-resources/endpoints-v1/)
object for a Service, you set the name of the new object to be the same as that
of the Service.
When you create an [EndpointSlice](#endpointslices) object for a Service, you can
use any name for the EndpointSlice. Each EndpointSlice in a namespace must have a
unique name. You link an EndpointSlice to a Service by setting the
`kubernetes.io/service-name` {{< glossary_tooltip text="label" term_id="label" >}}
on that EndpointSlice.
{{< note >}}
The endpoint IPs _must not_ be: loopback (127.0.0.0/8 for IPv4, ::1/128 for IPv6), or
link-local (169.254.0.0/16 and 224.0.0.0/24 for IPv4, fe80::/64 for IPv6).
Endpoint IP addresses cannot be the cluster IPs of other Kubernetes Services,
The endpoint IP addresses cannot be the cluster IPs of other Kubernetes Services,
because {{< glossary_tooltip term_id="kube-proxy" >}} doesn't support virtual IPs
as a destination.
{{< /note >}}
Accessing a Service without a selector works the same as if it had a selector.
In the example above, traffic is routed to the single endpoint defined in
the YAML: `192.0.2.42:9376` (TCP).
For an EndpointSlice that you create yourself, or in your own code,
you should also pick a value to use for the [`endpointslice.kubernetes.io/managed-by`](/docs/reference/labels-annotations-taints/#endpointslicekubernetesiomanaged-by) label.
If you create your own controller code to manage EndpointSlices, consider using a
value similar to `"my-domain.example/name-of-controller"`. If you are using a third
party tool, use the name of the tool in all-lowercase and change spaces and other
punctuation to dashes (`-`).
If people are directly using a tool such as `kubectl` to manage EndpointSlices,
use a name that describes this manual management, such as `"staff"` or
`"cluster-admins"`. You should
avoid using the reserved value `"controller"`, which identifies EndpointSlices
managed by Kubernetes' own control plane.
{{< note >}}
The Kubernetes API server does not allow proxying to endpoints that are not mapped to
pods. Actions such as `kubectl proxy <service-name>` where the service has no
selector will fail due to this constraint. This prevents the Kubernetes API server
from being used as a proxy to endpoints the caller may not be authorized to access.
{{< /note >}}
#### Accessing a Service without a selector {#service-no-selector-access}
Accessing a Service without a selector works the same as if it had a selector.
In the [example](#services-without-selectors) for a Service without a selector, traffic is routed to one of the two endpoints defined in
the EndpointSlice manifest: a TCP connection to 10.1.2.3 or 10.4.5.6, on port 9376.
An ExternalName Service is a special case of Service that does not have
selectors and uses DNS names instead. For more information, see the
[ExternalName](#externalname) section later in this document.
### Over Capacity Endpoints
If an Endpoints resource has more than 1000 endpoints then a Kubernetes v1.22 (or later)
cluster annotates that Endpoints with `endpoints.kubernetes.io/over-capacity: truncated`.
This annotation indicates that the affected Endpoints object is over capacity and that
the endpoints controller has truncated the number of endpoints to 1000.
### EndpointSlices
{{< feature-state for_k8s_version="v1.21" state="stable" >}}
EndpointSlices are an API resource that can provide a more scalable alternative
to Endpoints. Although conceptually quite similar to Endpoints, EndpointSlices
allow for distributing network endpoints across multiple resources. By default,
an EndpointSlice is considered "full" once it reaches 100 endpoints, at which
point additional EndpointSlices will be created to store any additional
endpoints.
[EndpointSlices](/docs/concepts/services-networking/endpoint-slices/) are objects that
represent a subset (a _slice_) of the backing network endpoints for a Service.
EndpointSlices provide additional attributes and functionality which is
described in detail in [EndpointSlices](/docs/concepts/services-networking/endpoint-slices/).
Your Kubernetes cluster tracks how many endpoints each EndpointSlice represents.
If there are so many endpoints for a Service that a threshold is reached, then
Kubernetes adds another empty EndpointSlice and stores new endpoint information
there.
By default, Kubernetes makes a new EndpointSlice once the existing EndpointSlices
all contain at least 100 endpoints. Kubernetes does not make the new EndpointSlice
until an extra endpoint needs to be added.
See [EndpointSlices](/docs/concepts/services-networking/endpoint-slices/) for more
information about this API.
### Endpoints
In the Kubernetes API, an
[Endpoints](/docs/reference/kubernetes-api/service-resources/endpoints-v1/)
(the resource kind is plural) defines a list of network endpoints, typically
referenced by a Service to define which Pods the traffic can be sent to.
The EndpointSlice API is the recommended replacement for Endpoints.
#### Over-capacity endpoints
Kubernetes limits the number of endpoints that can fit in a single Endpoints
object. When there are over 1000 backing endpoints for a Service, Kubernetes
truncates the data in the Endpoints object. Because a Service can be linked
with more than one EndpointSlice, the 1000 backing endpoint limit only
affects the legacy Endpoints API.
In that case, Kubernetes selects at most 1000 possible backend endpoints to store
into the Endpoints object, and sets an
{{< glossary_tooltip text="annotation" term_id="annotation" >}} on the
Endpoints:
[`endpoints.kubernetes.io/over-capacity: truncated`](/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity).
The control plane also removes that annotation if the number of backend Pods drops below 1000.
Traffic is still sent to backends, but any load balancing mechanism that relies on the
legacy Endpoints API only sends traffic to at most 1000 of the available backing endpoints.
The same API limit means that you cannot manually update an Endpoints to have more than 1000 endpoints.
### Application protocol
@ -573,19 +624,22 @@ selectors defined:
### With selectors
For headless Services that define selectors, the endpoints controller creates
`Endpoints` records in the API, and modifies the DNS configuration to return
A records (IP addresses) that point directly to the `Pods` backing the `Service`.
For headless Services that define selectors, the Kubernetes control plane creates
EndpointSlice objects in the Kubernetes API, and modifies the DNS configuration to return
A or AAAA records (IPv4 or IPv6 addresses) that point directly to the Pods backing
the Service.
### Without selectors
For headless Services that do not define selectors, the endpoints controller does
not create `Endpoints` records. However, the DNS system looks for and configures
For headless Services that do not define selectors, the control plane does
not create EndpointSlice objects. However, the DNS system looks for and configures
either:
* CNAME records for [`ExternalName`](#externalname)-type Services.
* A records for any `Endpoints` that share a name with the Service, for all
other types.
* DNS CNAME records for [`type: ExternalName`](#externalname) Services.
* DNS A / AAAA records for all IP addresses of the Service's ready endpoints,
for all Service types other than `ExternalName`.
* For IPv4 endpoints, the DNS system creates A records.
* For IPv6 endpoints, the DNS system creates AAAA records.
## Publishing Services (ServiceTypes) {#publishing-services-service-types}

View File

@ -17,7 +17,7 @@ description: >-
_Topology Aware Hints_ enable topology aware routing by including suggestions
for how clients should consume endpoints. This approach adds metadata to enable
consumers of EndpointSlice and / or Endpoints objects, so that traffic to
consumers of EndpointSlice (or Endpoints) objects, so that traffic to
those network endpoints can be routed closer to where it originated.
For example, you can route traffic within a locality to reduce

View File

@ -461,7 +461,7 @@ An example flow:
order. If the order of shutdowns matters, consider using a `preStop` hook to synchronize.
{{< /note >}}
1. At the same time as the kubelet is starting graceful shutdown, the control plane removes that
shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent
shutting-down Pod from EndpointSlice (and Endpoints) objects where these represent
a {{< glossary_tooltip term_id="service" text="Service" >}} with a configured
{{< glossary_tooltip text="selector" term_id="selector" >}}.
{{< glossary_tooltip text="ReplicaSets" term_id="replica-set" >}} and other workload resources

View File

@ -347,7 +347,7 @@ metadata:
# the rules below will be added to the "monitoring" ClusterRole.
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
resources: ["services", "endpointslices", "pods"]
verbs: ["get", "list", "watch"]
```
@ -702,9 +702,9 @@ When used in a <b>RoleBinding</b>, it gives full control over every resource in
If used in a <b>RoleBinding</b>, allows read/write access to most resources in a namespace,
including the ability to create roles and role bindings within the namespace.
This role does not allow write access to resource quota or to the namespace itself.
This role also does not allow write access to Endpoints in clusters created
This role also does not allow write access to EndpointSlices (or Endpoints) in clusters created
using Kubernetes v1.22+. More information is available in the
["Write Access for Endpoints" section](#write-access-for-endpoints).</td>
["Write Access for EndpointSlices and Endpoints" section](#write-access-for-endpoints).</td>
</tr>
<tr>
<td><b>edit</b></td>
@ -714,9 +714,9 @@ using Kubernetes v1.22+. More information is available in the
This role does not allow viewing or modifying roles or role bindings.
However, this role allows accessing Secrets and running Pods as any ServiceAccount in
the namespace, so it can be used to gain the API access levels of any ServiceAccount in
the namespace. This role also does not allow write access to Endpoints in
the namespace. This role also does not allow write access to EndpointSlices (or Endpoints) in
clusters created using Kubernetes v1.22+. More information is available in the
["Write Access for Endpoints" section](#write-access-for-endpoints).</td>
["Write Access for EndpointSlices and Endpoints" section](#write-access-for-endpoints).</td>
</tr>
<tr>
<td><b>view</b></td>
@ -1205,12 +1205,12 @@ In order from most secure to least secure, the approaches are:
--group=system:serviceaccounts
```
## Write access for Endpoints
## Write access for EndpointSlices and Endpoints {#write-access-for-endpoints}
Kubernetes clusters created before Kubernetes v1.22 include write access to
Endpoints in the aggregated "edit" and "admin" roles. As a mitigation for
[CVE-2021-25740](https://github.com/kubernetes/kubernetes/issues/103675), this
access is not part of the aggregated roles in clusters that you create using
EndpointSlices (and Endpoints) in the aggregated "edit" and "admin" roles.
As a mitigation for [CVE-2021-25740](https://github.com/kubernetes/kubernetes/issues/103675),
this access is not part of the aggregated roles in clusters that you create using
Kubernetes v1.22 or later.
Existing clusters that have been upgraded to Kubernetes v1.22 will not be

View File

@ -375,11 +375,16 @@ The control plane adds this label to an Endpoints object when the owning Service
### kubernetes.io/service-name {#kubernetesioservice-name}
Example: `kubernetes.io/service-name: "nginx"`
Example: `kubernetes.io/service-name: "my-website"`
Used on: Service
Used on: EndpointSlice
Kubernetes uses this label to differentiate multiple Services. Used currently for `ELB`(Elastic Load Balancer) only.
Kubernetes associates [EndpointSlices](/docs/concepts/services-networking/endpoint-slices/) with
[Services](/docs/concepts/services-networking/service/) using this label.
This label records the {{< glossary_tooltip term_id="name" text="name">}} of the
Service that the EndpointSlice is backing. All EndpointSlices should have this label set to
the name of their associated Service.
### kubernetes.io/service-account.name
@ -490,7 +495,9 @@ Example: `endpoints.kubernetes.io/over-capacity:truncated`
Used on: Endpoints
In Kubernetes clusters v1.22 (or later), the Endpoints controller adds this annotation to an Endpoints resource if it has more than 1000 endpoints. The annotation indicates that the Endpoints resource is over capacity and the number of endpoints has been truncated to 1000.
The {{< glossary_tooltip text="control plane" term_id="control-plane" >}} adds this annotation to an [Endpoints](/docs/concepts/services-networking/service/#endpoints) object if the associated {{< glossary_tooltip term_id="service" >}} has more than 1000 backing endpoints. The annotation indicates that the Endpoints object is over capacity and the number of endpoints has been truncated to 1000.
If the number of backend endpoints falls below 1000, the control plane removes this annotation.
### batch.kubernetes.io/job-tracking

View File

@ -144,8 +144,7 @@ In those cases, the `kube-dns` ConfigMap can be updated.
## Setting memory limits
The `node-local-dns` Pods use memory for storing cache entries and processing queries.
Since they do not watch Kubernetes objects, the cluster size or the number of Services/Endpoints
do not directly affect memory usage. Memory usage is influenced by the DNS query pattern.
Since they do not watch Kubernetes objects, the cluster size or the number of Services / EndpointSlices do not directly affect memory usage. Memory usage is influenced by the DNS query pattern.
From [CoreDNS docs](https://github.com/coredns/deployment/blob/master/kubernetes/Scaling_CoreDNS.md),
> The default cache size is 10000 entries, which uses about 30 MB when completely filled.

View File

@ -9,6 +9,8 @@ metadata:
or Ingress implementations to expose backend IPs that would not otherwise
be accessible, and can circumvent network policies or security controls
intended to prevent/isolate access to those backends.
EndpointSlices were never included in the edit or admin roles, so there
is nothing to restore for the EndpointSlice API.
labels:
rbac.authorization.k8s.io/aggregate-to-edit: "true"
name: custom:aggregate-to-edit:endpoints # you can change this if you wish