Match the ipv6 address configuration method on the right with its definition on the left.
Concepts and resources behind networking in Kubernetes. Show
The Kubernetes network modelEvery Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):
Note: For those platforms that support This model is not only less complex overall, but it is principally compatible with the desire for Kubernetes to enable low-friction porting of apps from VMs to containers. If your job previously ran in a VM, your VM had an IP and could talk to other VMs in your project. This is the same basic model. Kubernetes IP addresses exist at the How this is implemented is a detail of the particular container runtime in use. It is possible to request ports on the Kubernetes networking addresses four concerns:
Cluster Networking explains how to set up networking for your cluster, and also provides an overview of the technologies involved. 1 - ServiceExpose an application running in your cluster behind a single outward-facing endpoint, even when the workload is split across multiple backends. An abstract way to expose an application running on a set of Pods as a network service. With Kubernetes you don't need to modify your application to use an unfamiliar service discovery mechanism. Kubernetes gives Pods their own IP addresses and a single DNS name for a set of Pods, and can load-balance across them. MotivationKubernetes Pods are created and destroyed to match the desired state of your cluster. Pods are nonpermanent resources. If you use a Deployment to run your app, it can create and destroy Pods dynamically. Each Pod gets its own IP address, however in a Deployment, the set of Pods running in one moment in time could be different from the set of Pods running that application a moment later. This leads to a problem: if some set of Pods (call them "backends") provides functionality to other Pods (call them "frontends") inside your cluster, how do the frontends find out and keep track of which IP address to connect to, so that the frontend can use the backend part of the workload? Enter Services. Service resourcesIn Kubernetes, a Service is an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service). The set of Pods targeted by a Service is usually determined by a selector. To learn about other ways to define Service endpoints, see Services without selectors. For example, consider a stateless image-processing backend which is running with 3 replicas. Those replicas are fungible—frontends do not care which backend they use. While the actual Pods that compose the backend set may change, the frontend clients should not need to be aware of that, nor should they need to keep track of the set of backends themselves. The Service abstraction enables this decoupling. Cloud-native service discoveryIf you're able to use Kubernetes APIs for service discovery in your application, you can query the API server for matching EndpointSlices. Kubernetes updates the EndpointSlices for a Service whenever the set of Pods in a Service changes. For non-native applications, Kubernetes offers ways to place a network port or load balancer in between your application and the backend Pods. Defining a ServiceA Service in Kubernetes is a REST object, similar to a Pod. Like all of the REST objects, you can For example, suppose you have a set of Pods where each listens on TCP port 9376 and contains a label
This specification creates a new Service object named "my-service", which targets TCP port 9376 on any Pod with the Kubernetes assigns this Service an IP address (sometimes called the "cluster IP"), which is used by the Service proxies (see Virtual IPs and service proxies below). The controller for the Service selector continuously scans for Pods that match its selector, and then POSTs any updates to an Endpoint object also named "my-service". Port definitions in Pods have names, and you can
reference these names in the
This works even if there is a mixture of Pods in the Service using a single configured name, with the same network protocol available via different port numbers. This offers a lot of flexibility for deploying and evolving your Services. For example, you can change the port numbers that Pods expose in the next version of your backend software, without breaking clients. The default protocol for Services is TCP; you can also use any other supported protocol. As many Services need to expose more than one port, Kubernetes supports multiple port definitions on a Service object. Each port definition can have the same Services without selectorsServices most commonly abstract access to Kubernetes Pods thanks to the selector, but when used with a corresponding set of EndpointSlices objects and without a selector, the Service can abstract other kinds of backends, including ones that run outside the cluster. For example:
In any of these scenarios you can define a Service without a Pod selector. For example:
Because this Service has no selector, the corresponding EndpointSlice (and legacy Endpoints) objects are not created automatically. You can manually map the Service to the network address and port where it's running, by adding an EndpointSlice object manually. For example:
Custom EndpointSlicesWhen you create an
EndpointSlice object for a Service, you can use any name for the EndpointSlice. Each EndpointSlice in a namespace must have a unique name. You link an EndpointSlice to a Service by setting the For an EndpointSlice that you create yourself, or in your own code, you should also pick a value to use for the
Accessing a Service without a selectorAccessing a Service without a selector works the same as if it had a selector. In the example for a Service without a selector, traffic is routed to one of the two endpoints defined in the EndpointSlice manifest: a TCP connection to 10.1.2.3 or 10.4.5.6, on port 9376. An ExternalName Service is a special case of Service that does not have selectors and uses DNS names instead. For more information, see the ExternalName section later in this document. EndpointSlicesFEATURE
STATE: EndpointSlices are objects that represent a subset (a slice) of the backing network endpoints for a Service. Your Kubernetes cluster tracks how many endpoints each EndpointSlice represents. If there are so many endpoints for a Service that a threshold is reached, then Kubernetes adds another empty EndpointSlice and stores new endpoint information there. By default, Kubernetes makes a new EndpointSlice once the existing EndpointSlices all contain at least 100 endpoints. Kubernetes does not make the new EndpointSlice until an extra endpoint needs to be added. See EndpointSlices for more information about this API. EndpointsIn the Kubernetes API, an Endpoints (the resource kind is plural) defines a list of network endpoints, typically referenced by a Service to define which Pods the traffic can be sent to. The EndpointSlice API is the recommended replacement for Endpoints. Over-capacity endpointsKubernetes limits the number of endpoints that can fit in a single Endpoints object. When there are over 1000 backing endpoints for a Service, Kubernetes truncates the data in the Endpoints object. Because a Service can be linked with more than one EndpointSlice, the 1000 backing endpoint limit only affects the legacy Endpoints API. In that case, Kubernetes selects at most 1000 possible backend endpoints to store into the Endpoints object, and sets an
annotation on the Endpoints: Traffic is still sent to backends, but any load balancing mechanism that relies on the legacy Endpoints API only sends traffic to at most 1000 of the available backing endpoints. The same API limit means that you cannot manually update an Endpoints to have more than 1000 endpoints. Application protocolFEATURE STATE: The This field follows standard Kubernetes label syntax. Values should either be IANA standard service names or domain prefixed names such as Virtual IPs and service proxiesEvery node in a Kubernetes cluster runs a Why not use round-robin DNS?A question that pops up every now and then is why Kubernetes relies on proxying to forward inbound traffic to backends. What about other approaches? For example, would it be possible to configure DNS records that have multiple A values (or AAAA for IPv6), and rely on round-robin name resolution? There are a few reasons for using proxying for Services:
Later in this page you can read about how various kube-proxy implementations work. Overall, you should note that, when running ConfigurationNote that the kube-proxy starts up in different modes, which are determined by its configuration.
User space proxy modeIn this (legacy) mode, kube-proxy watches the Kubernetes control plane for the addition and removal of Service and Endpoint objects. For each Service it opens a port (randomly chosen) on the local node. Any connections to this "proxy port" are proxied to one of the Service's backend Pods (as reported via Endpoints). kube-proxy takes the Lastly, the user-space proxy installs iptables rules
which capture traffic to the Service's By default, kube-proxy in userspace mode chooses a backend via a round-robin algorithm. iptables proxy modeIn this mode, kube-proxy watches the Kubernetes control plane for the addition and removal of Service and Endpoint objects. For each Service, it installs iptables rules, which capture traffic to the Service's By default, kube-proxy in iptables mode chooses a backend at random. Using iptables to handle traffic has a lower system overhead, because traffic is handled by Linux netfilter without the need to switch between userspace and the kernel space. This approach is also likely to be more reliable. If kube-proxy is running in iptables mode and the first Pod that's selected does not respond, the connection fails. This is different from userspace mode: in that scenario, kube-proxy would detect that the connection to the first Pod had failed and would automatically retry with a different backend Pod. You can use Pod readiness probes to verify that backend Pods are working OK, so that kube-proxy in iptables mode only sees backends that test out as healthy. Doing this means you avoid having traffic sent via kube-proxy to a Pod that's known to have failed. IPVS proxy modeFEATURE STATE: In The IPVS proxy mode is based on netfilter hook function that is similar to iptables mode, but uses a hash table as the underlying data structure and works in the kernel space. That means kube-proxy in IPVS mode redirects traffic with lower latency than kube-proxy in iptables mode, with much better performance when synchronising proxy rules. Compared to the other proxy modes, IPVS mode also supports a higher throughput of network traffic. IPVS provides more options for balancing traffic to backend Pods; these are:
In these proxy models, the traffic bound for the Service's IP:Port is proxied to an appropriate backend without the clients knowing anything about Kubernetes or Services or Pods. If you want to make sure that connections from a particular client are passed to the same Pod each time, you can select the session
affinity based on the client's IP addresses by setting Multi-Port ServicesFor some Services, you need to expose more than one port. Kubernetes lets you configure multiple port definitions on a Service object. When using multiple ports for a Service, you must give all of your ports names so that these are unambiguous. For example:
Choosing your own IP addressYou can specify your own cluster IP address as part of a The IP address that you choose must be a
valid IPv4 or IPv6 address from within the Traffic policiesExternal traffic policyYou can set the Internal traffic policyFEATURE STATE: You can set the Discovering servicesKubernetes supports 2 primary modes of finding a Service - environment variables and DNS. Environment variablesWhen a Pod is run on a
Node, the kubelet adds a set of environment variables for each active Service. It adds For example, the Service
DNSYou can (and almost always should) set up a DNS service for your Kubernetes cluster using an add-on. A cluster-aware DNS server, such as CoreDNS, watches the Kubernetes API for new Services and creates a set of DNS records for each one. If DNS has been enabled throughout your cluster then all Pods should automatically be able to resolve Services by their DNS name. For example, if you have a Service called Pods in other namespaces must qualify the name as Kubernetes also supports DNS SRV (Service) records for named ports. If the The Kubernetes DNS server is the only way to access Headless ServicesSometimes you don't need load-balancing and a single Service IP. In this case, you can
create what are termed "headless" Services, by explicitly specifying You can use a headless Service to interface with other service discovery mechanisms, without being tied to Kubernetes' implementation. For headless With selectorsFor headless Services that define selectors, the Kubernetes control plane creates EndpointSlice objects in the Kubernetes API, and modifies the DNS configuration to return A or AAAA records (IPv4 or IPv6 addresses) that point directly to the Pods backing the Service. Without selectorsFor headless Services that do not define selectors, the control plane does not create EndpointSlice objects. However, the DNS system looks for and configures either:
Publishing Services (ServiceTypes)For some parts of your application (for example, frontends) you may want to expose a Service onto an external IP address, that's outside of your cluster. Kubernetes
You can also use Ingress to expose your Service. Ingress is not a Service type, but it acts as the entry point for your cluster. It lets you consolidate your routing rules into a single resource as it can expose multiple services under the same IP address. Type NodePortIf you set the Using a NodePort gives you the freedom to set up your own load balancing solution, to configure environments that are not fully supported by Kubernetes, or even to expose one or more nodes' IP addresses directly. For a node port Service, Kubernetes additionally allocates a
port (TCP, UDP or SCTP to match the protocol of the Service). Every node in the cluster configures itself to listen on that assigned port and to forward traffic to one of the ready endpoints associated with that Service. You'll be able to contact the Choosing your own portIf you
want a specific port number, you can specify a value in the Here is an example manifest for a Service of
Custom IP address configuration for |
Rule | Protocol | Port(s) | IpRange(s) | IpRange Description |
---|---|---|---|---|
Health Check | TCP | NodePort(s) (.spec.healthCheckNodePort for .spec.externalTrafficPolicy = Local )
| Subnet CIDR | kubernetes.io/rule/nlb/health= |
Client Traffic | TCP | NodePort(s) | .spec.loadBalancerSourceRanges (defaults to 0.0.0.0/0 )
| kubernetes.io/rule/nlb/client= |
MTU Discovery | ICMP | 3,4 | .spec.loadBalancerSourceRanges (defaults to 0.0.0.0/0 )
| kubernetes.io/rule/nlb/mtu= |
In order to limit which client IP's can access the Network Load Balancer, specify loadBalancerSourceRanges
.
spec:
loadBalancerSourceRanges:
- "143.231.0.0/16"
Further documentation on annotations for Elastic IPs and other common use-cases may be found in the AWS Load Balancer Controller documentation.
Other CLB annotations on Tencent Kubernetes Engine (TKE)
There are other annotations for managing Cloud Load Balancers on TKE as shown below.
metadata:
name: my-service
annotations:
# Bind Loadbalancers with specified nodes
service.kubernetes.io/qcloud-loadbalancer-backends-label: key in (value1, value2)
# ID of an existing load balancer
service.kubernetes.io/tke-existed-lbid:lb-6swtxxxx
# Custom parameters for the load balancer (LB), does not support modification of LB type yet
service.kubernetes.io/service.extensiveParameters: ""
# Custom parameters for the LB listener
service.kubernetes.io/service.listenerParameters: ""
# Specifies the type of Load balancer;
# valid values: classic (Classic Cloud Load Balancer) or application (Application Cloud Load Balancer)
service.kubernetes.io/loadbalance-type: xxxxx
# Specifies the public network bandwidth billing method;
# valid values: TRAFFIC_POSTPAID_BY_HOUR(bill-by-traffic) and BANDWIDTH_POSTPAID_BY_HOUR (bill-by-bandwidth).
service.kubernetes.io/qcloud-loadbalancer-internet-charge-type: xxxxxx
# Specifies the bandwidth value (value range: [1,2000] Mbps).
service.kubernetes.io/qcloud-loadbalancer-internet-max-bandwidth-out: "10"
# When this annotation is set,the loadbalancers will only register nodes
# with pod running on it, otherwise all nodes will be registered.
service.kubernetes.io/local-svc-only-bind-node-with-pod: true
Type ExternalName
Services of type ExternalName map a Service to a DNS name, not to a typical selector such as my-service
or cassandra
. You specify these Services with the spec.externalName
parameter.
This Service definition, for example, maps the my-service
Service in the prod
namespace to my.database.example.com
:
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: prod
spec:
type: ExternalName
externalName: my.database.example.com
When looking up the host my-service.prod.svc.cluster.local
, the cluster DNS Service returns a CNAME
record with the value my.database.example.com
. Accessing my-service
works in the same way as other Services but with the crucial difference that redirection happens at the DNS level rather than via proxying or forwarding. Should you later decide to move your database into your cluster, you can start its Pods, add appropriate selectors or endpoints, and change the Service's type
.
External IPs
If there are external IPs that route to one or more cluster nodes, Kubernetes Services can be exposed on those externalIPs
. Traffic that ingresses into the cluster with the external IP (as destination IP), on the Service port, will be routed to one of the Service endpoints. externalIPs
are not managed by Kubernetes and are the responsibility of the cluster administrator.
In the Service spec, externalIPs
can be specified along with any of the ServiceTypes
. In the example below, "my-service
" can be
accessed by clients on "80.11.12.10:80
" (externalIP:port
)
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app.kubernetes.io/name: MyApp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
externalIPs:
- 80.11.12.10
Shortcomings
Using the userspace proxy for VIPs works at small to medium scale, but will not scale to very large clusters with thousands of Services. The original design proposal for portals has more details on this.
Using the userspace proxy obscures the source IP address of a packet accessing a Service. This makes some kinds of network filtering (firewalling) impossible. The iptables proxy mode does not obscure in-cluster source IPs, but it does still impact clients coming through a load balancer or node-port.
The Type
field is designed as nested functionality - each level adds to the previous. This is not strictly required on all cloud providers (e.g. Google Compute Engine does not need to allocate a NodePort
to make LoadBalancer
work, but AWS does) but the Kubernetes API design for Service
requires it anyway.
Virtual IP implementation
The previous information should be sufficient for many people who want to use Services. However, there is a lot going on behind the scenes that may be worth understanding.
Avoiding collisions
One of the primary philosophies of Kubernetes is that you should not be exposed to situations that could cause your actions to fail through no fault of your own. For the design of the Service resource, this means not making you choose your own port number if that choice might collide with someone else's choice. That is an isolation failure.
In order to allow you to choose a port number for your Services, we must ensure that no two Services can collide. Kubernetes does that by allocating each Service its own IP address from within the service-cluster-ip-range
CIDR range that is configured for the API server.
To ensure each Service receives a unique IP, an internal allocator atomically updates a global allocation map in etcd prior to creating each Service. The map object must exist in the registry for Services to get IP address assignments, otherwise creations will fail with a message indicating an IP address could not be allocated.
In the control plane, a background controller is responsible for creating that map (needed to support migrating from older versions of Kubernetes that used in-memory locking). Kubernetes also uses controllers to check for invalid assignments (eg due to administrator intervention) and for cleaning up allocated IP addresses that are no longer used by any Services.
IP address ranges for type: ClusterIP
Services
FEATURE STATE: Kubernetes v1.25 [beta]
However, there is a problem with this ClusterIP
allocation strategy, because a user can also choose their own address for the service. This could result in a conflict if the internal allocator selects the same IP address for another Service.
The ServiceIPStaticSubrange
feature gate is enabled by default in v1.25 and later, using an allocation strategy that divides the ClusterIP
range into two bands, based on the size of the configured service-cluster-ip-range
by using the following formula min(max(16, cidrSize / 16), 256)
, described as never less than 16 or more than 256, with a graduated step function between them. Dynamic IP allocations will be preferentially chosen from the upper
band, reducing risks of conflicts with the IPs assigned from the lower band. This allows users to use the lower band of the service-cluster-ip-range
for their Services with static IPs assigned with a very low risk of running into conflicts.
Service IP addresses
Unlike Pod IP addresses, which actually route to a fixed destination, Service IPs are not actually answered by a single host. Instead, kube-proxy uses iptables (packet processing logic in Linux) to define virtual IP addresses which are transparently redirected as needed. When clients connect to the VIP, their traffic is automatically transported to an appropriate endpoint. The environment variables and DNS for Services are actually populated in terms of the Service's virtual IP address (and port).
kube-proxy supports three proxy modes—userspace, iptables and IPVS—which each operate slightly differently.
Userspace
As an example, consider the image processing application described above. When the backend Service is created, the Kubernetes master assigns a virtual IP address, for example 10.0.0.1. Assuming the Service port is 1234, the Service is observed by all of the kube-proxy instances in the cluster. When a proxy sees a new Service, it opens a new random port, establishes an iptables redirect from the virtual IP address to this new port, and starts accepting connections on it.
When a client connects to the Service's virtual IP address, the iptables rule kicks in, and redirects the packets to the proxy's own port. The "Service proxy" chooses a backend, and starts proxying traffic from the client to the backend.
This means that Service owners can choose any port they want without risk of collision. Clients can connect to an IP and port, without being aware of which Pods they are actually accessing.
iptables
Again, consider the image processing application described above. When the backend Service is created, the Kubernetes control plane assigns a virtual IP address, for example 10.0.0.1. Assuming the Service port is 1234, the Service is observed by all of the kube-proxy instances in the cluster. When a proxy sees a new Service, it installs a series of iptables rules which redirect from the virtual IP address to per-Service rules. The per-Service rules link to per-Endpoint rules which redirect traffic (using destination NAT) to the backends.
When a client connects to the Service's virtual IP address the iptables rule kicks in. A backend is chosen (either based on session affinity or randomly) and packets are redirected to the backend. Unlike the userspace proxy, packets are never copied to userspace, the kube-proxy does not have to be running for the virtual IP address to work, and Nodes see traffic arriving from the unaltered client IP address.
This same basic flow executes when traffic comes in through a node-port or through a load-balancer, though in those cases the client IP does get altered.
IPVS
iptables operations slow down dramatically in large scale cluster e.g 10,000 Services. IPVS is designed for load balancing and based on in-kernel hash tables. So you can achieve performance consistency in large number of Services from IPVS-based kube-proxy. Meanwhile, IPVS-based kube-proxy has more sophisticated load balancing algorithms (least conns, locality, weighted, persistence).
API Object
Service is a top-level resource in the Kubernetes REST API. You can find more details about the Service API object.
Supported protocols
TCP
You can use TCP for any kind of Service, and it's the default network protocol.
UDP
You can use UDP for most Services. For type=LoadBalancer Services, UDP support depends on the cloud provider offering this facility.
SCTP
FEATURE STATE: Kubernetes v1.20 [stable]
When using a network plugin that supports SCTP traffic, you can use SCTP for most Services. For type=LoadBalancer Services, SCTP support depends on the cloud provider offering this facility. (Most do not).
Warnings
Support for multihomed SCTP associations
Windows
Userspace kube-proxy
HTTP
If your cloud provider supports it, you can use a Service in LoadBalancer mode to set up external HTTP / HTTPS reverse proxying, forwarded to the Endpoints of the Service.
PROXY protocol
If your cloud provider supports it, you can use a Service in LoadBalancer mode to configure a load balancer outside of Kubernetes itself, that will forward connections prefixed with PROXY protocol.
The load balancer will send an initial series of octets describing the incoming connection, similar to this example
PROXY TCP4 192.0.2.202 10.0.42.7 12345 7\r\n
followed by the data from the client.
What's next
- Read Connecting Applications with Services
- Read about Ingress
- Read about EndpointSlices
2 - Topology-aware traffic routing with topology keys
FEATURE STATE: Kubernetes v1.21 [deprecated]
Service Topology enables a service to route traffic based upon the Node topology of the cluster. For example, a service can specify that traffic be preferentially routed to endpoints that are on the same Node as the client, or in the same availability zone.
Topology-aware traffic routing
By default, traffic sent to a ClusterIP
or NodePort
Service may be routed to any backend address for the Service. Kubernetes 1.7 made it possible to route "external" traffic to the Pods running on the same Node that received the traffic. For
ClusterIP
Services, the equivalent same-node preference for routing wasn't possible; nor could you configure your cluster to favor routing to endpoints within the same zone. By setting topologyKeys
on a Service, you're able to define a policy for routing traffic based upon the Node labels for the originating and destination Nodes.
The label matching between the source and destination lets you, as a cluster operator, designate sets of Nodes that are "closer" and "farther" from one another. You can define labels to represent whatever metric makes sense for your own requirements. In public clouds, for example, you might prefer to keep network traffic within the same zone, because interzonal traffic has a cost associated with it (and intrazonal traffic typically does not). Other common needs include being able to route traffic to a local Pod managed by a DaemonSet, or directing traffic to Nodes connected to the same top-of-rack switch for the lowest latency.
Using Service Topology
If your cluster has the ServiceTopology
feature gate enabled, you can control Service traffic routing by specifying the topologyKeys
field on the Service spec. This field is a preference-order list of Node labels which will be used to sort endpoints when accessing this Service. Traffic will be directed to a Node whose value for the first
label matches the originating Node's value for that label. If there is no backend for the Service on a matching Node, then the second label will be considered, and so forth, until no labels remain.
If no match is found, the traffic will be rejected, as if there were no backends for the Service at all. That is, endpoints are chosen based on the first topology key with available backends. If this field is specified and all entries have no backends that match the topology of the client, the
service has no backends for that client and connections should fail. The special value "*"
may be used to mean "any topology". This catch-all value, if used, only makes sense as the last value in the list.
If topologyKeys
is not specified or empty, no topology constraints will be applied.
Consider a cluster with Nodes that are labeled with their hostname, zone name, and region name. Then you can set the topologyKeys
values of a service to direct traffic as follows.
- Only to
endpoints on the same node, failing if no endpoint exists on the node:
["kubernetes.io/hostname"]
. - Preferentially to endpoints on the same node, falling back to endpoints in the same zone, followed by the same region, and failing otherwise:
["kubernetes.io/hostname", "topology.kubernetes.io/zone", "topology.kubernetes.io/region"]
. This may be useful, for example, in cases where data locality is critical. - Preferentially to the same zone, but fallback on any available endpoint if none are available within this zone:
["topology.kubernetes.io/zone", "*"]
.
Constraints
-
Service topology is not compatible with
externalTrafficPolicy=Local
, and therefore a Service cannot use both of these features. It is possible to use both features in the same cluster on different Services, only not on the same Service. Valid topology keys are currently limited to
kubernetes.io/hostname
,topology.kubernetes.io/zone
, andtopology.kubernetes.io/region
, but will be generalized to other node labels in the future.Topology keys must be valid label keys and at most 16 keys may be specified.
The catch-all value,
"*"
, must be the last value in the topology keys, if it is used.
Examples
The following are common examples of using the Service Topology feature.
Only Node Local Endpoints
A Service that only routes to node local endpoints. If no endpoints exist on the node, traffic is dropped:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
Prefer Node Local Endpoints
A Service that prefers node local Endpoints but falls back to cluster wide endpoints if node local endpoints do not exist:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "*"
Only Zonal or Regional Endpoints
A Service that prefers zonal then regional endpoints. If no endpoints exist in either, traffic is dropped.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
Prefer Node Local, Zonal, then Regional Endpoints
A Service that prefers node local, zonal, then regional endpoints but falls back to cluster wide endpoints.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
- "*"
What's next
- Read about enabling Service Topology
- Read Connecting Applications with Services
3 - Ingress
Make your HTTP (or HTTPS) network service available using a protocol-aware configuration mechanism, that understands web concepts like URIs, hostnames, paths, and more. The Ingress concept lets you map traffic to different backends based on rules you define via the Kubernetes API.
FEATURE STATE: Kubernetes v1.19 [stable]
An API object that manages external access to the services in a cluster, typically HTTP.
Ingress may provide load balancing, SSL termination and name-based virtual hosting.
Terminology
For clarity, this guide defines the following terms:
- Node: A worker machine in Kubernetes, part of a cluster.
- Cluster: A set of Nodes that run containerized applications managed by Kubernetes. For this example, and in most common Kubernetes deployments, nodes in the cluster are not part of the public internet.
- Edge router: A router that enforces the firewall policy for your cluster. This could be a gateway managed by a cloud provider or a physical piece of hardware.
- Cluster network: A set of links, logical or physical, that facilitate communication within a cluster according to the Kubernetes networking model.
- Service: A Kubernetes Service that identifies a set of Pods using label selectors. Unless mentioned otherwise, Services are assumed to have virtual IPs only routable within the cluster network.
What is Ingress?
Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.
Here is a simple example where an Ingress sends all its traffic to one Service:
Figure. Ingress
An Ingress may be configured to give Services externally-reachable URLs, load balance traffic, terminate SSL / TLS, and offer name-based virtual hosting. An Ingress controller is responsible for fulfilling the Ingress, usually with a load balancer, though it may also configure your edge router or additional frontends to help handle the traffic.
An Ingress does not expose arbitrary ports or protocols. Exposing services other than HTTP and HTTPS to the internet typically uses a service of type Service.Type=NodePort or Service.Type=LoadBalancer.
Prerequisites
You must have an Ingress controller to satisfy an Ingress. Only creating an Ingress resource has no effect.
You may need to deploy an Ingress controller such as ingress-nginx. You can choose from a number of Ingress controllers.
Ideally, all Ingress controllers should fit the reference specification. In reality, the various Ingress controllers operate slightly differently.
The Ingress resource
A minimal Ingress resource example:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx-example
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80
An Ingress needs apiVersion
, kind
, metadata
and spec
fields. The name of an Ingress object must be a valid DNS subdomain name. For general information about working with config files, see
deploying applications, configuring containers, managing resources. Ingress frequently uses annotations to configure some options depending on the Ingress
controller, an example of which is the rewrite-target annotation. Different Ingress controllers support different annotations. Review the documentation for your choice of Ingress controller to learn which annotations are supported.
The Ingress spec has all the information needed to configure a load balancer or proxy server. Most importantly, it contains a list of rules matched against all incoming requests. Ingress resource only supports rules for directing HTTP(S) traffic.
If the ingressClassName
is omitted, a
default Ingress class should be defined.
There are some ingress controllers, that work without the definition of a default IngressClass
. For example, the Ingress-NGINX controller can be configured with a flag --watch-ingress-without-class
. It is
recommended though, to specify the default IngressClass
as shown below.
Ingress rules
Each HTTP rule contains the following information:
- An optional host. In this example, no host is specified, so the rule applies to all inbound HTTP traffic through the IP address specified. If a host is provided (for example, foo.bar.com), the rules apply to that host.
- A list of paths (for example,
/testpath
), each of which has an associated backend defined with aservice.name
and aservice.port.name
orservice.port.number
. Both the host and path must match the content of an incoming request before the load balancer directs traffic to the referenced Service. - A backend is a combination of Service and port names as described in the Service doc or a custom resource backend by way of a CRD. HTTP (and HTTPS) requests to the Ingress that matches the host and path of the rule are sent to the listed backend.
A defaultBackend
is often configured in an Ingress controller
to service any requests that do not match a path in the spec.
DefaultBackend
An Ingress with no rules sends all traffic to a single default backend and .spec.defaultBackend
is the backend that should handle requests in that case. The defaultBackend
is conventionally a configuration option of the Ingress controller and is not specified in your Ingress resources. If no
.spec.rules
are specified, .spec.defaultBackend
must be specified. If defaultBackend
is not set, the handling of requests that do not match any of the rules will be up to the ingress controller (consult the documentation for your ingress controller to find out how it handles this case).
If none of the hosts or paths match the HTTP request in the Ingress objects, the traffic is routed to your default backend.
Resource backends
A Resource
backend is an
ObjectRef to another Kubernetes resource within the same namespace as the Ingress object. A Resource
is a mutually exclusive setting with Service, and will fail validation if both are specified. A common usage for a Resource
backend is to ingress data to an object storage backend with static assets.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-resource-backend
spec:
defaultBackend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: static-assets
rules:
- http:
paths:
- path: /icons
pathType: ImplementationSpecific
backend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: icon-assets
After creating the Ingress above, you can view it with the following command:
kubectl describe ingress ingress-resource-backend
Name: ingress-resource-backend
Namespace: default
Address:
Default backend: APIGroup: k8s.example.com, Kind: StorageBucket, Name: static-assets
Rules:
Host Path Backends
---- ---- --------
*
/icons APIGroup: k8s.example.com, Kind: StorageBucket, Name: icon-assets
Annotations:
Events:
Path types
Each path in an Ingress is required to have a corresponding path type. Paths that do not include an explicit pathType
will fail validation. There are three supported path types:
ImplementationSpecific
: With this path type, matching is up to the IngressClass. Implementations can treat this as a separatepathType
or treat it identically toPrefix
orExact
path types.Exact
: Matches the URL path exactly and with case sensitivity.-
Prefix
: Matches based on a URL path prefix split by/
. Matching is case sensitive and done on a path element by element basis. A path element refers to the list of labels in the path split by the/
separator. A request is a match for path p if every p is an element-wise prefix of p of the request path.
Examples
Kind | Path(s) | Request path(s) | Matches? |
---|---|---|---|
Prefix | /
| (all paths) | Yes |
Exact | /foo
| /foo
| Yes |
Exact | /foo
| /bar
| No |
Exact | /foo
| /foo/
| No |
Exact | /foo/
| /foo
| No |
Prefix | /foo
| /foo , /foo/
| Yes |
Prefix | /foo/
| /foo , /foo/
| Yes |
Prefix | /aaa/bb
| /aaa/bbb
| No |
Prefix | /aaa/bbb
| /aaa/bbb
| Yes |
Prefix | /aaa/bbb/
| /aaa/bbb
| Yes, ignores trailing slash |
Prefix | /aaa/bbb
| /aaa/bbb/
| Yes, matches trailing slash |
Prefix | /aaa/bbb
| /aaa/bbb/ccc
| Yes, matches subpath |
Prefix | /aaa/bbb
| /aaa/bbbxyz
| No, does not match string prefix |
Prefix | / , /aaa
| /aaa/ccc
| Yes, matches /aaa prefix
|
Prefix | / , /aaa , /aaa/bbb
| /aaa/bbb
| Yes, matches /aaa/bbb prefix
|
Prefix | / , /aaa , /aaa/bbb
| /ccc
| Yes, matches / prefix
|
Prefix | /aaa
| /ccc
| No, uses default backend |
Mixed | /foo (Prefix), /foo (Exact)
| /foo
| Yes, prefers Exact |
Multiple matches
In some cases, multiple paths within an Ingress will match a request. In those cases precedence will be given first to the longest matching path. If two paths are still equally matched, precedence will be given to paths with an exact path type over prefix path type.
Hostname wildcards
Hosts can be precise matches (for example “foo.bar.com
”) or a wildcard (for example “*.foo.com
”).
Precise matches require that the HTTP host
header matches the host
field. Wildcard matches require the HTTP host
header is equal to the suffix of the wildcard rule.
Host | Host header | Match? |
---|---|---|
*.foo.com
| bar.foo.com
| Matches based on shared suffix |
*.foo.com
| baz.bar.foo.com
| No match, wildcard only covers a single DNS label |
*.foo.com
| foo.com
| No match, wildcard only covers a single DNS label |
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-wildcard-host
spec:
rules:
- host: "foo.bar.com"
http:
paths:
- pathType: Prefix
path: "/bar"
backend:
service:
name: service1
port:
number: 80
- host: "*.foo.com"
http:
paths:
- pathType: Prefix
path: "/foo"
backend:
service:
name: service2
port:
number: 80
Ingress class
Ingresses can be implemented by different controllers, often with different configuration. Each Ingress should specify a class, a reference to an IngressClass resource that contains additional configuration including the name of the controller that should implement the class.
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb
spec:
controller: example.com/ingress-controller
parameters:
apiGroup: k8s.example.com
kind: IngressParameters
name: external-lb
The .spec.parameters
field
of an IngressClass lets you reference another resource that provides configuration related to that IngressClass.
The specific type of parameters to use depends on the ingress controller that you specify in the .spec.controller
field of the IngressClass.
IngressClass scope
Depending on your ingress controller, you may be able to use parameters that you set cluster-wide, or just for one namespace.
- Cluster
- Namespaced
The default scope for IngressClass parameters is cluster-wide.
If you set the .spec.parameters
field and don't set .spec.parameters.scope
, or if you set .spec.parameters.scope
to Cluster
, then the IngressClass refers to a cluster-scoped resource. The kind
(in combination the apiGroup
) of the parameters refers to a cluster-scoped API (possibly a custom resource),
and the name
of the parameters identifies a specific cluster scoped resource for that API.
For example:
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb-1
spec:
controller: example.com/ingress-controller
parameters:
# The parameters for this IngressClass are specified in a
# ClusterIngressParameter (API group k8s.example.net) named
# "external-config-1". This definition tells Kubernetes to
# look for a cluster-scoped parameter resource.
scope: Cluster
apiGroup: k8s.example.net
kind: ClusterIngressParameter
name: external-config-1
FEATURE STATE: Kubernetes v1.23 [stable]
If you set the .spec.parameters
field and set .spec.parameters.scope
to Namespace
, then the IngressClass refers to a namespaced-scoped resource. You must also set the namespace
field within .spec.parameters
to the namespace that
contains the parameters you want to use.
The kind
(in combination the apiGroup
) of the parameters refers to a namespaced API (for example: ConfigMap), and the name
of the parameters identifies a specific resource in the namespace you specified in namespace
.
Namespace-scoped parameters help the cluster operator delegate control over the configuration (for example: load balancer settings, API gateway definition) that is used for a workload. If you used a cluster-scoped parameter then either:
- the cluster operator team needs to approve a different team's changes every time there's a new configuration change being applied.
- the cluster operator must define specific access controls, such as RBAC roles and bindings, that let the application team make changes to the cluster-scoped parameters resource.
The IngressClass API itself is always cluster-scoped.
Here is an example of an IngressClass that refers to parameters that are namespaced:
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb-2
spec:
controller: example.com/ingress-controller
parameters:
# The parameters for this IngressClass are specified in an
# IngressParameter (API group k8s.example.com) named "external-config",
# that's in the "external-configuration" namespace.
scope: Namespace
apiGroup: k8s.example.com
kind: IngressParameter
namespace: external-configuration
name: external-config
Deprecated annotation
Before the IngressClass resource and ingressClassName
field were added in Kubernetes 1.18, Ingress classes were specified with a kubernetes.io/ingress.class
annotation on the Ingress. This annotation was never formally defined, but was widely supported by Ingress controllers.
The newer ingressClassName
field on Ingresses is a replacement for that
annotation, but is not a direct equivalent. While the annotation was generally used to reference the name of the Ingress controller that should implement the Ingress, the field is a reference to an IngressClass resource that contains additional Ingress configuration, including the name of the Ingress controller.
Default IngressClass
You can mark a particular IngressClass as default for your cluster. Setting the ingressclass.kubernetes.io/is-default-class
annotation to true
on an
IngressClass resource will ensure that new Ingresses without an ingressClassName
field specified will be assigned this default IngressClass.
There are some ingress controllers, that work without the definition of a default IngressClass
. For example, the Ingress-NGINX controller can be configured with a flag --watch-ingress-without-class
. It is
recommended though, to specify the default IngressClass
:
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
labels:
app.kubernetes.io/component: controller
name: nginx-example
annotations:
ingressclass.kubernetes.io/is-default-class: "true"
spec:
controller: k8s.io/ingress-nginx
Types of Ingress
Ingress backed by a single Service
There are existing Kubernetes concepts that allow you to expose a single Service (see alternatives). You can also do this with an Ingress by specifying a default backend with no rules.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-ingress
spec:
defaultBackend:
service:
name: test
port:
number: 80
If you create it using kubectl apply -f
you should be able to view the state of the Ingress you added:
kubectl get ingress test-ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
test-ingress external-lb * 203.0.113.123 80 59s
Where 203.0.113.123
is the IP allocated by the Ingress controller to satisfy this Ingress.
Simple fanout
A fanout configuration routes traffic from a single IP address to more than one Service, based on the HTTP URI being requested. An Ingress allows you to keep the number of load balancers down to a minimum. For example, a setup like:
Figure. Ingress Fan Out
would require an Ingress such as:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: simple-fanout-example
spec:
rules:
- host: foo.bar.com
http:
paths:
- path: /foo
pathType: Prefix
backend:
service:
name: service1
port:
number: 4200
- path: /bar
pathType: Prefix
backend:
service:
name: service2
port:
number: 8080
When you create the Ingress with kubectl apply -f
:
kubectl describe ingress simple-fanout-example
Name: simple-fanout-example
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:4200 (10.8.0.90:4200)
/bar service2:8080 (10.8.0.91:8080)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 22s loadbalancer-controller default/test
The Ingress controller provisions an implementation-specific load balancer that satisfies the
Ingress, as long as the Services (service1
, service2
) exist. When it has done so, you can see the address of the load balancer at the Address field.
Name based virtual hosting
Name-based virtual hosts support routing HTTP traffic to multiple host names at the same IP address.
Figure. Ingress Name Based Virtual hosting
The following Ingress tells the backing load balancer to route requests based on the Host header.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress
spec:
rules:
- host: foo.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: bar.foo.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
If you create an Ingress resource without any hosts defined in the rules, then any web traffic to the IP address of your Ingress controller can be matched without a name based virtual host being required.
For example, the following Ingress routes traffic requested for first.bar.com
to service1
, second.bar.com
to service2
, and any traffic whose request host header doesn't match first.bar.com
and second.bar.com
to service3
.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress-no-third-host
spec:
rules:
- host: first.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: second.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
- http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service3
port:
number: 80
TLS
You can secure an Ingress by specifying a Secret that contains a TLS private key and certificate. The Ingress resource only supports a single TLS port, 443, and assumes TLS termination at the ingress point (traffic to the Service and its Pods
is in plaintext). If the TLS configuration section in an Ingress specifies different hosts, they are multiplexed on the same port according to the hostname specified through the SNI TLS extension (provided the Ingress controller supports SNI). The TLS secret must contain keys named tls.crt
and tls.key
that contain the certificate and private key to use for TLS. For example:
apiVersion: v1
kind: Secret
metadata:
name: testsecret-tls
namespace: default
data:
tls.crt: base64 encoded cert
tls.key: base64 encoded key
type: kubernetes.io/tls
Referencing this secret in an Ingress tells the Ingress controller to secure the channel from the
client to the load balancer using TLS. You need to make sure the TLS secret you created came from a certificate that contains a Common Name (CN), also known as a Fully Qualified Domain Name (FQDN) for https-example.foo.com
.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tls-example-ingress
spec:
tls:
- hosts:
- https-example.foo.com
secretName: testsecret-tls
rules:
- host: https-example.foo.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: service1
port:
number: 80
Load balancing
An Ingress controller is bootstrapped with some load balancing policy settings that it applies to all Ingress, such as the load balancing algorithm, backend weight scheme, and others. More advanced load balancing concepts (e.g. persistent sessions, dynamic weights) are not yet exposed through the Ingress. You can instead get these features through the load balancer used for a Service.
It's also worth noting that even though health checks are not exposed directly through the Ingress, there exist parallel concepts in Kubernetes such as readiness probes that allow you to achieve the same end result. Please review the controller specific documentation to see how they handle health checks (for example: nginx, or GCE).
Updating an Ingress
To update an existing Ingress to add a new Host, you can update it by editing the resource:
kubectl describe ingress test
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 35s loadbalancer-controller default/test
kubectl edit ingress test
This pops up an editor with the existing configuration in YAML format. Modify it to include the new Host:
spec:
rules:
- host: foo.bar.com
http:
paths:
- backend:
service:
name: service1
port:
number: 80
path: /foo
pathType: Prefix
- host: bar.baz.com
http:
paths:
- backend:
service:
name: service2
port:
number: 80
path: /foo
pathType: Prefix
..
After you save your changes, kubectl updates the resource in the API server, which tells the Ingress controller to reconfigure the load balancer.
Verify this:
kubectl describe ingress test
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
bar.baz.com
/foo service2:80 (10.8.0.91:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 45s loadbalancer-controller default/test
You can achieve the same outcome by invoking kubectl replace -f
on a modified Ingress YAML file.
Failing across availability zones
Techniques for spreading traffic across failure domains differ between cloud providers. Please check the documentation of the relevant Ingress controller for details.
Alternatives
You can expose a Service in multiple ways that don't directly involve the Ingress resource:
- Use Service.Type=LoadBalancer
- Use Service.Type=NodePort
What's next
- Learn about the Ingress API
- Learn about Ingress controllers
- Set up Ingress on Minikube with the NGINX Controller
4 - Connecting Applications with Services
The Kubernetes model for connecting containers
Now that you have a continuously running, replicated application you can expose it on a network.
Kubernetes assumes that pods can communicate with other pods, regardless of which host they land on. Kubernetes gives every pod its own cluster-private IP address, so you do not need to explicitly create links between pods or map container ports to host ports. This means that containers within a Pod can all reach each other's ports on localhost, and all pods in a cluster can see each other without NAT. The rest of this document elaborates on how you can run reliable services on such a networking model.
This guide uses a simple nginx server to demonstrate proof of concept.
Exposing pods to the cluster
We did this in a previous example, but let's do it once again and focus on the networking perspective. Create an nginx Pod, and note that it has a container port specification:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
This makes it accessible from any node in your cluster. Check the nodes the Pod is running on:
kubectl apply -f ./run-my-nginx.yaml
kubectl get pods -l run=my-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-nginx-3800858182-jr4a2 1/1 Running 0 13s 10.244.3.4 kubernetes-minion-905m
my-nginx-3800858182-kna2y 1/1 Running 0 13s 10.244.2.5 kubernetes-minion-ljyd
Check your pods' IPs:
kubectl get pods -l run=my-nginx -o custom-columns=POD_IP:.status.podIPs
POD_IP
[map[ip:10.244.3.4]]
[map[ip:10.244.2.5]]
You should be able to ssh into any node in your cluster and use a tool such as curl
to make queries against both
IPs. Note that the containers are not using port 80 on the node, nor are there any special NAT rules to route traffic to the pod. This means you can run multiple nginx pods on the same node all using the same containerPort
, and access them from any other pod or node in your cluster using the assigned IP address for the Service. If you want to arrange for a specific port on the host Node to be forwarded to backing Pods, you can - but the networking model should mean that you do not need to do
so.
You can read more about the Kubernetes Networking Model if you're curious.
Creating a Service
So we have pods running nginx in a flat, cluster wide, address space. In theory, you could talk to these pods directly, but what happens when a node dies? The pods die with it, and the Deployment will create new ones, with different IPs. This is the problem a Service solves.
A Kubernetes Service is an abstraction which defines a logical set of Pods running somewhere in your cluster, that all provide the same functionality. When created, each Service is assigned a unique IP address (also called clusterIP). This address is tied to the lifespan of the Service, and will not change while the Service is alive. Pods can be configured to talk to the Service, and know that communication to the Service will be automatically load-balanced out to some pod that is a member of the Service.
You can create a Service for your 2 nginx replicas with kubectl expose
:
kubectl expose deployment/my-nginx
service/my-nginx exposed
This is equivalent to kubectl apply -f
the following yaml:
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
ports:
- port: 80
protocol: TCP
selector:
run: my-nginx
This specification will create a Service which targets TCP port 80 on any Pod with the run: my-nginx
label, and expose it on an abstracted Service port (targetPort
: is the port the container accepts traffic on,
port
: is the abstracted Service port, which can be any port other pods use to access the Service). View Service API object to see the list of supported fields in service definition. Check your Service:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx ClusterIP 10.0.162.149 80/TCP 21s
As mentioned previously, a Service is backed by a group of Pods. These Pods are exposed through EndpointSlices. The Service's selector will be evaluated continuously and the results will be POSTed to an EndpointSlice that is connected to the Service using a labels. When a Pod dies, it is automatically removed from the EndpointSlices that contain it as an endpoint. New Pods that match the Service's selector will automatically get added to an EndpointSlice for that Service. Check the endpoints, and note that the IPs are the same as the Pods created in the first step:
kubectl describe svc my-nginx
Name: my-nginx
Namespace: default
Labels: run=my-nginx
Annotations:
Selector: run=my-nginx
Type: ClusterIP
IP: 10.0.162.149
Port: 80/TCP
Endpoints: 10.244.2.5:80,10.244.3.4:80
Session Affinity: None
Events:
kubectl get endpointslices -l kubernetes.io/service-name=my-nginx
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
my-nginx-7vzhx IPv4 80 10.244.2.5,10.244.3.4 21s
You should now be able to curl the nginx Service on
from any node in your cluster. Note that the Service IP is completely virtual, it never hits the wire. If you're curious about how this works you can read more about the service proxy.
Accessing the Service
Kubernetes supports 2 primary modes of finding a Service - environment variables and DNS. The former works out of the box while the latter requires the CoreDNS cluster addon.
Environment Variables
When a Pod runs on a Node, the kubelet adds a set of environment variables for each active Service. This introduces an ordering problem. To see why, inspect the environment of your running nginx Pods (your Pod name will be different):
kubectl exec my-nginx-3800858182-jr4a2 -- printenv | grep SERVICE
KUBERNETES_SERVICE_HOST=10.0.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
Note there's no mention of your Service. This is because you created the replicas before the Service. Another disadvantage of doing this is that the scheduler might put both Pods on the same machine, which will take your entire Service down if it dies. We can do this the right way by killing the 2 Pods and waiting for the Deployment to recreate them. This time around the Service exists before the replicas. This will give you scheduler-level Service spreading of your Pods (provided all your nodes have equal capacity), as well as the right environment variables:
kubectl scale deployment my-nginx --replicas=0; kubectl scale deployment my-nginx --replicas=2;
kubectl get pods -l run=my-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-nginx-3800858182-e9ihh 1/1 Running 0 5s 10.244.2.7 kubernetes-minion-ljyd
my-nginx-3800858182-j4rm4 1/1 Running 0 5s 10.244.3.8 kubernetes-minion-905m
You may notice that the pods have different names, since they are killed and recreated.
kubectl exec my-nginx-3800858182-e9ihh -- printenv | grep SERVICE
KUBERNETES_SERVICE_PORT=443
MY_NGINX_SERVICE_HOST=10.0.162.149
KUBERNETES_SERVICE_HOST=10.0.0.1
MY_NGINX_SERVICE_PORT=80
KUBERNETES_SERVICE_PORT_HTTPS=443
DNS
Kubernetes offers a DNS cluster addon Service that automatically assigns dns names to other Services. You can check if it's running on your cluster:
kubectl get services kube-dns --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.0.0.10 53/UDP,53/TCP 8m
The
rest of this section will assume you have a Service with a long lived IP (my-nginx), and a DNS server that has assigned a name to that IP. Here we use the CoreDNS cluster addon (application name kube-dns
), so you can talk to the Service from any pod in your cluster using standard methods (e.g. gethostbyname()
). If CoreDNS isn't running, you can enable it referring to the CoreDNS README or
Installing CoreDNS. Let's run another curl application to test this:
kubectl run curl --image=radial/busyboxplus:curl -i --tty
Waiting for pod default/curl-131556218-9fnch to be running, status is Pending, pod ready: false
Hit enter for command prompt
Then, hit enter and run nslookup my-nginx
:
[ root@curl-131556218-9fnch:/ ]$ nslookup my-nginx
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: my-nginx
Address 1: 10.0.162.149
Securing the Service
Till now we have only accessed the nginx server from within the cluster. Before exposing the Service to the internet, you want to make sure the communication channel is secure. For this, you will need:
- Self signed certificates for https (unless you already have an identity certificate)
- An nginx server configured to use the certificates
- A secret that makes the certificates accessible to pods
You can acquire all these from the nginx https example. This requires having go and make tools installed. If you don't want to install those, then follow the manual steps later. In short:
make keys KEY=/tmp/nginx.key CERT=/tmp/nginx.crt
kubectl create secret tls nginxsecret --key /tmp/nginx.key --cert /tmp/nginx.crt
secret/nginxsecret created
NAME TYPE DATA AGE
nginxsecret kubernetes.io/tls 2 1m
And also the configmap:
kubectl create configmap nginxconfigmap --from-file=default.conf
configmap/nginxconfigmap created
NAME DATA AGE
nginxconfigmap 1 114s
Following are the manual steps to follow in case you run into problems running make (on windows for example):
# Create a public private key pair
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /d/tmp/nginx.key -out /d/tmp/nginx.crt -subj "/CN=my-nginx/O=my-nginx"
# Convert the keys to base64 encoding
cat /d/tmp/nginx.crt | base64
cat /d/tmp/nginx.key | base64
Use the output from the previous commands to create a yaml file as follows. The base64 encoded value should all be on a single line.
apiVersion: "v1"
kind: "Secret"
metadata:
name: "nginxsecret"
namespace: "default"
type: kubernetes.io/tls
data:
tls.crt: "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURIekNDQWdlZ0F3SUJBZ0lKQUp5M3lQK0pzMlpJTUEwR0NTcUdTSWIzRFFFQkJRVUFNQ1l4RVRBUEJnTlYKQkFNVENHNW5hVzU0YzNaak1SRXdEd1lEVlFRS0V3aHVaMmx1ZUhOMll6QWVGdzB4TnpFd01qWXdOekEzTVRKYQpGdzB4T0RFd01qWXdOekEzTVRKYU1DWXhFVEFQQmdOVkJBTVRDRzVuYVc1NGMzWmpNUkV3RHdZRFZRUUtFd2h2CloybHVlSE4yWXpDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSjFxSU1SOVdWM0IKMlZIQlRMRmtobDRONXljMEJxYUhIQktMSnJMcy8vdzZhU3hRS29GbHlJSU94NGUrMlN5ajBFcndCLzlYTnBwbQppeW1CL3JkRldkOXg5UWhBQUxCZkVaTmNiV3NsTVFVcnhBZW50VWt1dk1vLzgvMHRpbGhjc3paenJEYVJ4NEo5Ci82UVRtVVI3a0ZTWUpOWTVQZkR3cGc3dlVvaDZmZ1Voam92VG42eHNVR0M2QURVODBpNXFlZWhNeVI1N2lmU2YKNHZpaXdIY3hnL3lZR1JBRS9mRTRqakxCdmdONjc2SU90S01rZXV3R0ljNDFhd05tNnNTSzRqYUNGeGpYSnZaZQp2by9kTlEybHhHWCtKT2l3SEhXbXNhdGp4WTRaNVk3R1ZoK0QrWnYvcW1mMFgvbVY0Rmo1NzV3ajFMWVBocWtsCmdhSXZYRyt4U1FVQ0F3RUFBYU5RTUU0d0hRWURWUjBPQkJZRUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYnhFVjcKTUI4R0ExVWRJd1FZTUJhQUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYnhFVjdNQXdHQTFVZEV3UUZNQU1CQWY4dwpEUVlKS29aSWh2Y05BUUVGQlFBRGdnRUJBRVhTMW9FU0lFaXdyMDhWcVA0K2NwTHI3TW5FMTducDBvMm14alFvCjRGb0RvRjdRZnZqeE04Tzd2TjB0clcxb2pGSW0vWDE4ZnZaL3k4ZzVaWG40Vm8zc3hKVmRBcStNZC9jTStzUGEKNmJjTkNUekZqeFpUV0UrKzE5NS9zb2dmOUZ3VDVDK3U2Q3B5N0M3MTZvUXRUakViV05VdEt4cXI0Nk1OZWNCMApwRFhWZmdWQTRadkR4NFo3S2RiZDY5eXM3OVFHYmg5ZW1PZ05NZFlsSUswSGt0ejF5WU4vbVpmK3FqTkJqbWZjCkNnMnlwbGQ0Wi8rUUNQZjl3SkoybFIrY2FnT0R4elBWcGxNSEcybzgvTHFDdnh6elZPUDUxeXdLZEtxaUMwSVEKQ0I5T2wwWW5scE9UNEh2b2hSUzBPOStlMm9KdFZsNUIyczRpbDlhZ3RTVXFxUlU9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
tls.key: "LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQ2RhaURFZlZsZHdkbFIKd1V5eFpJWmVEZWNuTkFhbWh4d1NpeWF5N1AvOE9ta3NVQ3FCWmNpQ0RzZUh2dGtzbzlCSzhBZi9WemFhWm9zcApnZjYzUlZuZmNmVUlRQUN3WHhHVFhHMXJKVEVGSzhRSHA3VkpMcnpLUC9QOUxZcFlYTE0yYzZ3MmtjZUNmZitrCkU1bEVlNUJVbUNUV09UM3c4S1lPNzFLSWVuNEZJWTZMMDUrc2JGQmd1Z0ExUE5JdWFubm9UTWtlZTRuMG4rTDQKb3NCM01ZUDhtQmtRQlAzeE9JNHl3YjREZXUraURyU2pKSHJzQmlIT05Xc0RadXJFaXVJMmdoY1kxeWIyWHI2UAozVFVOcGNSbC9pVG9zQngxcHJHclk4V09HZVdPeGxZZmcvbWIvNnBuOUYvNWxlQlkrZStjSTlTMkQ0YXBKWUdpCkwxeHZzVWtGQWdNQkFBRUNnZ0VBZFhCK0xkbk8ySElOTGo5bWRsb25IUGlHWWVzZ294RGQwci9hQ1Zkank4dlEKTjIwL3FQWkUxek1yall6Ry9kVGhTMmMwc0QxaTBXSjdwR1lGb0xtdXlWTjltY0FXUTM5SjM0VHZaU2FFSWZWNgo5TE1jUHhNTmFsNjRLMFRVbUFQZytGam9QSFlhUUxLOERLOUtnNXNrSE5pOWNzMlY5ckd6VWlVZWtBL0RBUlBTClI3L2ZjUFBacDRuRWVBZmI3WTk1R1llb1p5V21SU3VKdlNyblBESGtUdW1vVlVWdkxMRHRzaG9reUxiTWVtN3oKMmJzVmpwSW1GTHJqbGtmQXlpNHg0WjJrV3YyMFRrdWtsZU1jaVlMbjk4QWxiRi9DSmRLM3QraTRoMTVlR2ZQegpoTnh3bk9QdlVTaDR2Q0o3c2Q5TmtEUGJvS2JneVVHOXBYamZhRGR2UVFLQmdRRFFLM01nUkhkQ1pKNVFqZWFKClFGdXF4cHdnNzhZTjQyL1NwenlUYmtGcVFoQWtyczJxWGx1MDZBRzhrZzIzQkswaHkzaE9zSGgxcXRVK3NHZVAKOWRERHBsUWV0ODZsY2FlR3hoc0V0L1R6cEdtNGFKSm5oNzVVaTVGZk9QTDhPTm1FZ3MxMVRhUldhNzZxelRyMgphRlpjQ2pWV1g0YnRSTHVwSkgrMjZnY0FhUUtCZ1FEQmxVSUUzTnNVOFBBZEYvL25sQVB5VWs1T3lDdWc3dmVyClUycXlrdXFzYnBkSi9hODViT1JhM05IVmpVM25uRGpHVHBWaE9JeXg5TEFrc2RwZEFjVmxvcG9HODhXYk9lMTAKMUdqbnkySmdDK3JVWUZiRGtpUGx1K09IYnRnOXFYcGJMSHBzUVpsMGhucDBYSFNYVm9CMUliQndnMGEyOFVadApCbFBtWmc2d1BRS0JnRHVIUVV2SDZHYTNDVUsxNFdmOFhIcFFnMU16M2VvWTBPQm5iSDRvZUZKZmcraEppSXlnCm9RN3hqWldVR3BIc3AyblRtcHErQWlSNzdyRVhsdlhtOElVU2FsbkNiRGlKY01Pc29RdFBZNS9NczJMRm5LQTQKaENmL0pWb2FtZm1nZEN0ZGtFMXNINE9MR2lJVHdEbTRpb0dWZGIwMllnbzFyb2htNUpLMUI3MkpBb0dBUW01UQpHNDhXOTVhL0w1eSt5dCsyZ3YvUHM2VnBvMjZlTzRNQ3lJazJVem9ZWE9IYnNkODJkaC8xT2sybGdHZlI2K3VuCnc1YytZUXRSTHlhQmd3MUtpbGhFZDBKTWU3cGpUSVpnQWJ0LzVPbnlDak9OVXN2aDJjS2lrQ1Z2dTZsZlBjNkQKckliT2ZIaHhxV0RZK2Q1TGN1YSt2NzJ0RkxhenJsSlBsRzlOZHhrQ2dZRUF5elIzT3UyMDNRVVV6bUlCRkwzZAp4Wm5XZ0JLSEo3TnNxcGFWb2RjL0d5aGVycjFDZzE2MmJaSjJDV2RsZkI0VEdtUjZZdmxTZEFOOFRwUWhFbUtKCnFBLzVzdHdxNWd0WGVLOVJmMWxXK29xNThRNTBxMmk1NVdUTThoSDZhTjlaMTltZ0FGdE5VdGNqQUx2dFYxdEYKWSs4WFJkSHJaRnBIWll2NWkwVW1VbGc9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K"
Now create the secrets using the file:
kubectl apply -f nginxsecrets.yaml
kubectl get secrets
NAME TYPE DATA AGE
nginxsecret kubernetes.io/tls 2 1m
Now modify your nginx replicas to start an https server using the certificate in the secret, and the Service, to expose both ports (80 and 443):
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
type: NodePort
ports:
- port: 8080
targetPort: 80
protocol: TCP
name: http
- port: 443
protocol: TCP
name: https
selector:
run: my-nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 1
template:
metadata:
labels:
run: my-nginx
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
- name: configmap-volume
configMap:
name: nginxconfigmap
containers:
- name: nginxhttps
image: bprashanth/nginxhttps:1.0
ports:
- containerPort: 443
- containerPort: 80
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
- mountPath: /etc/nginx/conf.d
name: configmap-volume
Noteworthy points about the nginx-secure-app manifest:
- It contains both Deployment and Service specification in the same file.
- The nginx server serves HTTP traffic on port 80 and HTTPS traffic on 443, and nginx Service exposes both ports.
- Each container has access to the keys through a volume mounted at
/etc/nginx/ssl
. This is set up before the nginx server is started.
kubectl delete deployments,svc my-nginx; kubectl create -f ./nginx-secure-app.yaml
At this point you can reach the nginx server from any node.
kubectl get pods -l run=my-nginx -o custom-columns=POD_IP:.status.podIPs
POD_IP
[map[ip:10.244.3.5]]
node $ curl -k https://10.244.3.5
...
Welcome to nginx!
Note
how we supplied the -k
parameter to curl in the last step, this is because we don't know anything about the pods running nginx at certificate generation time, so we have to tell curl to ignore the CName mismatch. By creating a Service we linked the CName used in the certificate with the actual DNS name used by pods during Service lookup. Let's test this from a pod (the same secret is being reused for simplicity, the pod only needs nginx.crt to access the Service):
apiVersion: apps/v1
kind: Deployment
metadata:
name: curl-deployment
spec:
selector:
matchLabels:
app: curlpod
replicas: 1
template:
metadata:
labels:
app: curlpod
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
containers:
- name: curlpod
command:
- sh
- -c
- while true; do sleep 1; done
image: radial/busyboxplus:curl
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
kubectl apply -f ./curlpod.yaml
kubectl get pods -l app=curlpod
NAME READY STATUS RESTARTS AGE
curl-deployment-1515033274-1410r 1/1 Running 0 1m
kubectl exec curl-deployment-1515033274-1410r -- curl https://my-nginx --cacert /etc/nginx/ssl/tls.crt
...
Welcome to nginx!
...
Exposing the Service
For some parts of your applications you may want to expose a Service onto an external IP address. Kubernetes supports two ways of doing this: NodePorts and LoadBalancers. The Service created in the last section already used NodePort
, so your nginx HTTPS replica is ready to serve traffic on the internet if your node has a public IP.
kubectl get svc my-nginx -o yaml | grep nodePort -C 5
uid: 07191fb3-f61a-11e5-8ae5-42010af00002
spec:
clusterIP: 10.0.162.149
ports:
- name: http
nodePort: 31704
port: 8080
protocol: TCP
targetPort: 80
- name: https
nodePort: 32453
port: 443
protocol: TCP
targetPort: 443
selector:
run: my-nginx
kubectl get nodes -o yaml | grep ExternalIP -C 1
- address: 104.197.41.11
type: ExternalIP
allocatable:
--
- address: 23.251.152.56
type: ExternalIP
allocatable:
...
$ curl https://: -k
...
Welcome to nginx!
Let's now recreate the Service to use a cloud load balancer. Change the Type
of my-nginx
Service from NodePort
to LoadBalancer
:
kubectl edit svc my-nginx
kubectl get svc my-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx LoadBalancer 10.0.162.149 xx.xxx.xxx.xxx 8080:30163/TCP 21s
curl https:// -k
...
Welcome to nginx!
The IP address in the EXTERNAL-IP
column is the one that is available on the public internet. The CLUSTER-IP
is only available inside your cluster/private cloud network.
Note that on AWS, type LoadBalancer
creates an ELB, which uses a (long) hostname, not an IP. It's too long to fit in the standard kubectl get svc
output, in fact, so you'll need to do kubectl describe service my-nginx
to see it. You'll see something like this:
kubectl describe service my-nginx
...
LoadBalancer Ingress: a320587ffd19711e5a37606cf4a74574-1142138393.us-east-1.elb.amazonaws.com
...
What's next
- Learn more about Using a Service to Access an Application in a Cluster
- Learn more about Connecting a Front End to a Back End Using a Service
- Learn more about Creating an External Load Balancer
5 - Ingress Controllers
In order for an Ingress to work in your cluster, there must be an ingress controller running. You need to select at least one ingress controller and make sure it is set up in your cluster. This page lists common ingress controllers that you can deploy.
In order for the Ingress resource to work, the cluster must have an ingress controller running.
Unlike other types of controllers which run as part
of the kube-controller-manager
binary, Ingress controllers are not started automatically with a cluster. Use this page to choose the ingress controller implementation that best fits your cluster.
Kubernetes as a project supports and maintains AWS, GCE, and nginx ingress controllers.
Additional controllers
- AKS Application Gateway Ingress Controller is an ingress controller that configures the Azure Application Gateway.
- Ambassador API Gateway is an Envoy-based ingress controller.
- Apache APISIX ingress controller is an Apache APISIX-based ingress controller.
- Avi Kubernetes Operator provides L4-L7 load-balancing using VMware NSX Advanced Load Balancer.
- BFE Ingress Controller is a BFE-based ingress controller.
- The Citrix ingress controller works with Citrix Application Delivery Controller.
- Contour is an Envoy based ingress controller.
- EnRoute is an Envoy based API gateway that can run as an ingress controller.
- Easegress IngressController is an Easegress based API gateway that can run as an ingress controller.
- F5 BIG-IP Container Ingress Services for Kubernetes lets you use an Ingress to configure F5 BIG-IP virtual servers.
- Gloo is an open-source ingress controller based on Envoy, which offers API gateway functionality.
- HAProxy Ingress is an ingress controller for HAProxy.
- The HAProxy Ingress Controller for Kubernetes is also an ingress controller for HAProxy.
- Istio Ingress is an Istio based ingress controller.
- The Kong Ingress Controller for Kubernetes is an ingress controller driving Kong Gateway.
- Kusk Gateway is an OpenAPI-driven ingress controller based on Envoy.
- The NGINX Ingress Controller for Kubernetes works with the NGINX webserver (as a proxy).
- The Pomerium Ingress Controller is based on Pomerium, which offers context-aware access policy.
- Skipper HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress, designed as a library to build your custom proxy.
- The Traefik Kubernetes Ingress provider is an ingress controller for the Traefik proxy.
- Tyk Operator extends Ingress with Custom Resources to bring API Management capabilities to Ingress. Tyk Operator works with the Open Source Tyk Gateway & Tyk Cloud control plane.
- Voyager is an ingress controller for HAProxy.
Using multiple Ingress controllers
You may deploy any number of ingress controllers using ingress class within a cluster. Note the .metadata.name
of your ingress class resource. When you create an ingress you would need that name to specify the ingressClassName
field on your
Ingress object (refer to IngressSpec v1 reference. ingressClassName
is a replacement of the older annotation method.
If you do not specify an IngressClass for an Ingress, and your cluster has exactly one IngressClass marked as default, then Kubernetes
applies the cluster's default IngressClass to the Ingress. You mark an IngressClass as default by setting the ingressclass.kubernetes.io/is-default-class
annotation on that IngressClass, with the string value "true"
.
Ideally, all ingress controllers should fulfill this specification, but the various ingress controllers operate slightly differently.
What's next
- Learn more about Ingress.
- Set up Ingress on Minikube with the NGINX Controller.
6 - EndpointSlices
The EndpointSlice API is the mechanism that Kubernetes uses to let your Service scale to handle large numbers of backends, and allows the cluster to update its list of healthy backends efficiently.
FEATURE STATE: Kubernetes v1.21 [stable]
EndpointSlices provide a simple way to track network endpoints within a Kubernetes cluster. They offer a more scalable and extensible alternative to Endpoints.
EndpointSlice API
In Kubernetes, an EndpointSlice contains references to a set of network endpoints. The control plane automatically creates EndpointSlices for any Kubernetes Service that has a selector specified. These EndpointSlices include references to all the Pods that match the Service selector. EndpointSlices group network endpoints together by unique combinations of protocol, port number, and Service name. The name of a EndpointSlice object must be a valid DNS subdomain name.
As an example, here's a sample EndpointSlice object, that's owned by the example
Kubernetes Service.
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: example-abc
labels:
kubernetes.io/service-name: example
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
hostname: pod-1
nodeName: node-1
zone: us-west2-a
By default, the control plane creates and
manages EndpointSlices to have no more than 100 endpoints each. You can configure this with the --max-endpoints-per-slice
kube-controller-manager flag, up to a maximum of 1000.
EndpointSlices can act as the source of truth for kube-proxy when it comes to how to route internal traffic.
Address types
EndpointSlices support three address types:
- IPv4
- IPv6
- FQDN (Fully Qualified Domain Name)
Each EndpointSlice
object represents a specific IP address type. If you have a Service that is available via IPv4 and IPv6, there will be at least two EndpointSlice
objects (one for IPv4, and one for IPv6).
Conditions
The EndpointSlice API stores conditions about endpoints that may be useful for consumers. The three conditions are ready
, serving
, and terminating
.
Ready
ready
is a condition that maps to a Pod's Ready
condition. A running Pod with the Ready
condition set to True
should have this EndpointSlice condition also set to true
. For compatibility reasons, ready
is NEVER true
when a Pod is terminating. Consumers should refer to the serving
condition to inspect the readiness of terminating Pods. The only exception to this rule is for Services with spec.publishNotReadyAddresses
set to true
. Endpoints for these Services will always have the ready
condition set to true
.
Serving
FEATURE STATE: Kubernetes v1.22 [beta]
serving
is identical to the ready
condition, except it does not account for terminating states. Consumers of the EndpointSlice API should check this condition if they care about pod readiness while the pod is also terminating.
Terminating
FEATURE STATE: Kubernetes v1.22 [beta]
Terminating
is a condition that indicates whether an endpoint is terminating. For pods, this is any pod that has a deletion timestamp set.
Topology information
Each endpoint within an EndpointSlice can contain relevant topology information. The topology information includes the location of the endpoint and information about the corresponding Node and zone. These are available in the following per endpoint fields on EndpointSlices:
nodeName
- The name of the Node this endpoint is on.zone
- The zone this endpoint is in.
Management
Most often, the control plane (specifically, the endpoint slice controller) creates and manages EndpointSlice objects. There are a variety of other use cases for EndpointSlices, such as service mesh implementations, that could result in other entities or controllers managing additional sets of EndpointSlices.
To ensure that multiple entities can manage EndpointSlices without interfering with each other, Kubernetes defines the label endpointslice.kubernetes.io/managed-by
, which indicates the
entity managing an EndpointSlice. The endpoint slice controller sets endpointslice-controller.k8s.io
as the value for this label on all EndpointSlices it manages. Other entities managing EndpointSlices should also set a unique value for this label.
Ownership
In most use cases, EndpointSlices are owned by the Service that the endpoint slice object tracks endpoints for. This ownership is indicated by an owner reference on each EndpointSlice as well as a kubernetes.io/service-name
label that enables simple
lookups of all EndpointSlices belonging to a Service.
EndpointSlice mirroring
In some cases, applications create custom Endpoints resources. To ensure that these applications do not need to concurrently write to both Endpoints and EndpointSlice resources, the cluster's control plane mirrors most Endpoints resources to corresponding EndpointSlices.
The control plane mirrors Endpoints resources unless:
- the Endpoints resource has a
endpointslice.kubernetes.io/skip-mirror
label set totrue
. - the Endpoints resource has a
control-plane.alpha.kubernetes.io/leader
annotation. - the corresponding Service resource does not exist.
- the corresponding Service resource has a non-nil selector.
Individual Endpoints resources may translate into multiple EndpointSlices. This will occur if an Endpoints resource has multiple subsets or includes endpoints with multiple IP families (IPv4 and IPv6). A maximum of 1000 addresses per subset will be mirrored to EndpointSlices.
Distribution of EndpointSlices
Each EndpointSlice has a set of ports that applies to all endpoints within the resource. When named ports are used for a Service, Pods may end up with different target port numbers for the same named port, requiring different EndpointSlices. This is similar to the logic behind how subsets are grouped with Endpoints.
The control plane tries to fill EndpointSlices as full as possible, but does not actively rebalance them. The logic is fairly straightforward:
- Iterate through existing EndpointSlices, remove endpoints that are no longer desired and update matching endpoints that have changed.
- Iterate through EndpointSlices that have been modified in the first step and fill them up with any new endpoints needed.
- If there's still new endpoints left to add, try to fit them into a previously unchanged slice and/or create new ones.
Importantly, the third step prioritizes limiting EndpointSlice updates over a perfectly full distribution of EndpointSlices. As an example, if there are 10 new endpoints to add and 2 EndpointSlices with room for 5 more endpoints each, this approach will create a new EndpointSlice instead of filling up the 2 existing EndpointSlices. In other words, a single EndpointSlice creation is preferrable to multiple EndpointSlice updates.
With kube-proxy running on each Node and watching EndpointSlices, every change to an EndpointSlice becomes relatively expensive since it will be transmitted to every Node in the cluster. This approach is intended to limit the number of changes that need to be sent to every Node, even if it may result with multiple EndpointSlices that are not full.
In practice, this less than ideal distribution should be rare. Most changes processed by the EndpointSlice controller will be small enough to fit in an existing EndpointSlice, and if not, a new EndpointSlice is likely going to be necessary soon anyway. Rolling updates of Deployments also provide a natural repacking of EndpointSlices with all Pods and their corresponding endpoints getting replaced.
Duplicate endpoints
Due to the nature of EndpointSlice changes, endpoints may be represented in more than one EndpointSlice at the same time. This naturally occurs as changes to different EndpointSlice objects can arrive at the Kubernetes client watch / cache at different times.
Comparison with Endpoints
The original Endpoints API provided a simple and straightforward way of tracking network endpoints in Kubernetes. As Kubernetes clusters and Services grew to handle more traffic and to send more traffic to more backend Pods, the limitations of that original API became more visible. Most notably, those included challenges with scaling to larger numbers of network endpoints.
Since all network endpoints for a Service were stored in a single Endpoints object, those Endpoints objects could get quite large. For Services that stayed stable (the same set of endpoints over a long period of time) the impact was less noticeable; even then, some use cases of Kubernetes weren't well served.
When a Service had a lot of backend endpoints and the workload was either scaling frequently, or rolling out new changes frequently, each update to the single Endpoints object for that Service meant a lot of traffic between Kubernetes cluster components (within the control plane, and also between nodes and the API server). This extra traffic also had a cost in terms of CPU use.
With EndpointSlices, adding or removing a single Pod triggers the same number of updates to clients that are watching for changes, but the size of those update message is much smaller at large scale.
EndpointSlices also enabled innovation around new features such dual-stack networking and topology-aware routing.
What's next
- Read Connecting Applications with Services
7 - Network Policies
If you want to control traffic flow at the IP address or port level (OSI layer 3 or 4), NetworkPolicies allow you to specify rules for traffic flow within your cluster, and also between Pods and the outside world. Your cluster must use a network plugin that supports NetworkPolicy enforcement.
If you want to control traffic flow at the IP address or port level (OSI layer 3 or 4), then you might consider using Kubernetes NetworkPolicies for particular applications in your cluster. NetworkPolicies are an application-centric construct which allow you to specify how a pod is allowed to communicate with various network "entities" (we use the word "entity" here to avoid overloading the more common terms such as "endpoints" and "services", which have specific Kubernetes connotations) over the network. NetworkPolicies apply to a connection with a pod on one or both ends, and are not relevant to other connections.
The entities that a Pod can communicate with are identified through a combination of the following 3 identifiers:
- Other pods that are allowed (exception: a pod cannot block access to itself)
- Namespaces that are allowed
- IP blocks (exception: traffic to and from the node where a Pod is running is always allowed, regardless of the IP address of the Pod or the node)
When defining a pod- or namespace- based NetworkPolicy, you use a selector to specify what traffic is allowed to and from the Pod(s) that match the selector.
Meanwhile, when IP based NetworkPolicies are created, we define policies based on IP blocks (CIDR ranges).
Prerequisites
Network policies are implemented by the network plugin. To use network policies, you must be using a networking solution which supports NetworkPolicy. Creating a NetworkPolicy resource without a controller that implements it will have no effect.
The Two Sorts of Pod Isolation
There are two sorts of isolation for a pod: isolation for egress, and isolation for ingress. They concern what connections may be established. "Isolation" here is not absolute, rather it means "some restrictions apply". The alternative, "non-isolated for $direction", means that no restrictions apply in the stated direction. The two sorts of isolation (or not) are declared independently, and are both relevant for a connection from one pod to another.
By default, a pod is non-isolated for egress; all outbound connections are allowed. A pod is isolated for egress if there is any NetworkPolicy that both selects the pod and has "Egress" in its policyTypes
; we say that such a policy applies to the pod for egress. When a pod is isolated for egress, the only allowed connections from the pod are those allowed by the egress
list of some NetworkPolicy that applies to
the pod for egress. The effects of those egress
lists combine additively.
By default, a pod is non-isolated for ingress; all inbound connections are allowed. A pod is isolated for ingress if there is any NetworkPolicy that both selects the pod and has "Ingress" in its policyTypes
; we say that such a policy applies to the pod for ingress. When a pod is isolated for ingress, the only allowed connections into the pod are those from the pod's node and those allowed by the ingress
list of some
NetworkPolicy that applies to the pod for ingress. The effects of those ingress
lists combine additively.
Network policies do not conflict; they are additive. If any policy or policies apply to a given pod for a given direction, the connections allowed in that direction from that pod is the union of what the applicable policies allow. Thus, order of evaluation does not affect the policy result.
For a connection from a source pod to a destination pod to be allowed, both the egress policy on the source pod and the ingress policy on the destination pod need to allow the connection. If either side does not allow the connection, it will not happen.
The NetworkPolicy resource
See the NetworkPolicy reference for a full definition of the resource.
An example NetworkPolicy might look like this:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
- namespaceSelector:
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 6379
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 5978
Mandatory Fields: As with all other Kubernetes config, a NetworkPolicy needs apiVersion
, kind
, and metadata
fields. For general information about working with config files, see Configure a Pod to Use a ConfigMap, and
Object Management.
spec: NetworkPolicy spec has all the information needed to define a particular network policy in the given namespace.
podSelector: Each NetworkPolicy includes a
podSelector
which selects the grouping of pods to which the policy applies. The example policy selects pods with the label "role=db". An empty podSelector
selects all pods in the namespace.
policyTypes: Each NetworkPolicy includes a policyTypes
list which may include either Ingress
, Egress
, or both. The policyTypes
field indicates whether or not the given policy applies to ingress traffic to selected pod, egress traffic from selected pods, or both. If no policyTypes
are specified on a NetworkPolicy then by default Ingress
will always be set and Egress
will be set if the NetworkPolicy has any egress rules.
ingress: Each NetworkPolicy may include a list of allowed ingress
rules. Each rule allows traffic which matches both the from
and ports
sections. The example policy contains a single rule, which matches traffic on a single port, from one of three sources, the first specified via an ipBlock
, the second via a namespaceSelector
and
the third via a podSelector
.
egress: Each NetworkPolicy may include a list of allowed egress
rules. Each rule allows traffic which matches both the to
and ports
sections. The example policy contains a single rule, which matches traffic on a single port to any destination in 10.0.0.0/24
.
So, the example NetworkPolicy:
isolates "role=db" pods in the "default" namespace for both ingress and egress traffic (if they weren't already isolated)
(Ingress rules) allows connections to all pods in the "default" namespace with the label "role=db" on TCP port 6379 from:
- any pod in the "default" namespace with the label "role=frontend"
- any pod in a namespace with the label "project=myproject"
- IP addresses in the ranges 172.17.0.0–172.17.0.255 and 172.17.2.0–172.17.255.255 (ie, all of 172.17.0.0/16 except 172.17.1.0/24)
(Egress rules) allows connections from any pod in the "default" namespace with the label "role=db" to CIDR 10.0.0.0/24 on TCP port 5978
See the Declare Network Policy walkthrough for further examples.
Behavior of to and from selectors
There are four kinds of selectors that can be specified in an ingress
from
section or egress
to
section:
podSelector: This selects particular Pods in the same namespace as the NetworkPolicy which should be allowed as ingress sources or egress destinations.
namespaceSelector: This selects particular namespaces for which all Pods should be allowed as ingress sources or egress destinations.
namespaceSelector and podSelector: A single to
/from
entry that specifies both namespaceSelector
and podSelector
selects particular Pods within particular
namespaces. Be careful to use correct YAML syntax; this policy:
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
podSelector:
matchLabels:
role: client
...
contains a single from
element allowing connections from Pods with the label role=client
in namespaces with the label user=alice
. But this policy:
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
- podSelector:
matchLabels:
role: client
...
contains two elements in the from
array, and allows connections from Pods in the local Namespace with the label role=client
, or from any Pod in any namespace with the label user=alice
.
When in doubt, use kubectl describe
to see
how Kubernetes has interpreted the policy.
ipBlock: This selects particular IP CIDR ranges to allow as ingress sources or egress destinations. These should be cluster-external IPs, since Pod IPs are ephemeral and unpredictable.
Cluster ingress and egress mechanisms often require rewriting the source or destination IP of packets. In cases where this happens, it is not defined whether this happens before or after
NetworkPolicy processing, and the behavior may be different for different combinations of network plugin, cloud provider, Service
implementation, etc.
In the case of ingress, this means that in some cases you may be able to filter incoming packets based on the actual original source IP, while in other cases, the "source IP" that the NetworkPolicy acts on may be the IP of a LoadBalancer
or of the Pod's node, etc.
For egress, this means that connections from pods to Service
IPs that get
rewritten to cluster-external IPs may or may not be subject to ipBlock
-based policies.
Default policies
By default, if no policies exist in a namespace, then all ingress and egress traffic is allowed to and from pods in that namespace. The following examples let you change the default behavior in that namespace.
Default deny all ingress traffic
You can create a "default" ingress isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any ingress traffic to those pods.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
This ensures that even pods that aren't selected by any other NetworkPolicy will still be isolated for ingress. This policy does not affect isolation for egress from any pod.
Allow all ingress traffic
If you want to allow all incoming connections to all pods in a namespace, you can create a policy that explicitly allows that.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-ingress
spec:
podSelector: {}
ingress:
- {}
policyTypes:
- Ingress
With this policy in place, no additional policy or policies can cause any incoming connection to those pods to be denied. This policy has no effect on isolation for egress from any pod.
Default deny all egress traffic
You can create a "default" egress isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any egress traffic from those pods.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
spec:
podSelector: {}
policyTypes:
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed egress traffic. This policy does not change the ingress isolation behavior of any pod.
Allow all egress traffic
If you want to allow all connections from all pods in a namespace, you can create a policy that explicitly allows all outgoing connections from pods in that namespace.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-egress
spec:
podSelector: {}
egress:
- {}
policyTypes:
- Egress
With this policy in place, no additional policy or policies can cause any outgoing connection from those pods to be denied. This policy has no effect on isolation for ingress to any pod.
Default deny all ingress and all egress traffic
You can create a "default" policy for a namespace which prevents all ingress AND egress traffic by creating the following NetworkPolicy in that namespace.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed ingress or egress traffic.
SCTP support
FEATURE STATE: Kubernetes v1.20 [stable]
As a stable feature, this is enabled by default. To disable
SCTP at a cluster level, you (or your cluster administrator) will need to disable the SCTPSupport
feature gate for the API server with --feature-gates=SCTPSupport=false,…
. When the feature gate is enabled, you can set the protocol
field of a NetworkPolicy to SCTP
.
Targeting a range of ports
FEATURE STATE: Kubernetes v1.25 [stable]
When writing a NetworkPolicy, you can target a range of ports instead of a single port.
This is achievable with the usage of the endPort
field, as the following example:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: multi-port-egress
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 32000
endPort: 32768
The above rule allows any Pod with label role=db
on the namespace default
to communicate with any IP within the range 10.0.0.0/24
over TCP, provided that the target port is between the range 32000 and 32768.
The following restrictions apply when using this field:
- The
endPort
field must be equal to or greater than theport
field. endPort
can only be defined ifport
is also defined.- Both ports must be numeric.
Targeting a Namespace by its name
FEATURE STATE: Kubernetes 1.22 [stable]
The Kubernetes control plane sets an immutable label kubernetes.io/metadata.name
on all namespaces, provided that the NamespaceDefaultLabelName
feature gate is enabled. The
value of the label is the namespace name.
While NetworkPolicy cannot target a namespace by its name with some object field, you can use the standardized label to target a specific namespace.
What you can't do with network policies (at least, not yet)
As of Kubernetes 1.25, the following functionality does not exist in the NetworkPolicy API, but you might be able to implement workarounds using Operating System components (such as SELinux, OpenVSwitch, IPTables, and so on) or Layer 7 technologies (Ingress controllers, Service Mesh implementations) or admission controllers. In case you are new to network security in Kubernetes, its worth noting that the following User Stories cannot (yet) be implemented using the NetworkPolicy API.
- Forcing internal cluster traffic to go through a common gateway (this might be best served with a service mesh or other proxy).
- Anything TLS related (use a service mesh or ingress controller for this).
- Node specific policies (you can use CIDR notation for these, but you cannot target nodes by their Kubernetes identities specifically).
- Targeting of services by name (you can, however, target pods or namespaces by their labels, which is often a viable workaround).
- Creation or management of "Policy requests" that are fulfilled by a third party.
- Default policies which are applied to all namespaces or pods (there are some third party Kubernetes distributions and projects which can do this).
- Advanced policy querying and reachability tooling.
- The ability to log network security events (for example connections that are blocked or accepted).
- The ability to explicitly deny policies (currently the model for NetworkPolicies are deny by default, with only the ability to add allow rules).
- The ability to prevent loopback or incoming host traffic (Pods cannot currently block localhost access, nor do they have the ability to block access from their resident node).
What's next
- See the Declare Network Policy walkthrough for further examples.
- See more recipes for common scenarios enabled by the NetworkPolicy resource.
8 - DNS for Services and Pods
Your workload can discover Services within your cluster using DNS; this page explains how that works.
Kubernetes creates DNS records for Services and Pods. You can contact Services with consistent DNS names instead of IP addresses.
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures the kubelets to tell individual containers to use the DNS Service's IP to resolve DNS names.
Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name. By default, a client Pod's DNS search list includes the Pod's own namespace and the cluster's default domain.
Namespaces of Services
A DNS query may return different results based on the namespace of the Pod making it. DNS queries that don't specify a namespace are limited to the Pod's namespace. Access Services in other namespaces by specifying it in the DNS query.
For example, consider a Pod in a test
namespace. A data
Service is in the prod
namespace.
A query for
data
returns no results, because it uses the Pod's test
namespace.
A query for data.prod
returns the intended result, because it specifies the namespace.
DNS queries may be expanded using the Pod's /etc/resolv.conf
. Kubelet sets this file for each Pod. For example, a query for just data
may be expanded to data.test.svc.cluster.local
. The values of the search
option are used to expand queries. To learn more about DNS queries, see
the resolv.conf
manual page.
nameserver 10.32.0.10
search .svc.cluster.local svc.cluster.local cluster.local
options ndots:5
In summary, a Pod in the test namespace can successfully resolve either data.prod
or data.prod.svc.cluster.local
.
DNS Records
What objects get DNS records?
- Services
- Pods
The following sections detail the supported DNS record types and layout that is supported. Any other layout or names or queries that happen to work are considered implementation details and are subject to change without warning. For more up-to-date specification, see Kubernetes DNS-Based Service Discovery.
Services
A/AAAA records
"Normal" (not headless) Services are assigned a DNS A or AAAA record, depending on the IP family of the Service, for a name of the form my-svc.my-namespace.svc.cluster-domain.example
. This resolves to
the cluster IP of the Service.
"Headless" (without a cluster IP) Services are also assigned a DNS A or AAAA record, depending on the IP family of the Service, for a name of the form my-svc.my-namespace.svc.cluster-domain.example
. Unlike normal Services, this resolves to the set of IPs of the Pods selected by the Service. Clients are expected to consume the set or else use standard round-robin selection from the set.
SRV records
SRV Records are created for named ports that are part of normal or
Headless Services. For each named port, the SRV record would have the form _my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster-domain.example
. For a regular Service, this resolves to the port number and the domain name: my-svc.my-namespace.svc.cluster-domain.example
. For a headless Service, this resolves to multiple answers, one for each Pod that is backing the Service, and contains the port number and the domain name of the Pod of the form auto-generated-name.my-svc.my-namespace.svc.cluster-domain.example
.
Pods
A/AAAA records
In general a Pod has the following DNS resolution:
pod-ip-address.my-namespace.pod.cluster-domain.example
.
For example, if a Pod in the default
namespace has the IP address 172.17.0.3, and the domain name for your cluster is cluster.local
, then the Pod has a DNS name:
172-17-0-3.default.pod.cluster.local
.
Any Pods exposed by a Service have the following DNS resolution available:
pod-ip-address.service-name.my-namespace.svc.cluster-domain.example
.
Pod's hostname and subdomain fields
Currently when a
Pod is created, its hostname is the Pod's metadata.name
value.
The Pod spec has an optional hostname
field, which can be used to specify the Pod's hostname. When specified, it takes precedence over the Pod's name to be the hostname of the Pod. For example, given a Pod with hostname
set to "my-host
", the Pod will have its hostname set to "my-host
".
The Pod spec also has an optional subdomain
field which can be used to specify its subdomain. For example, a Pod with hostname
set to
"foo
", and subdomain
set to "bar
", in namespace "my-namespace
", will have the fully qualified domain name (FQDN) "foo.bar.my-namespace.svc.cluster-domain.example
".
Example:
apiVersion: v1
kind: Service
metadata:
name: default-subdomain
spec:
selector:
name: busybox
clusterIP: None
ports:
- name: foo # Actually, no port is needed.
port: 1234
targetPort: 1234
---
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
name: busybox
spec:
hostname: busybox-1
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
---
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
name: busybox
spec:
hostname: busybox-2
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
If there exists a headless Service in the same namespace as the Pod and with the same name as the subdomain, the cluster's DNS Server also returns an A or AAAA record for the Pod's fully qualified hostname. For example, given a Pod with the hostname set to "busybox-1
" and the subdomain set to "default-subdomain
", and a headless
Service named "default-subdomain
" in the same namespace, the Pod will see its own FQDN as "busybox-1.default-subdomain.my-namespace.svc.cluster-domain.example
". DNS serves an A or AAAA record at that name, pointing to the Pod's IP. Both Pods "busybox1
" and "busybox2
" can have their distinct A or AAAA records.
An EndpointSlice can specify the DNS hostname for any endpoint addresses, along with its IP.
Pod's setHostnameAsFQDN field
FEATURE STATE: Kubernetes v1.22 [stable]
When a Pod is configured to have fully
qualified domain name (FQDN), its hostname is the short hostname. For example, if you have a Pod with the fully qualified domain name busybox-1.default-subdomain.my-namespace.svc.cluster-domain.example
, then by default the hostname
command inside that Pod returns busybox-1
and the hostname --fqdn
command returns the FQDN.
When you set setHostnameAsFQDN: true
in the Pod spec, the kubelet writes the Pod's FQDN into the hostname for that Pod's namespace. In this case, both hostname
and hostname --fqdn
return the Pod's FQDN.
Pod's DNS Policy
DNS
policies can be set on a per-Pod basis. Currently Kubernetes supports the following Pod-specific DNS policies. These policies are specified in the dnsPolicy
field of a Pod Spec.
- "
Default
": The Pod inherits the name resolution configuration from the node that the Pods run on. See related discussion for more details. - "
ClusterFirst
": Any DNS query that does not match the configured cluster domain suffix, such as "www.kubernetes.io
", is forwarded to the upstream nameserver inherited from the node. Cluster administrators may have extra stub-domain and upstream DNS servers configured. See related discussion for details on how DNS queries are handled in those cases. - "
ClusterFirstWithHostNet
": For Pods running with hostNetwork, you should explicitly set its DNS policy "ClusterFirstWithHostNet
".- Note: This is not supported on Windows. See below for details
- "
None
": It allows a Pod to ignore DNS settings from the Kubernetes environment. All DNS settings are supposed to be provided using thednsConfig
field in the Pod Spec. See Pod's DNS config subsection below.
The example below shows a Pod with its DNS policy set to "ClusterFirstWithHostNet
" because it has hostNetwork
set to true
.
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
Pod's DNS Config
FEATURE STATE: Kubernetes v1.14 [stable]
Pod's DNS Config allows users more control on the DNS settings for a Pod.
The dnsConfig
field is optional and it can work with any
dnsPolicy
settings. However, when a Pod's dnsPolicy
is set to "None
", the dnsConfig
field has to be specified.
Below are the properties a user can specify in the dnsConfig
field:
nameservers
: a list of IP addresses that will be used as DNS servers for the Pod. There can be at most 3 IP addresses specified. When the Pod'sdnsPolicy
is set to "None
", the list must contain at least one IP address, otherwise this property is optional. The servers listed will be combined to the base nameservers generated from the specified DNS policy with duplicate addresses removed.searches
: a list of DNS search domains for hostname lookup in the Pod. This property is optional. When specified, the provided list will be merged into the base search domain names generated from the chosen DNS policy. Duplicate domain names are removed. Kubernetes allows for at most 6 search domains.options
: an optional list of objects where each object may have aname
property (required) and avalue
property (optional). The contents in this property will be merged to the options generated from the specified DNS policy. Duplicate entries are removed.
The following is an example Pod with custom DNS settings:
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: dns-example
spec:
containers:
- name: test
image: nginx
dnsPolicy: "None"
dnsConfig:
nameservers:
- 1.2.3.4
searches:
- ns1.svc.cluster-domain.example
- my.dns.search.suffix
options:
- name: ndots
value: "2"
- name: edns0
When the Pod above is created, the container test
gets the following contents in its /etc/resolv.conf
file:
nameserver 1.2.3.4
search ns1.svc.cluster-domain.example my.dns.search.suffix
options ndots:2 edns0
For IPv6 setup, search path and name server should be set up like this:
kubectl exec -it dns-example -- cat /etc/resolv.conf
The output is similar to this:
nameserver 2001:db8:30::a
search default.svc.cluster-domain.example svc.cluster-domain.example cluster-domain.example
options ndots:5
Expanded DNS Configuration
FEATURE STATE: Kubernetes 1.22 [alpha]
By default, for Pod's DNS Config, Kubernetes allows at most 6 search domains and a list of search domains of up to 256 characters.
If the feature gate ExpandedDNSConfig
is enabled for the kube-apiserver and the kubelet, it is allowed for Kubernetes to have at most 32 search domains and a list of search domains of up to 2048 characters.
DNS resolution on Windows nodes
- ClusterFirstWithHostNet is not supported for Pods that run on Windows nodes. Windows treats all names with a
.
as a FQDN and skips FQDN resolution. - On Windows, there are multiple DNS resolvers that can be used. As these come with slightly different behaviors, using the
Resolve-DNSName
powershell cmdlet for name query resolutions is recommended. - On Linux, you have a DNS suffix list, which is used after resolution of a name as fully qualified has failed. On Windows, you can only have 1 DNS suffix, which is the DNS suffix associated with that Pod's namespace (example:
mydns.svc.cluster.local
). Windows can resolve FQDNs, Services, or network name which can be resolved with this single suffix. For example, a Pod spawned in thedefault
namespace, will have the DNS suffixdefault.svc.cluster.local
. Inside a Windows Pod, you can resolve bothkubernetes.default.svc.cluster.local
andkubernetes
, but not the partially qualified names (kubernetes.default
orkubernetes.default.svc
).
What's next
For guidance on administering DNS configurations, check Configure DNS Service
9 - IPv4/IPv6 dual-stack
Kubernetes lets you configure single-stack IPv4 networking, single-stack IPv6 networking, or dual stack networking with both network families active. This page explains how.
FEATURE STATE: Kubernetes v1.23 [stable]
IPv4/IPv6 dual-stack networking enables the allocation of both IPv4 and IPv6 addresses to Pods and Services.
IPv4/IPv6 dual-stack networking is enabled by default for your Kubernetes cluster starting in 1.21, allowing the simultaneous assignment of both IPv4 and IPv6 addresses.
Supported Features
IPv4/IPv6 dual-stack on your Kubernetes cluster provides the following features:
- Dual-stack Pod networking (a single IPv4 and IPv6 address assignment per Pod)
- IPv4 and IPv6 enabled Services
- Pod off-cluster egress routing (eg. the Internet) via both IPv4 and IPv6 interfaces
Prerequisites
The following prerequisites are needed in order to utilize IPv4/IPv6 dual-stack Kubernetes clusters:
Kubernetes 1.20 or later
For information about using dual-stack services with earlier Kubernetes versions, refer to the documentation for that version of Kubernetes.
Provider support for dual-stack networking (Cloud provider or otherwise must be able to provide Kubernetes nodes with routable IPv4/IPv6 network interfaces)
A network plugin that supports dual-stack networking.
Configure IPv4/IPv6 dual-stack
To configure IPv4/IPv6 dual-stack, set dual-stack cluster network assignments:
- kube-apiserver:
--service-cluster-ip-range=
,
- kube-controller-manager:
--cluster-cidr=
, --service-cluster-ip-range=
, --node-cidr-mask-size-ipv4|--node-cidr-mask-size-ipv6
defaults to /24 for IPv4 and /64 for IPv6
- kube-proxy:
--cluster-cidr=
,
- kubelet:
- when there is no
--cloud-provider
the administrator can pass a comma-separated pair of IP addresses via--node-ip
to manually configure dual-stack.status.addresses
for that Node. If a Pod runs on that node in HostNetwork mode, the Pod reports these IP addresses in its.status.podIPs
field. AllpodIPs
in a node match the IP family preference defined by the.status.addresses
field for that Node.
- when there is no
Services
You can create Services which can use IPv4, IPv6, or both.
The address family of a Service defaults to the address family of the first service cluster IP range (configured via the --service-cluster-ip-range
flag to the kube-apiserver).
When you define a Service you can
optionally configure it as dual stack. To specify the behavior you want, you set the .spec.ipFamilyPolicy
field to one of the following values:
SingleStack
: Single-stack service. The control plane allocates a cluster IP for the Service, using the first configured service cluster IP range.PreferDualStack
:- Allocates IPv4 and IPv6 cluster IPs for the Service.
RequireDualStack
: Allocates Service.spec.ClusterIPs
from both IPv4 and IPv6 address ranges.- Selects the
.spec.ClusterIP
from the list of.spec.ClusterIPs
based on the address family of the first element in the.spec.ipFamilies
array.
- Selects the
If you would like to define which IP family to use for single stack or define the order of IP families for dual-stack, you can choose the address families by setting an optional field, .spec.ipFamilies
, on the Service.
You can set .spec.ipFamilies
to any of the following array values:
["IPv4"]
["IPv6"]
["IPv4","IPv6"]
(dual stack)["IPv6","IPv4"]
(dual stack)
The first
family you list is used for the legacy .spec.ClusterIP
field.
Dual-stack Service configuration scenarios
These examples demonstrate the behavior of various dual-stack Service configuration scenarios.
Dual-stack options on new Services
This Service specification does not explicitly define
.spec.ipFamilyPolicy
. When you create this Service, Kubernetes assigns a cluster IP for the Service from the first configuredservice-cluster-ip-range
and sets the.spec.ipFamilyPolicy
toSingleStack
. (Services without selectors and headless Services with selectors will behave in this same way.)apiVersion: v1 kind: Service metadata: name: my-service labels: app.kubernetes.io/name: MyApp spec: selector: app.kubernetes.io/name: MyApp ports: - protocol: TCP port: 80
This Service specification explicitly defines
PreferDualStack
in.spec.ipFamilyPolicy
. When you create this Service on a dual-stack cluster, Kubernetes assigns both IPv4 and IPv6 addresses for the service. The control plane updates the.spec
for the Service to record the IP address assignments. The field.spec.ClusterIPs
is the primary field, and contains both assigned IP addresses;.spec.ClusterIP
is a secondary field with its value calculated from.spec.ClusterIPs
.- For the
.spec.ClusterIP
field, the control plane records the IP address that is from the same address family as the first service cluster IP range. - On a single-stack cluster, the
.spec.ClusterIPs
and.spec.ClusterIP
fields both only list one address. - On a cluster with dual-stack enabled, specifying
RequireDualStack
in.spec.ipFamilyPolicy
behaves the same asPreferDualStack
.
apiVersion: v1 kind: Service metadata: name: my-service labels: app.kubernetes.io/name: MyApp spec: ipFamilyPolicy: PreferDualStack selector: app.kubernetes.io/name: MyApp ports: - protocol: TCP port: 80
- For the
This Service specification explicitly defines
IPv6
andIPv4
in.spec.ipFamilies
as well as definingPreferDualStack
in.spec.ipFamilyPolicy
. When Kubernetes assigns an IPv6 and IPv4 address in.spec.ClusterIPs
,.spec.ClusterIP
is set to the IPv6 address because that is the first element in the.spec.ClusterIPs
array, overriding the default.apiVersion: v1 kind: Service metadata: name: my-service labels: app.kubernetes.io/name: MyApp spec: ipFamilyPolicy: PreferDualStack ipFamilies: - IPv6 - IPv4 selector: app.kubernetes.io/name: MyApp ports: - protocol: TCP port: 80
Dual-stack defaults on existing Services
These examples demonstrate the default behavior when dual-stack is newly enabled on a cluster where Services already exist. (Upgrading an existing cluster to 1.21 or beyond will enable dual-stack.)
When dual-stack is enabled on a cluster, existing Services (whether
IPv4
orIPv6
) are configured by the control plane to set.spec.ipFamilyPolicy
toSingleStack
and set.spec.ipFamilies
to the address family of the existing Service. The existing Service cluster IP will be stored in.spec.ClusterIPs
.apiVersion: v1 kind: Service metadata: name: my-service labels: app.kubernetes.io/name: MyApp spec: selector: app.kubernetes.io/name: MyApp ports: - protocol: TCP port: 80
You can validate this behavior by using kubectl to inspect an existing service.
kubectl get svc my-service -o yaml
apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/name: MyApp name: my-service spec: clusterIP: 10.0.197.123 clusterIPs: - 10.0.197.123 ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - port: 80 protocol: TCP targetPort: 80 selector: app.kubernetes.io/name: MyApp type: ClusterIP status: loadBalancer: {}
When dual-stack is enabled on a cluster, existing headless Services with selectors are configured by the control plane to set
.spec.ipFamilyPolicy
toSingleStack
and set.spec.ipFamilies
to the address family of the first service cluster IP range (configured via the--service-cluster-ip-range
flag to the kube-apiserver) even though.spec.ClusterIP
is set toNone
.apiVersion: v1 kind: Service metadata: name: my-service labels: app.kubernetes.io/name: MyApp spec: selector: app.kubernetes.io/name: MyApp ports: - protocol: TCP port: 80
You can validate this behavior by using kubectl to inspect an existing headless service with selectors.
kubectl get svc my-service -o yaml
apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/name: MyApp name: my-service spec: clusterIP: None clusterIPs: - None ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - port: 80 protocol: TCP targetPort: 80 selector: app.kubernetes.io/name: MyApp
Switching Services between single-stack and dual-stack
Services can be changed from single-stack to dual-stack and from dual-stack to single-stack.
To change a Service from single-stack to dual-stack, change
.spec.ipFamilyPolicy
fromSingleStack
toPreferDualStack
orRequireDualStack
as desired. When you change this Service from single-stack to dual-stack, Kubernetes assigns the missing address family so that the Service now has IPv4 and IPv6 addresses.Edit the Service specification updating the
.spec.ipFamilyPolicy
fromSingleStack
toPreferDualStack
.Before:
spec: ipFamilyPolicy: SingleStack
After:
spec: ipFamilyPolicy: PreferDualStack
To change a Service from dual-stack to single-stack, change
.spec.ipFamilyPolicy
fromPreferDualStack
orRequireDualStack
toSingleStack
. When you change this Service from dual-stack to single-stack, Kubernetes retains only the first element in the.spec.ClusterIPs
array, and sets.spec.ClusterIP
to that IP address and sets.spec.ipFamilies
to the address family of.spec.ClusterIPs
.
Headless Services without selector
For Headless Services
without selectors and without .spec.ipFamilyPolicy
explicitly set, the .spec.ipFamilyPolicy
field defaults to RequireDualStack
.
Service type LoadBalancer
To provision a dual-stack load balancer for your Service:
- Set the
.spec.type
field toLoadBalancer
- Set
.spec.ipFamilyPolicy
field toPreferDualStack
orRequireDualStack
Egress traffic
If you want to enable egress traffic in order to reach off-cluster destinations (eg. the public Internet) from a Pod that uses non-publicly routable IPv6 addresses, you need to enable the Pod to use a publicly routed IPv6 address via a mechanism such as transparent proxying or IP masquerading. The ip-masq-agent project supports IP masquerading on dual-stack clusters.
Windows support
Kubernetes on Windows does not support single-stack "IPv6-only" networking. However, dual-stack IPv4/IPv6 networking for pods and nodes with single-family services is supported.
You can use IPv4/IPv6 dual-stack networking with l2bridge
networks.
You can read more about the different network modes for Windows within the Networking on Windows topic.
What's next
- Validate IPv4/IPv6 dual-stack networking
- Enable dual-stack networking using kubeadm
10 - Topology Aware Hints
Topology Aware Hints provides a mechanism to help keep network traffic within the zone where it originated. Preferring same-zone traffic between Pods in your cluster can help with reliability, performance (network latency and throughput), or cost.
FEATURE STATE: Kubernetes v1.23 [beta]
Topology Aware Hints enable topology aware routing by including suggestions for how clients should consume endpoints. This approach adds metadata to enable consumers of EndpointSlice (or Endpoints) objects, so that traffic to those network endpoints can be routed closer to where it originated.
For example, you can route traffic within a locality to reduce costs, or to improve network performance.
Motivation
Kubernetes clusters are increasingly deployed in multi-zone environments. Topology Aware Hints provides a mechanism to help keep traffic within the zone it originated from. This concept is commonly referred to as "Topology Aware Routing". When calculating the endpoints for a Service, the EndpointSlice controller considers the topology (region and zone) of each endpoint and populates the hints field to allocate it to a zone. Cluster components such as the kube-proxy can then consume those hints, and use them to influence how the traffic is routed (favoring topologically closer endpoints).
Using Topology Aware Hints
You can activate Topology Aware Hints for a Service by setting the service.kubernetes.io/topology-aware-hints
annotation to auto
. This
tells the EndpointSlice controller to set topology hints if it is deemed safe. Importantly, this does not guarantee that hints will always be set.
How it works
The functionality enabling this feature is split into two components: The EndpointSlice controller and the kube-proxy. This section provides a high level overview of how each component implements this feature.
EndpointSlice controller
The EndpointSlice controller is responsible for setting hints on EndpointSlices when this feature is enabled. The controller allocates a proportional amount of endpoints to each zone. This proportion is based on the allocatable CPU cores for nodes running in that zone. For example, if one zone had 2 CPU cores and another zone only had 1 CPU core, the controller would allocate twice as many endpoints to the zone with 2 CPU cores.
The following example shows what an EndpointSlice looks like when hints have been populated:
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: example-hints
labels:
kubernetes.io/service-name: example-svc
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
hostname: pod-1
zone: zone-a
hints:
forZones:
- name: "zone-a"
kube-proxy
The kube-proxy component filters the endpoints it routes to based on the hints set by the EndpointSlice controller. In most cases, this means that the kube-proxy is able to route traffic to endpoints in the same zone. Sometimes the controller allocates endpoints from a different zone to ensure more even distribution of endpoints between zones. This would result in some traffic being routed to other zones.
Safeguards
The Kubernetes control plane and the kube-proxy on each node apply some safeguard rules before using Topology Aware Hints. If these don't check out, the kube-proxy selects endpoints from anywhere in your cluster, regardless of the zone.
Insufficient number of endpoints: If there are less endpoints than zones in a cluster, the controller will not assign any hints.
Impossible to achieve balanced allocation: In some cases, it will be impossible to achieve a balanced allocation of endpoints among zones. For example, if zone-a is twice as large as zone-b, but there are only 2 endpoints, an endpoint allocated to zone-a may receive twice as much traffic as zone-b. The controller does not assign hints if it can't get this "expected overload" value below an acceptable threshold for each zone. Importantly this is not based on real-time feedback. It is still possible for individual endpoints to become overloaded.
One or more Nodes has insufficient information: If any node does not have a
topology.kubernetes.io/zone
label or is not reporting a value for allocatable CPU, the control plane does not set any topology-aware endpoint hints and so kube-proxy does not filter endpoints by zone.One or more endpoints does not have a zone hint: When this happens, the kube-proxy assumes that a transition from or to Topology Aware Hints is underway. Filtering endpoints for a Service in this state would be dangerous so the kube-proxy falls back to using all endpoints.
A zone is not represented in hints: If the kube-proxy is unable to find at least one endpoint with a hint targeting the zone it is running in, it falls to using endpoints from all zones. This is most likely to happen as you add a new zone into your existing cluster.
Constraints
Topology Aware Hints are not used when either
externalTrafficPolicy
orinternalTrafficPolicy
is set toLocal
on a Service. It is possible to use both features in the same cluster on different Services, just not on the same Service.This approach will not work well for Services that have a large proportion of traffic originating from a subset of zones. Instead this assumes that incoming traffic will be roughly proportional to the capacity of the Nodes in each zone.
The EndpointSlice controller ignores unready nodes as it calculates the proportions of each zone. This could have unintended consequences if a large portion of nodes are unready.
The EndpointSlice controller does not take into account tolerations when deploying or calculating the proportions of each zone. If the Pods backing a Service are limited to a subset of Nodes in the cluster, this will not be taken into account.
This may not work well with autoscaling. For example, if a lot of traffic is originating from a single zone, only the endpoints allocated to that zone will be handling that traffic. That could result in Horizontal Pod Autoscaler either not picking up on this event, or newly added pods starting in a different zone.
What's next
- Read Connecting Applications with Services
11 - Networking on Windows
Kubernetes supports running nodes on either Linux or Windows. You can mix both kinds of node within a single cluster. This page provides an overview to networking specific to the Windows operating system.
Container networking on Windows
Networking for Windows containers is exposed through CNI plugins. Windows containers function similarly to virtual machines in regards to networking. Each container has a virtual network adapter (vNIC) which is connected to a Hyper-V virtual switch (vSwitch). The Host Networking Service (HNS) and the Host Compute Service (HCS) work together to create containers and attach container vNICs to networks. HCS is responsible for the management of containers whereas HNS is responsible for the management of networking resources such as:
- Virtual networks (including creation of vSwitches)
- Endpoints / vNICs
- Namespaces
- Policies including packet encapsulations, load-balancing rules, ACLs, and NAT rules.
The Windows HNS and vSwitch implement namespacing and can create virtual NICs as needed for a pod
or container. However, many configurations such as DNS, routes, and metrics are stored in the Windows registry database rather than as files inside /etc
, which is how Linux stores those configurations. The Windows registry for the container is separate from that of the host, so concepts like mapping /etc/resolv.conf
from the host into a container don't have the same effect they would on Linux. These must be configured using Windows APIs run in the context of that container. Therefore CNI
implementations need to call the HNS instead of relying on file mappings to pass network details into the pod or container.
Network modes
Windows supports five different networking drivers/modes: L2bridge, L2tunnel, Overlay (Beta), Transparent, and NAT. In a heterogeneous cluster with Windows and Linux worker nodes, you need to select a networking solution that is compatible on both Windows and Linux. The following table lists the out-of-tree plugins are supported on Windows, with recommendations on when to use each CNI:
Network Driver | Description | Container Packet Modifications | Network Plugins | Network Plugin Characteristics |
---|---|---|---|---|
L2bridge | Containers are attached to an external vSwitch. Containers are attached to the underlay network, although the physical network doesn't need to learn the container MACs because they are rewritten on ingress/egress. | MAC is rewritten to host MAC, IP may be rewritten to host IP using HNS OutboundNAT policy. | win-bridge, Azure-CNI, Flannel host-gateway uses win-bridge | win-bridge uses L2bridge network mode, connects containers to the underlay of hosts, offering best performance. Requires user-defined routes (UDR) for inter-node connectivity. |
L2Tunnel | This is a special case of l2bridge, but only used on Azure. All packets are sent to the virtualization host where SDN policy is applied. | MAC rewritten, IP visible on the underlay network | Azure-CNI | Azure-CNI allows integration of containers with Azure vNET, and allows them to leverage the set of capabilities that Azure Virtual Network provides. For example, securely connect to Azure services or use Azure NSGs. See azure-cni for some examples |
Overlay | Containers are given a vNIC connected to an external vSwitch. Each overlay network gets its own IP subnet, defined by a custom IP prefix.The overlay network driver uses VXLAN encapsulation. | Encapsulated with an outer header. | win-overlay, Flannel VXLAN (uses win-overlay) | win-overlay should be used when virtual container networks are desired to be isolated from underlay of hosts (e.g. for security reasons). Allows for IPs to be re-used for different overlay networks (which have different VNID tags) if you are restricted on IPs in your datacenter. This option requires KB4489899 on Windows Server 2019. |
Transparent (special use case for ovn-kubernetes) | Requires an external vSwitch. Containers are attached to an external vSwitch which enables intra-pod communication via logical networks (logical switches and routers). | Packet is encapsulated either via GENEVE or STT tunneling to reach pods which are not on the same host. Packets are forwarded or dropped via the tunnel metadata information supplied by the ovn network controller. NAT is done for north-south communication. | ovn-kubernetes | Deploy via ansible. Distributed ACLs can be applied via Kubernetes policies. IPAM support. Load-balancing can be achieved without kube-proxy. NATing is done without using iptables/netsh. |
NAT (not used in Kubernetes) | Containers are given a vNIC connected to an internal vSwitch. DNS/DHCP is provided using an internal component called WinNAT | MAC and IP is rewritten to host MAC/IP. | nat | Included here for completeness |
As outlined above, the Flannel CNI plugin is also supported on Windows via the VXLAN network backend (Beta support ; delegates to win-overlay) and host-gateway network backend (stable support; delegates to win-bridge).
This plugin supports delegating to one of the reference CNI plugins (win-overlay, win-bridge), to work in conjunction with Flannel daemon on Windows (Flanneld) for automatic node subnet lease assignment and HNS network creation. This plugin reads in its own
configuration file (cni.conf), and aggregates it with the environment variables from the FlannelD generated subnet.env file. It then delegates to one of the reference CNI plugins for network plumbing, and sends the correct configuration containing the node-assigned subnet to the IPAM plugin (for example: host-local
).
For Node, Pod, and Service objects, the following network flows are supported for TCP/UDP traffic:
- Pod → Pod (IP)
- Pod → Pod (Name)
- Pod → Service (Cluster IP)
- Pod → Service (PQDN, but only if there are no ".")
- Pod → Service (FQDN)
- Pod → external (IP)
- Pod → external (DNS)
- Node → Pod
- Pod → Node
IP address management (IPAM)
The following IPAM options are supported on Windows:
- host-local
- azure-vnet-ipam (for azure-cni only)
- Windows Server IPAM (fallback option if no IPAM is set)
Load balancing and Services
A Kubernetes Service is an abstraction that defines a logical set of Pods and a means to access them over a network. In a cluster that includes Windows nodes, you can use the following types of Service:
NodePort
ClusterIP
LoadBalancer
ExternalName
Windows container networking differs in some important ways from Linux networking. The Microsoft documentation for Windows Container Networking provides additional details and background.
On Windows, you can use the following settings to configure Services and load balancing behavior:
Feature | Description | Minimum Supported Windows OS build | How to enable |
---|---|---|---|
Session affinity | Ensures that connections from a particular client are passed to the same Pod each time. | Windows Server 2022 | Set service.spec.sessionAffinity to "ClientIP"
|
Direct Server Return (DSR) | Load balancing mode where the IP address fixups and the LBNAT occurs at the container vSwitch port directly; service traffic arrives with the source IP set as the originating pod IP. | Windows Server 2019 | Set the following flags in kube-proxy: --feature-gates="WinDSR=true" --enable-dsr=true
|
Preserve-Destination | Skips DNAT of service traffic, thereby preserving the virtual IP of the target service in packets reaching the backend Pod. Also disables node-node forwarding. | Windows Server, version 1903 | Set "preserve-destination": "true" in service annotations and enable DSR in kube-proxy.
|
IPv4/IPv6 dual-stack networking | Native IPv4-to-IPv4 in parallel with IPv6-to-IPv6 communications to, from, and within a cluster | Windows Server 2019 | See IPv4/IPv6 dual-stack |
Client IP preservation | Ensures that source IP of incoming ingress traffic gets preserved. Also disables node-node forwarding. | Windows Server 2019 | Set service.spec.externalTrafficPolicy to "Local" and enable DSR in kube-proxy
|
Limitations
The following networking functionality is not supported on Windows nodes:
- Host networking mode
- Local NodePort access from the node itself (works for other nodes or external clients)
- More than 64 backend pods (or unique destination addresses) for a single Service
- IPv6 communication between Windows pods connected to overlay networks
- Local Traffic Policy in non-DSR mode
- Outbound
communication using the ICMP protocol via the
win-overlay
,win-bridge
, or using the Azure-CNI plugin.
Specifically, the Windows data plane (VFP) doesn't support ICMP packet transpositions, and this means:- ICMP packets directed to destinations within the same network (such as pod to pod communication via ping) work as expected;
- TCP/UDP packets work as expected;
- ICMP packets directed to pass through a remote network (e.g. pod to external internet communication via ping) cannot be transposed and thus will not be routed back to their source;
- Since TCP/UDP packets can still be transposed, you can substitute
ping
withcurl
when debugging connectivity with the outside world.
Other limitations:
- Windows reference network plugins win-bridge and win-overlay do not implement
CNI spec v0.4.0, due to a missing
CHECK
implementation. - The Flannel VXLAN CNI plugin has the following limitations on Windows:
- Node-pod connectivity is only possible for local pods with Flannel v0.12.0 (or higher).
- Flannel is restricted to using VNI 4096 and UDP port 4789. See the official Flannel VXLAN backend docs for more details on these parameters.
12 - Service Internal Traffic Policy
If two Pods in your cluster want to communicate, and both Pods are actually running on the same node, use Service Internal Traffic Policy to keep network traffic within that node. Avoiding a round trip via the cluster network can help with reliability, performance (network latency and throughput), or cost.
FEATURE STATE: Kubernetes v1.23 [beta]
Service Internal Traffic Policy enables internal traffic restrictions to only route internal traffic to endpoints within the node the traffic originated from. The "internal" traffic here refers to traffic originated from Pods in the current cluster. This can help to reduce costs and improve performance.
Using Service Internal Traffic Policy
The ServiceInternalTrafficPolicy
feature gate is a Beta feature and enabled by default. When the feature is enabled, you can enable the internal-only traffic policy for a
Service, by setting its .spec.internalTrafficPolicy
to Local
. This tells kube-proxy to only use node local endpoints for cluster internal traffic.
The following example shows what a Service looks like when you set .spec.internalTrafficPolicy
to Local
:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app.kubernetes.io/name: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
internalTrafficPolicy: Local
How it works
The kube-proxy filters the endpoints it routes to based on the spec.internalTrafficPolicy
setting. When it's set to Local
, only node local endpoints are considered. When it's Cluster
or missing, all endpoints are considered. When the feature gate ServiceInternalTrafficPolicy
is enabled, spec.internalTrafficPolicy
defaults to "Cluster".
What's next
- Read about Topology Aware Hints
- Read about Service External Traffic Policy
- Read Connecting Applications with Services