🧠 Second Brain

Kubernetes

Last updated Feb 9, 2024

It’s a platform that allows you to run and orchestrate container workloads. Kubernetes has become the de-facto standard for your cloud-native apps to (auto-) scale-out and deploys your open-source zoo fast, cloud-provider-independent. No lock-in here. You could use open-shift or OKD. With the latest version, they added the OperatorHub which you can install as of today 182 items with just a few clicks. Also, check out Managed Data Stacks which were created to mitigate exactly that.

Some more reasons for Kubernetes are the move from infrastructure as code towards infrastructure as data, specifically as YAML. All the resources in Kubernetes that include Pods, Configurations, Deployments, Volumes, etc., can simply be expressed in a YAML file. Developers quickly write applications that run across multiple operating environments. Costs can be reduced by scaling down (even to zero with, e.g. [Knative][63]) and also by using plain python or other programming languages instead of paying for a service on Azure, AWS, or Google Cloud. Its management makes it easy through its modularity and abstraction, also with the use of Containers (Docker or [Rocket][65]), you can monitor all your applications in one place.

To get hands-on with Kubernetes you can install Docker Desktop with Kubernetes included. All of my examples are built on top of it and run on any cloud as well as locally. For a more sophisticated set-up in terms of Apache Spark, I suggest reading the blog post from Data Mechanics about Setting up, Managing & Monitoring Spark on Kubernetes. If you are more of a video guy, An introduction to Apache Spark on Kubernetes contains the same content but adds still even on top of it.

As said above, if setting up Kubernetes is too hard, there are Managed Data Stacks, where you can choose existing open-source tools to pick from.

Security: Separation of Concerns as with different namespaces.

# Kubernetes Orchestration

Continuously working towards a desired state.

Everything is represented as a “Kubernetes Resources”
A Pod is the smallest “schedulable” resource (~= container)
A Manifest (YAML) defines the desired state of a resource
Kubernetes drives “reality” to the desired state
The current state is updated based on “reality”

# Kubernetes Architecture

etcd: defines and documents:
- current known state
- desired state

graph LR
  subgraph node
    kubelet["kubelet & kube-proxy"]
    containerd
    container
  end
  subgraph control_plane
    subgraph etcd
      kubernetes_resource
    end
    controllers
    kube-api
    scheduler[Default Scheduler]
  end
  subgraph yaml_file
    resource_configurations
  end
  resource_configurations --> kubectl
  kubectl --> kube-api
  controllers -->|adapts| kube-api
  scheduler -->|adapts| kube-api
  kube-api -->|informs| scheduler
  kube-api -->|informs| controllers
  kube-api -->|manages| kubernetes_resource["kubernetes resource:
- current known state
- desired state"]
  kube-api -->|informs| kubelet
  kubelet -->|updates state| kube-api
  kubelet -->|manages| containerd
  containerd --> container

# Workload Resources

graph TD
  subgraph Workload Resources
    deployment-->replicaset-->pod
    statefulset-->pod
    daemonset-->pod
    cronjob-->job-->pod
    pod[Pod]-.->container
    container[Container]
    style container stroke-dasharray: 5 5
  end

Pods - smallest schedulable unit ~= container
Deployment - declarative updates for Pods
- ReplicaSet - ensures a specified number of Pods
StatefulSet - manages stateful applications
DaemonSet - ensures a Pod on each node
CronJob - runs Jobs on a schedule
- Job - runs Pods to completion

# Deployment Patterns

# Containers deployments

When to use multiple container inside a deplyoment?

In Kubernetes, it’s common to run multiple containers within a single Pod when the containers are tightly coupled application components that need to operate together. It’s a anti-pattern to use multiple containers inside the same pod, except for below patterns such as Sidecar, Ambassaador, etc. Usually you would use a different pod deployment for a DB or a different important service.

Shared Storage: Containers in the same Pod share the same storage volumes. This can be beneficial for situations where one container writes to a shared volume and another reads from it.
Inter-process Communication: Since containers in the same Pod share the same network namespace, they can easily communicate with each other using localhost and share the same Port space.
Sidecar Pattern: A common use case is the sidecar pattern, where the main application might need an auxiliary helper that pushes logs or data elsewhere. For example, one container might serve a web application while a sidecar container pushes logs or data to an external source.
Adapter Pattern: You can use a second container to modify or adapt the data output of the main container in some way. For example, transforming output formats or adapting legacy systems to more modern requirements.
Ambassador Pattern: A container can proxy or shuttle network connections for the main container. This can be used for sharding or partitioning in distributed systems.

Init-Container is another container, but these are specified in a sepreate part of the deployment.

Here an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


...
  initContainers:
  - name: copy-airflow-dag-to-airflow-bd
    image: my-image:0.1.0-a.2
    command: ["/bin/sh","-c"]
    args: [
      "mkdir -p /storage/backup/dags-$(date +%Y%m%d-%H%M%S) && cp -a /storage/dags/. /storage/backup/dags-$(date +%Y%m%d-%H%M%S)/ && rm -rf /storage/dags/* && cp -a /opt/airflow/airflow_home/dags/. /storage/dags/"
    ]
    volumeMounts:
    - name: storage
      mountPath: /storage

# Services (Network)

Kubernetes provides several types of Services to expose your application inside or outside of a cluster. Let’s break them down:

ClusterIP: This is the default service type.
- Scope: Internal to the cluster.
- Purpose: Provides a single IP address and port pair which routes traffic to the underlying Pods.
- Use-case: When you want to expose your service only within the Kubernetes cluster, for example, a backend service that should not be exposed to external traffic.
NodePort: Exposes the service on each Node’s IP at a static port.
- Scope: External, using ``:` combination.
- Purpose: Allocates a port from a specified range (default: 30000-32767) on each node and forwards traffic on that port to the service.
- Use-case: Useful for development and debugging, but typically not used directly for production workloads exposed externally.
LoadBalancer: Provisions an external load balancer in a cloud provider’s infrastructure and directs external traffic to the Kubernetes service.
- Scope: External.
- Purpose: Integrates with cloud providers to automatically provision an external load balancer pointing to the NodePort and ClusterIP services.
- Use-case: When running Kubernetes in a cloud provider that supports automatic load balancer provisioning (like AWS, GCP, Azure), this is a straightforward way to expose services to external traffic.
ExternalName: Maps a service to a DNS name, rather than an IP.
- Scope: External.
- Purpose: Returns a CNAME record pointing to the specified external name.
- Use-case: Useful when you want to point a service to an external service outside the cluster without proxying traffic through Kubernetes.
Headless Service: Service without a ClusterIP.
- Scope: Internal to the cluster.
- Purpose: Allows direct pod-to-pod communication without a virtual IP in the middle.
- Use-case: Useful for stateful applications like databases where direct pod addressing is preferable.

Ingress: Ingress is not a service type, but a separate Kubernetes resource designed for HTTP and HTTPS routing to services.

Scope: External.
Purpose: Allows you to define HTTP and HTTPS routes, host-based routing, path-based routing, SSL/TLS termination, and other advanced routing features. Ingress requires an Ingress Controller (like nginx, traefik, or others) to function.
Use-case: When you want to expose multiple services under the same IP address with path- or host-based routing, and especially when you need SSL/TLS termination.

Decision Points:

If you need simple internal communication: Use ClusterIP.
For quick external exposure, especially during development: Use NodePort.
If you’re using a cloud provider that supports it and need simple external exposure: Use LoadBalancer.
To map a service to a DNS name: Use ExternalName.
For direct pod-to-pod communication: Use a Headless Service.
To expose HTTP/HTTPS applications with routing, SSL, etc.: Use Ingress.

As Kubernetes continues to evolve, there might be additional service types or routing mechanisms in the future. Always refer to the official Kubernetes documentation for the most up-to-date information.

References: YAML, DevOps engine – Kubernetes