1
0
Fork 0

Final content proofreading

master
Jack Henschel 2 years ago
parent b6424380db
commit f9e038886c
  1. 2
      images/k8s_control_plane.svg
  2. 30
      images/wasmpa-architecture.svg
  3. 13
      include/01-introduction.md
  4. 61
      include/02-background.md
  5. 190
      include/03-research.md
  6. 77
      include/04-implementation.md
  7. 160
      include/05-evaluation.md
  8. 27
      include/06-conclusion.md
  9. 36
      include/appendices.md

@ -914,7 +914,7 @@
x="76.082108"
y="94.042717"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:5.64444447px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
id="tspan14097-1-7-7">Authz, ...</tspan></text>
id="tspan14097-1-7-7">Authz, Validation, ...</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:0.75;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"

Before

Width:  |  Height:  |  Size: 61 KiB

After

Width:  |  Height:  |  Size: 62 KiB

@ -746,16 +746,16 @@
borderopacity="1.0"
inkscape:pageopacity="0.0"
inkscape:pageshadow="2"
inkscape:zoom="0.98994949"
inkscape:cx="440.09957"
inkscape:cy="209.40509"
inkscape:zoom="1.979899"
inkscape:cx="472.2037"
inkscape:cy="152.97741"
inkscape:document-units="mm"
inkscape:current-layer="layer1"
showgrid="true"
inkscape:connector-spacing="19"
inkscape:snap-bbox="true"
inkscape:snap-global="false"
inkscape:window-width="1710"
inkscape:window-width="3430"
inkscape:window-height="1410"
inkscape:window-x="1920"
inkscape:window-y="20"
@ -774,7 +774,7 @@
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
<dc:title />
</cc:Work>
</rdf:RDF>
</metadata>
@ -786,13 +786,13 @@
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:0.75;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="158.24304"
y="91.697769"
x="158.52652"
y="92.264732"
id="text14087"><tspan
sodipodi:role="line"
x="158.24304"
y="91.697769"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:5.64444447px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
x="158.52652"
y="92.264732"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:6.3499999px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
id="tspan14097">Metrics API</tspan></text>
<g
id="g1454"
@ -909,7 +909,7 @@
sodipodi:role="line"
x="159.9819"
y="103.3988"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:5.64444447px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:6.35px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
id="tspan14097-5">Configuration</tspan></text>
<text
xml:space="preserve"
@ -920,7 +920,7 @@
sodipodi:role="line"
x="18.504961"
y="103.60992"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:5.64444447px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:6.35px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
id="tspan14097-5-1">Scaling (Horizontal &amp; Vertical)</tspan></text>
<rect
style="fill:#316ce6;fill-opacity:1;stroke:#ffffff;stroke-width:0.384;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
@ -1034,13 +1034,13 @@
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:0.75;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="43.023254"
x="41.605843"
y="85.16507"
id="text14087-4-6-3"><tspan
sodipodi:role="line"
x="43.023254"
x="41.605843"
y="85.16507"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:5.64444447px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:6.3499999px;line-height:0.75;font-family:'Roboto Condensed';-inkscape-font-specification:'Roboto Condensed, Italic';text-align:center;text-anchor:middle;fill:#666666;stroke-width:0.26458332"
id="tspan14097-5-1-6">Metrics</tspan></text>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 56 KiB

After

Width:  |  Height:  |  Size: 56 KiB

@ -12,16 +12,17 @@ It is a technique to automatically scale the application (and the services it is
In general, *scaling* refers to acquiring and releasing resources while maintaining a certain application performance level, such as response time or throughput \cite{UnderstandingDistributedSystems_2021}.
Scaling an application can be achieved in two ways: *horizontal scaling* and *vertical scaling*.
Horizontal scaling, also referred to as *scaling out*, refers to creating more instances of the same service.
The workload is then distributed across all instances (*load balancing*), resulting in a lower workload per instance.
An example here is adjusting the number of web servers according to the amount of visitors of the website.
The workload is then distributed across all instances, resulting in a lower workload per instance (*load balancing*).
An example here is adjusting the number of web servers according to the amount of incoming website requests.
Vertical scaling, also referred to as *scaling up*, refers to giving more resources (compute, memory, network, storage) to a particular instance of the service.
By giving more resources to one or multiple instances, they are able to handle more workload.
An example for this is providing more memory resources to a database instance so that it can fit more data into memory, instead of having to load the data from disk.
An example for this is providing more memory resources to a database instance:
this commonly results in faster response times, because the database can fit more data in memory instead of having to load it from disk.
Due to the higher cost of cloud infrastructure compared to on-premise infrastructure, it is vital to take advantage of its elasticity and implement autoscaling.
This autoscaling needs to provision and release cloud resources without human intervention.
Overprovisioning leads to paying for unused resources, while underprovisioning causes the application performance to degrade \cite{AutoScalingWebApplicationsClouds_2018}.
The scaling logic needs to balance between these two goals: minimizing resource usage (and thereby cost) with an acceptable service quality and minimizing service-level agreement violations at any cost by overprovisioning resources.
The scaling logic needs to balance between these two goals: minimizing resource usage (and thereby cost) with an acceptable service quality and minimizing service-level agreement violations by provisioning sufficient amount of resources.
To reliably and consistently achieve this in the first place, extensive monitoring of low- and high-level metrics needs to be set up.
These metrics are not only used as inputs for a scaling policy, but also to ensure that the system is not spiraling out of control (e.g., erroneously requesting more and more compute resources, thereby incurring large bills or starving other services for resources).
@ -37,7 +38,7 @@ The contributions of this thesis are the following:
* a thorough discussion of Kubernetes concepts and components relevant for autoscaling;
* an overview of generic autoscaling literature and a qualitative comparison of research proposals for Kubernetes autoscalers;
* a proposal for a novel, modular Kubernetes autoscaler with a WebAssembly sandbox;
* the implementation of an extensive monitoring solution for a production-grade application running on Kubernetes (with Grafana, Prometheus and several metrics exporters);
* the implementation of an extensive monitoring solution for a production-grade application running on Kubernetes (with Grafana, Prometheus and several metric exporters);
* a discussion of which types of metrics are suitable for scaling and how metrics can be used to get a holistic view of application performance;
* the implementation and fine-tuning of autoscaling policies for the target application (with HPA and KEDA);
* a quantitative evaluation of several autoscaling policies according to performance and cost criteria.
@ -52,4 +53,4 @@ It also presents and evaluates recent research about Kubernetes autoscalers, and
Chapter \ref{implementation} documents the concrete setup of metrics-based monitoring for an application running on Kubernetes and the associated autoscaling infrastructure.
In Chapter \ref{evaluation}, this infrastructure is used to conduct quantitative experiments about the behavior, performance and cost of different autoscaling policies.
It also discusses and validates several policy optimizations.
Finally, Chapter \ref{conclusion} provides concluding remarks and outlines future work.
Finally, \mbox{Chapter \ref{conclusion}} provides concluding remarks and outlines future work.

@ -32,20 +32,20 @@ In 2000, FreeBSD introduced *Jails*^[<https://docs.freebsd.org/en/books/handbook
In 2002, Oracle introduced *Zones*, also known as *Solaris Containers*, as a first-class concept in the Solaris operating system \cite{BorgOmegaKubernetes_2016}.
In 2005, *OpenVZ* was the first operating-system-level virtualization for Linux, but required a modified Linux kernel.
By 2008, the Linux kernel natively supported enough features to host *Linux Containers* (*LXC*).
Containers allow limiting the resources (CPU, memory, filesystem, network etc.) available to a process -- or set of process -- through a Linux kernel feature called *cgroups* (control groups).
Containers allow limiting the resources (CPU, memory, filesystem, network etc.) available to a process -- or set of processes -- through a Linux kernel feature called *cgroups* (control groups).
Like regular processes in an operating system, containers share the same kernel with all other processes on the host.
Unlike regular processes, each container sees and has access to only its own, separate environment, which is achieved through namespace isolation.
Unlike regular processes, each container only sees and has access to its own, separate environment, which is achieved through namespace isolation.
By combining both resource restriction and namespace isolation containers implement *operating-system-level virtualization*.
In contrast to full virtualization, where multiple kernels are running on the same host, containers have a lower resource footprint (CPU, memory and storage utilization), thereby enabling higher application performance \cite{UpdatedPerformanceComparisonVirtual_2015}.
Subsequently, this allows a higher density of applications per host and more efficient resource usage by colocating different types of applications \cite{BorgOmegaKubernetes_2016}.
Since the size of container images tends to be an order of magnitude smaller compared to virtual machine disk images \cite{HypervisorsVsLightweightVirtualization_2015}, they can easily and efficiently be shared online through container image registries.
Finally, containers can be started within seconds, as opposed virtual machines which can take minutes to initialize.
This allows frequently adding and removing container instances without too much overhead, thus improving elasticity \cite{QuantifyingCloudElasticityContainerbased_2019}.
This allows frequently adding and removing container instances without much overhead, thereby improving elasticity \cite{QuantifyingCloudElasticityContainerbased_2019}.
The Docker project introduced a tool that can manage the entire container life cycle of a container: building an image from a set of instructions (*Dockerfile*); sharing this container image over the internet (*DockerHub*); as well as creating, running and deleting containers based on images.
The Docker project introduced a tool that can manage the entire life cycle of a container: building an image from a set of instructions (*Dockerfile*); sharing this container image over the internet (*DockerHub*); as well as creating, running and deleting containers based on images.
This is what is commonly understood when referring to the modern container \cite{BorgOmegaKubernetes_2016}.
All these features mean that containers can not only be used to run applications and their components, but also to package them up in a convenient format alongside their configuration.
Thus containers provide a higher level of abstraction for the application lifecycle, including not only starting and stopping, but also facilitating upgrades in a seamless way \cite{AutoScalingContainersImpactRelative_2017}.
Thus, containers provide a higher level of abstraction for the application lifecycle, including not only starting and stopping, but facilitating also upgrades and replication in a seamless way \cite{AutoScalingContainersImpactRelative_2017}.
Since Docker's introduction in 2013, the *containerization* of applications has seen widespread adoption.
The monitoring company *Datadog* found in their 2018 report that 23.4% of their customers had adopted Docker, <!-- \cite{SurprisingFactsRealDocker_2018}. -->
@ -55,8 +55,8 @@ both at small (single developers and small startups) and large scales (enterpris
## Microservices
Coincidentally, containers provide a flexible abstraction for composing a collection of microservices, which is a software architecture that has increased in popularity over the last ten years \cite{BuildingMicroservicesDesigningFinegrained_2021}.
With the micro-service architecture, a single application is decoupled into multiple, distributed services.
Each service follows the *Single Responsibility Principle* by providing independent functionality and communicates with other services via language-agnostic *Application Programming Interfaces* (APIs) \cite{BuildingMicroservicesDesigningFinegrained_2021}.
With the microservice architecture, a single application is decoupled into multiple, distributed services.
Each service follows the *Single Responsibility Principle* by providing independent functionality and communicates with other services via language-agnostic *Application Programming Interfaces* (APIs).
The main advantage of microservices is organizational: each service can be developed and operated by a different development team, and therefore each team can make independent organizational decisions (such as software releases) as well as technological decisions (programming languages, frameworks etc.) \cite{KeyInfluencingFactorsKubernetes_2020}.
As a result of the shift towards microservices, the backend architecture of many applications has seen an increase in complexity as well.
Already in 2001, IBM has pointed out that the main obstacle in the IT industry is the growing software complexity \cite{VisionAutonomicComputing_2003}.
@ -98,10 +98,10 @@ However, such an architecture comes at the cost of increased software complexity
Additionally, there is another major challenge: increased operational complexity.
While all the services may be simple to install and run with container tools such as Docker, system administrators need to operate many of these applications across a large number of machines (dozens, hundreds or even thousands).
These and related tasks are commonly referred to as *orchestration*.
More specifically, orchestration is the management of (virtual) infrastructure required by an application during its entire lifecycle: deployment, provisioning, running, adjusting, termination.
This is part of the vision of *autonomic computing* introduced in 2001 \cite{VisionAutonomicComputing_2003}: computing systems that can manage themselves.
More specifically, orchestration is the management of (virtual) infrastructure required by an application during its entire lifecycle: deploying, provisioning, running, adjusting, terminating.
This is part of the vision of *autonomic computing* introduced in 2001 \cite{VisionAutonomicComputing_2003}: software systems that can manage themselves.
This is exactly where Kubernetes comes in: its main goal is making the orchestration of complex distributed systems easy while leveraging the density improvements containers offer \cite{BorgOmegaKubernetes_2016}.
This is exactly where Kubernetes comes in: its main goal is making the orchestration of complex distributed systems easy while leveraging the density improvements offered by containers \cite{BorgOmegaKubernetes_2016}.
## Kubernetes
@ -109,7 +109,7 @@ This is exactly where Kubernetes comes in: its main goal is making the orchestra
Kubernetes is an open-source framework for automating the deployment, scaling and management of distributed applications in the context of a cluster.
A *cluster* is a set of worker nodes which are orchestrated by one control plane and appear as a single unit to the outside.
Kubernetes' initial design was based on Google's internal *Borg* and *Omega* systems, both cluster management systems that the company uses to schedule workloads across its machines in datacenters \cite{BorgOmegaKubernetes_2016}.
Kubernetes' initial design was based on Google's internal *Borg* and *Omega* systems, both cluster management systems that the company uses to schedule workloads across machines in its datacenters \cite{BorgOmegaKubernetes_2016}.
Kubernetes was introduced in 2015 and the name is commonly abbreviated as *K8s*.
Since then, it has become a widely used platform for deploying distributed applications.
This is made apparent <!-- evident --> by the fact that all major public cloud platforms offer a managed Kubernetes service (AWS EKS, GCP GKE, Azure AKS, IBM Cloud Kubernetes, AlibabaCloud ACK).
@ -120,19 +120,19 @@ This means the individual parts -- which are referred to as *components* -- are
Each component provides a set of services to the other components through well-defined APIs^[<https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md>].
This allows the Kubernetes architecture to be open and extensible, which is one of the explicit development goals.
Using a microservice architecture for Kubernetes makes sense for two reasons: the software managing other applications needs to be highly available (fault tolerancy); and the software is developed in a distributed fashion by many *special interest groups*^[<https://sigs.k8s.io>] (SIG).
Using a microservice architecture for Kubernetes makes sense for two reasons: the software managing other applications needs to be highly available (fault tolerancy); and the software is developed in a distributed fashion by many *special interest groups*^[<https://sigs.k8s.io>] (SIGs).
<!-- , which are overseen by the *Steering Committee*\footnote{\url{https://github.com/kubernetes/steering}}. -->
Furthermore, it enables anyone to enhance or replace every single one of the components individually without having to modify the rest of the system.
Finally, many of the components are optional and thus do not necessarily need to be used in every environment.
One example is the Crossplane project^[<https://crossplane.io/>], which exposes resources outside of a Kubernetes cluster (such as databases or virtual machines) through the Kubernetes API.
One example of the extensibility is the Crossplane project^[<https://crossplane.io/>], which exposes resources outside of a Kubernetes cluster (such as databases or virtual machines) through the Kubernetes API.
The result of this extensibility is that Kubernetes itself is a complex mesh of microservices.
It is absolutely necessary to have a firm understanding of its components to effectively operate and optimize it.
It is absolutely necessary to have a firm understanding of its components to be able to effectively operate and optimize it.
Kubernetes refers to individual machines as *nodes* and to a set of nodes controlled by the same Kubernetes instance as a *cluster*.
To the user, Kubernetes presents a declarative interface for describing the state of *objects* in the cluster.
The most common *core objects* (supported by default) are Pods and Services.
As described before, the list of objects is extensible through Kubernetes' microservice architecture.
Declarative means that Kubernetes continuously tries to the converge the current state of the objects towards the desired state, which is described through a *specification* (or *Spec*) in Kubernetes.
*Declarative* means that Kubernetes continuously tries to the converge the current state of the objects towards the desired state, which is defined by a *specification* (or *Spec*) in Kubernetes.
Practically speaking, when the user specifies that 5 instances of a web server should be running, Kubernetes creates 5 instances and monitors that they are available.
When one of them is no longer available, for example due to a software bug or hardware failure, Kubernetes automatically creates a new instance.
@ -140,14 +140,14 @@ When one of them is no longer available, for example due to a software bug or ha
In Kubernetes, a *Pod* -- not a single container -- is the smallest deployment unit \cite{KubernetesDocumentationWorkloads_2021}.
A Pod can comprise one or more containers.
It has an associated configuration that determines how exactly the container(s) \improve{are run} as well as their network and storage resources.
It has an associated configuration (the *PodSpec*) that determines how exactly the container(s) are run, including attached compute, network and storage resources.
All containers within the same Pod share these resources.
Furthermore, Kubernetes supports two types of resource declarations: *requests* and *limits*.
Requests are reservations referring to the minimum value of the given compute resource that has to be guaranteed to the Pod.
Resource requests are used by the scheduler to decide on which worker node to place the Pod.
Limits define the maximum amount of resources that can be used by the Pod.
By default, the supported compute resources are CPU and memory resources \cite{KubernetesDocumentationWorkloads_2021}.
Resources for other metrics (e.g., network or disk usage), so called *extended resources*, can be defined by installing third-party extensions^[<https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/>].
*Resource requests* define the minimum value of a given compute resource that has to be guaranteed to the Pod.
Requests are used by the scheduler to decide on which worker node to place the Pod.
*Limits* define the maximum amount of resources available to the Pod.
By default, the supported resources are CPU and memory resources \cite{KubernetesDocumentationWorkloads_2021}.
Resource reservations and limits for other metrics (e.g., network or disk usage) can be added by installing third-party extensions --- these define so-called *extended resources*^[<https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/>].
<!-- https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-units-in-kubernetes -->
@ -156,7 +156,7 @@ A *Deployment* is a higher-level concept that manages the ephemeral ReplicaSets
It provides many useful features to describe the desired state of an application \cite{KubernetesDocumentationWorkloads_2021}.
For example, if a node fails (hardware fault, power outage etc.), Kubernetes does not "recreate" individual Pods previously running on this node.
However, when Kubernetes detects that a Pod which is part of a Deployment is unavailable, it creates a new instance of the same Pod type (a behavior referred to as *self-healing*).
Therefore Deployments provide a declarative orchestration interface for applications running on Kubernetes.
Therefore, Deployments provide a declarative orchestration interface for applications running on Kubernetes.
<!-- https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ -->
@ -169,7 +169,7 @@ Therefore Deployments provide a declarative orchestration interface for applicat
A *Service* is an abstraction that gives a distinct network identity to an application running in one or more Pods (Figure \ref{fig:k8s-objects}).
This is necessary because Pods are ephemeral, meaning that they can be created or destroyed all the time -- alongside their associated IP addresses.
When an application communicates with another one through a *Service*, Kubernetes automatically forwards all requests to the *Service* to the associated *Pods*.
Thus, a Service provides a *service discovery* mechanism (each Pod has a unique IP address, but Pods can be created and destroyed) as well as *load balancing* (by default Kubernetes uses a round-robin algorithm distribute requests across all Pods associated to a Service).
Thus, a Service provides a mechanism for *service discovery* (each Pod has a unique IP address, but Pods can be created and destroyed) as well as *load balancing* (by default Kubernetes uses a round-robin algorithm distribute requests across all Pods associated to a Service).
An example of a Service definition is shown in Appendix \ref{prometheus-service-discovery-with-kubernetes}.
A Service can also describe an entity outside the Kubernetes cluster, such as an external load balancer \cite{KubernetesDocumentationServices_2021}.
@ -192,12 +192,11 @@ An extended discussion of all objects can be found in \cite{KubernetesPatterns_2
The *control plane* is the layer of components that exposes the API and interfaces to define, deploy and manage the lifecycle of Kubernetes objects \cite{KubernetesDocumentationComponents_2021}.
These components are drawn blue in Figure \ref{fig:k8s-architecture}.
The *data plane* is the layer that provides compute capacity (such as CPU, memory, network and storage resources) where objects can be scheduled.
Such capacities are made available through the *kubelet*, which is a daemon running on each *worker node* that is responsible for the communication with the control plane.
The kubelet continuously gathers facts about its host and the workloads running on it (e.g., CPU,
memory, filesystem, and network usage statistics) and sends them to the control plane.
These facts are collected with *cAdvisor*^[<https://github.com/google/cadvisor>], which is an agent for container resource usage and performance analysis.
Such capacities are made available through the *kubelet* running on each worker node: the daemon is responsible for the communication with the control plane.
The kubelet continuously gathers facts about its host and the workloads running on it (e.g., CPU, memory, filesystem, and network usage statistics), and sends them to the control plane.
These statistics are collected with *cAdvisor*^[<https://github.com/google/cadvisor>], which is an tool for container resource usage and performance analysis.
Based on the information provided by the worker nodes, the *scheduler* decides which workloads (i.e., Pods) will be \improve{run} on the worker node, subject to predefined constraints and runtime statistics.
Based on the information provided by the worker nodes, the *scheduler* decides which workloads (i.e., Pods) will be placed on the worker node, subject to predefined constraints and runtime statistics.
The default scheduling policy is to place Pods on nodes with the most free resources, while distributing Pods from the same Deployment across different nodes.
In this way the scheduler tries to balance out resource utilization of the worker nodes \cite{KubernetesDocumentationComponents_2021}.
@ -205,8 +204,8 @@ Then, the kubelet on the corresponding worker node receives the scheduling decis
It implements the received instructions by launching and monitoring the containers through the *container runtime* (e.g., containerd^[<https://containerd.io/>] or cri-o^[<https://cri-o.io/>]).
The kubelet also manages the lifecycle of other host-specific resources, such as storage volumes and network adapters \cite{KubernetesDocumentationComponents_2021}.
The *controller manager* (CM) implements the core functions of Kubernetes with control loops (replication controller, endpoints controller, namespace controller etc.).
A control loop is non-terminating loop that regulates the state of a system.
The *controller manager* (CM) implements the core functions of Kubernetes with control loops (such as replication, endpoints and namespace controller).
A control loop is a non-terminating loop that regulates the state of a system.
It is commonly found in industrial control systems and robotics.
The controller manager monitors the state of the cluster (including all its objects) and tries to converge the state of the system towards the desired state, thereby implementing Kubernetes' declarative nature \cite{KubernetesDocumentationComponents_2021}.
@ -229,7 +228,7 @@ The *Metrics Server* is an efficient and scalable cluster-wide aggregator of liv
It tracks CPU and memory usage across all worker nodes, as reported by the kubelet's cAdvisor (Figure \ref{fig:k8s-architecture}).
The Metrics Server implements the Metrics API^[<https://github.com/kubernetes/metrics>] and is the successor of the deprecated Heapster^[<https://github.com/kubernetes-retired/heapster>].
Other implementations (e.g., Prometheus) and adapters can be used as an alternative source for the Metrics API.
Notably, the Metrics API does not offer access to any historical metrics.
Notably, the Metrics API does not offer access to any historical usage statistics.
<!-- Importantly, all components interact through clearly defined APIs so that the implementation of each component can be replaced.
An example here is Virtual Kubelet^[<https://virtual-kubelet.io/>], which acts like a regular kubelet to the cluster, while provisioning resources on third-party platforms (e.g., AWS Fargate, HashiCorp Nomad) instead of an actual worker node. -->

@ -13,11 +13,11 @@ Finally, we survey research proposals for novel Kubernetes autoscalers and perfo
It has been extensively researched over the last decades \cite{DynamicserverQueuingSimulation_1998, EnergyAwareServerProvisioningLoad_2008}, also outside of the realm of computer science \cite{MMPPQueueCongestionbasedStaffing_2019}.
In particular the widespread adoption of the cloud computing paradigm has accelerated the pace of development and research in this field.
With the proliferation of container orchestration frameworks over the last five years, the topic of container auto-scaling has seen particular attention.
With the proliferation of container orchestration frameworks over the last five years, the topic of container autoscaling has seen particular attention.
In this chapter we specifically focus on autoscaling techniques for the Kubernetes framework.
It has seen the largest adoption in industry as well as academia in recent years.
A large ecosystem of open-source technologies, startups and business models has evolved around it, making it very likely to remain popular in the future.
Other orchestration frameworks are either used rarely (like Mesosphere Marathon) or gradually being phased out by their developers (like Docker Swarm)^[<https://thenewstack.io/mirantis-acquires-docker-enterprise/>].
Other orchestration frameworks are either rarely used (like Mesosphere Marathon) or gradually being phased out by their developers (like Docker Swarm)^[<https://thenewstack.io/mirantis-acquires-docker-enterprise/>].
## Autoscaling in the Cloud
@ -26,11 +26,11 @@ The following section gives an overview of research on the topic of autoscaling
In general, autoscaling approaches are based on the broadly recognized *MAPE-K* control loop.
MAPE-K stands for *Monitor*, *Analyse*, *Plan* and *Execute* over a *shared Knowledge base* \cite{VisionAutonomicComputing_2003}.
It is an instance of a feedback loop widely used for building self-adaptive software systems \cite{AutoScalingWebApplicationsClouds_2018}.
In the monitor phase, the execution environment is observed.
In the monitor phase the execution environment is observed.
The observed data (i.e., metrics) is then used in the analyse phase, which determines whether any adaptation of the system is required.
In the plan phase appropriate actions to adapt the system are evaluated and a final selection is made, considering feasible adaptation strategies such as scaling up or down.
In the plan phase the appropriate actions to adapt the system are evaluated and a final selection is made, considering feasible adaptation strategies such as scaling up or down.
In the execute phase the system is changed to match the new desired state proposed in the plan phase.
The knowledge base is used as a central store of information about the entire execution environment (i.e., a database).
The knowledge base is used as a central store of information about the entire execution environment (i.e., a database) \cite{KeyInfluencingFactorsKubernetes_2020}.
In 2014, Lorido-Botran et al.\ \cite{ReviewAutoscalingTechniquesElastic_2014} published a comprehensive survey on resource estimation techniques for scaling application infrastructure.
Their survey is one of the seminal works in this field.
@ -44,37 +44,36 @@ In 2015, Galante et al.\ \cite{AnalysisPublicCloudsElasticity_2016} published a
Their work describes several approaches, advantages and drawbacks of running scientific computations in the cloud.
Their presentation of elasticity mechanisms of public cloud providers focused on IaaS (*Infrastructure-as-a-Service*) and PaaS (*Platform-as-a-Service*) solutions.
They found that most traditional scientific applications are executed in batch mode and have difficulty adapting to dynamic changes in the infrastructure (e.g., addition or removal of one or more worker nodes).
Among other challenges, they found the elasticity mechanisms lacking good support for scientific batch workloads, as they are more focused on server-based applications (web servers, email server etc.).
Among other challenges, they found that the elasticity mechanisms lacked good support for scientific batch workloads, as they were more focused on server-based applications (web servers etc.).
They also recognized the lack of cloud interoperability.
In 2016, Hummaida et al.\ \cite{AdaptationCloudResourceConfiguration_2016} published a survey that focused on efficient resource reconfiguration in the cloud from the perspective of the infrastructure provider, instead of the user.
In this context, they defined *Cloud Systems Adaptation* to be "*a change to provider revenue, data centre power consumption, capacity or end-user experience where decision making resulted in a reconfiguration of compute, network or storage resources*." \cite{AdaptationCloudResourceConfiguration_2016}
In this context, they defined *Cloud Systems Adaptation* as "*a change to provider revenue, data centre power consumption, capacity or end-user experience where decision making resulted in a reconfiguration of compute, network or storage resources*." \cite{AdaptationCloudResourceConfiguration_2016}
Another large scale survey on the challenges of autoscaling was published by Qu et al.\ in 2018 \cite{AutoScalingWebApplicationsClouds_2018}.
Additionally, they provided a detailed taxonomy of cloud application auto-scaling, which we apply in our work.
Additionally, they provided a detailed taxonomy of cloud application autoscaling, which we apply in our work.
Also their work focused only on autoscaling proposals for virtual machine based infrastructure, however many of the concepts applied to VMs can also be applied to containers, sometimes even in better ways.
<!-- e.g., faster startup of containers compared to VMs, smaller size allows higher density -->
As well in 2018, Al-Dhuraibi et al.\ \cite{ElasticityCloudComputingState_2018} published a similar survey of state-of-the-art techniques and research challenges in autoscaling.
Their work was the first survey that included both VM- as well as container-based solutions.
While their survey included state-of-the-art container orchestration tools at the time, none of the research proposals for autoscaling were developed for Kubernetes.
In addition, their survey also presented a taxonomy for cloud application elasticity, which we partially base our classification on.
While their survey included relevant container orchestration tools at the time, none of the research proposals for autoscaling were developed for Kubernetes.
Furthermore, their survey also presented a taxonomy for cloud application elasticity, which we partially base our classification on.
More recently, Radhika and Sadasivam \cite{ReviewPredictionBasedAutoscaling_2021} provided a study on proactive autoscaling techniques for heterogeneous applications.
None of the proposals evaluated in the work was developed for Kubernetes.
## Built-in Kubernetes Autoscalers
The Kubernetes authors have developed three components that enable elastic scaling of clusters: *Horizontal Pod Autoscaler* (HPA), *Vertical Pod Autoscaler* (VPA) and *Cluster Autoscaler* (CA).
*Kubernetes Event-driven Autoscaler* (KEDA) is a third-party component for Kubernetes that builds on top of HPA.
The purpose and operation of these autoscalers is described in the following sections.
The purpose and operation of these autoscalers are described in the following sections.
### Horizontal Pod Autoscaler
<!-- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ -->
Kubernetes' *Horizontal Pod Autoscaler* (HPA) is one of the control loops integrated in the Controller Manager.
Kubernetes' *Horizontal Pod Autoscaler* (HPA) is one of the control loops integrated in the Controller Manager (Section \ref{kubernetes-components}).
It allows dynamically adjusting the number of Pod replicas for a particular Deployment by consuming the Metrics API \cite{KubernetesDocumentationHorizontalPod_2021}.
By default, HPA scales Pods based on relative CPU utilization.
@ -85,7 +84,7 @@ The kube-metrics-adapter project^[<https://github.com/zalando-incubator/kube-met
The core algorithm of HPA is shown in Equation \ref{hpa-algorithm-1} and \ref{hpa-algorithm-2}.
The current metric value $m_i$ is retrieved for all active replicas in the set $R$ and the mean value $\bar{m}$ is calculated.
The target number of replicas $\hat{r}$ is calculated by dividing the mean usage $m$ by the desired usage $\hat{m}$, multiplying the result with the current number of replicas $r$ and finally rounding up \cite{KubernetesDocumentationHorizontalPod_2021}.
The target number of replicas $\hat{r}$ is calculated by dividing the \mbox{mean usage $m$} by the desired usage $\hat{m}$, multiplying the result with the current number of \mbox{replicas $r$} and finally rounding up \cite{KubernetesDocumentationHorizontalPod_2021}.
<!-- \begin{equation}\label{hpa-algorithm} -->
<!-- \begin{gathered} -->
@ -106,23 +105,23 @@ As an example, let us assume that the target memory utilization is set to 50%, t
In this case, the mean utilization of all Pods is 70%, which is 1.4 times above the target utilization.
Multiplying this number with the current number of replicas yields 4.2, which is rounded up to 5.
Thus, the HPA indicates to the Kubernetes control plane that two more replicas are needed in order to match the target resource utilization.
Of course, a single adjustment might not be enough (due to non-linear scaling of applications and varying load), therefore the HPA runs as a control loop with a default interval of 30 seconds \cite{KubernetesDocumentationHorizontalPod_2021}.
Of course, a single adjustment might not be enough (due to non-linear scaling of applications and varying load), therefore the HPA runs as a control loop with a default interval of 15 seconds (*sync-period*) \cite{KubernetesDocumentationHorizontalPod_2021}.
In general, the algorithm has a bias towards scaling up faster and scaling down slower.
For example, to avoid *thrashing* (oscillations in the number of Pods) each newly created replica runs for at least one downscale period (by default 5 minutes) before it can be removed again.
The HPA supports several other settings: cluster-wide settings (e.g., scaling stabilization, scaling tolerance, Pod synchronization period) as well as deployment-wide settings (e.g., specifying a minimum or maximum number of Pod replicas) \cite{KubernetesDocumentationHorizontalPod_2021}.
The HPA supports several other settings: cluster-wide settings (e.g., scaling stabilization, scaling tolerance, Pod synchronization period) as well as Deployment-wide settings (e.g., specifying a minimum or maximum number of Pod replicas) \cite{KubernetesDocumentationHorizontalPod_2021}.
The authors of \cite{AdaptiveAIbasedAutoscalingKubernetes_2020} have developed a formal, discrete-time queuing model of HPA's algorithm, which gives an approximation of the number of Pods deployed by the autoscaler.
<!-- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-configurable-scaling-behavior -->
Scaling decisions taken by HPA are not immediately reflected in the status of the cluster, but first need propagate through several components described above.
In particular, Kubernetes needs to process the following steps before a new Pod is available:
Scaling decisions taken by HPA are not immediately reflected in the status of the cluster, but first need to propagate through several control plane components (Section \ref{kubernetes-components}).
In particular, Kubernetes needs to perform the following steps before a new Pod is available for handling workloads:
1. The HPA control loop is activated (*sync-interval*) to calculate the new desired number of replicas.
1. The HPA control loop is activated to calculate the new desired number of replicas.
The result is saved in the ReplicaSet object.
2. The ReplicaSet controller is activated to pick up the changes in the ReplicaSet.
It creates a new Pod object to fulfill the requirement.
3. The scheduler control loop detects that there is a Pod without an assigned node.
It selects an appropriate worker node to run the new Pod, while taking into account the scheduling policy, cluster and node status.
It selects an appropriate worker node for the new Pod, while taking into account the scheduling policy and cluster status.
The kubelet on the selected node is notified about the pending Pod.
4. The kubelet on the worker node initiates the Pod creation process.
This includes downloading container images from the registry and unpacking them, launching and initializing containers associated to the Pod and waiting until it becomes *ready* (indicated through a *Readiness Probe*).
@ -150,8 +149,8 @@ The VPA itself is implemented with a microservice architecture consisting of thr
This highlights a major drawback of the current VPA implementation: its operation is disruptive.
In fact, resource adjustments are carried out by terminating the Pod and then re-scheduling it with the newly estimated resources.
This approach works for stateless services (though they still may experience service disruption), but it is a major impediment for stateful and performance sensitive applications \cite{ExploringPotentialNonDisruptiveVertical_2019}.
As of writing, the Kubernetes authors are working on in-place updates of Pods, which allows the VPA to operate non-disruptively.^[<https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources>]
This approach works for stateless services (though they may still experience service disruption), but it is a major impediment for stateful and high-performance applications \cite{ExploringPotentialNonDisruptiveVertical_2019}.
As of writing, the Kubernetes authors are working on in-place updates of Pods, which would allow the VPA to operate non-disruptively.^[<https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources>]
\begin{figure}[ht]
\centering
@ -171,11 +170,11 @@ Both types of samples then get a decaying weight with a half-life of 24 hours, m
Then, three values are calculated (Equation \ref{eq:vpa-estimate}): the lower boundary $b_l$ (50th percentile of historic usage $H$), the target value $t$ (90th percentile of $H$) and the upper boundary $b_u$ (95th percentile of $H$).
Exemplary, the 90th percentile describes the boundary below which the resource utilization is for 90% of the time.
Each of these bounds is then scaled with a safety margin of 15% $m$ \cite{VerticalPodAutoscalingDefinitive_2021}. <!-- (*recommendation margin fraction*). -->
Each of these bounds is then scaled with a safety margin $m$ of 15% \cite{VerticalPodAutoscalingDefinitive_2021}. <!-- (*recommendation margin fraction*). -->
The target value $t$ is the recommended *resource request* for the Pod.
The *resource limit* is either scaled proportionally to the initial ratio between request and limit or set to a specific maximum.
For example, when the initial request is 100MB with a limit of 200MB, and VPA recommends the request to be 175MB, then the proportionally scaled limit will be 350MB, unless a LimitRange^[<https://kubernetes.io/docs/concepts/policy/limit-range/>] is set.
For example, when the initial request is 100MB with a limit of 200MB, and VPA recommends the request to be 175MB, then the proportionally scaled limit will be 350MB (unless a LimitRange^[<https://kubernetes.io/docs/concepts/policy/limit-range/>] is specified).
Finally, the calculated upper and lower bounds are multiplied with a confidence interval $c$ which is based on the amount of collected samples (number of days of historic data), i.e., with more historic data the confidence of the estimations is higher (Equation \ref{eq:vpa-estimate}).
@ -214,7 +213,7 @@ e_u &= b_u * c
### Cluster Autoscaler
The *Cluster Autoscaler*^[<https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler>] (CA) is the third component that enables scaling on Kubernetes.
Specifically, it adjusts the size (number of worker nodes) of the cluster.
Specifically, it adjusts the size of the cluster in terms of the number of worker nodes.
It can either add nodes when the compute resources in the current cluster are insufficient, or remove nodes when there are unutilized nodes.
This is achieved through integration with the APIs for provisioning and deprovisioning virtual machines of several public cloud platforms.
@ -224,7 +223,7 @@ Conversely, the CA decreases the size of the cluster when some nodes are consist
All of these decisions are made subject to several constraints the cluster administrator can set to prevent the CA from affecting the functionality of the cluster (e.g., eviction policy or Pod disruption budget).
While the interaction between the two Pod autoscalers (HPA and VPA) and the CA is crucial for successfully operating an elastic Kubernetes cluster, the CA is not relevant to the work carried out in this thesis.
Thus, a technical study of its internal mechanisms is omitted here.
Thus, a technical study of its internal mechanism is omitted here.
### Kubernetes Event-driven Autoscaler
@ -249,7 +248,7 @@ instead of reactively scaling based on metrics from a monitoring system, it obse
KEDA supports a wide range of event sources (including message queues, databases, streaming services and monitoring solutions)^[<https://keda.sh/docs/2.2/scalers/>], which can be cluster-internal and cluster-external.
<!-- Internally, KEDA uses a very similar metrics pipeline to the one we have manually built in previous sections. -->
<!-- However, -->
With KEDA the end user only needs to configure a simple *ScaledObject* CRD (Listing \ref{src:keda-v1}), just like for HPA and VPA.
With KEDA the end user only needs to configure a simple \mbox{\textit{ScaledObject}} CRD (discussed in Section \ref{horizontal-scaling-with-keda}).
<!-- , similar to the way VPA works (Section \ref{vertical-scaling-with-vpa}). -->
\begin{figure}[ht]
@ -261,8 +260,8 @@ With KEDA the end user only needs to configure a simple *ScaledObject* CRD (List
Figure \ref{fig:keda-architecture} shows KEDA's two components: *agent* and *metrics API server* \cite{KEDADocumentationVersion_2021}.
The agent is responsible for scaling a Deployment between zero and one replicas.
This is a workaround for the limitation that HPA does not scale Deployments to less than one replica.
The metrics API server is responsible for listening to the event source and exposing new events through Kubernetes' metrics API.
Thus, this component serves the same purpose as the metrics adapter we discuss in Section \ref{prometheus-exporters}.
The metrics API server is responsible for listening to the event source and exposing new events through Kubernetes' metrics API
(this component serves the same purpose as the metrics adapter we discuss in Section \ref{prometheus-exporters}).
Figure \ref{fig:keda-architecture} shows the autoscaling operation of KEDA.
When idle (i.e., no incoming events) KEDA scales the Deployment to zero replicas.
@ -276,7 +275,7 @@ Because Deployments need to be explicitly marked as managed by KEDA, it is a fle
While autoscaling has been adequately considered in the literature, the following survey provides an overview and discussion of proposals for novel Kubernetes autoscalers.
<!-- As there is a lack of overview about Kubernetes autoscalers in the literature, we address this issue by presenting and discussing high-quality research proposals for Kubernetes pod autoscalers from relevant, peer-reviewed publications. -->
The survey considers only cluster internal scaling mechanisms (i.e., vertical and horizontal scaling of Pods), external cluster scaling is outside the scope of this study (e.g., \cite{ExperimentalEvaluationKubernetesCluster_2020,DynamicallyAdjustingScaleKubernetes_2019}).
The survey considers only cluster-internal scaling mechanisms (i.e., vertical and horizontal scaling of Pods), external cluster scaling is outside the scope of this study (e.g., \cite{ExperimentalEvaluationKubernetesCluster_2020,DynamicallyAdjustingScaleKubernetes_2019}).
This choice was made because the nature of autoscaling decisions between these dimensions is quite different.
The same reasoning applies to scheduling algorithms.
While there have been interesting proposals for improved Kubernetes schedulers \cite{ClientSideSchedulingBasedApplication_2017,CaravelBurstTolerantScheduling_2019,ImprovingDataCenterEfficiency_2019}, scaling and scheduling are two fundamentally different operations.
@ -359,15 +358,15 @@ If the autoscaler works purely based on the current demand, it is considered to
If the algorithm also partially predicts future demand and scales accordingly, it is considered *proactive*.
The *Scaling Method* refers to the dimension of scaling.
Horizontal scaling is defined by creating more instances of the same type.
Vertical scaling is defined by allocating more resources for a particular instance.
Horizontal scaling is characterized by adjusting the number of instances of the same type.
Vertical scaling is defined by adjusting the amount of resources allocated to a particular instance.
Ideally, both approaches should be combined \cite{AutopilotWorkloadAutoscalingGoogle_2020}.
The *Metric* column specifies based on which time-series values the autoscaler decides to scale and perform its resource estimation.
The *Workload Pattern* column indicates if the authors developed or tested their algorithm for a particular usage scenario.
*Predictable burst* refers to a cyclical usage pattern with a large difference between minimum and maximum utilization.
This is a common pattern for news and social media websites, which have large amounts of traffic during the day and very little during the night.
This is a common pattern for news and social media websites, which have large amounts of traffic during the day and very little at night.
In case of *unspecified*, the authors did not indicate a specific workload pattern.
<!-- "Application workload patterns can be categorized in three -->
@ -388,7 +387,7 @@ Thus, the following focuses on a qualitative evaluation of the algorithms.
<!-- ### A study on performance measures for auto-scaling CPU-intensive containerized applications (2018) -->
<!-- KUBERNETES AUTOSCALER -->
Casalicchio \cite{StudyPerformanceMeasuresAutoscaling_2019} and Casalicchio and Perciballi \cite{AutoScalingContainersImpactRelative_2017} studied the relation between absolute and relative usage metrics for CPU intensive workloads.
Casalicchio \cite{StudyPerformanceMeasuresAutoscaling_2019} and Casalicchio and Perciballi \cite{AutoScalingContainersImpactRelative_2017} studied the relation between absolute and relative usage metrics for CPU-intensive workloads.
Relative usage measures refer to the utilization reported as a percentage of the allocated resource capacity.
These relative measures are exposed through the `cgroup` kernel primitives and are used by most container tools, such as Docker and cAdvisor.
They are commonly used because they provide an intuitive notion for horizontal scaling and defining usage quotas \cite{AutoScalingContainersImpactRelative_2017}.
@ -396,7 +395,7 @@ For example, two containers with maximum CPU utilization running on the same hos
However, if just one application is running and the host system is at 50% load, the response time from that application would be quite different.
Absolute usage measures refer to the actual, system-wide utilization of resources on the host system.
The authors studied this discrepancy and found that
the required capacity tends to be underestimated with relative usage metrics, which makes them not suitable for determining the necessary resources needed to meet a target service level objectives \cite{StudyPerformanceMeasuresAutoscaling_2019}.
the required capacity tends to be underestimated with relative usage metrics, which makes them unsuitable for determining the necessary resources needed to meet a service level objective \cite{StudyPerformanceMeasuresAutoscaling_2019}.
In particular, their findings revealed that there is a linear correlation between relative and absolute usage metrics.
Using this linear correlation allows transforming relative metrics (such as container quotas and limits) into absolute metrics.
@ -412,8 +411,8 @@ The authors argue that the results show that relative usage metrics cannot be re
With their proposed algorithm, QoS metrics can be translated into CPU usage metrics.
It should be noted that the workloads in the study (`sysbench` and `stress-ng`) were highly artificial and solely CPU bound.
Real-world applications will certainly behave differently when utilizing CPU, memory, storage and network resources.
Furthermore, the linear coefficients required for the transformation from relative to absolute usage metrics are specific to the workload.
Thus, they need to be constantly re-evaluated and it is unclear if this linear correlation also holds true in a real-world scenario with more diverse workloads.
Furthermore, the linear coefficients required for the transformation from relative to absolute usage metrics are specific to the workload and execution environment.
Thus, they need to be constantly re-evaluated and it is unclear if this linear correlation also holds true in real-world scenarios with more diverse workloads.
<!-- "for CPU intensive workloads KHPA under-dimensions the number of -->
<!-- deployed Pods because the use of relative metrics. As -->
@ -445,24 +444,24 @@ Thus, they need to be constantly re-evaluated and it is unclear if this linear c
<!-- Assigning appropriate resource thresholds is a non-trivial task (even for simple services, as the required resources may depend on the input to the service) and can usually only be set after observing the service for a long time period. -->
<!-- Therefore, this is a task that should be taken care of by an auto-scaling component. -->
Balla et al.\ \cite{AdaptiveScalingKubernetesPods_2020} argue that solely horizontal auto-scaling is insufficient for enabling a service to be fully elastic.
Therefore, they propose an autoscaler called *Libra* which performs both vertical scaling (determining the correct CPU limit for the instances) as well as horizontal scaling (determining the correct number of instances).
Balla et al.\ \cite{AdaptiveScalingKubernetesPods_2020} argue that solely horizontal autoscaling is insufficient for enabling a service to be fully elastic.
Therefore, they propose an autoscaler called *Libra* which performs both vertical scaling (determining the appropriate CPU limit) as well as horizontal scaling (determining the correct number of replicas).
The autoscaler starts out by determining the adequate CPU limit.
Libra deploys at least two Pods to avoid affecting the QoS too much while performing these measurements.
The *production* Pod is deployed with high CPU limits and serves 75% of the incoming traffic, while the *canary* Pod is used to find the appropriate CPU limit and serves the residual 25% of traffic.
The latter is assigned a low initial limit, which is then gradually increased by Libra, until the average number of served requests and the response time are converging towards a stable value.
The latter is assigned a low initial limit, which is then gradually increased by Libra, until the average number of served requests and the response time converge towards a stable value.
Afterwards, Libra updates the production Pod with the newly determined CPU limit and acts as a horizontal auto-scaler.
Afterwards, Libra updates the production Pod with the newly determined CPU limit and acts as a horizontal autoscaler.
In particular, it increases the number of Pods when the average response time is double the value determined in the previous phase or when the amount of served requests approaches 90% of the value associated to CPU limit.
For example, if the appropriate CPU limit has been determined to be 70% and it has been empirically measured that the Pod can serve 1,000 requests with that value, Libra starts adding another Pod when the running instances are serving more than 900 requests each.
Conversely, if the served requests per Pod fall below 40%, Libra scales down the Deployment \cite{AdaptiveScalingKubernetesPods_2020}.
For example, if the appropriate CPU limit has been determined to be 70% and it has been empirically measured that the Pod can serve 1,000 requests with that value, Libra starts adding another Pod when each of the running instances is serving more than 900 requests.
Conversely, if the requests per Pod fall below 40%, Libra removes Pod replicas \cite{AdaptiveScalingKubernetesPods_2020}.
The authors' experiments showed that Kubernetes' default HPA did not scale the Deployment enough, which led to 40% lower throughput (requests per second), while Libra scaled the Deployment to double the number of Pods.
Determining the appropriate resource requests and limits for a service by deploying different kinds of Pods is an excellent idea:
it enables accurate live measurements on real-world data without a large impact on clients using the service.
Unfortunately, the conducted benchmark was quite simplistic: a web server that simply returns the string "Hello" (and does not perform any other computations or I/O operations).
Thus, from these experiments it is unclear whether the same results could be obtained for more sophisticated applications and deployment scenarios.
Thus, from these experiments it is unclear whether the same results could be obtained for more sophisticated applications and workload scenarios.
<!-- Also, in order to implement the request routing between production and canary pods the authors used an Istio service mesh with Envoy proxies as sidecars. -->
<!-- This is a pattern known to have performance issues due to the additional hops in routing HTTP requests and other overheads. \todo{citation} -->
<!-- Finally, it is great that the authors found an approach to automatically determine the adequate CPU resources thresholds. -->
@ -483,9 +482,9 @@ An example here would be a database, which needs to have complex and large data
In contrast to VPA, which uses the peak resource consumption for estimating the new resource requirements, RUBAS bases the estimates on the median of past observations.
The authors argue that a temporary peak in resource consumption by the application should not be used for estimating future resource demands.
Their experiments showed that the non-disruptive migration can significantly reduce the execution time of application (16% runtime improvement) when the initial resource thresholds specified by the user were too low.
Their experiments showed that the non-disruptive migration can significantly reduce the execution time of applications (16% runtime improvement) when the initial resource thresholds specified by the user were too low.
In this case both VPA and RUBAS need to update the resource thresholds several times before converging on the optimal solution.
This result is particularly relevant for one-off batch jobs, not so much for long-running services. \improve{but why?}
This result is particularly relevant for one-off batch jobs, not so much for long-running services.
They also found that due to resource allocation based on average utilization, RUBAS had to perform fewer migrations (stopping and restarting of Pods) compared to VPA.
Conversely, this also increased the CPU (72% compared to 82%) and memory utilization (76% compared to 86%).
Higher utilization means that tasks on the cluster will complete faster overall.
@ -499,55 +498,55 @@ To this end, the presented architecture is unlikely to find adoption in the Kube
<!-- ### Adaptive AI-based auto-scaling for Kubernetes (2020) -->
<!-- KUBERNETES AUTOSCALER -->
Toka et al.\ \cite{AdaptiveAIbasedAutoscalingKubernetes_2020} proposed a proactive auto-scaling approach based on demand forecasting with machine learning.
Unlike most scaling algorithms, such as the default HPA, which are reacting only based on the current load of the system, their auto-scaling engine *HPA+* takes into account a larger window of time and future demand forecasts based on several models.
Toka et al.\ \cite{AdaptiveAIbasedAutoscalingKubernetes_2020} proposed a proactive autoscaling approach based on demand forecasting with machine learning.
Unlike most scaling algorithms which are reacting only based on the current load of the system (such as the default HPA), their autoscaling engine *HPA+* takes into account a larger window of time and future demand forecasts based on machine learning models.
<!-- Fundamentally, the authors assume a linear relation between the number of the number of pods and the served requests per second of an application. -->
<!-- Furthermore, they state that Kubernetes is most commonly used for web services, where this application profile holds true. -->
<!-- However, even simple web services usually include some form of database or disk access, which will inevitably break this linear relationship as the number of requests rises. -->
Under the hood, the HPA+ utilizes several models for scaling: an auto-regression (AR) model, a supervised Hierarchical Temporal Memory (HTM) neural network and an unsupervised Long Short-Term Memory (LSTM) neural network.
In \cite{MachineLearningbasedScalingManagement_2020}, the authors also added a fourth, reinforcement learning-based (RL) model.
In \cite{MachineLearningbasedScalingManagement_2020} the same authors also added a fourth, reinforcement learning-based (RL) model.
<!-- After training and testing the models on their dataset, they found that the AR model produced the lowest root mean square error, the best predictive power ($R^2$) and had the fastest training time by several orders of magnitude. -->
<!-- However, the AR model is sensitive to outliers, unlike the LSTM model, which also required the most time for training by several orders of magnitude. -->
<!-- HTM provided a trade-off between robustness, computational requirements and predictive performance. -->
<!-- Additionally, this algorithm can be continuously trained based on the incoming data. -->
<!-- The RL model exhibited the worst prediction performance, despite long learning times. -->
Because each model performed poor or well depending on the particular usage pattern, the authors combined all models in a single auto-scaling engine.
Because each model performed poor or well depending on the particular usage pattern, the authors combined all models in a single autoscaling engine.
This engine continuously runs all models, but only the model with the best performance on the most recent input (last two minutes) is considered for scaling decisions.
While the underlying algorithm is quite complex, the HPA+ packages it into a single parameter which the end user can tune: *excess* describes the trade-off between lower loss (amount of unserved client requests) and higher resource utilization.
While the underlying algorithms are quite complex, HPA+ packages them into a single parameter which the end user can tune: *excess* describes the trade-off between lower loss (amount of unserved client requests) and higher resource utilization.
This is achieved through resource over- and under-provisioning.
Compared to the original HPA, the HPA+ only scaled the Pods about 3-9% more (depending on the excess parameter), but had significantly lower request loss.
In their follow-up article \cite{MachineLearningbasedScalingManagement_2020}, the authors confirmed these findings with more extensive benchmarks.
They generated synthetic data based on real-world traces of Facebook.com website visits on a university campus.
While the use of recent real-world data for their tests is commendable, it should be noted that the artificially-generated traffic pattern was much more spiky (meaning large, positive outliers) than the original input data (traces), which can be considered as an artifact in the data.
While the use of recent real-world data for their tests is commendable, it should be noted that the artificially-generated traffic pattern was much more spiky (meaning large, positive outliers) than the original input traces, which can be considered as an artifact in the data.
With their auto-scaling solution, the authors tried to focus on ease of usability by introducing a parameter which controls the trade-off between resource usage and QoS level violation.
With their autoscaling solution, the authors tried to focus on ease of usability by introducing a parameter which controls the trade-off between resource usage and QoS level violation.
However, since the proposed models need to be trained and fitted to the application at hand, setting up such a system is a non-trivial task.
In addition, this requires a sizable amount of clean data that closely mirrors the usage scenario of the application, because not every application follows the usage patterns of a social media website.
In addition, it requires a sizable amount of clean data that closely mirrors the usage scenario of the application, because not every application follows the usage patterns of a social media website.
### Microscaler
<!-- ### Microscaler: Automatic Scaling for Microservices with an Online Learning Approach (2019) -->
<!-- KUBERNETES AUTOSCALER -->
Yu et al.\ \cite{MicroscalerAutomaticScalingMicroservices_2019} presented *Microscaler*, a horizontal autoscaler which combines an online learning approach with a step-by-step heuristic approach for cost-optimal scaling of microservices while maintaining desired QoS levels.
The authors introduced a criterion named *service power* to determine scaling needs and estimate appropriate scale.
Service power represents the ratio between the average latency of the slowest 50 percent of requests ($P_{50}$) and the average latency of the slowest 10 percent of requests ($P_{90}$) over the last 30 seconds.
When the service power is close to or above 1 (i.e., $P_{90} \approx P_{50}$), the application could handle most of the requests within the desired QoS level.
When the service power falls significantly below 1, it means that the service quality has been degraded.
Yu et al.\ \cite{MicroscalerAutomaticScalingMicroservices_2019} presented *Microscaler*, a horizontal autoscaler which combines an online learning approach with a heuristic approach for cost-optimal scaling of microservices while maintaining desired QoS levels.
The authors introduced a criterion called *service power* to determine the need for scaling and to estimate the appropriate scale.
Service power represents the ratio between the average latency of the slowest 50 percent of requests ($P_{50}$) and the average latency of the slowest 10 percent of requests ($P_{90}$) during the last 30 seconds.
When the service power is close to or above 1 (i.e., $P_{90} \approx P_{50}$), the application can handle most of the requests within the desired QoS level.
When the service power falls significantly below 1, it means that the service quality is degraded.
Microscaler mainly considers the QoS experienced by the user, instead of QoS of individual services.
As a consequence it scales not just because a service has high CPU utilization or increased response time, but only when a user-facing SLA is violated.
This way, it can avoid detecting false-positive scaling events.
In their case, response time SLAs can either be violated by falling below the minimum threshold $T_{min}$ or by rising above the maximum threshold $T_{max}$.
The service power criterion is then incorporated into a Bayesian Optimization approach, which allows to minimize an objective function (i.e., cost) while obeying constraints (i.e., SLA bounds).
The service power criterion is then incorporated into a Bayesian Optimization approach, which allows minimizing an objective function (i.e., cost) while obeying constraints (i.e., SLA bounds).
<!-- With each lower or upper bound SLA violation, the cost model is updated. -->
Unfortunately, the results presented in their work are inconclusive:
while Microscaler converged slightly faster to the desired QoS level than other autoscaling approaches, the difference is not significant.
while Microscaler converges slightly faster to the desired QoS level than other autoscaling approaches, the difference is not significant.
Also, the mathematical model carries complexity and several parameters which need to be adjusted depending on the application \cite{MicroscalerAutomaticScalingMicroservices_2019}.
<!-- For routing and generating metrics, the authors used a service-mesh architecture (Istio) inside Kubernetes. -->
<!-- This architecture provides allows scaling applications without having to modify the application source code in order to emit metrics such as response time. -->
@ -581,29 +580,29 @@ Therefore, it is ultimately unclear how well the Q-Threshold algorithm would wor
<!-- ### Hierarchical Scaling of Microservices in Kubernetes (2020) -->
<!-- KUBERNETES AUTOSCALER -->
Rossi et al.\ \cite{HierarchicalScalingMicroservicesKubernetes_2020} presented *me-kube* (Multi-Level Elastic Kubernetes), an extension that coordinates the horizontal scaling of microservice-based applications.
Rossi et al.\ \cite{HierarchicalScalingMicroservicesKubernetes_2020} presented *Multi-Level Elastic Kubernetes* (*me-kube*), an autoscaler that coordinates the horizontal scaling of microservice-based applications.
It is based on a two-layered control loop.
On the lower layer, the *Microservice Manager* controls the scaling of a single service by monitoring and analyzing metrics with a local policy.
This local policy can either be proactive (in this case reinforcement learning-based) or reactive (application metric-based).
On the higher layer, the *Application Manager* controls the scaling of an entire application (which can be composed of multiple services) by observing the application performance.
When the Microservice Manager detects a need to scale the service, it sends a *proposal* to the Application Manager.
These proposals contain the desired number of replicas for the service as well as a *score*:
A proposal contains the desired number of replicas for the service as well as a *score*:
it describes the estimated improvement of the proposed adaptation.
Conversely, when the Application Manager detects a (near) SLA violation, it requests proposals from the Microservice Managers.
Conversely, when the Application Manager detects an SLA violation, it requests proposals from the Microservice Managers.
The Application Manager then coordinates the scaling of several microservices to avoid interfering scaling decisions (e.g., service B is scaled down because it is not receiving traffic from service A, which is under high-load).
To evaluate scaling decisions, the Application Manager considers all submitted reconfiguration proposals and chooses the one with the highest score, since it considers that one to have the highest impact on the overall application SLA.
This process is repeated iteratively until the target response time is met again or until there are no more proposals.
Once a decision has been made, the scaling action is communicated down to the Microservice Managers.
In their experiments the authors tested several scaling approaches.
They found that Q-Learning performed the worst due to the moderately complex deployment scenario, which drastically increased the state space thereby made it hard to learn for the model.
Kubernetes' HPA produced several policy violations, mainly because the workload was not necessarily CPU bound but HPA still scaled based on CPU utilization by default.
They found that Q-Learning performed the worst in a moderately complex deployment scenario, which drastically increased the state space and thereby made it hard to learn for the model.
Kubernetes' HPA produced several policy violations, mainly because the workload was not necessarily CPU bound but HPA still scales based on CPU utilization by default.
The hierarchical (i.e., centrally coordinated) scaling approach combined with a local predictive policy (ARIMA) performed the best, since it only caused SLA violations at the very beginning of the stress test.
In summary, the central coordination of scaling decisions in microservice-based applications is a promising approach.
The authors have not documented how they deployed or configured the additional scaling components, therefore it is somewhat unclear which impact these aspects would have on a real-world deployment.
Moreover, an additional Microservice Manager is required for each microservice, which can have significant compute requirements, depending on the model and amount of data.
Moreover, an additional Microservice Manager is required for each microservice, which can have significant compute requirements, depending on the model and amount of data used for the scaling algorithm.
<!-- "We select ARIMA as it -->
<!-- is able to estimate the trend even from a few points. In -->
@ -636,10 +635,10 @@ Moreover, an additional Microservice Manager is required for each microservice,
Chang et al.\ \cite{KubernetesBasedMonitoringPlatformDynamic_2017} proposed a generic platform for dynamic resource provisioning in Kubernetes with three key features: comprehensive monitoring, deployment flexibility and automatic operation.
All three are essential for operating a large, distributed system on top of Kubernetes (or any other orchestration platform).
Their monitoring stack is based on Heapster (for collecting low-level system metrics), Apache JMeter (for generating application load and measuring response time), InfluxDB (time-series database for storing metrics) and Grafana (as a visualization layer).
Their monitoring stack is based on Heapster (for collecting low-level system metrics), Apache JMeter (for generating application load and measuring response time), InfluxDB (time-series database for storing metrics) and Grafana (as a visualization tool).
It needs to be pointed out that the authors decided to inject artificial load into the system to measure application performance, instead of collecting "native" system metrics.
In this sense, they are measuring and collecting performance data at the client side, instead of at the server-side.
This approach has the advantage that it more accurately captures the service level experienced by clients \cite{UnderstandingDistributedSystems_2021}.
This approach has the advantage that it captures the service level experienced by clients more accurately \cite{UnderstandingDistributedSystems_2021}.
However, it also introduces additional stress on the system, which might be undesirable when the system is already under heavy load.
The authors described their *Resource Scheduler Module* and *Pod Scaler Module*, which adjust the number of running Pods based on CPU utilization with static thresholds.
@ -665,7 +664,7 @@ Second, these autoscalers alleviate the developer's burden of having to accurate
This increases productivity, because humans are not only slow and inaccurate at estimating required resources, but they also need to continuously perform this task while the software is developed and updated.
Most of the solutions looked at scaling each service individually;
only one (me-kube \cite{HierarchicalScalingMicroservicesKubernetes_2020}) implemented centralized (*hierarchical*) scaling of several services.
only one (\mbox{me-kube} \cite{HierarchicalScalingMicroservicesKubernetes_2020}) implemented centralized (*hierarchical*) scaling of several services.
While coordinated scaling can offer significant benefits, it is difficult to generalize this approach from simple test cases to large, complex meshes of production systems.
Bauer et al.\ \cite{ChamulteonCoordinatedAutoScalingMicroServices_2019} have also considered hierarchical scaling research, but they have not implemented and validated their proposal with Kubernetes.
@ -702,25 +701,19 @@ In the following, we outline its key design and architectural aspects.
The autoscaling component should be implemented using the Go programming language to align with other Kubernetes components and make use of the official Kubernetes client libraries^[<https://github.com/kubernetes/client-go>].
The architecture of the autoscaling component (shown in Figure \ref{fig:wasmpa-architecture}) is similar to HPA and VPA:
it fetches metrics from the Metrics API or an external monitoring system, and is configured with Kubernetes CRD objects.
The CRD object is the interface between the user and Kubernetes cluster since it contains the deployment-specific scaling configuration.
Listing \ref{src:wasmpa-crd} shows a preliminary example of this CRD.
\improve{These features satisfy the requirements of cluster operators.}
The CRD object is the interface between the user and Kubernetes cluster since it contains the workload-specific scaling configuration.
Appendix \ref{modular-kubernetes-autoscaler-crd} shows a preliminary example of this CRD.
At runtime, the autoscaling component passes the metrics and scaling configuration to a WebAssembly sandbox, which runs the core autoscaling algorithm and returns scaling results.
Afterwards, the autoscaling component performs the actions described by the results of the algorithm (e.g., increase the amount of replicas to a certain number) by communicating with the Kubernetes API.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/wasmpa-architecture.pdf}
\caption{\label{fig:wasmpa-architecture} High-level operational diagram of modular Kubernetes autoscaler with WebAssembly (WASM) sandbox}
\end{figure}
WebAssembly^[<https://webassembly.org/>] is a portable binary instruction format that can be used as a compilation target for many programming languages.
Its main features are speed, memory safety and debuggability \cite{DifferentialFuzzingWebAssembly_2020}.
Running the core scaling algorithm inside a WebAssembly sandbox has several advantages for researchers:
* the algorithm can be implemented in any programming language (Python, JavaScript, Rust etc.) and then compiled to WebAssembly bytecode;
* the code runs at near-native speed (faster than interpreted languages), allowing the implementation of complex and resource-intensive algorithms (e.g., neural networks);
* the sandbox provides simple interfaces for data input and output (i.e., no need to interact with the complex and evolving Kubernetes API), which simplifies development, testing and simulation.
* the sandbox provides simple interfaces for data input and output (i.e., no need to interact directly with the complex and evolving Kubernetes API), which simplifies development, testing and simulation.
Cluster operators gain the following advantages from the WebAssembly sandbox:
@ -729,41 +722,22 @@ Cluster operators gain the following advantages from the WebAssembly sandbox:
* the autoscaler can host multiple WebAssembly sandboxes, which allows different services to be scaled with individual algorithms;
* the same scaling algorithm can be used across different hosts and environments because the bytecode is agnostic in terms of processor architecture and operating system.
The Kubewarden project^[<https://www.kubewarden.io/>] uses WebAssembly sandboxes for similar reasons to implement modular security policies in Kubernetes.
For similar reasons, the Kubewarden project^[<https://www.kubewarden.io/>] uses WebAssembly sandboxes to implement modular security policies in Kubernetes.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/wasmpa-architecture.pdf}
\caption{\label{fig:wasmpa-architecture} High-level operational diagram of modular Kubernetes autoscaler with WebAssembly (WASM) sandbox}
\end{figure}
Additionally, the modularity of this autoscaler allows one important distinction from HPA and VPA:
we can combine the decision for horizontal and vertical scaling into one algorithm and one component.
HPA and VPA are separate, uncoordinated components which cannot be used to scale on the same metric:
it can combine the decision for horizontal and vertical scaling into one algorithm and one component.
HPA and VPA are separate, uncoordinated components which cannot be used to scale on the same metric
(otherwise race-conditions might occur and lead to unstable behavior^[<https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/README.md#known-limitations>]).
The proposed modular autoscaling component does not have this issue since horizontal and vertical scaling decisions are generated from the same component -- and are therefore conflict-free.
The implementation of this modular Kubernetes autoscaler is left as future work.
\clearpage
\todo{maybe move this to appendices}
\begin{lstlisting}[caption=Example CRD of modular Kubernetes autoscaler, label=src:wasmpa-crd, language=yaml, numbers=left]
apiVersion: wasmpa.io/v1alpha1
kind: ScaledObject
metadata:
name: server-so
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: server
metrics:
- type: Resource
metricName: cpu
target: 0.6 # 60% CPU load
- type: External
metricName: http_request_duration_seconds
target: 0.1 # 100ms response time
algorithm:
name: wasmpa-arima # use ARIMA algorithm for estimations
params: [] # list of additional parameters
\end{lstlisting}
<!-- by providing a Kubernetes components that modularizes the core scaling logic and abstracts away the infrastructure work an autoscaling component needs to take care of. -->

@ -2,7 +2,7 @@
\chapterquote{If you are not monitoring stuff, it is actually out of control.}{John Wilkes}{10.5cm}
\noindent
This chapter presents a monitoring infrastructure and autoscaling policies for an application in a production-grade environment.
This chapter presents a monitoring infrastructure and autoscaling policies for a Kubernetes-based application in a production-grade environment.
It details the technical implementation necessary to identify and expose relevant metrics from the target application and execution environment;
aggregate and visualize those metrics with a modern monitoring solution;
install VPA, HPA and KEDA autoscaling components;
@ -14,7 +14,7 @@ Among other features, it provides policy-based security automation, compliance m
It aids network operators to automate security controls and maintain them in the desired state.
As the complexity of modern telecommunication infrastructure grows, it is crucial for these systems to be configured securely and remain that way.
One part of the target application is responsible for connecting to the external systems, checking the security settings and if necessary re-configuring them.
One part of the target application is responsible for connecting to the external systems, checking their security settings and if necessary re-configuring them.
The architectural design of the application which covers this functionality is shown in Figure \ref{fig:app-architecture}.
The API server accepts commands from the user, such as *"check security settings of system A and B"*.
It then fetches the necessary connection details from the database and forwards detailed task instructions via the message queue to an executor.
@ -41,7 +41,7 @@ While we focus our implementation on the target application, the methods and fin
It refers to observing the state of the execution environment to detect failures, trigger alerts and provide information about overall system health.
Modern monitoring systems are based on metrics.
A *metric* is a numeric representation of information represented as a time-series, i.e., each value is associated with a unique timestamp.
A *metric* is a numeric value of information represented as a time-series, i.e., each value is associated with a unique timestamp.
A *service-level indicator* (SLI) is a metric which measures a specific dimension of the *quality of service* (e.g., response time, error rate).
A *service-level objective* (SLO) defines the range of acceptable values for an SLI within which the service is considered to be in a healthy state.
A *service-level agreement* (SLA) can be based on an SLO and is a formal commitment from a service provider towards its users.
@ -62,8 +62,8 @@ This means Grafana itself does not store any data, but fetches the relevant data
\caption{\label{fig:monitoring-prometheus-grafana} Logical view of monitoring infrastructure with Prometheus and Grafana}
\end{figure}
Prometheus itself does not extract metrics from the system or application, but rather relies on so called *exporters*^[<https://prometheus.io/docs/instrumenting/exporters/>].
These exporters expose relevant metrics through an HTTP endpoint in a plaintext format^[<https://prometheus.io/docs/instrumenting/exposition_formats/>], which then gets queried regularly (a process referred to as *scraping*) according to the *scrape_interval*.
Prometheus itself does not extract metrics from the system or application, but rather relies on so-called *exporters*^[<https://prometheus.io/docs/instrumenting/exporters/>].
These exporters expose relevant metrics through an HTTP endpoint in a plaintext format^[<https://prometheus.io/docs/instrumenting/exposition_formats/>], which then gets queried periodically (according to the *scrape_interval*) --- this process is referred to as *scraping*.
For many commonly used services (databases, message queues, operating systems etc.) open-source exporters already exist^[<https://exporterhub.io/>].
A *custom* exporter needs to be developed to expose metrics from a proprietary or novel application.
@ -89,11 +89,10 @@ This behavior is an instance of a self-configuring system according to the conce
Since Prometheus itself does not extract metrics from an application, installing and configuring special exporters is necessary.
This section documents which exporters have been tried and deployed to collect metrics as well as the purpose they serve.
We start with low-level metrics and gradually move towards higher-level metrics.
It is important to note that only when metrics from different sources (exporters) are combined together, we get a comprehensive view of the application behavior and performance.
It is important to note that only when metrics from different sources (exporters) are combined, it is possible to obtain a comprehensive view of the application behavior and performance.
The first exporter we configured was *kube-state-metrics*^[<https://github.com/kubernetes/kube-state-metrics/tree/v1.9.8>].
It exposes many details about the state of objects managed by Kubernetes' control plane.
For example, it exposes the number of Deployments and Pods as well as configuration of these objects (e.g., resource requests and limits) to Prometheus.
The *kube-state-metrics* exporter^[<https://github.com/kubernetes/kube-state-metrics/tree/v1.9.8>] exposes details about the state of objects managed by Kubernetes' control plane.
For example, it reports the number of Deployments and Pods as well as configuration of these objects (e.g., resource requests and limits) to Prometheus.
It should be noted that these metrics only describe the state of virtual objects.
To get real-time information about the state of the Kubernetes worker nodes (CPU, memory, I/O utilization), Prometheus is configured to scrape metrics from *cAdvisor* (refer to Section \ref{kubernetes-components}).
<!-- It provides usage statistics about each container running on a node. -->
@ -182,16 +181,16 @@ esm_tasks_total{status="running"} 42
esm_tasks_total{status="finished"} 7
\end{lstlisting}
We have also implemented several other metrics with types Counter and Histogram.
Additionally, we implemented several other metrics with types Counter and Histogram.
A *Counter* is a monotonically increasing value, which is only reset when restarting the service \cite{PrometheusDocumentation_2021}.
It can be useful for describing the total number of tasks known to the application.
Exemplary, it can be useful for describing the total number of tasks created by the application.
A *Histogram* samples observations into buckets with pre-configured sizes \cite{PrometheusDocumentation_2021}.
It can be used to describe the duration of requests, for instance 0-10ms, 10-100ms, 100ms-1s, 1-10s etc.
A histogram provides a balance between tracking the duration of each task individually (which has high *cardinality*, i.e., expensive in terms of compute and storage resources) and aggregating into mean, minimum and maximum values (which loses information about distribution and outliers).
A histogram provides a balance between tracking the duration of each task individually (which has high *cardinality*, i.e., expensive in terms of bandwidth and storage resources) and aggregating into mean, minimum and maximum values (which loses information about distribution and outliers).
While developing this custom exporter, we followed the best practices and conventions for writing Prometheus exporters^[<https://prometheus.io/docs/instrumenting/writing_exporters/>] and metric naming^[<https://prometheus.io/docs/practices/naming/>].
The custom exporter was packaged into a container image and deployed alongside the target application.
We were able to confirm that it provides useful high-level metrics about the application by setting up a Grafana dashboard for these metrics and \improve{observing them}.
We were able to confirm that it provides useful high-level metrics about the application by setting up a Grafana dashboard and visualizing the metrics as time-series graphs.
The exposed metrics allow reasoning about the application behavior and making appropriate scaling decisions based on the collected metrics.
\begin{figure}[ht]
@ -200,9 +199,9 @@ The exposed metrics allow reasoning about the application behavior and making ap
\caption{\label{fig:scaling-flow} Flow of metrics used for scaling. Arrows denote the logical flow of data. Orange arrows symbolize raw HTTP metrics.}
\end{figure}
To use these custom metrics with Kubernetes HPA (see Section \ref{horizontal-pod-autoscaler}), another component needs to be installed into the cluster: a *metrics adapter* (Figure \ref{fig:scaling-flow}).
To use these custom metrics with Kubernetes HPA (Section \ref{horizontal-pod-autoscaler}), another component needs to be installed into the cluster: a *metrics adapter* (Figure \ref{fig:scaling-flow}).
This component is responsible for translating the metrics from Prometheus into a format compatible with the Kubernetes metrics API (Section \ref{kubernetes-components}).
We choose the *prometheus-adapter* project^[<https://github.com/kubernetes-sigs/prometheus-adapter>] for this purpose as our use case is focused on Prometheus metrics.
We choose the *prometheus-adapter* project^[<https://github.com/kubernetes-sigs/prometheus-adapter>] for this purpose as our use case focuses solely on Prometheus metrics.
Another project with a similar goal is *kube-metrics-adapter*^[<https://github.com/zalando-incubator/kube-metrics-adapter>] which allows utilizing a wider range of data sources, for example InfluxDB or AWS SQS queues.
The installation of the adapter was performed with a Helm Chart and is detailed in Appendix \ref{prometheus-adapter-setup}.
In essence, the adapter is configured with a PromQL query it should execute.
@ -221,9 +220,10 @@ Assuming that the workload is automatically distributed across all instances, sc
It should be noted that this assumption does not always hold true.
Special attention needs to be paid to (partially) stateful services.
Nguyen and Kim \cite{HighlyScalableLoadBalancing_2020} performed an investigation of load balancing stateful applications on Kubernetes.
They found that especially distribution and load balancing of leaders throughout the cluster are important for maximizing performance.
They found that especially distribution and load balancing of leaders throughout the cluster are important for maximizing performance when scaling horizontally.
Two other aspects need to be considered when implementing horizontal scaling on Kubernetes: microservice startup and shutdown.
These aspects are particularly important when using autoscaling, since individual Pods are frequently created and removed at all times.
If the Pods of a microservice are exposed with a Kubernetes Service (see Section \ref{kubernetes-objects}), the Pods should have *readiness probes* configured.
Based on this probe Kubernetes determines if the application is ready to handle requests (e.g., after it has finished its startup routine).
@ -232,20 +232,19 @@ This way the new replicas handle part of the incoming load as soon as possible -
Any application running as a distributed system should implemented *graceful shutdown* (or *graceful termination*):
when a Pod is shut down, Kubernetes stops routing new traffic to the replica and sends the Pod a SIGTERM signal;
the application should finish serving the outstanding requests it has accepted and terminate afterwards \cite{KubernetesPatterns_2019}.
This behavior is particularly important when using autoscaling, since individual Pods are frequently created and removed at all times.
the application should finish serving the outstanding requests it has accepted and terminate itself afterwards \cite{KubernetesPatterns_2019}.
*Vertical scaling* (scaling up or down) refers to adjusting requested resources (compute, memory, network, storage) allocated to a service based on the actual usage.
*Vertical scaling* (scaling up and down) refers to adjusting requested resources (compute, memory, network, storage) allocated to a service based on the actual usage.
By giving more resources to one or multiple instances, they are able to handle more workload.
While most industry practitioners only focus on scaling up (allocating more resources), the opposite is actually far more desirable: scaling down.
The Autopilot paper from researches at Google shows that significant cost savings can be realized by automatically adjusting the allocated resources, i.e., vertical scaling \cite{AutopilotWorkloadAutoscalingGoogle_2020}.
The *Autopilot* paper from researches at Google shows that significant cost savings can be realized by automatically adjusting the allocated resources, i.e., vertical scaling \cite{AutopilotWorkloadAutoscalingGoogle_2020}.
Some of the research proposals discussed in Section \ref{research-proposals-for-kubernetes-autoscalers} have shown potential to be effective and cost-efficient autoscalers, but none of them offer a publicly available implementation.
For this reason, HPA (Horizontal Pod Autoscaler, Section \ref{horizontal-pod-autoscaler}), VPA (Vertical Pod Autoscaler, Section \ref{vertical-pod-autoscaler}) and KEDA (Kubernetes Event-driven Autoscaler, Section \ref{kubernetes-event-driven-autoscaler}) have been chosen for autoscaling:
For this reason, HPA (Horizontal Pod Autoscaler, Section \ref{horizontal-pod-autoscaler}), VPA (Vertical Pod Autoscaler, Section \ref{vertical-pod-autoscaler}) and KEDA (Kubernetes Event-driven Autoscaler, Section \ref{kubernetes-event-driven-autoscaler}) were chosen for autoscaling:
they are widely deployed in the industry and their implementations are battle-tested.
Furthermore, they feature a plethora of configuration options to adjust the scaling behavior.
This allows developers and administrators to fine-tune the scaling behavior to their use-cases and goals.
These settings -- as well as their effects -- will be explored in the following sections.
This allows developers and administrators to fine-tune the scaling behavior to their \mbox{use cases} and goals.
These options -- as well as their effects -- will be explored in the following sections.
### Vertical Scaling with VPA
@ -260,15 +259,16 @@ Thus, it has potential for service disruption, especially when the number of Pod
Nevertheless, the VPA Recommender can be a useful tool for determining appropriate resource requests and limits, as we show in the following section.
After the component is installed into the cluster, VPA needs to be instructed to monitor our application so that it can build its internal resource usage model and produce an estimate.
VPA can be enabled and configured for each application running on Kubernetes individually.
VPA is enabled and configured for each application running on Kubernetes individually.
<!-- In our case, we are configuring it for a *Deployment* (see Section \ref{kubernetes-objects}). -->
This is done through a special object called *Custom Resource Definition*, short *CRD*.
CRDs act just like Kubernetes core objects, however they are not implemented by the *Controller Manager* (Section \ref{kubernetes-components}), but through an external component --- in this case: the VPA.
Listing \ref{src:vpa-crd-executor} shows the CRD we used for configuring VPA to monitor the `executor` Deployment (line 6-9).
Listing \ref{src:vpa-crd-executor} shows the CRD for configuring VPA to monitor the `executor` Deployment (line 6-9).
VPA is instructed to only provide resource recommendations, but not change the configuration of running Pods (lines 10-11).
We configure VPA to monitor a specific container in the Pod (this avoids interference with sidecar containers) and the types of resources (lines 13-16).
CPU and memory are the only resources supported by VPA.
The CRD is added to the cluster with `kubectl apply` in the same namespace as the target Deployment.
\clearpage
@ -291,7 +291,6 @@ spec:
controlledResources: ["cpu", "memory"]
\end{lstlisting}
The CRD is added to the cluster with `kubectl apply -f filename.yaml` in the same namespace as the target Deployment.
Once the VPA has been able to collect metrics for a while, the resource request and limit recommendations can be retrieved as shown in Listing \ref{src:vpa-recommendation-example}.
Section \ref{vertical-pod-autoscaler} explains the meaning and calculations behind these values.
For our use case, only the upper bound and target recommendations are relevant.
@ -330,7 +329,7 @@ A tool like *Goldilocks*^[<https://goldilocks.docs.fairwinds.com/>] can provide
### Horizontal Scaling with HPA
This section details the configuration of Kubernetes' Horizontal Pod Autoscaler (HPA) \cite{KubernetesDocumentationHorizontalPod_2021}.
As outlined Section \ref{horizontal-pod-autoscaler}, HPA is part of the Controller Manager and therefore part of every Kubernetes installation.
As outlined Section \ref{horizontal-pod-autoscaler}, HPA is implemented in the Controller Manager and therefore part of every Kubernetes installation.
Therefore, no installation is required, HPA just needs to be configured for each scaling target.
The goal of the HPA in our scenario is to give the application similar performance to a static overprovisioning of resources<!-- (low average queue time and time to completion) -->, while keeping the cost <!-- (replica seconds) --> at a minimum.
@ -341,22 +340,23 @@ Mathematically, the resulting system can be described as a queuing system where
The most minimal horizontal scaling policy could be applied with the command `kubectl autoscale executor --cpu-percent=50`.
This would scale the number of replicas based on the average CPU load across all Pods in the Deployment.
However, as discussed at the beginning of this chapter, our workload is neither purely CPU nor memory bound, but also by the throughput of external systems.
Thus, we need to scale this Deployment based on a higher-level metric which we have exposed in Section \ref{prometheus-exporters}.
We have identified the current queue length (used in Figure \ref{fig:hpa-scaling-v0}) as a meaningful autoscaling metric.
Thanks to the metrics adapter installed in Section \ref{prometheus-exporters}, we can configure the HPA to scale based on external metrics, as shown in Listing \ref{src:hpa-scale-v0}.
Thus, we need to scale this Deployment with a high-level metric which we exposed in Section \ref{prometheus-exporters}.
Based on empirical observations and experiments, we identified the current queue length (used in Figure \ref{fig:hpa-scaling-v0}) as a meaningful autoscaling metric.
Thanks to the metrics adapter installed in Section \ref{prometheus-exporters}, we can configure the HPA to scale based on external metrics, as shown in \mbox{Listing \ref{src:hpa-scale-v0}}.
The object structure is similar to the CRD of the VPA.
It specifies a *target* -- the Kubernetes object which should be scaled (line 6-9) -- and based on which metric it should be scaled (line 13-16).
<!-- We are using an external metric (line 13-16) based on which the target is scaled. -->
The goal of the horizontal autoscaler is to make the metric value equal to target value (line 17-19) by adjusting the number of Pods.
The goal of the HPA is to make the metric value equal to target value (line 17-19) by adjusting the number of Pods.
When the metric value is above the target, it creates more instances.
When the metric value is below the target, it removes instances.
For details about the algorithm refer to Section \ref{horizontal-pod-autoscaler}.
Additionally, we specify safety bounds: the Deployment must have at least 1 replica (line 10) and at most 20 replicas (line 11).
This is an important engineering practice to guard against bugs and misconfiguration (e.g., the scale of the metric value changes from seconds to milliseconds), which could lead to the autoscaler creating large numbers of instances.
This is an important engineering practice to guard against bugs and misconfiguration (e.g., the unit of the metric value changes from seconds to milliseconds), which could lead to automatic creation of large numbers of replicas.
After applying the object from Figure \ref{src:hpa-scale-v0} to the cluster, the current configuration as well as operation of HPA can be observed on the command line (Appendix \ref{hpa-log-messages}) as well as visually with the monitoring setup (Figure \ref{fig:hpa-scaling-v0}).
After applying the `HorizontalPodAutoscaler` object (shown in Figure \ref{src:hpa-scale-v0}) to the cluster, the current configuration as well as operation of HPA can be observed on the command line (Appendix \ref{hpa-log-messages}) as well as visually with the monitoring setup (\mbox{Figure \ref{fig:hpa-scaling-v0}}).
\begin{lstlisting}[caption=Initial HPA Scaling Policy (hpav0), label=src:hpa-scale-v0, language=yaml, numbers=left]
\begin{lstlisting}[caption=Initial HPA Scaling Policy (\texttt{hpav0}), label=src:hpa-scale-v0, language=yaml, numbers=left]
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
@ -381,7 +381,7 @@ spec:
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/hpa-scaling-v0.png}
\caption{\label{fig:hpa-scaling-v0} Grafana screenshot of scaling behavior with initial HPA policy (hpav0). Task queue length in green, number of replicas in red.}
\caption{\label{fig:hpa-scaling-v0} Grafana screenshot of scaling behavior with initial HPA policy (\texttt{hpav0}). Task queue length in green, number of replicas in red.}
\end{figure}
### Horizontal Scaling with KEDA
@ -391,13 +391,14 @@ As noted in Section \ref{kubernetes-event-driven-autoscaler}, KEDA significantly
A monitoring tool, exporters and a metrics adapter (as documented in the previous sections) are not required for using KEDA.
The installation procedure with KEDA's Helm Chart is shown in Appendix \ref{keda-setup}.
Similar to the previous section, we will use the RabbitMQ event source as a trigger for horizontal scaling (Listing \ref{src:keda-v1}, line 14).
Specifically, KEDA is configured to scale the Deployment based on the number of messages in a specific queue (line 18-20).
Similar to the previous section, we will use the RabbitMQ message broker as a trigger for horizontal scaling (Listing \ref{src:keda-v1}, line 14).
Specifically, KEDA is configured to scale the `executor` Deployment based on the number of messages in a specific queue (line 18-20).
Alternatively to scaling based on queue length, the rate of messages could also be used.
Listing \ref{src:scaledobject} in Appendix \ref{keda-setup} shows the status of KEDA's ScaledObject, HPA object (created by KEDA) and the Deployment after applying the CRD from Listing \ref{src:keda-v1}.
Listing \ref{src:scaledobject} in Appendix \ref{keda-setup} shows the status of KEDA's `ScaledObject`, the HPA object (created by KEDA) and the Deployment after applying the CRD from \mbox{Listing \ref{src:keda-v1}}.
Of particular note is that the HPA object has a minimum Pod count of one, but the KEDA agent scales the Deployment to zero replicas anyway.
This allows saving resources when there are no tasks for the system.
In the following chapter we evaluate the effectiveness of this *scale-to-zero* behavior and which side-effects it has.
\begin{lstlisting}[caption=ScaledObject CRD for KEDA autoscaling, label=src:keda-v1, language=yaml, numbers=left]
apiVersion: keda.sh/v1alpha1

@ -24,17 +24,17 @@ We follow the guidelines on scientific benchmarking by Hoefler and Belli \cite{S
As outlined in the introduction, autoscaling always needs to make a trade-off between optimizing for application performance and optimizing for cost, i.e., allocated resources.
If cost is not an issue, one could simply allocate a fixed, large number of resources and always keep those running --- known as dimensioning for peak load.
However, this is not cost-effective, especially not in a world where cloud resources (such as virtual machines and containers) can be allocated and are billed by the second.
Thus, the goal is to always keep the number of allocated resources as low as possible --- with some reasonable safety margin to compensate for unforeseen deviations.
Thus, the goal is to always keep the number of allocated resources as low as possible --- with some reasonable safety margin to compensate unforeseen deviations.
## Benchmark Setup
<!-- In order to quantify the improvements made by the autoscaling components, we also develop performance and cost benchmarks for the target application. -->
To measure the performance of the system we use one of the application-level metrics we have exposed in Section \ref{prometheus-exporters}: the duration of configuration runs.
To measure the performance of the system we use one of the application-level metrics we previously exposed in Section \ref{prometheus-exporters}: the duration of configuration runs.
A configuration run refers to the configuration of a fixed set of systems being checked and updated.
This process is commonly triggered by the user through a web interface, thus it is deemed a relevant metric for measuring the performance of the application.
To measure the cost of horizontal scaling we use *replica seconds*: the number of running Pods per second integrated over the time period of the benchmark.
For example, if the benchmark lasts 1 minute and we have a constant number of 2 replicas running all the time, this would result in 120 replica seconds.
For example, if the benchmark lasts one minute and two replicas are running the entire time, this would result in 120 replica seconds.
Prometheus is not suitable for such time-accurate measurements \cite{PrometheusDocumentation_2021}, therefore we implement our own tool that fetches this data directly from the Kubernetes API.
<!-- \begin{leftrule} -->
@ -61,7 +61,7 @@ The entire benchmark procedure is repeated three times.
We verified the benchmark results do not contain outliers or other inconsistencies.
<!-- If there are outliers they are reported, otherwise the results are averaged where appropriate. -->
Between each benchmark the target application is completely removed from the cluster (the Kubernetes namespace is deleted) and re-installed into a new namespace.
This ensures the benchmark runs are completely isolated and no transient side effects (e.g., warm caches) are present.
This ensures the benchmark runs are completely isolated and no transient side effects (such as warm caches) are present.
The setup phase described above is entirely scripted to minimize potential for variation.
Configuration runs are started exactly 60 seconds apart from each other.
@ -70,8 +70,9 @@ Each configuration run contains the same amount of work.
The *time to completion* marks the point when all configuration runs have finished processing.
The application continues to run afterwards (until a fixed timeout) to demonstrate the cost-savings made possible by autoscaling.
If the benchmark was stopped as soon as the workload was completed, only the performance (but not the cost) could be compared.
Because of the fixed benchmark duration, the cost-savings can be extrapolated to one day, week or month.
The benchmarks are run on a Kubernetes cluster consisting of one control plane node and two worker nodes.
The benchmarks are performed on a Kubernetes cluster consisting of one control plane node and two worker nodes.
The worker nodes have a combined capacity of 136 CPU cores and 1134 GiB memory.
## Functional Verification
@ -92,7 +93,7 @@ To get a baseline for the application behavior, we tested several static values
The graph in Figure \ref{fig:performance-cost-benchmark-static} shows the results with the cost (replica seconds) on the x-axis and the performance (time to completion, i.e., time until the simulated user has all desired results) on the y-axis.
Naturally, the lowest cost (4.000 replica seconds) is achieved when using 2 replicas, which also has by far the largest time to completion (2.000 seconds).
The highest cost (14.000 replica seconds) is recorded with 20 replicas, which also has the lowest time to completion (4.000 seconds).
Due to the constant workload size, the benchmarks with 10 and 15 replicas had almost the same time to completion, while having significantly fewer replicas and therefore replica seconds.
Due to the constant workload size, the benchmarks with 10, 15 and 20 replicas have almost the same time to completion, while having significantly more replicas and therefore replica seconds.
This means that for this workload size, more than 10 replicas are inefficient, since most of the replicas are idle (unused) during the execution.
In this scenario, the goal of the autoscaler is to optimize towards the bottom left corner (low cost and high performance), irrespective of the type and size of the given workload.
@ -108,10 +109,11 @@ In this scenario, the goal of the autoscaler is to optimize towards the bottom l
Data from the monitoring system shows that with 3 replicas the average queue duration (time before tasks are actually executed) continuously rises during the benchmark up to a value of 360 seconds.
With 20 replicas, the average queue time quickly converges to a value of 30 seconds.
This highlights the effect of not having enough executor replicas available which leads to significant delays, since each executor may only process one task at a time.
Figure \ref{fig:config-run-variance-static} shows the execution time of individual configuration runs during the benchmark.
It confirms our intuition that with an overall lower time to completion (as it was the case in Figure \ref{fig:performance-cost-benchmark-static}), the time of individual workloads is also smaller.
It also shows that a low number of replicas has a significant effect on the variance of the configuration run's execution times.
Furthermore, it shows that a low number of replicas has a significant effect on the variance of the configuration run's execution times.
While each configuration run has the same workload, with a low number of replicas a significant increase in variance (in addition to an increase of the average) can be observed.
This is explained by the fact that with few replicas, the same executor needs to run multiple workloads sequentially, which slows some of them down drastically.
@ -121,38 +123,41 @@ This is explained by the fact that with few replicas, the same executor needs to
\caption{\label{fig:config-run-variance-static} Execution time of individual configuration runs (3 benchmarks, 10 configuration runs per benchmark). Numbers in blue indicate the mean value.}
\end{figure}
After having established the performance and cost characteristics of static configuration, we evaluate the characteristics of a basic autoscaling setup.
The first version of the autoscaling policy has been shown in Listing \ref{src:hpa-scale-v0}.
After establishing the performance and cost characteristics of static configurations, we evaluate the characteristics of a basic autoscaling setup.
The first version of the autoscaling policy was shown in Listing \ref{src:hpa-scale-v0}.
We now evaluate the behavior of this policy, which is labeled as `hpav0` in the figures.
As Figure \ref{fig:performance-cost-benchmark-static} shows, the benchmark with this policy had a similar performance and cost as a static configuration with 10 replicas: the application used around 18.600 replica seconds during the benchmark and the total time to completion was approx. 700 seconds.
However, the variance in execution time between different configuration runs was slightly larger than with 10 replicas (Figure \ref{fig:config-run-variance-static}).
This can be explained by the fact that the autoscaler gradually needs to scale up the Deployment at the beginning of the benchmark.
Thus, the executor Deployment does not have the optimal resources immediately available.
This behavior can be observed in Figure \ref{fig:hpa-scaling-v0}.
The result highlights the strengths and weaknesses of this autoscaling policy: it enabled the application to achieve the same performance as with a static configuration of 10 replicas.
This result highlights the strengths and weaknesses of this autoscaling policy: it enabled the application to achieve the same performance as with a static configuration of 10 replicas.
As Figure \ref{fig:performance-cost-benchmark-static} shows, this is the maximum amount of performance the application is able to deliver for this particular workload.
At the same time, the autoscaling policy was just as costly as a static configuration of 10 replicas, even though for significant periods the application was running with only 1 replica.
This is explained by Figure \ref{fig:hpa-scaling-v0}: the policy overscaled the number of replicas (more than the necessary value established previously) and even reached the replica limit (`maxReplicas` in Listing \ref{src:hpa-scale-v0}).
This is explained by Figure \ref{fig:hpa-scaling-v0}: the policy overscaled the number of replicas (more than the necessary value established previously) and even reached the replica limit (`maxReplicas` from Listing \ref{src:hpa-scale-v0}).
Furthermore, the variance in execution time between different configuration runs was slightly larger than with 10 replicas.
This can be explained by the fact that the autoscaler gradually needs to scale up the Deployment at the beginning of the benchmark.
Thus, the executor Deployment does not have the optimal resources immediately available and some tasks need to wait longer in the queue.
This behavior can be observed in \mbox{Figure \ref{fig:hpa-scaling-v0}}.
In summary, the autoscaling policy has performed well (no performance loss, minor introduction of variance), but has not realized any cost-savings in our experiments due to drastic overscaling (provisioning more replicas than required for the workload).
\clearpage
## Cost Optimization
While the initial scaling policy presented above has successfully scaled the Deployment during the benchmark, we have identified several aspects for improvement.
While the initial scaling policy successfully scaled the Deployment during the benchmark, we identified several aspects for improvement.
This section addresses these issues by fine-tuning the scaling policy.
* **Delayed scale-down**: the number of replicas is not reduced soon enough after the workload has finished, as is apparent from the number of queued tasks vs. the number of replicas in Figure \ref{fig:hpa-scaling-v0}.
* **Delayed scale-down**: the number of replicas is not reduced soon enough after the workload has finished, as is apparent from the number of queued tasks compared to the number of replicas (Figure \ref{fig:hpa-scaling-v0}).
* **Potential for premature scale-down**: if a task has a long execution time, the autoscaler might reduce the number of replicas too early because the scaling metric only depends on queued tasks.
* **Overscaling of replicas**: our previous experiments with static replica configurations have shown that provisioning 20 replicas (as shown in Figure \ref{fig:performance-cost-benchmark-static}) is ineffective, since it does not increase performance.
* **Overscaling of replicas**: our previous experiments with static replica configurations have shown that provisioning 20 replicas is ineffective, since it does not increase performance (as shown in Figure \ref{fig:performance-cost-benchmark-static}).
The delayed scale-down can be tackled by adjusting the *stabilization window* of the scaling policies (also referred to as *cool-down period*).
This setting (shown in Listing \ref{src:downscale-behavior}) specifies how soon HPA starts to remove replicas from the Deployment after it has detected that the scaling metric is below the target value \cite{KubernetesDocumentationHorizontalPod_2021}.
This setting (shown in Listing \ref{src:downscale-behavior}) specifies how soon HPA starts removing replicas from the Deployment after it detects that the scaling metric is below the target value \cite{KubernetesDocumentationHorizontalPod_2021}.
The default value is 5 minutes (observable in Figure \ref{fig:hpa-scaling-v0}); we adjust the value to 1 minute (line 7).
Decreasing the downscale stabilization window can lead to thrashing (continuous creation and removal of Pods) when workload bursts are more than the specified window apart, but offers better elasticity \cite{QuantifyingCloudElasticityContainerbased_2019}.
In our case this is an acceptable trade-off because these containers start quickly, as this application component is lightweight and does not hold any internal state.
Additionally, this risk is partially mitigated by only allowing HPA to remove 50% of the active replicas per minute (Listing \ref{src:downscale-behavior}, line 4-6).
By default, HPA is allowed to deprovision all Pods <!-- (down to `minReplicas`) --> at the same time \cite{KubernetesDocumentationHorizontalPod_2021}, as it is illustrated in Figure \ref{fig:hpa-scaling-v0} (rapid increase from 20 to 1 replica).
By default, HPA is allowed to deprovision all Pods <!-- (down to `minReplicas`) --> at the same time \cite{KubernetesDocumentationHorizontalPod_2021}, as it is illustrated in Figure \ref{fig:hpa-scaling-v0} (rapid decrease from 20 to 1 replica).
\begin{lstlisting}[caption=Improved Downscale Behavior for HPA, label=src:downscale-behavior, language=yaml, numbers=left]
behavior:
@ -168,6 +173,14 @@ The potential for premature scale-down is reduced by not only considering the qu
Just because all tasks have been taken out of the message queue by the executors does not mean that the number of executors can be reduced, as they might still be processing the tasks.
For this purpose, HPA allows specifying multiple metrics (Listing \ref{src:improved-scaling-logic}): it calculates the desired replica count for all specified metrics individually and then scales the Deployment based on the maximum results.
Finally, the issue of overscaling replicas can be mitigating by switching to a different baseline scaling metric, shown in Listing \ref{src:improved-scaling-logic} (line 5).
This metric immediately represents the number of tasks available for the executor, unlike all future tasks as before.
This distinction is important because future tasks might have interdependencies (e.g., if task #1 fails, task #2 and #3 does not need to be executed).
Additionally, until now we have been using an absolute value as a target, e.g., the total number of tasks in queue.
It makes more sense to use a metric that incorporates the current number of replicas as a ratio.
This is necessary because -- with the change explained previously -- the scaling metric contains the number of available tasks, which are by definition distributed across all executors.
Instead of using the scaling metric directly, the raw value is averaged: it is divided by the number of active Pod replicas (Listing \ref{src:improved-scaling-logic}, line 7-8) \cite{KubernetesDocumentationHorizontalPod_2021}.
\begin{lstlisting}[caption=HPA scaling based on multiple metrics, label=src:improved-scaling-logic, language=yaml, numbers=left]
metrics:
- type: External
@ -186,56 +199,50 @@ For this purpose, HPA allows specifying multiple metrics (Listing \ref{src:impro
averageValue: 1
\end{lstlisting}
Finally, the issue of overscaling replicas can be mitigating by switching to a different baseline scaling metric, shown in Listing \ref{src:improved-scaling-logic} (line 5).
This metric immediately represents the number of tasks available for the executor, unlike all future tasks as before.
This distinction is important because future tasks might have interdependencies (e.g., if task #1 fails, task #2 and #3 does not need to be executed).
Additionally, until now we have been using an absolute value as a target, e.g., the total number of tasks in queue.
It makes more sense to use a metric that incorporates the current number of replicas as a ratio.
This is necessary because -- with the change explained previously -- the scaling metric contains the number of available tasks, which are by definition distributed across all executors.
Instead of using the scaling metric directly, the raw value is averaged: it is divided by the number of active Pod replicas (Listing \ref{src:improved-scaling-logic}, line 7-8) \cite{KubernetesDocumentationHorizontalPod_2021}.
To find the appropriate value for `averageValue`, we performed several benchmarks, the results of which are shown Figure \ref{fig:performance-cost-benchmark-autoscaling} and \ref{fig:config-run-variance-autoscaling}.
To find the appropriate value for `averageValue`, we perform several benchmarks, the results of which are shown Figure \ref{fig:performance-cost-benchmark-autoscaling} to \ref{fig:config-run-variance-autoscaling}.
The different `averageValues` have been labeled as `hpav1`, `hpav2`, `hpav3` and `hpa4` for the values 1, 2, 3, and 4, respectively.
These new benchmarks also include the other optimizations outlined above.
For comparison, the following figures also show the previously discussed benchmark results of the initial autoscaling policy (`hpav0`) and a statically scaled Deployment with 5 and 10 replicas.
\begin{figure}[ht]
Due to the downscaling optimization outlined above, all of the scaling policies were able to reduce the cost (Figure \ref{fig:performance-cost-benchmark-autoscaling}) by 50% as they allow scaling the Deployment down sooner.
This behavior is illustrated in Figure \ref{fig:comparison-hpa-scaling-policies}.
Additional cost savings can be realized by reducing the replica count to 0 when there are no tasks to be processed.
In our application architecture this is feasible because the executors are taking the tasks out of a message queue, which acts as a buffer when no executors are available (yet).
However, HPA does not support this by default \cite{KubernetesDocumentationHorizontalPod_2021}: setting `minReplicas` field (see Figure \ref{fig:hpa-scaling-v0}) to 0 is only possible when changing the configuration of the Controller Manager, which requires re-deploying the cluster.
KEDA works around this HPA limitation by having an agent that scales the Deployment to zero when no tasks are active.
\clearpage
\begin{figure}[ht!]
\centering
\includegraphics[width=\textwidth]{plots/mock-results/performance-cost-benchmark-autoscaling.pdf}
\caption{\label{fig:performance-cost-benchmark-autoscaling} Results of autoscaling benchmark. Scaling setting indicates static number of replicas or autoscaling policy.}
\end{figure}
\clearpage
Due to the downscaling optimization outlined above, all of the scaling policies were able to reduce the cost (Figure \ref{fig:performance-cost-benchmark-autoscaling}) by 50% as they allow to scale the Deployment down sooner.
This behavior is illustrated in Figure \ref{fig:comparison-hpa-scaling-policies}.
Additional cost savings can be realized by reducing the replica count to 0 when there are no tasks to be processed.
In our application architecture this is feasible because the executors are taking the tasks out of a message queue, which acts as a buffer when no executors are available (yet).
However, HPA does not support this by default \cite{KubernetesDocumentationHorizontalPod_2021}: setting `minReplicas` field (see Figure \ref{fig:hpa-scaling-v0}) to 0 is only possible when changing the configuration of the Controller Manager (the option needs to be specified with a command-line argument), which requires re-deploying the cluster.
KEDA works around this HPA limitation by having an agent that scales the Deployment to zero when no events (tasks) are active.
\begin{figure}[ht]
\begin{figure}[ht!]
\centering
\includegraphics[width=\textwidth]{images/comparison-of-hpa-scaling-policies.png}
\caption{\label{fig:comparison-hpa-scaling-policies} Grafana screenshot of different horizontal scaling policies. Y-axis represents the number of active replicas, X-axis represents time.}
\end{figure}
\begin{figure}[ht!]
\centering
\includegraphics[width=\textwidth]{images/scaling-policies-average-queue-time.png}
\caption{\label{fig:autoscaling-average-queue} Grafana screenshot of queuing behavior with different autoscaling policies. Y-axis shows average time tasks are queued in seconds.}
\end{figure}
\clearpage
Despite the significant cost reduction, all scaling policies performed nearly as well as the fastest configuration (with 10 replicas).
This establishes the effectiveness of the autoscaling policies, as confirmed by comparing the average time tasks spend in queue (Figure \ref{fig:autoscaling-average-queue}): there is a gradual increase from 38.1 seconds with `hpav1`, 45.4 seconds with `hpav2`, 76.4 seconds with `hpav3` to 104.0 seconds with `hpav4`.
A comparison against static dimensioning shows that the values are quite stable (i.e., the average value is not continuously rising) and are almost as low as the best performing static scaling configuration (30 seconds with 20 replicas).
Therefore different parameters for `averageValue` can be used to tweak the trade-off between application performance and cost.
With our specific test scenario a value of one provides an excellent performance while already realizing major cost savings.
Therefore, different parameters for `averageValue` can be used to tweak the trade-off between application performance and cost.
With our specific test scenario a value of one provides excellent performance while already realizing major cost savings.
<!-- The ideal value depends on the workload scenario. -->
<!-- Thus, it should be empirically determined and adjusted based on user preferences. -->
Clearly, different user preferences (performance-cost trade-off) will require different values.
<!-- We found this value to work well in our case, but generally the ideal value should be adjusted based on user preferences (performance-cost trade-off) and workload scenario. -->
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/scaling-policies-average-queue-time.png}
\caption{\label{fig:autoscaling-average-queue} Grafana screenshot of queuing behavior with different autoscaling policies. Y-axis shows average time tasks are queued in seconds.}
\end{figure}
Figure \ref{fig:config-run-variance-autoscaling} shows the impact of the `averageValue` parameter on the duration of individual configuration runs.
A larger value effectively allows more tasks to be waiting in the queue without triggering any scaling.
Since configuration runs are made up of many individual tasks, the waiting time of these tasks affects the duration of the entire configuration run.
@ -248,69 +255,70 @@ Between the autoscaling policies, the `averageValue` does not seem to have a maj
\caption{\label{fig:config-run-variance-autoscaling} Execution time of individual configuration runs. Numbers at the top (in blue) indicate the mean value.}
\end{figure}
\clearpage
## Real-world test scenario
The previous sections have validated the functionality of the autoscaling setup as well as evaluated several scaling metrics and parameters based on an artificial workload.
The previous sections validated the functionality of the autoscaling setup as well as evaluated several scaling metrics and parameters based on an artificial workload.
In this section we set up production-like target systems, consisting of 25 virtual machines, and configure the application to connect to these systems via SSH.
In addition, the configuration scripts used to interact with the systems are representative of tasks carried out in production environments:
each script consists of 42 checks for system security settings such as administrator access, login retry interval etc.
We repeat the benchmark scenario described in Section \ref{benchmark-setup}.
The overall benchmark time (75 minutes), number of configuration runs (50) and maximum replicas (50) are adjusted due to the increased workload.
We repeat the benchmark scenario described in Section \ref{benchmark-setup} with adjustments for the production-like scenario:
the overall benchmark time (75 minutes), number of configuration runs (50) and maximum replicas (50) are increased to accommodate workload.
Thus, the results from Section \ref{benchmark-setup} cannot be compared in absolute terms to the results presented in these sections, though we expect to confirm the trends from our previous findings.
Appendix \ref{hpa-scaling-policy} (Listing \ref{src:real-autoscaling-policy}) shows the full horizontal autoscaling policy used in HPA benchmarks.
The KEDA benchmarks use the scaling policy shown in Listing \ref{src:keda-v1}.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{plots/real-results-static/performance-cost-benchmark.pdf}
\caption{\label{fig:real-performance-cost-benchmark} Results of autoscaling benchmark in production-like scenario with constant workload. Scaling setting indicates static number of replicas or autoscaling policy.}
\end{figure}
The benchmark results shown in Figure \ref{fig:real-performance-cost-benchmark} confirm our previous experiments:
the autoscaling policy `sp-v1` (blue) as well as KEDA (green) achieved the same performance as a static replica number of 10.
the autoscaling policy `sp-v1` (blue) as well as KEDA (green) achieve the same performance as a static replica number of 10.
This is the maximum performance the application is able to achieve in this scenario, because even with higher replica counts (e.g., 20 in Figure \ref{fig:real-performance-cost-benchmark}) the performance remains the same.
<!-- 10 is the ideal replica count in this scenario because even with 20 replicas, no higher performance can be achieved. -->
At the same time, all scaling policies were able to consistently reduce the cost during the benchmark:
`sp-v1` had 19.3% lower cost and `keda` had 18.1% lower cost while maintaining the same performance as 10 replicas.
At the same time, all scaling policies are able to consistently reduce the cost during the benchmark:
`sp-v1` has 19.3% lower cost and `keda` has 18.1% lower cost while maintaining the same performance as 10 replicas.
The same performance is explained by the fact that internally KEDA uses HPA for autoscaling and in this case the same target metric and value was specified for KEDA and HPA.
Logically, the `keda` autoscaling policy should have a lower cost than `sp-v1` because KEDA has the ability to scale the Deployment down to zero replicas (as opposed to `sp-v1` which has a minimum of one replica).
These cost savings did not manifest themselves in the benchmarks because most of the time there is load on the system.
Scaling the Deployment to zero has larger benefits when there are significant periods where a particular service is completely idle.
`sp-v2` had 38.6% lower cost while having worse performance than 10 replicas.
`sp-v2` has 38.6% lower cost while having worse performance than 10 replicas.
This is due to the fact that `sp-v2` allows more tasks to be in queue compared to `sp-v1` and `keda`, thereby increasing the time to completion.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{plots/real-results-static/performance-cost-benchmark.pdf}
\caption{\label{fig:real-performance-cost-benchmark} Results of autoscaling benchmark in production-like scenario with static workload. Scaling setting indicates static number of replicas or autoscaling policy.}
\end{figure}
Looking at the execution time of individual configuration runs (Figure \ref{fig:real-config-run-variance}), we see that the autoscaling policies have a higher variance compared to a static over-provisioning of resources.
While the means of `sp-v1` and `keda` are just slightly elevated compared to the result of 10 static replicas, the mean of `sp-v2` is double that of `sp-v1`.
Logically this is correct because scaling policy `sp-v2` allows twice the number tasks in queue.
Additionally, we can also see that the variance of `keda` is slightly higher than that of `sp-v1`.
This can be explained by the additional latency that KEDA has when scaling the Deployment up from zero to one replica.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{plots/real-results-static/config-run-variance.pdf}
\caption{\label{fig:real-config-run-variance} Execution time of individual configuration runs in production-like scenario with constant workload. Numbers at the top (in blue) indicate the mean value.}
\end{figure}
Exemplary behaviors of the autoscaling policies during the benchmark are shown in Figure \ref{fig:real-scaling-activity-static}.
In blue it shows the replica count of `sp-v1`, orange the replica count of `sp-v2` and green the replica count of `keda`.