1
0
Fork 0

Adjustements and improvements according to Markus' feedback (part 2)

master
Jack Henschel 2 years ago
parent c4b99205bd
commit b6424380db
  1. 4
      include/03-research.md
  2. 33
      include/04-implementation.md
  3. 53
      include/05-evaluation.md
  4. 10
      include/06-conclusion.md

@ -276,7 +276,7 @@ Because Deployments need to be explicitly marked as managed by KEDA, it is a fle
While autoscaling has been adequately considered in the literature, the following survey provides an overview and discussion of proposals for novel Kubernetes autoscalers.
<!-- As there is a lack of overview about Kubernetes autoscalers in the literature, we address this issue by presenting and discussing high-quality research proposals for Kubernetes pod autoscalers from relevant, peer-reviewed publications. -->
The survey considers only cluster internal scaling mechanisms (i.e., vertical and horizontal Pod scaling), external cluster scaling is outside the scope of this study (e.g., \cite{ExperimentalEvaluationKubernetesCluster_2020,DynamicallyAdjustingScaleKubernetes_2019}).
The survey considers only cluster internal scaling mechanisms (i.e., vertical and horizontal scaling of Pods), external cluster scaling is outside the scope of this study (e.g., \cite{ExperimentalEvaluationKubernetesCluster_2020,DynamicallyAdjustingScaleKubernetes_2019}).
This choice was made because the nature of autoscaling decisions between these dimensions is quite different.
The same reasoning applies to scheduling algorithms.
While there have been interesting proposals for improved Kubernetes schedulers \cite{ClientSideSchedulingBasedApplication_2017,CaravelBurstTolerantScheduling_2019,ImprovingDataCenterEfficiency_2019}, scaling and scheduling are two fundamentally different operations.
@ -322,6 +322,8 @@ Nevertheless, there are certainly advances to be made by having some amount of c
% end of renewcommand
}
\clearpage
The nomenclature in Table \ref{tab:comparison-k8s-autoscalers} mostly follows the taxonomy of \cite{AutoScalingWebApplicationsClouds_2018} and \cite{ReviewAutoscalingTechniquesElastic_2014}.
The *Architecture* refers to the logical application architecture the autoscaler focuses on.
By default, an autoscaler scales each service individually based on a set of criteria, in which case the column is labeled *single*.

@ -191,7 +191,7 @@ A histogram provides a balance between tracking the duration of each task indivi
While developing this custom exporter, we followed the best practices and conventions for writing Prometheus exporters^[<https://prometheus.io/docs/instrumenting/writing_exporters/>] and metric naming^[<https://prometheus.io/docs/practices/naming/>].
The custom exporter was packaged into a container image and deployed alongside the target application.
We were able to confirm that it provides useful high-level metrics about the application by setting up a Grafana dashboard for these metrics and observing them.
We were able to confirm that it provides useful high-level metrics about the application by setting up a Grafana dashboard for these metrics and \improve{observing them}.
The exposed metrics allow reasoning about the application behavior and making appropriate scaling decisions based on the collected metrics.
\begin{figure}[ht]
@ -200,10 +200,10 @@ The exposed metrics allow reasoning about the application behavior and making ap
\caption{\label{fig:scaling-flow} Flow of metrics used for scaling. Arrows denote the logical flow of data. Orange arrows symbolize raw HTTP metrics.}
\end{figure}
To use these custom metrics with Kubernetes HPA (see Section \ref{horizontal-pod-autoscaler}), we need to install another component into the cluster: a *metrics adapter* (Figure \ref{fig:scaling-flow}).
To use these custom metrics with Kubernetes HPA (see Section \ref{horizontal-pod-autoscaler}), another component needs to be installed into the cluster: a *metrics adapter* (Figure \ref{fig:scaling-flow}).
This component is responsible for translating the metrics from Prometheus into a format compatible with the Kubernetes metrics API (Section \ref{kubernetes-components}).
We choose the *prometheus-adapter* project^[<https://github.com/kubernetes-sigs/prometheus-adapter>] for this purpose as our use case is focused on Prometheus metrics.
Another project with a similar goal is *kube-metrics-adapter*^[<https://github.com/zalando-incubator/kube-metrics-adapter>] which allows utilizing a wider range of data sources, for example InfluxDB or AWS.
Another project with a similar goal is *kube-metrics-adapter*^[<https://github.com/zalando-incubator/kube-metrics-adapter>] which allows utilizing a wider range of data sources, for example InfluxDB or AWS SQS queues.
The installation of the adapter was performed with a Helm Chart and is detailed in Appendix \ref{prometheus-adapter-setup}.
In essence, the adapter is configured with a PromQL query it should execute.
The query can be parameterized with several labels and parameter overrides.
@ -221,6 +221,7 @@ Assuming that the workload is automatically distributed across all instances, sc
It should be noted that this assumption does not always hold true.
Special attention needs to be paid to (partially) stateful services.
Nguyen and Kim \cite{HighlyScalableLoadBalancing_2020} performed an investigation of load balancing stateful applications on Kubernetes.
They found that especially distribution and load balancing of leaders throughout the cluster are important for maximizing performance.
Two other aspects need to be considered when implementing horizontal scaling on Kubernetes: microservice startup and shutdown.
@ -234,9 +235,9 @@ when a Pod is shut down, Kubernetes stops routing new traffic to the replica and
the application should finish serving the outstanding requests it has accepted and terminate afterwards \cite{KubernetesPatterns_2019}.
This behavior is particularly important when using autoscaling, since individual Pods are frequently created and removed at all times.
*Vertical scaling* (scaling up or down) refers to adjusting requested resources (compute, memory, network, storage) allocated to a service to the actual usage.
*Vertical scaling* (scaling up or down) refers to adjusting requested resources (compute, memory, network, storage) allocated to a service based on the actual usage.
By giving more resources to one or multiple instances, they are able to handle more workload.
While \improve{most people only think about} scaling up (allocating more resources), the opposite is actually far more desirable: scaling down.
While most industry practitioners only focus on scaling up (allocating more resources), the opposite is actually far more desirable: scaling down.
The Autopilot paper from researches at Google shows that significant cost savings can be realized by automatically adjusting the allocated resources, i.e., vertical scaling \cite{AutopilotWorkloadAutoscalingGoogle_2020}.
Some of the research proposals discussed in Section \ref{research-proposals-for-kubernetes-autoscalers} have shown potential to be effective and cost-efficient autoscalers, but none of them offer a publicly available implementation.
@ -258,7 +259,7 @@ Since only the *PodTemplate* can be modified, the Pod needs to be deleted and cr
Thus, it has potential for service disruption, especially when the number of Pod replicas is small (removing 1 out of 100 Pod replicas does not make a significant difference, but removing 1 out of 3 replicas can impact overall service health).
Nevertheless, the VPA Recommender can be a useful tool for determining appropriate resource requests and limits, as we show in the following section.
After the component has been installed into the cluster, VPA needs to be instructed to monitor our application so that it can build its internal resource usage model and produce an estimate.
After the component is installed into the cluster, VPA needs to be instructed to monitor our application so that it can build its internal resource usage model and produce an estimate.
VPA can be enabled and configured for each application running on Kubernetes individually.
<!-- In our case, we are configuring it for a *Deployment* (see Section \ref{kubernetes-objects}). -->
This is done through a special object called *Custom Resource Definition*, short *CRD*.
@ -290,7 +291,7 @@ spec:
controlledResources: ["cpu", "memory"]
\end{lstlisting}
The CRD is installed into the cluster with `kubectl apply -f filename.yaml` in the same namespace as the target Deployment.
The CRD is added to the cluster with `kubectl apply -f filename.yaml` in the same namespace as the target Deployment.
Once the VPA has been able to collect metrics for a while, the resource request and limit recommendations can be retrieved as shown in Listing \ref{src:vpa-recommendation-example}.
Section \ref{vertical-pod-autoscaler} explains the meaning and calculations behind these values.
For our use case, only the upper bound and target recommendations are relevant.
@ -332,7 +333,7 @@ This section details the configuration of Kubernetes' Horizontal Pod Autoscaler
As outlined Section \ref{horizontal-pod-autoscaler}, HPA is part of the Controller Manager and therefore part of every Kubernetes installation.
Therefore, no installation is required, HPA just needs to be configured for each scaling target.
The goal of the HPA in our scenario is to give the application a similar performance to a static overprovisioning of resources<!-- (low average queue time and time to completion) -->, while keeping the cost <!-- (replica seconds) --> to a minimum.
The goal of the HPA in our scenario is to give the application similar performance to a static overprovisioning of resources<!-- (low average queue time and time to completion) -->, while keeping the cost <!-- (replica seconds) --> at a minimum.
The exact metrics for quantifying these dimensions are discussed in Chapter \ref{evaluation}.
Additionally, the autoscaler should be able to find this optimum trade-off with varying workload sizes.
Mathematically, the resulting system can be described as a queuing system where the number of workers is adjusted dynamically based on the queue length \cite{QueueingSystemOnDemandNumber_2012}.
@ -344,13 +345,13 @@ Thus, we need to scale this Deployment based on a higher-level metric which we h
We have identified the current queue length (used in Figure \ref{fig:hpa-scaling-v0}) as a meaningful autoscaling metric.
Thanks to the metrics adapter installed in Section \ref{prometheus-exporters}, we can configure the HPA to scale based on external metrics, as shown in Listing \ref{src:hpa-scale-v0}.
The object structure is similar to the CRD of the VPA.
It specifies a *target*: the object which should be scaled (line 6-9).
We are using an external metric (line 13-16) based on which the target is scaled.
It specifies a *target* -- the Kubernetes object which should be scaled (line 6-9) -- and based on which metric it should be scaled (line 13-16).
<!-- We are using an external metric (line 13-16) based on which the target is scaled. -->
The goal of the horizontal autoscaler is to make the metric value equal to target value (line 17-19) by adjusting the number of Pods.
When the metric value is above the target, it creates more instances.
When the metric value is below the target, it removes instances.
For details about the algorithm refer to Section \ref{horizontal-pod-autoscaler}.
Additionally, we also specify safety bounds: the Deployment must have at least 1 replica (line 10) and at most 20 replicas (line 11).
Additionally, we specify safety bounds: the Deployment must have at least 1 replica (line 10) and at most 20 replicas (line 11).
This is an important engineering practice to guard against bugs and misconfiguration (e.g., the scale of the metric value changes from seconds to milliseconds), which could lead to the autoscaler creating large numbers of instances.
After applying the object from Figure \ref{src:hpa-scale-v0} to the cluster, the current configuration as well as operation of HPA can be observed on the command line (Appendix \ref{hpa-log-messages}) as well as visually with the monitoring setup (Figure \ref{fig:hpa-scaling-v0}).
@ -392,9 +393,11 @@ The installation procedure with KEDA's Helm Chart is shown in Appendix \ref{keda
Similar to the previous section, we will use the RabbitMQ event source as a trigger for horizontal scaling (Listing \ref{src:keda-v1}, line 14).
Specifically, KEDA is configured to scale the Deployment based on the number of messages in a specific queue (line 18-20).
Instead of scaling based on queue length, the rate of messages can also be used.
Alternatively to scaling based on queue length, the rate of messages could also be used.
\clearpage
Listing \ref{src:scaledobject} in Appendix \ref{keda-setup} shows the status of KEDA's ScaledObject, HPA object (created by KEDA) and the Deployment after applying the CRD from Listing \ref{src:keda-v1}.
Of particular note is that the HPA object has a minimum Pod count of one, but the KEDA agent scales the Deployment to zero replicas anyway.
This allows saving resources when there are no tasks for the system.
\begin{lstlisting}[caption=ScaledObject CRD for KEDA autoscaling, label=src:keda-v1, language=yaml, numbers=left]
apiVersion: keda.sh/v1alpha1
@ -418,7 +421,3 @@ spec:
mode: QueueLength
value: '1'
\end{lstlisting}
Listing \ref{src:scaledobject} in Appendix \ref{keda-setup} shows the status of KEDA's ScaledObject, HPA object (created by KEDA) and the Deployment after applying the CRD from Listing \ref{src:keda-v1}.
Of particular note is that the HPA object has a minimum Pod count of one, but the KEDA agent scales the Deployment to zero replicas anyway.
This allows saving resources when there are no tasks for the system.

@ -53,8 +53,8 @@ An individual benchmark run is structured as follows:
1. Set up and initialize target application from scratch.
2. Start the benchmark timer.
3. After 60 seconds, start a configuration run. Sequentially repeat ten times.
4. Wait until all configuration runs finish. This marks the *time to completion*.
5. Continue running the application for 30 minutes after starting the benchmark timer. Then, stop the replica seconds counter.
4. Wait until all configuration runs finish (this marks the *time to completion*).
5. Continue running the application for 30 minutes after starting the benchmark timer. Then, stop the *replica seconds* counter.
6. Terminate the application.
The entire benchmark procedure is repeated three times.
@ -68,7 +68,7 @@ Configuration runs are started exactly 60 seconds apart from each other.
Since they take longer than 60 seconds (see Figure \ref{fig:config-run-variance-static}), the application needs to process multiple configuration runs in parallel.
Each configuration run contains the same amount of work.
The *time to completion* marks the point when all configuration runs have finished processing.
The application continues to run afterwards (until a fixed timeout) to demonstrate the cost-savings made possible through autoscaling.
The application continues to run afterwards (until a fixed timeout) to demonstrate the cost-savings made possible by autoscaling.
If the benchmark was stopped as soon as the workload was completed, only the performance (but not the cost) could be compared.
The benchmarks are run on a Kubernetes cluster consisting of one control plane node and two worker nodes.
@ -106,11 +106,11 @@ In this scenario, the goal of the autoscaler is to optimize towards the bottom l
<!-- With just 3 replicas, there are many tasks queued for a long period of time (yellow line), which causes the average queue time (green line) to continuously rise, up to a value of 6 minutes. -->
<!-- With 20 replicas, the average queue time never rises above 30 seconds. -->
Data from the monitoring system shows that with 3 replicas the average queue time (time before tasks are actually executed) continuously rises during the benchmark up to a value of 6 minutes.
Data from the monitoring system shows that with 3 replicas the average queue duration (time before tasks are actually executed) continuously rises during the benchmark up to a value of 360 seconds.
With 20 replicas, the average queue time quickly converges to a value of 30 seconds.
Figure \ref{fig:config-run-variance-static} shows the execution time of individual configuration runs during the benchmark.
It confirms our intuition that with an overall lower time-to-completion (as it was the case in Figure \ref{fig:performance-cost-benchmark-static}), the time of individual workloads is also smaller.
It confirms our intuition that with an overall lower time to completion (as it was the case in Figure \ref{fig:performance-cost-benchmark-static}), the time of individual workloads is also smaller.
It also shows that a low number of replicas has a significant effect on the variance of the configuration run's execution times.
While each configuration run has the same workload, with a low number of replicas a significant increase in variance (in addition to an increase of the average) can be observed.
This is explained by the fact that with few replicas, the same executor needs to run multiple workloads sequentially, which slows some of them down drastically.
@ -126,10 +126,11 @@ The first version of the autoscaling policy has been shown in Listing \ref{src:h
We now evaluate the behavior of this policy, which is labeled as `hpav0` in the figures.
As Figure \ref{fig:performance-cost-benchmark-static} shows, the benchmark with this policy had a similar performance and cost as a static configuration with 10 replicas: the application used around 18.600 replica seconds during the benchmark and the total time to completion was approx. 700 seconds.
However, the variance in execution time between different configuration runs was slightly larger than with 10 replicas (Figure \ref{fig:config-run-variance-static}).
This can be explained by the fact that the autoscaler gradually needs to scale up the Deployment at the beginning of the benchmark and thus does not have the optimal resources immediately available.
The behavior can be observed in Figure \ref{fig:hpa-scaling-v0}.
This can be explained by the fact that the autoscaler gradually needs to scale up the Deployment at the beginning of the benchmark.
Thus, the executor Deployment does not have the optimal resources immediately available.
This behavior can be observed in Figure \ref{fig:hpa-scaling-v0}.
This result highlights the strengths and weaknesses of this autoscaling policy: it enabled the application to achieve the same performance as with a static configuration of 10 replicas.
The result highlights the strengths and weaknesses of this autoscaling policy: it enabled the application to achieve the same performance as with a static configuration of 10 replicas.
As Figure \ref{fig:performance-cost-benchmark-static} shows, this is the maximum amount of performance the application is able to deliver for this particular workload.
At the same time, the autoscaling policy was just as costly as a static configuration of 10 replicas, even though for significant periods the application was running with only 1 replica.
This is explained by Figure \ref{fig:hpa-scaling-v0}: the policy overscaled the number of replicas (more than the necessary value established previously) and even reached the replica limit (`maxReplicas` in Listing \ref{src:hpa-scale-v0}).
@ -148,7 +149,7 @@ This section addresses these issues by fine-tuning the scaling policy.
The delayed scale-down can be tackled by adjusting the *stabilization window* of the scaling policies (also referred to as *cool-down period*).
This setting (shown in Listing \ref{src:downscale-behavior}) specifies how soon HPA starts to remove replicas from the Deployment after it has detected that the scaling metric is below the target value \cite{KubernetesDocumentationHorizontalPod_2021}.
The default value is 5 minutes (observable in Figure \ref{fig:hpa-scaling-v0}); we adjust the value to 1 minute (line 7).
Decreasing the downscale stabilization window can lead to thrashing (continuous creating and removing of Pods) when workload bursts are more than the specified window apart, but offers better elasticity \cite{QuantifyingCloudElasticityContainerbased_2019}.
Decreasing the downscale stabilization window can lead to thrashing (continuous creation and removal of Pods) when workload bursts are more than the specified window apart, but offers better elasticity \cite{QuantifyingCloudElasticityContainerbased_2019}.
In our case this is an acceptable trade-off because these containers start quickly, as this application component is lightweight and does not hold any internal state.
Additionally, this risk is partially mitigated by only allowing HPA to remove 50% of the active replicas per minute (Listing \ref{src:downscale-behavior}, line 4-6).
By default, HPA is allowed to deprovision all Pods <!-- (down to `minReplicas`) --> at the same time \cite{KubernetesDocumentationHorizontalPod_2021}, as it is illustrated in Figure \ref{fig:hpa-scaling-v0} (rapid increase from 20 to 1 replica).
@ -163,7 +164,7 @@ By default, HPA is allowed to deprovision all Pods <!-- (down to `minReplicas`)
stabilizationWindowSeconds: 60
\end{lstlisting}
The potential for premature scale down is reduced by not only considering the queued tasks as a scaling metric, but also the currently running tasks.
The potential for premature scale-down is reduced by not only considering the queued tasks as a scaling metric, but also the currently running tasks.
Just because all tasks have been taken out of the message queue by the executors does not mean that the number of executors can be reduced, as they might still be processing the tasks.
For this purpose, HPA allows specifying multiple metrics (Listing \ref{src:improved-scaling-logic}): it calculates the desired replica count for all specified metrics individually and then scales the Deployment based on the maximum results.
@ -186,17 +187,17 @@ For this purpose, HPA allows specifying multiple metrics (Listing \ref{src:impro
\end{lstlisting}
Finally, the issue of overscaling replicas can be mitigating by switching to a different baseline scaling metric, shown in Listing \ref{src:improved-scaling-logic} (line 5).
This metric immediately represents the number of tasks available for the executor, not all future tasks like before.
This metric immediately represents the number of tasks available for the executor, unlike all future tasks as before.
This distinction is important because future tasks might have interdependencies (e.g., if task #1 fails, task #2 and #3 does not need to be executed).
Additionally, until now we have been using an absolute value as a target, e.g., the total number of tasks in queue.
It makes more sense to use a metric that incorporates the current number of replicas as a ratio.
This is necessary because -- with the change explained previously -- the scaling metric contains the number of available tasks, which are by definition distributed across all executors.
Instead of using the scaling metric directly, the raw value is averaged: it is divided by the number of active Pod replicas (Listing \ref{src:improved-scaling-logic}, line 7-8) \cite{KubernetesDocumentationHorizontalPod_2021}.
To find the appropriate value(s) for `averageValue`, we performed several benchmarks, the results of which are shown Figure \ref{fig:performance-cost-benchmark-autoscaling} and \ref{fig:config-run-variance-autoscaling}.
To find the appropriate value for `averageValue`, we performed several benchmarks, the results of which are shown Figure \ref{fig:performance-cost-benchmark-autoscaling} and \ref{fig:config-run-variance-autoscaling}.
The different `averageValues` have been labeled as `hpav1`, `hpav2`, `hpav3` and `hpa4` for the values 1, 2, 3, and 4, respectively.
These new benchmarks also include the other optimizations outlined above.
For comparison, the figures also show the previously discussed benchmark results of the initial autoscaling policy (`hpav0`) and a statically scaled Deployment with 5 and 10 replicas.
For comparison, the following figures also show the previously discussed benchmark results of the initial autoscaling policy (`hpav0`) and a statically scaled Deployment with 5 and 10 replicas.
\begin{figure}[ht]
\centering
@ -204,7 +205,9 @@ For comparison, the figures also show the previously discussed benchmark results
\caption{\label{fig:performance-cost-benchmark-autoscaling} Results of autoscaling benchmark. Scaling setting indicates static number of replicas or autoscaling policy.}
\end{figure}
Thanks to the downscaling optimization outlined above, all of the scaling policies were able to reduce the cost (Figure \ref{fig:performance-cost-benchmark-autoscaling}) by 50% as they were allowed to scale down the Deployment sooner.
\clearpage
Due to the downscaling optimization outlined above, all of the scaling policies were able to reduce the cost (Figure \ref{fig:performance-cost-benchmark-autoscaling}) by 50% as they allow to scale the Deployment down sooner.
This behavior is illustrated in Figure \ref{fig:comparison-hpa-scaling-policies}.
Additional cost savings can be realized by reducing the replica count to 0 when there are no tasks to be processed.
In our application architecture this is feasible because the executors are taking the tasks out of a message queue, which acts as a buffer when no executors are available (yet).
@ -221,7 +224,7 @@ Despite the significant cost reduction, all scaling policies performed nearly as
This establishes the effectiveness of the autoscaling policies, as confirmed by comparing the average time tasks spend in queue (Figure \ref{fig:autoscaling-average-queue}): there is a gradual increase from 38.1 seconds with `hpav1`, 45.4 seconds with `hpav2`, 76.4 seconds with `hpav3` to 104.0 seconds with `hpav4`.
A comparison against static dimensioning shows that the values are quite stable (i.e., the average value is not continuously rising) and are almost as low as the best performing static scaling configuration (30 seconds with 20 replicas).
Therefore different parameters for `averageValue` can be used to tweak the trade-off between application performance and cost.
With our specific test scenario the value of one provides an excellent performance while already realizing major cost savings.
With our specific test scenario a value of one provides an excellent performance while already realizing major cost savings.
<!-- The ideal value depends on the workload scenario. -->
<!-- Thus, it should be empirically determined and adjusted based on user preferences. -->
Clearly, different user preferences (performance-cost trade-off) will require different values.
@ -249,26 +252,26 @@ Between the autoscaling policies, the `averageValue` does not seem to have a maj
The previous sections have validated the functionality of the autoscaling setup as well as evaluated several scaling metrics and parameters based on an artificial workload.
In this section we set up production-like target systems, consisting of 25 virtual machines, and configure the application to connect to these systems via SSH.
Also the configuration scripts used to interact with these systems are representative of tasks carried out in production environments:
each script has 42 checks for system security settings such as administrator access, login retry interval etc.
In addition, the configuration scripts used to interact with the systems are representative of tasks carried out in production environments:
each script consists of 42 checks for system security settings such as administrator access, login retry interval etc.
We repeat the benchmark scenario described in Section \ref{benchmark-setup}.
The overall benchmark time (75 minutes), number of configuration runs (50) and maximum replicas (50) are adjusted due to the increased workload.
Thus, the results from Section \ref{benchmark-setup} cannot be compared in absolute terms to the results presented in these sections, though we expect to confirm the trends from our previous findings.
Appendix \ref{hpa-scaling-policy} (Listing \ref{src:real-autoscaling-policy}) shows the full horizontal autoscaling policy that was used in HPA benchmarks.
The KEDA benchmarks used the scaling policy shown in Listing \ref{src:keda-v1}.
Appendix \ref{hpa-scaling-policy} (Listing \ref{src:real-autoscaling-policy}) shows the full horizontal autoscaling policy used in HPA benchmarks.
The KEDA benchmarks use the scaling policy shown in Listing \ref{src:keda-v1}.
The benchmark results shown in Figure \ref{fig:real-performance-cost-benchmark} confirm our previous experiments:
the autoscaling policy `sp-v1` (blue) as well as KEDA (green) achieved the same performance as a static replica number of 10.
This is the maximum performance the application is able to achieve in this scenario, because along with higher replica counts (e.g., 20 in Figure \ref{fig:real-performance-cost-benchmark}) the performance remains the same.
This is the maximum performance the application is able to achieve in this scenario, because even with higher replica counts (e.g., 20 in Figure \ref{fig:real-performance-cost-benchmark}) the performance remains the same.
<!-- 10 is the ideal replica count in this scenario because even with 20 replicas, no higher performance can be achieved. -->
At the same time, all scaling policies were able to consistently reduce the cost during the benchmark:
`sp-v1` had 19.3% lower cost and `keda` had 18.1% lower cost while maintaining the same performance as 10 replicas.
The same performance is explained by the fact that internally KEDA uses HPA for autoscaling and in this case the same target metric and value was specified for KEDA and HPA.
Logically, the `keda` autoscaling policy should have a lower cost than `sp-v1` because KEDA has the ability to scale down the Deployment to zero replicas (as opposed to `sp-v1` which has a minimum of one replica).
Logically, the `keda` autoscaling policy should have a lower cost than `sp-v1` because KEDA has the ability to scale the Deployment down to zero replicas (as opposed to `sp-v1` which has a minimum of one replica).
These cost savings did not manifest themselves in the benchmarks because most of the time there is load on the system.
Scaling the Deployment to zero has larger benefits when there are significant periods of time where a particular service is completely idle.
Scaling the Deployment to zero has larger benefits when there are significant periods where a particular service is completely idle.
`sp-v2` had 38.6% lower cost while having worse performance than 10 replicas.
This is due to the fact that `sp-v2` allows more tasks to be in queue compared to `sp-v1` and `keda`, thereby increasing the time to completion.
@ -286,7 +289,7 @@ This can be explained by the additional latency that KEDA has when scaling the D
Exemplary behaviors of the autoscaling policies during the benchmark are shown in Figure \ref{fig:real-scaling-activity-static}.
In blue it shows the replica count of `sp-v1`, orange the replica count of `sp-v2` and green the replica count of `keda`.
The red peaks indicate when new work has been submitted for the system (their height does not have any significance).
The red peaks indicate when new work has been submitted to the system (their height does not have any significance).
From this example it is clear that scaling policy `sp-v2` suffers from *thrashing*: scaling operations are frequently made and then reverted shortly afterwards again.
`sp-v1` exhibits a much smoother line, which is expected with a static workload, and took 13 minutes to converge to a stable number of replicas (11).
Similarly, the scale up of `keda` was slightly delayed and converged to the same number of replicas after 18 minutes.
@ -304,7 +307,7 @@ The scaling policy is able to maintain overall application performance (the time
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/real-scaling-activity-static.png}
\caption{\label{fig:real-scaling-activity-static} Grafana screenshot of scaling activity during production-like scenario with static workload. Red bars indicate launch of new configuration runs.}
\caption{\label{fig:real-scaling-activity-static} Grafana screenshot of scaling activity during production-like scenario with static workload. Red bars indicate the launch of new configuration runs.}
\end{figure}
\clearpage
@ -315,7 +318,7 @@ This pattern is repeated five times, resulting in a total of 40 configuration ru
This scenario tests the elasticity of the autoscaling polices. <!-- to scale in and out. -->
Naturally, the results of these experiments are more variable than the previous ones.
Figure \ref{fig:variable-performance-cost-benchmark} shows the performance-cost trade-off and in particular the lower cost achieved by the autoscaling policies, while almost reaching the maximum performance (total time-to-completion).
Figure \ref{fig:variable-performance-cost-benchmark} shows the performance-cost trade-off and in particular the lower cost achieved by the autoscaling policies, while almost reaching the maximum performance (total time to completion).
This illustrates that the autoscaling policies `sp-v1` and `keda` are able to adjust to the varying workload and delivers competitive results.
However, Figure \ref{fig:variable-config-run-variance} highlights a minor flaw:
on average individual configuration runs are much slower compared to the best case (10 replicas) and there is a large variance in their durations (even though each configuration run contains the same amount of work).

@ -15,7 +15,7 @@ We gave an overview of the available literature on the subject of autoscaling ap
This revealed that while there have been numerous articles and surveys about VM- and container-based autoscaling, only recently have researchers started investigating specifically Kubernetes.
A comprehensive review of the algorithms and technical architectures of publicly available autoscaling components for Kubernetes (HPA, VPA, CA, KEDA) was performed to understand the technologies currently used in the industry.
Finally, a survey of research proposals for novel Kubernetes autoscalers was conducted and the proposals were evaluated qualitatively.
This research made it clear that proactive autoscaling (i.e., scaling not only based on current load, but based on future predicted load) is beneficial for aggressive scaling.
This research made it clear that proactive autoscaling (i.e., scaling not only based on current load, but based on predicted future load) is beneficial for aggressive scaling.
However, this leads to more complex algorithms (which require more time to train and potentially large amounts of data) as well as system behavior that is more opaque to cluster operators.
Thus, these two aspects need to be balanced.
@ -24,7 +24,7 @@ No conclusion has been reached about whether a service should be scaled based on
Ultimately, the choice of scaling metrics depends on the development context and application usage scenario.
For this reason, the steps necessary to expose and identify metrics relevant for scaling an application running on Kubernetes were outlined.
Unfortunately, none of the reviewed articles has a publicly available implementation.
This is problematic because it prevents us from evaluating the technical soundness of the implementation and its integration with Kubernetes.
This is problematic because it prevents evaluating the technical soundness of the implementation and its integration with Kubernetes.
In the end, not only the underlying algorithms are important when setting up a production-grade system, but also how the operators need to configure and interact with it.
For this reason, we proposed the design and architecture of a novel Kubernetes autoscaler:
@ -45,10 +45,10 @@ Since we provided detailed documentation about our setup, industry professionals
Finally, we performed a quantitative evaluation of several autoscaling policies.
Our findings showed that the target application is able to achieve maximum performance with the autoscaling policies, while having only minor variances in performance.
At the same time, we were able to realize significant cost-savings due to downscaling during times of low load in our benchmark.
At the same time, we were able to realize significant cost-savings due to downscaling during times of low load.
Despite the benchmark results being specific to our target application, other researchers and professionals can reuse the same benchmarking procedures for any queue-based cloud application.
Furthermore, the scaling optimization we have discussed (delayed scale-down, overscaling etc.) are applicable to any system leveraging autoscaling.
In particular, the criteria for evaluating the performance (time-to-completion) and cost (replica seconds) dimensions are valuable for anyone carrying out performance-and-cost optimizations with container-based infrastructure.
In particular, the criteria for evaluating the performance (*time to completion*) and cost (*replica seconds*) dimensions are valuable for anyone carrying out performance-and-cost optimizations with container-based infrastructure.
Concerning future work, we think it would be valuable to compare the current implementation with an event-driven implementation.
Event-driven architectures are at the core of popular *serverless* or *functions-as-a-service* offerings of public cloud providers (e.g., AWS Lambda, GCP Cloud Run, Azure Functions).
@ -61,5 +61,5 @@ More recently, the SPEC research group conducted a general review of use-cases f
<!-- For this comparison, the same performance and cost criteria should be used. -->
Overall, this thesis provided foundational and relevant knowledge on the topic of autoscaling for researchers and industry practitioners alike.
While not all software architectures and deployment models were discussed in our work (in particular stateful applications), the reader should have gained insights for tackling the challenging tasks of dimensioning, optimizing and scaling their cloud-native applications.
While not all software architectures and deployment models were discussed in our work (in particular stateful applications), the reader should have gained insights into tackling the challenging tasks of dimensioning, optimizing and scaling their cloud-native applications.
The key takeaway is that a solid foundation of metrics (collected from several components) allows effective dimensioning and scaling any application with state-of-the-art cloud-native solutions.

Loading…
Cancel
Save