1
0
Fork 0

Implementation: write about scaling evaluation

master
Jack Henschel 2 years ago
parent 738fc45d03
commit 1de4bf6875
  1. BIN
      images/hpa-scaling-v1.png
  2. BIN
      images/replicas-20-vs-replicas-3.png
  3. 157
      include/04-implementation.md
  4. 2
      include/appendices.md

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 53 KiB

After

Width:  |  Height:  |  Size: 40 KiB

@ -68,7 +68,7 @@ The installation routine is configured with a *values* file in YAML format.
The `prometheus.values.yaml` and `grafana.values.yaml` configuration files are available in Appendix \ref{prometheus-setup} and \ref{grafana-setup}, respectively.
It is important to keep in mind that Prometheus is fundamentally a pull-based monitoring solution.
This means the Prometheus server connects to the exporters, not the other way around (push-based).
This means the Prometheus server opens a connection to the exporters, not the other way around (push-based).
Thus, the network topology (firewalls etc.) must allow this.
In Kubernetes, this is configured through Network Policies^[<https://kubernetes.io/docs/concepts/services-networking/network-policies/>].
@ -202,7 +202,7 @@ Another project with a similar goal is *kube-metrics-adapter*^[<https://github.c
The installation of the adapter was performed with Helm Charts and is detailed in Appendix \ref{prometheus-adapter-setup}.
In essence, the adapter is configured with a PromQL query it should execute.
The query can be parameterized with several labels and parameter overrides.
The result of this query is exposed as a new metric through Kubernetes' metrics API (cf. Figure{fig:scaling-flow}).
The result of this query is exposed as a new metric through Kubernetes' metrics API (cf. Figure \ref{fig:scaling-flow}).
\todo{It might be necessary to go more into detail about this. Let's see.}
## Autoscaling Setup
@ -243,7 +243,7 @@ These settings will be explored in the following sections.
The goal of the horizontal autoscaler in our scenario will be to give the application a similar performance to a static overprovisioning of resources (low average queue time and time to completion), while keeping the cost <!-- (replica seconds) --> to a minimum.
Additionally, the autoscaler should be able to find this optimum trade-off with varying workload sizes.
This section details the configuration of Kubernetes' Horizontal Pod Autoscaler (HPA).
This section details the configuration of Kubernetes' Horizontal Pod Autoscaler (HPA) \cite{KubernetesDocumentationHorizontalPod_2021}.
As outlined Section \ref{horizontal-pod-autoscaler}, HPA is part of the Controller Manager and therefore part of every Kubernetes installation.
Therefore, no installation is required, HPA just needs to be configured for each scaling target.
@ -251,7 +251,7 @@ The most minimal horizontal scaling policy could be applied with the command `ku
This would scale the number of replicas based on the average CPU load across all pods in the deployment.
However, as discussed at the beginning of this chapter, our workload is neither purely CPU nor memory bound, but also by the throughput of external systems.
Thus, we need to scale this deployment based on a higher-level metric which we have exposed in Section \ref{prometheus-exporters}.
We have identified the current queue length (cf. Figure \ref{grafana-fig}) as a meaningful autoscaling metric.
We have identified the current queue length (cf. Figure \ref{fig:hpa-scaling-v1}) as a meaningful autoscaling metric.
Thanks to the metrics adapter installed in Section \ref{prometheus-exporters}, we can configure the HPA to scale based on external metrics, as shown in Listing \ref{src:hpa-scale-v1}.
The object structure is similar to the CRD of the VPA.
It specifies a *target*: the object which should be scaled (line 6-9).
@ -294,19 +294,23 @@ Deployment pods: 1 current / 1 desired
Metrics: ( current / target )
"esm_tasks_queued_total": 0 / 1
Messages:
recommended size matches current size
the HPA was able to successfully calculate a replica count from
Recommended size matches current size
The HPA was able to successfully calculate a replica count from
external metric esm_tasks_queued_total
the desired replica count is less than the minimum replica count
The desired replica count is less than the minimum replica count
[...]
New size: 5; external metric esm_tasks_queued_total above target
New size: 10; external metric esm_tasks_queued_total above target
New size: 5; All metrics below target
\end{lstlisting}
As Figure \ref{} shows, the number of replicas can also be visually observed with the monitoring setup.
As Figure \ref{fig:hpa-scaling-v1} shows, the number of replicas can also be visually observed with the monitoring setup.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/hpa-scaling-v1.png}
\caption{\label{fig:hpa-scaling-v1} Scaling behavior of HPA Policy v1. Green: task queue length. Red: number of active replicas.}
\end{figure}
### Vertical Scaling with VPA
@ -410,7 +414,7 @@ Thus, the goal is to always keep the number of allocated resources as low as pos
To measure the performance of the system we will use one of the application-level metrics we have exposed in Section \ref{prometheus-exporters}: the duration of configuration runs.
A configuration run refers to the configuration of a fixed set of systems being checked and updated.
This process is usually trigger by a user in the web interface, thus it is deemed a relevant metric for measuring the performance of the application.
This process is commonly triggered by the user through the web interface, thus it is deemed a relevant metric for measuring the performance of the application.
In order to measure the cost of horizontal scaling we will use *replica seconds*: the number of running pods per second integrated over the time period of the benchmark.
For example, if the benchmark lasts 1 minute and we have a constant number of 2 replicas running all the time, this would result in 120 replica seconds.
@ -424,52 +428,135 @@ Prometheus is not suitable for such highly time accurate measurements \cite{Prom
We set up a performance benchmark for the application with a constant workload size.
Thus, our experiment will evaluate the *strong scaling* behavior \cite{ScientificBenchmarkingParallelComputing_2015} of the application and -- by extension -- the autoscaler.
Benchmarks are repeated three times and the results are checked for outliers.
If there are not outliers the results are averaged, otherwise they are reported.
Between each benchmark the target application is completely removed from the cluster (deleting Kubernetes namespace) and re-installed.
An individual benchmark run is structured as follows:
1. Set up and initialize target application from scratch.
2. Start the benchmark timer.
3. After 60 seconds, start a configuration run. Repeat ten times.
4. Wait until all configuration runs finish. This marks the *time to completion*.
5. Continue running the application until 30 minutes after starting the benchmark timer. Then, stop the replica seconds counter.
6. Terminate the application.
This procedure is repeated three times and the results are checked for outliers.
If there are outliers they are reported, otherwise the results are averaged where appropriate.
Between each benchmark the target application is completely removed from the cluster (Kubernetes namespace is deleted) and re-installed into a new namespace.
This ensures the benchmark runs are completely isolated and no transient side effects (e.g. warm caches) are present.
The entire setup is scripted to minimize potential for variation.
The setup described above is entirely scripted to minimize potential for variation.
Each configuration run is started with an interval of 60 seconds.
Since configuration runs take longer than 60 seconds (see Figure \ref{fig:config-run-variance}), the applications needs process multiple in parallel.
Each configuration run contains the same amount of work.
The application continues to run after the work has finished (time to completion) in order to demonstrate the cost-savings made possible through autoscaling.
If the benchmark is stopped as soon as the workload is completed, only the performance (but not the cost) could be compared.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/static-baseline-benchmark.pdf}
\caption{\label{fig:performance-cost-benchmark} Performance benchmark results of static scaling. Number in square brackets indicate number of replicas.}
\end{figure}
As explained at the beginning of this chapter, the number of executors in the target application can be adjusted.
To get a baseline for the application behavior, we tested several replica values.
Figure \ref{fig:benchmark-baseline} shows the results: the number in square brackets indicate the number of replicas.
The graph shows the cost (replica seconds) on the x-axis and the time to completion (execution time) on the y-axis.
The lowest cost (4.000 replica seconds) is achieved when using 2 replicas, which also has the slowest time to completion (2.000 seconds).
To get a baseline for the application behavior, we tested several static values for replicas.
The graph in Figure \ref{fig:performance-cost-benchmark} shows the results with the cost (replica seconds) on the x-axis and the time to completion (time until the simulated user has all desired results) on the y-axis.
The lowest cost (4.000 replica seconds) is achieved when using 2 replicas, which also has by far the slowest time to completion (2.000 seconds).
The highest cost (14.000 replica seconds) is recorded with 20 replicas, which also has the lowest time to completion (4.000 seconds).
Due to the constant workload size, the benchmarks with 10 and 15 replicas had almost the same time to completion, while having significantly fewer replicas and therefore replica seconds.
This means most of the replicas were idle (unused) during the execution.
In this scenario, the goal of the autoscaler is to optimize the towards the bottom left corner (low cost and high performance), irrespective of the given workload.
This means that for this workload size, more than ten replicas are inefficient, since most of the replicas will be idle (unused) during the execution.
In this scenario, the goal of the autoscaler is to optimize towards the bottom left corner (low cost and high performance), irrespective of the type and size of the given workload.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{images/static-baseline-benchmark.pdf}
\caption{\label{fig:benchmark-baseline} Performance benchmark results of static scaling. Number in square brackets indicate number of replicas.}
\includegraphics[width=0.85\textwidth]{images/replicas-20-vs-replicas-3.png}
\caption{\label{fig:20-vs-3-replicas} Performance results of static scaling. Left: 20 Replicas, Right: 3 Replicas}
\end{figure}
The same behavior is illustrated in Figure \ref{fig:20-vs-3-replicas}, but with metrics collected from the monitoring system: on the left side is a benchmark run with 20 replicas, on the right side one with 3 replicas.
With just 3 replicas, there are many tasks queued for a long period of time (yellow line), which causes the average queue time to continuously rise up to a value of 6 minutes.
The same behavior is illustrated in Figure \ref{fig:20-vs-3-replicas}, but with metrics collected from the monitoring system: on the left side shows a benchmark run with 20 replicas, the right side one with 3 replicas.
With just 3 replicas, there are many tasks queued for a long period of time (yellow line), which causes the average queue time (green line) to continuously rise, up to a value of 6 minutes.
With 20 replicas, the average queue time never rises above 30 seconds.
Figure \ref{fig:config-run-variance} shows the execution time of individual configuration runs during the benchmark.
It is logical that when the overall time to completion is lower (as shown in Figure \ref{fig:performance-cost-benchmark}), also the time of individual workloads must be smaller.
The plot confirms this and also shows that a low number of replicas has a significant effect on the variance of the configuration run's execution times.
While each configuration run has the same workload, with a low number of replicas a significant increase in variance (in addition to an increase of the average) can be observed.
This is explained by the fact that with few replicas, the same executor needs to run multiple workloads sequentially, which slows down some of them drastically.
\begin{figure}[ht]
\centering
\includegraphics[width=0.85\textwidth]{images/replicas-20-vs-replicas-3.png}
\caption{\label{fig:20-vs-3-replicas} Performance results of static scaling. Left: 20 Replicas, Right: 3 Replicas}
\includegraphics[width=\textwidth]{images/config-run-variance.pdf}
\caption{\label{fig:config-run-variance} Execution time of individual configuration runs (3 benchmarks, 10 configuration runs per benchmark). Numbers in blue indicate mean.}
\end{figure}
After having established a baseline performance, we can transition into testing the performance of a basic autoscaling setup.
The first version of the autoscaling policy already has been shown in Listing \ref{src:hpa-policy-v1}.
After having established the performance and cost characteristics of static configuration, we transition into testing a basic autoscaling setup.
The first version of the autoscaling policy has been shown in Listing \ref{src:hpa-scale-v1}.
We will now evaluate the behavior of this policy.
As Figure \ref{fig:hpa-scaling-v1} shows, the benchmark with this policy had a similar before a static configuration with 10 replicas: the application utilized around 18600 replica seconds during the benchmark and the total time to completion was approx. 700 seconds.
\todo{}
This result highlights the strength and weaknesses of this autoscaling policy: it enabled the application to achieve the same performance as with a static configuration of 10 replicas.
As Figure \ref{fig:performance-cost-benchmark} shows, this is the maximum amount of performance the application is able to deliver for this particular workload.
At the same time, the autoscaling policy was just as costly as a static configuration of 10 replicas, even though for significant periods the application was just running with 1 replica.
This is explained by Figure \ref{fig:hpa-scaling-v1}: the policy overscaled the number of replicas (more than the necessary value established previously) and even reached the replica limit (`maxReplicas` in Listing \ref{src:hpa-scale-v1}).
* scales down too quickly -> config runs still active
* evident by correlating scale down events with configuration finished; and execution time of last config runs
* check downscale behavior
* probably need to implement better graceful shutdown
* consequently: high average queue time
In summary, the autoscaling policy has performed well (no performance loss, minor introduction of variance), but has not realized any cost-savings in our experiments.
## Optimization
How to optimize the improvement.
While the initial scaling policy presented above has successfully scaled the deployment during the benchmark, we have identified several aspects for improvement.
This section will address these issue by fine-tuning the scaling policy.
* **Delayed scale down**: the number of replicas is not reduced soon enough after the workload has finished, as is evident by the number of queued tasks vs. the number of replicas in Figure \ref{fig:hpa-scaling-v1}.
* **Potential for premature scale down**: if a task has a long execution time, the autoscaler might scale reduce the number of replicas too early because the target metric only considers queued tasks.
* **Overscaling of replicas**: our previous experiments with static replica configurations have shown that provisioning 20 replicas (as shown in Figure \ref{fig:performance-cost-benchmark}) is ineffective, since it does not increase performance.
The delayed scale down can be tackled by adjusting the *downscale stabilization window* of the scaling policies (shown in Listing \ref{src:downscale-behavior}).
This setting specifies how soon HPA will start scaling down the deployment after it has detected that the scaling metric is below the target \cite{KubernetesDocumentationHorizontalPod_2021}.
The default value is 5 minutes (cf. Figure \ref{fig:hpa-scaling-v1}), we adjust the value to 1 minute (line 7).
Decreasing the stabilization window can lead to thrashing (continuous creating and removing of pods) when workload bursts are more than the specified window apart.
In our case, starting these containers is very fast since this application component is lightweight and does not hold any internal state.
Additionally, this risk is partially mitigated by only allowing HPA to remove 50% of the active replicas per minute (line 4-6).
By default, HPA is allowed to deprovision all pods (down to `minReplicas`) at the same time \cite{KubernetesDocumentationHorizontalPod_2021}, as it is illustrated in Figure \ref{fig:hpa-scaling-v1} (jump from 20 to 1 replica).
\begin{lstlisting}[caption=Improved Downscale Behavior for HPA, label=src:downscale-behavior, language=yaml, numbers=left]
behavior:
scaleDown:
policies:
- type: Percent
value: 50
periodSeconds: 60
stabilizationWindowSeconds: 60
\end{lstlisting}
The potential for premature scale down is reduced by not only considering the queued tasks in the scaling metric, but also the currently running tasks.
Just because all tasks have been taken out of the message queue by the executors does not mean that the number of executors can be reduced, as they might still be processing the tasks.
Thus, the Prometheus metrics query is adjusted as shown in Listing \ref{src:improved-metrics-query} to consider all queued and active tasks.
Refer to Appendix \ref{prometheus-adapter-setup} for the original version and context.
\begin{lstlisting}[caption=Improved Prometheus metrics query used for scaling, label=src:improved-metrics-query, language=yaml, numbers=none]
sum(esm_tasks_total{status=~"queued|running"})
# equivalent to:
esm_tasks_total{status="queued"} + esm_tasks_total{status="running"}
\end{lstlisting}
Finally, the issue of overscaling replicas can be \improve{fixed} by adjusting the logic of the scaling metric.
Until now, we have been using an absolute value as a target, e.g. total number of tasks in queue.
It makes more sense to use a metric that incorporates the current number of replicas as a ratio.
This is necessary because -- with the change explained previously -- the scaling metric contains the number of running tasks.
....
Instead of using the scaling metric directly, the raw value is averaged: it is divided by the number of active pod replicas (Listing \ref{src:improved-scaling-logic}) \cite{KubernetesDocumentationHorizontalPod_2021}.
In order to find an appropriate value for `averageValue`, we performed several benchmarks.
They have been labeled as `hpav2`, `hpav3`, `hpav4` and `hpa5` for the values 2, 3, 4 and 5, respectively.
These benchmarks also include the other optimizations outlined above.
\begin{lstlisting}[caption=Improved scaling logic for HPA, label=src:improved-scaling-logic, language=yaml, numbers=left]
target:
type: AverageValue
averageValue: 2
\end{lstlisting}
\todo{Discuss results of optimizations}
also talk about average queue time
## Real-world test scenario

@ -156,7 +156,7 @@ spec:
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
helm install -n monitoring --name prometheus-adapter --version 2.12.1 \
helm install -n monitoring prometheus-adapter --version 2.12.1 \
-f prometheus-adapter.values.yaml prometheus-community/prometheus-adapter
\end{lstlisting}

Loading…
Cancel
Save