Summary
There are multiple use cases to containerize and host proprietary applications on AWS ECS which is “a fully managed container orchestration service that makes it easy for you to deploy, manage, and scale containerized applications”
Scaling ECS then is one of the key needs of any application. This article describes challenges with ECS supported scaling and describes a custom solution to alleviate them. Thanks to Edwin Essenius for the mentorship.
Problem Statement
In this scenario, application hosted on AWS ECS can process a wide range of requests, each with unpredictable compute usage and uncertain execution time.
The core back-end architecture on AWS is based on asynchronous processing, where SQS receives the requests. The application on AWS ECS (Fargate) polls continuously on SQS for incoming requests and process them.
One of the use cases for us is that the users can split a single large request into multiple smaller requests and distribute them asynchronously to cloud, for parallel processing to reduce the overall processing time by even 10-12 times (from even hours to minutes). ECS needs to scale quickly & accurately to be able to serve the spikes in demand.
The ECS task needs to run the application for an uncertain amount of time, could be seconds or days. The compute usage statistics are not dependable to find if a particular task is active. This means that the scale-in action can kill active tasks and jobs which is highly undesirable.
Implementing scaling with the below policies to meet above requirements is a challenge.
Target Tracking Scaling Policies
Link below describe automatic scaling support from ECS.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
The default automatic scaling policy support from ECS can increase or decrease the number of tasks that your service runs based on a target value for a specific metric.
The scale out action based on compute usage spikes, but there would be cases when compute usage is un-predictable, where few requests consume negligible compute but need to parallel processing with others
Inability to configure the exact number of tasks needed with out-of-the-box basic scaling
Default scale out takes minutes before application in the container picks requests for processing
Default scale in takes 5-15 minutes causing wasted compute adding to unnecessary costs. More importantly, the scale-in policy cannot identify & stop the actual idle tasks, just based on compute usage, resulting in stopping active tasks !!
For the scale-in event, KILL signal sent by ECS expects tasks to complete their job within 30 seconds, which was not possible, depending on the nature of the application.
Step Scaling Policies
Limitations of basic target-tracking policies drove the need to set up customized step auto-scaling policies based on cloud watch metrics on request count on SQS (possible with both default from ECS or custom metrics.
These policies use Cloud Watch alarms and aggregates metric data points based on the statistic for the metric. On breach of the alarm, the appropriate scaling policy in invoked.
Step scaling policies are complex and in-capable to decide exact desired count needed. Additionally, Cloud Watch alarms are costly & slow in response.
The core issues with still slow scaling (out or in) stays un-resolved and the scaling policy cannot find the idle tasks to stop when scaling in, which is the key requirement. Even the advanced customized auto scaling policies can only “approximate” scaling needs.
The link below is an interesting read on how cluster auto scaling works, the complexity and math applied in implementing scaling policy.
https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/
Custom Solution
Summary
With multiple limitations in what ECS supports by default, there is a need to build a custom scaling solution by updating desired count on AWS ECS Fargate service. In the example above, desired count is known by querying SQS for the available message count.
Extend solution with ECS Capacity Providers
An EC2 auto-scaling group can provide capacity to ECS instead of the server less Fargate option; preferred in certain scenarios such as:
ECS Fargate does not support exceptionally large compute (up-to 16 vCPUs as when this written)
The image caching feature of container is another valuable proposition when using ECS with EC2, especially when images are large. Currently, Fargate does not offer container image caching feature. This allows single EC2 to run multiple instances of the container but downloading the container image only once for that EC2.
EC2 warm-up pool saves on instance provisioning times unlike Fargate instances
Need for more control over infrastructure, for example, specific OS configuration.
Extend solution to support ECS with EC2, to simultaneously update auto-scaling group (ASG) configured as the “capacity provider” to ECS service. This is more complex to design & configure.
Solution Components & Configuration
AWS ECS runs application container on Fargate or uses auto-scaling group (ASG) as the “capacity provider”
An event bridge bus with a set of rules intercepts the scaling event and triggers a lambda (target) which runs ECS scaling logic, referred now as ecs-scaling-lambda
ecs-scaling-lambda’s environment is prepared with required properties to talk to AWS such as queue name, ECS cluster & service details, min/max desired count. With ASG, attributes such as per EC2 capacity, min/max ASG count.
Configure ecs-scaling-lambda as ASG custom termination policy (for ECS with EC2). Per documentation Amazon EC2 Auto Scaling uses termination policies to prioritize which instances to terminate first when decreasing the size of your Auto Scaling group (referred to as scaling in). However, this works fine even to stop EC2 to return them to the warm pool.
Design lambda to respond to scale up and scale down events. Additionally, configure lambda to respond to auto-scaling service scale-down event, with the list of EC2 to stop (needed when ASG is the “capacity provider”)
Solution Architecture
Implementation
ECS Scaling Lambda Handler
The lambda handler function responds to these event types
Scale up for new requests
Scale down when task shuts down
Respond to ASG scale-down event with the idle instance-ids to stop.
def lambda_handler(event, context):
# Read event type, parse it based on your application
# ASG when scaling in sends an event with cause SCALE_IN
event_type = get_event_type(event)
if event_type == "_scale_up":
return scale_up()
if event_type == "_scale_down":
return scale_down()
if event_type == "asg_scale_down":
# Response to auto-scaling service with idle EC2
# Reset idle_ec2 in environment to empty
return {"InstanceIDs": [env.idle_ec2]}
return None
Scale Up
On receiving a request, application emits a scale-up event. For example, application may receive a request via an API or via an asynchronous trigger such as, on a file upload.
# Application when receiving a request
event_bridge.put_event(source,"_scale_up", event)
On receiving a scale-up event, ecs-scaling-lambda calculates & updates the new desired count on ECS based on its current running & pending task count; and based on pending requests on the queue.
The lambda caps the maximum desired count as configured is its environment (same value as on the ECS service).
When using ASG, start EC2s on scale-up event by calculating and updating ASG’s desired count, again based on ECS scaling status and per EC2 processing capability (e.g., a 32 vCPU EC2 can process four requests when one request uses maximum 8 vCPU). ASG “capacity provider” itself is configured with placement strategy “binpack” to maximize compute utilization & run with minimal instances.
def scale_up():
queued = Queue.get_available_messages_count()
if queued <= 0:
return None
desired, running = ECS.get_task_count()
pending = desired - running
to_add = queued - pending
if to_add <= 0:
update_asg_desired(desired)
return desired
# Limit to count as configured
tasks_desired = desired + to_add
tasks_desired = min(tasks_desired, env.max_tasks)
# Need update of ASG for EC2 deployments and new tasks
update_asg_desired(tasks_desired)
ECS.update_ecs_desired(tasks_desired)
def update_asg_desired(tasks_desired):
if not env.is_asg_providing_capacity:
return None
# Query ASG to fetch in service instances, and its desired count
in_service_instances, desired = AutoScaling.describe()
# Calculate capacity of 1 EC2
capacity = desired * env.ec2_capacity
if capacity >= tasks_desired:
return None
# Calculate new ASG desired count
new_desired = int(math.ceil((tasks_desired - capacity) /
env.ec2_capacity)) + desired
# But cap with maximum ASG size
new_desired = min(new_desired, env.max_ec2)
# Update ASG to start exact needed EC2
AutoScaling.set_desired_count(new_desired, desired)
Scale Down
Container triggers scale down event
The container application keeps on processing requests and checks if it has been idle for too long (for example, idle time of 30 seconds). Only the container in the task decides when it is idle & when idle, requests ECS for its shutdown, stops accepting any more requests and generates a scale-down event.
The task itself needs to query ECS to fetch running count and request shutdown but try to keep minimum desired count.
# ----- Code Inside Container ----
# Container starting. Define an Exit Handler
exit_handler = ExitHandler()
signal.signal(signal.SIGTERM, exit_handler.shutdown)
while not exit_handler.stop:
if shutdown_mode:
# Do not pick any requests.
# Though the SIGTERM is immediate
time.sleep(0.5)
continue
# Main app processing logic
if is_request_available():
process()
else:
# Check since how long the process has been idle
shutdown_mode = stop_if_idle(last_active_at, timeout)
def stop_if_idle(last_active_at, timeout):
# If the task had been idle for too long, stop itself
if (time.time() - last_active_at) <= timeout:
# Check ECS for if running more than minimum tasks
can_i_shut_down()
# Query ECS metadata service to get own (task) ARN
ecs.stop_task(
cluster=self.cluster,
task=self.task_arn,
reason="Custom scale in")
# Scale down event for this task which wants to stop
EventBridge.put_event(self.service_arn,
"_scale_down",
{'timestamp': str((time.time()))},
[self.task_arn])
return True
Stopping of ‘self’ is the key to be able request ECS service for a graceful shutdown.
AWS ECS service on the other hand, on receiving the STOP request sends a KILL SIGNAL to this task, which the container process reads and exits completely. The task finally shuts down gracefully.
ecs-scaling-lambda responds to scale down event
ecs-scaling-lambda intercepts scale-down event and decrements ECS desired task count. While ECS service performs the action of stopping the task by sending KILL signal, the decremented desired count ensures a replacement task is not spun. “Stopping the task” and “decrementing desired count” works together in conjunction.
def scale_down():
ECS.decrement_desired_tasks()
if env.is_asg_providing_capacity:
scale_down_asg()
With shutdown of idle tasks one by one, finally ECS runs minimum desired count.
With ASG when used, to scale down ASG to the needed EC2 count, find idle EC2, and ask auto-scaling service to shut down only the idle instances, and decrement ASG’s desired count simultaneously.
In the below code snippet, the update of desired count on ASG results in auto-scaling service invoking ecs-scaling-lambda again, asking for list of EC2s to stop. Configure a custom termination policy on ASG to stop only the idle instances.
The lambda responds with the list of idle EC2 instance-ids (see lambda_handler definition) and auto-scaling service either stops them to return them to the warm pool or terminates them when there is no warm pool used.
def scale_down_asg():
# Query auto scaling service for in-service EC2, desired count
in_service_instances, current_desired = AutoScaling.describe()
# Query ECS to find EC2 that are in use by the tasks
ecs_instances = ECS.get_instances_in_use()
new_desired = len(ecs_instances) if ecs_instances else 0
# Find idle EC2
if not ecs_instances:
idle = in_service_instances
else:
idle = in_service_instances - ecs_instances
if not idle:
return
if not ecs_instances:
# ECS is not using any EC2, ASG desired be 0
new_asg_desired = 0
else:
new_asg_desired = len(ecs_instances)
# Only decrement ASG desired count
# This throws an event from AWS auto-scaling service that
# Lambda capture and returns actual instance IDS to
# Stop (with warm pool) or terminate
AutoScaling.set_desired_count(new_asg_desired, current_desired)
# Update Lambda environment with the set of idle EC2
# Return them auto scaling service to stop these idle
env.idle_ec2 = idle
Design Considerations
To accurately read the pending request count on SQS, use “FIFO�� queue and not “STANDARD” queue which “almost” guarantees accuracy (with slight delays seen up-to one second) in synchronizing the queue attributes. Scaling lambda waits & reads queue attributes after a second of the request made.
Run scaling lambda with a reserved concurrency of one to avoid concurrent updates on the ECS service from multiple scaling events received at the same time. The ecs-scaling-lambda responds very quickly to the events, and a fixed concurrency of one adds negligible overhead.
The capacity provider considered in design is 100% by either FARGATE or by EC2. Mixing “capacity provider” types would result in un-desired behavior.
If using ASG as the “capacity provider” use the placement strategy binpack. This leaves the least amount of unused CPU or memory. This strategy minimizes the number of container instances in use. Additionally, start with using no placement constraints. Turn off “instances protected from scale-in” on ASG, for the custom scaling to work.
The capacity provider should still have “ecs managed scaling” turned on. Reason - If the scaling is “managed” by ECS, the ECS service waits for EC2s to come up and does not fail the tasks at once due to lack of available instances. Also, turn off managed termination protection for the capacity provider. Delete any lifecycle hooks on ASG that may intervene with the custom scaling service & add overhead.
If using ASG, use ASG warm pool to save on time (turn on reuse on scale-in), to provision new instances.
For improved performance, remember to re-use cached AWS connections in the lambda for improved performance & throughout your application
More Ideas
Before sending requests for processing, a trusted client can ask for capacity up front. Integrated with ecs-scaling-lambda, increment ECS desired count for the “expected” demand as asked by a smart client.
With automated deployments, a new deployment would replace tasks & may shut down “active” processes, which is un-desired. Abort deployment if ECS is busy processing, by querying ECS desired/pending count to check if scaling is in progress & by querying cloud watch log activity from the container.
To capture scaling metrics, persist ECS desired count update actions in a timestream database. One use case could be to see ECS scaling status and analyze in real-time how busy the system is.
Asynchronous invocation of ecs-scaling-lambda means errors may go un-noticed. Configure destinations on ecs scaling lambda to be able to send out SNS email notifications on failed invocations.
Note: This article was also published by author at https://dev.to/vibhor_agarwal/customized-scaling-of-aws-ecs-39ge