Cloudreach in the UK has a great article on Chef node de-registration. With autoscaling groups bringing machines up and down all over the place, Chef needs a way to know that they’ve been terminated and remove them from the configs lest things become a cluttered mess. Luckily, AWS provides some great shortcuts. We implemented the solution, then bulletproofed it. Here’s what it takes.
It starts in CloudFormation for both source control and automation. See our summary of CloudFormation and Chef. We set up two queues, one for nodes to deregister and one for any errors.
"DeregQueue" : {
"Type" : "AWS::SQS::Queue"
},
"DeregErrorQueue" : {
"Type" : "AWS::SQS::Queue"
}
Simple enough. Then we need to configure all autoscaling groups (ASGs) to put themselves into DeregQueue upon termination. We do this by way of an SNS topic with DeregQueue as its endpoint.
"DeregTopic" : {
"Type" : "AWS::SNS::Topic",
"Properties" : {
"Subscription" : [
{
"Endpoint" : { "Fn::GetAtt" : ["DeregQueue", "Arn"]},
"Protocol" : "sqs"
}
]
}
}
Just tell all your ASGs to send notification to this topic on termination, and the circle is complete.
"NotificationConfiguration" : {
"TopicARN" : { "Ref" : "DeregTopic" },
"NotificationTypes" : ["autoscaling:EC2_INSTANCE_TERMINATE"]
},
The topic needs permission of course, so you’ll need to add an SQS policy.
"DeregQueuePolicy" : {
"Type" : "AWS::SQS::QueuePolicy",
"Properties" : {
"PolicyDocument": {
"Id":"DeregQueuePolicy",
"Statement" : [
{
"Sid":"Allow-SendMessage-To-Queue-From-SNS-Topic",
"Effect":"Allow",
"Principal" : {"AWS" : "*"},
"Action":["sqs:SendMessage"],
"Resource": "*",
"Condition": {
"ArnEquals": {
"aws:SourceArn": { "Ref" : "DeregTopic" }
}
}
}
]
},
"Queues" : [ {"Ref" : "DeregQueue"} ]
}
},
For the processor, We chose the aws-sdk for ruby, easily accessible via gem install aws-sdk. IAM permissions should be set up ahead of time.
require 'rubygems'
require 'aws-sdk'
require 'json'
require 'time'
require 'chef'
asg_queue_url = ENV['ASG_DEREG_QUEUE_URL']
asg_error_queue_url = ENV['ASG_DEREG_ERROR_QUEUE_URL']
topic_arn= ENV['WARNING_TOPIC']
sqs = AWS::SQS.new(:access_key_id => ENV['ADMINPROC_USER'],
:secret_access_key => ENV['ADMINPROC_PASS'],
:sqs_endpoint => asg_queue_url)
sns = AWS::SNS.new(:access_key_id => ENV['RELEASE_USER'],
:secret_access_key => ENV['RELEASE_PASS'])
Chef::Config.from_file(ENV['HOME'] + '/.chef/knife.rb')
rest = Chef::REST.new(Chef::Config[:chef_server_url])
sqs.queues[asg_queue_url].poll do |m|
sns_msg = m.as_sns_message
body = JSON.parse(sns_msg.to_h[:body])
event = body["Event"]
begin
if event.include? "autoscaling:EC2_INSTANCE_TERMINATE"
time = Time.now.utc.iso8601
# here we assume that the ec2 instance id = node name
iid = body["EC2InstanceId"]
puts "deleting node " + iid + "\n"
del_node = rest.delete_rest("/nodes/" + iid)
puts "deleting client " + iid + "\n"
del_client = rest.delete_rest("/clients/" + iid)
elsif event.include? "autoscaling:TEST_NOTIFICATION"
m.delete
end
# on failure:
rescue
msg = "There was a problem deregistering instance #{iid}.\n" + $!.to_s + "\n" + body.to_a.sort.join("\n")
puts msg # should go to STDERR
# send alert to staff
topic = sns.topics[topic_arn]
topic.publish(msg)
# put the message in DeregErrorQueue
error_q = sqs.queues[asg_error_queue_url]
error_q.send_message(m.to_s)
# keep moving
next
end
puts ""
end
Easy peasy.
% tail -f process_asg_queue.log 2012-07-01T16:12:50Z: nodename = bas-testo-test-i-1288786a, iid = i-1288786a knife node delete bas-testo-test-i-1288186a -y... result = true knife client delete bas-testo-test-i-1218786a -y... result = true 2012-07-01T16:13:22Z: nodename = bas-database-database-i-1c867664, iid = i-1c867664 knife node delete bas-database-database-i-1c167664 -y... result = true knife client delete bas-database-database-i-1c167664 -y... result = true
All that remains is to properly daemonize. We chose to wrap this guy up in supervisord.
[program:chef_dereg] directory=<%= node['script_dir'] %> command=ruby <%= node['script_dir'] %>/process_asg_queue.rb autostart=true autorestart=true startretries=5 stdout_logfile=<%= node['log_dir'] %>/process_asg_queue.log redirect_stderr=true stopsignal=INT stdout_logfile_maxbytes=15000000 stdout_logfile_backups=45
Finally, the error handling proved very tolerant in testing, but you never know, right? So as a catch-all, set an alarm on the queue and never look at it again. Until you have to.
"DeregQueueAlarm": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"AlarmDescription": "Alarm if queue is not being processed. The processor might have died.",
"Namespace": "AWS/SQS",
"MetricName": "ApproximateNumberOfMessagesVisible",
"Dimensions": [{
"Name": "QueueName",
"Value" : { "Fn::GetAtt" : ["DeregQueue", "QueueName"] }
}],
"Statistic": "Average",
"Period": "300",
"EvaluationPeriods": "3",
"Threshold": "1",
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"AlarmActions": [{
"Ref": "CriticalTopic"
}]
}
},