Chef Node Deregistration For Autoscaling Groups

Cloudreach in the UK has a great article on Chef node de-registration. With autoscaling groups bringing machines up and down all over the place, Chef needs a way to know that they’ve been terminated and remove them from the configs lest things become a cluttered mess. Luckily, AWS provides some great shortcuts. We implemented the solution, then bulletproofed it. Here’s what it takes.

It starts in CloudFormation for both source control and automation. See our summary of CloudFormation and Chef. We set up two queues, one for nodes to deregister and one for any errors.

"DeregQueue" : {
    "Type" : "AWS::SQS::Queue"
},
"DeregErrorQueue" : {
    "Type" : "AWS::SQS::Queue"
}

Simple enough. Then we need to configure all autoscaling groups (ASGs) to put themselves into DeregQueue upon termination. We do this by way of an SNS topic with DeregQueue as its endpoint.

"DeregTopic" : {
    "Type" : "AWS::SNS::Topic",
    "Properties" : {
        "Subscription" : [
            {
                "Endpoint" : { "Fn::GetAtt" : ["DeregQueue", "Arn"]},
                "Protocol" : "sqs"
            }
        ]
    }
}

Just tell all your ASGs to send notification to this topic on termination, and the circle is complete.

"NotificationConfiguration" : {
    "TopicARN" : { "Ref" : "DeregTopic" },
    "NotificationTypes" : ["autoscaling:EC2_INSTANCE_TERMINATE"]
},

The topic needs permission of course, so you’ll need to add an SQS policy.

"DeregQueuePolicy" : {
    "Type" : "AWS::SQS::QueuePolicy",
    "Properties" : {
        "PolicyDocument":  {
            "Id":"DeregQueuePolicy",
            "Statement" : [
                {
                    "Sid":"Allow-SendMessage-To-Queue-From-SNS-Topic",
                    "Effect":"Allow",
                    "Principal" : {"AWS" : "*"},
                    "Action":["sqs:SendMessage"],
                    "Resource": "*",
                    "Condition": {
                        "ArnEquals": {
                            "aws:SourceArn": { "Ref" : "DeregTopic" }
                        }
                    }
                }
            ]
        },
        "Queues" : [ {"Ref" : "DeregQueue"} ]
    }
},

For the processor, We chose the aws-sdk for ruby, easily accessible via gem install aws-sdk. IAM permissions should be set up ahead of time.

require 'rubygems'
require 'aws-sdk'
require 'json'
require 'time'
require 'chef'

asg_queue_url = ENV['ASG_DEREG_QUEUE_URL']
asg_error_queue_url = ENV['ASG_DEREG_ERROR_QUEUE_URL']
topic_arn= ENV['WARNING_TOPIC']
sqs = AWS::SQS.new(:access_key_id => ENV['ADMINPROC_USER'],
                   :secret_access_key => ENV['ADMINPROC_PASS'],
                   :sqs_endpoint => asg_queue_url)
sns = AWS::SNS.new(:access_key_id => ENV['RELEASE_USER'],
                   :secret_access_key => ENV['RELEASE_PASS'])
Chef::Config.from_file(ENV['HOME'] + '/.chef/knife.rb')
rest = Chef::REST.new(Chef::Config[:chef_server_url])

sqs.queues[asg_queue_url].poll do |m|
  sns_msg = m.as_sns_message
  body = JSON.parse(sns_msg.to_h[:body])
  event = body["Event"]
  begin
    if event.include? "autoscaling:EC2_INSTANCE_TERMINATE"
      time = Time.now.utc.iso8601
      # here we assume that the ec2 instance id = node name
      iid = body["EC2InstanceId"]

      puts "deleting node " + iid + "\n"
      del_node = rest.delete_rest("/nodes/" + iid)

      puts "deleting client " + iid + "\n"
      del_client = rest.delete_rest("/clients/" + iid)

    elsif event.include? "autoscaling:TEST_NOTIFICATION"
      m.delete
    end

  # on failure:
  rescue
    msg = "There was a problem deregistering instance #{iid}.\n" + $!.to_s + "\n" + body.to_a.sort.join("\n")
    puts msg    # should go to STDERR

    # send alert to staff
    topic = sns.topics[topic_arn]
    topic.publish(msg)

    # put the message in DeregErrorQueue
    error_q = sqs.queues[asg_error_queue_url]
    error_q.send_message(m.to_s)

    # keep moving
    next
  end
  puts ""
end

Easy peasy.

% tail -f process_asg_queue.log

2012-07-01T16:12:50Z: nodename =   bas-testo-test-i-1288786a, iid = i-1288786a
knife node delete   bas-testo-test-i-1288186a -y...
   result = true
knife client delete   bas-testo-test-i-1218786a -y...
   result = true

2012-07-01T16:13:22Z: nodename =   bas-database-database-i-1c867664, iid = i-1c867664
knife node delete   bas-database-database-i-1c167664 -y...
   result = true
knife client delete   bas-database-database-i-1c167664 -y...
   result = true

All that remains is to properly daemonize. We chose to wrap this guy up in supervisord.

[program:chef_dereg]
directory=<%= node['script_dir'] %>
command=ruby <%= node['script_dir'] %>/process_asg_queue.rb
autostart=true
autorestart=true
startretries=5
stdout_logfile=<%= node['log_dir'] %>/process_asg_queue.log
redirect_stderr=true
stopsignal=INT
stdout_logfile_maxbytes=15000000
stdout_logfile_backups=45

Finally, the error handling proved very tolerant in testing, but you never know, right? So as a catch-all, set an alarm on the queue and never look at it again. Until you have to.

        "DeregQueueAlarm": {
            "Type": "AWS::CloudWatch::Alarm",
            "Properties": {
                "AlarmDescription": "Alarm if queue is not being processed.  The processor might have died.",
                "Namespace": "AWS/SQS",
                "MetricName": "ApproximateNumberOfMessagesVisible",
                "Dimensions": [{
                    "Name": "QueueName",
                    "Value" : { "Fn::GetAtt" : ["DeregQueue", "QueueName"] }
		}],
                "Statistic": "Average",
                "Period": "300",
                "EvaluationPeriods": "3",
                "Threshold": "1",
                "ComparisonOperator": "GreaterThanOrEqualToThreshold",
                "AlarmActions": [{
                    "Ref": "CriticalTopic"
                }]
            }
        },

Note: INSUFFICIENT_DATA is not feasible for sqs.