One of my jobs (at least as I see it) is not only to automate infrastructure, but also to consolidate and simplify infrastructure. Too many companies I have worked at have small bits of amazing things, surrounded by loads of convoluted messes left by others. As we move more and more into public cloud infrastructure, in this case AWS, I started thinking about the things I could simplify using AWS only tools. The old way of doing things (as set up by the engineers before me) was to have loads of cron jobs, running across multiple machines, for tasks like taking EBS snapshots. This isn’t a problem when everything is humming along perfectly, but as soon as something breaks, the hunt for where it broke begins. My old strategy involved using the bastion hosts (those that sat on the outside of the VPC for access to VPC machines). But even this could be missed by someone who doesn’t understand my particular way of doing things. This also doesn’t provide consolidated logging either, unless you consume logs from every bastion host into some logging service. Enter Lambda.

If you haven’t heard of Lambda, you can go read about it, the serverless system that Amazon built is far more capable than my simple use case. My specific use case for it now though is to run cron jobs that interact with various AWS services. Using Python 3 and the Boto library I’ve already moved two of our major tasks over to Lambda: taking EBS snapshots and cleaning out orphaned ECR images. Both of these tasks are relatively simple, but would generally require a server to run them on a cron schedule. Using Lambda you can run these two simple scripts, with subsecond pricing, to keep your infrastructure clean and backed up.

ECR Cleaning

The system we have set up currently builds Docker images and pushes these tagged images to a ECR repository. In the testing environment we have a lot of images that are built and tagged with the same tags, leaving orphaned images with no tags. Since ECR charges by storage, it is wasteful to keep all these untagged images around for long. Also, there are limits on the ECR repository sizes which would be hit eventually. The script I wrote in Python does the following: searches through your account for a list of all ECR repositories, from that list searches through each ECR repository for any images that are UNTAGGED, deletes said images from the repository.

import boto3

client = boto3.client('ecr')

def get_repositories():

    global repositories
    repositories = []

    repo_list_client = client.get_paginator('describe_repositories')
    for response in repo_list_client.paginate():
        for repo in response['repositories']:
            repositories.append(repo['repositoryName'])

def get_image_list(repository_name):
    images = client.list_images(
        repositoryName = repository_name,
        filter = {
            'tagStatus': 'UNTAGGED'
        }
    )

    global image_id_list
    image_id_list = []

    for topkey, topvalue in images.items():
        if topkey == 'imageIds':
            for items in topvalue:
                for imagekeys, imagevalues in items.items():
                    image_id_list.append(imagevalues)

def delete_untagged_images(repository_name, image_sha256):

    delete = client.batch_delete_image(
        repositoryName = repository_name,
        imageIds = [
            {
                'imageDigest': image_sha256
            },
        ]
    )

def main():

    get_repositories()

    for repository in repositories:
        get_image_list(repository)
        for image_sha256 in image_id_list:
            print('Removing %s from %s'%(image_sha256, repository))
            delete_untagged_images(repository, image_sha256)

if __name__ == '__main__':
    main()

Image Snapshots

Another task that was originally being performed by way too many individual machines was taking EBS snapshots. These daily snapshots were originally scheduled by some of the hosts, on their own volumes. The problem was until I stumbled upon it, there was actually no way for me to tell how these snapshots were happening. Even more troubling, these cron jobs were inserted by hand, so if a host had to be rebuilt, they would have stopped. My solution to this was to again use Python, Boto, and Lambda. This script simply searches through all available EC2 instances for the key Backup that equals true. If this key is present, it then takes a snapshot of the volume and labels it with the Name tag of the instance.

import boto3

ec = boto3.client('ec2')
snapshots = {}

def take_snapshot(name, vol_id):

    print("Snapshot scheduled for %s on volume %s"%(name, vol_id))
    ec.create_snapshot(VolumeId=vol_id, Description=name)

def get_instances():
    reservations = ec.describe_instances(
        Filters = [
            {'Name': 'tag:Backup', 'Values': ['true', 'True']},
        ]
    )['Reservations']

    instances = sum(
        [
            [ i for i in r['Instances']]
            for r in reservations
        ], []
    )

    for instance in instances:
        for entry in instance['Tags']:
            if entry['Key'] == 'Name':
                instance_name = entry['Value']
        for ebs in instance['BlockDeviceMappings']:
            if ebs.get('Ebs', None) is None:
                continue
            ebs_id = ebs['Ebs']['VolumeId']
            snapshots[instance_name] = ebs_id


def main():
    print("Preparing list of Snapshots")
    get_instances()

    for name, vol_id in snapshots.items():
        take_snapshot(name, vol_id)


if __name__ == '__main__':
    main()