Archiving large ‘folders’ in S3

8 min readApr 10, 2020

Abstract

Sometimes you need to archive (and maybe compress) a large number of files in S3, but S3 does not offer an out-of-the-box solution for this.

When the total file size is large this makes running a Lambda impractical, both from a needed storage space (this can be fixed using streams) and time constraint (lambdas can’t run for more than 15 minutes, this cannot be fixed).

The following solution works with very large files, but at the cost of infrastructure complexity.

TL;DR;

Create a Docker image that uses the JS SDK and Archiver to do the archiving
Create a Fargate task that uses the Docker image
Use Lambda to trigger the Fargate Task
Send messages to SQS to trigger the Lambda

Overview

The whole task is split in:

Node.js app that does the actual archiving
Docker image that loads the Node.js app and runs in
Fargate task that uses the Docker images and passes env vars
Lambda function to trigger the Fargate task
SQS as trigger for Lambda
A lot of ‘glue’ to make all of this work

Folder structure

This is the project structure that we’re gonna use to develop this

Inside app we get the node.js files.

Dockerfile is, well, for the docker image.

push_docker_image_to_ecr.sh is a helper script to push docker images to AWS ECR

Node.js app

The core of the task is the Node.js app. It archives an S3 ‘folder’ (actually any object that starts with a given prefix) and saves it in the same S3 bucket, at a desired location.

It needs the following information to do the job:

Bucket name
Source ‘folder’, prefix of files to be archived
Destination, S3 key where the archive will be saved
Exclude, a regular expression that allows for excluding files

All of this are passed as environment vars.

Show me the code

This is the actual code that does the archiving

archive.js

This is mostly taken from this dev.to post, with the added ability to exclude files and added environment variables.

The app doesn’t download the files and just streams then from source to destination. This will allow to archive very large files without the need of much disk space.

Dependencies

The app only needs 2 dependencies

archiver
aws-sdk

Install the dependencies

npm install --save archive
npm install --save aws-sdk

Testing

Because all the logic is inside a node.js app and we use env vars it is extremely easy to test it.

run npm install
call node to execute the file BUCKET=bucketname KEY=folderkey DESTINATION=archivekey node app/archive.js

Note: for this to work properly and actually do an archive you need to have AWS credentials that allow access to the S3 bucket on the machine running this.

Docker

Docker’s role in all of this is pretty simple: Bundle the node.js so we can run it on an ECS Cluster.

Dockerfile

To build the image run docker build -t archive .

At this point you’ll have a docker image that you can use, but we’ll need to setup some AWS infrastructure before moving on.

AWS

This is the part where we need to configure AWS to make all of this work.

We’ll need to set up the following

ECR repository, Fargate cluster, ECS task definition
SQS where to send the archiving message
Lambda to receive the SQS message and trigger the Fargate Task
And of course some IAM Roles (Lambda Role to access SQS and start Fargate Tasks, Fargate Role to access S3)

We’ll start with creating the ECR repo so we can push the docker image.

ECR repository

ECR is a Docker repository hosted by AWS.
You’ll have to upload your docker images to ECR so AWS can access them.
This way AWS can pull your images and use them as needed.

Use the AWS Console and search for ECR. Click it.

Now, on the left hand side click the menu and choose Repositories

Click on Create Repository on the right hand side and create a new repo

At this point you have to name the repo. I usually follow this naming schema for all AWS resources projectName-environment-{optionalId}. This allows to group resources based on project name and also quickly find out on what environment are they running. The blacked out part is your AWS Account ID.

Now you should follow the instructions about how to push an image to ECR. I’ll not go into details regarding this.

You can also use this script, which aids in pushing a docker image to ECR. Note: you will need to log in to AWS ECR prior to running this.

Fargate

Now that we have a Docker Image on ECR we need a place to run it. One way of doing this is to create a cluster and run it using Fargate.

Cluster

To create a cluster, go to ECS and Click on Create Cluster. Choose Fargate as the launch type and name the cluster. Using a name schema is useful.

IAM Role

The ECS Task will need access to the S3 bucket to be able to read, create and list object files. The best way to do this is to create an IAM Role for this specific task that allows access only to the desired bucket.

Try to refrain from giving access to all S3 resources. The role should have a policy attached similar to this. Replace bucket-name and aws-account-id with appropriate values.

ECS Task Role Policy

Task Definition

Now that we have a docker image and an ECS Task Role we’re ready to create a an ECS Task Definition.

For this we need to create a new task and fill in the following details

use the previously created IAM role as the Task Role, not the TaskExecutionRole
Use 4GB and 2vCPU for this task
Create a Container that uses the docker images we’ve previously created
Add env vars for BUCKET, KEY, DESTINATION and EXCLUDE

Testing
At this point we can test to see that the app is working. We can head to the cluster we’ve created and Run new Task from the Tasks tab. You should see the task definition and launch it from there.

Note: you’ll need to override the env vars

SQS

The entry point for starting an archive job will be an SQS.

Head to the SQS service and create a Standard Queue. You don’t need to configure anything special for it.

Just give it a good name.

Messages sent to the Q will use JSON. A message is similar to this

Lambda

Lambda will be the glue between SQS and the Fargate Cluster.

It will poll for messages from SQS, read the JSON message and start an ECS Task inside the cluster.

For this to work the lambda will need permissions to receive messages from the Q, delete messages from the Q, start an ECS Task and pass IAM roles to the ECS task.

IAM Role
We should create an IAM Role for this lambda that has the AWSLambdaBasicExecutionRole policy and a policy that allows access to our resources (SQS, ECS Task Definition and IAM)

What permissions should the policy have

For accessing SQS: sqs:ReceiveMessage, sqs:DeleteMessage and sqs:GetQueueAttributes
For starting a task: ecs:RunTask
For passing IAM Roles: iam:PassRole. We actually need to pass 2 different roles, the Task Role and the Task Execution Role, so don’t forget to pass both of them under Resources

The policy should be similar to this:

You need to change ecs-task-role with the name of the role you’ve created for the ECS Task,

Code
Lambda will read one message at a time from the Q. Launch and ECS Task in the desired cluster and pass the Bucket, Key, Destination and Exclude as container env vars.

One thing to note is that even if the IAM Role for the Lambda only allows for reading from a specific Q, and has access to specific IAM Roles, the code is agnostic about this. You could use a different Q as a trigger and the code doesn’t need to change. You’ll have to update the IAM role though.

This is the actual code for the lambda:

There are a few env vars that need to be defined for this to work

TASK_DEFINITION the name of the ECS Task definition
CLUSTER_NAME the name of the ECS cluster where to start the task
SUBNET the name of the VPC subnet under which the task will run
SECURITY_GROUP the name of the Security Group under which the task will run

Notice that we are catching an JSON parsing error. This is important, as otherwise the lambda will have an uncaught exception and the message will not be removed from the Q. This will lead to the message being processed again.

Testing
Create a test event for an SQS trigger and send an appropriate message. You can go and check the cluster to see that a new job has started. You can also create a second event, with a malformed JSON message to see that a job is not started.

Wrap up

Now you should have a setup where you can start an archiving job for files inside an S3.

The actual code that does the job is under 100 lines. The difficult part is setting up all the infrastructure that will allow for running this job whenever you want.

Further development

There are a some improvements that could be made to this setup:

Starting an ECS Task may fail, we should log this as a failed task and maybe retry it in the future
There is no validation made on the SQS message. We should make sure that we have all the keys in the message. At the moment the ECS Task will fail and there’s no easy way of knowing that.