Start a verificationTalk to sales
Mortgage

Using AWS Lambda & Node.js to scan your S3 uploads

Learn how the Truework team leverages AWS Lambda Functions, ClamAV, and Node.JS to scan S3 upload and protect its customers.
Headshot of Victor Kabdebon
Victor Kabdebon
Truework
circle-decoration
ninety-angle-decoration
Table of Contents
Table of Contents

Link to the repository --> https://github.com/truework/lambda-s3-antivirusAt Truework, we work hard to protect employee privacy. Key to this mission is helping HR teams build a strong culture of security when it comes to sharing employee data.

A common reason for employee data to be shared outside of the organization is when a third-party requires information. Typically these transactions revolve around loans, mortgages, or background checks. As part of the verification, the third-party requester will send along documents such as PDF, images, etc...

It is not recommended to download and open attachments from parties outside of your organization. Popular corporate email providers, like Gmail, even warn explicitly against trusting attachments from unknown sources. Unfortunately, email attachments are the primary way third-party requesters transmit both signed consent forms and other documents authorizing the release of the information. It is, therefore, a constant struggle to balance good email security against these often urgent and important requests. Especially because validating the information contained in the forms and employee authorization documents is one of several important steps in verifying the legitimacy of these requests and protecting employees from targeted phishing campaigns.

In our employee information request flow, requesters frequently attach documents (PDFs, Word Document, Images) alongside their requests. Just like HR departments, we need to be especially careful handling these file attachments. To make sure that we don’t open or download any malicious files, we needed to build a system to scan and flag anything suspicious. We’ve laid out our approach below. We use AWS S3 for our file storage, but this solution can be adapted to other platforms.

User uploads & AWS Lambda

Uploads are infrequent and maintaining running instances waiting for this event wouldn’t be a cost-effective solution, even with small EC2 instances. AWS Lambda allows you or your team to run short-lived functions which can be triggered only when certain events happen. They can be time-based, for example, “once every 5 hours” or based on triggers from other systems, for example, “a new file has been uploaded to an S3 bucket”. AWS Lambda functions are only run when specific events are received, costing you nothing more than the storage of the executable at other times.

On top of not having to manage running instances, AWS Lambda has three characteristics that make it especially attractive for solving this problem:

  • Fine-grained time-based pricing, in increments of 100ms (if it takes 250ms to run the scan, you’ll be billed 300ms). It can be cheap as you only pay for what you use.
  • Massively scalable - Since uploads generally happen in short groupings, it’s convenient to have the ability to spike to multiple instances so all scans can happen in parallel and be finished quickly.
  • Recent updates to AWS Lambda have enabled developers to use JavaScript with a modern runtime (Node.js 8.10) which makes development easier.

ClamAV - Free & Open source antivirus

ClamAV is an open-source antivirus that can be used to scan files on any system. It is the de-facto solution for a lot of email servers and has been deployed in many large-scale systems. It can be installed directly from the package managers of most distributions.

It is currently not possible to install packages on the AWS Lambda instances on which your function is going to run - at least not from a package manager. Instead, the way to go is to build for the target machine (amazonlinux), package the executable and the libraries together, zip up, and upload the zip file to AWS Lambda.

We also wanted to be able to build the Lambda function from our own computers rather than have to log on a separate machine to pull the required files. Luckily, Docker enables us to download an image of Amazon Linux and retrieve the executable directly from our preferred development environment.

A note for virus definitions for ClamAV: ClamAV recommends updating virus definitions on a regular interval. Because of some file size limitations of AWS Lambda, it is preferable to only use the .cvd files, the compressed version of the virus definitions.

We’ll present a solution that relies on two lambda functions, coded in JavaScript/Node.js 8.10:

  • A virus definition updater will run on a fixed schedule to keep the virus definitions up-to-date. It relies CloudWatch events set to run on a regular schedule.
  • A virus scanner will be triggered by S3 upload events. This function will scan incoming files and label them based on the result of the scan.

The Lambda Functions

Setting your environment

Before you can get started, make sure you have the following installed

  • Download docker and set it up on your computer
  • A way to run bash scripts - Mac OS, Linux, and modern Windows will all work fine with their respective tools
  • Node.js and an IDE/your favorite text editor if you want to modify the Lambda function

Build script - build_lambda.sh

The build_lambda.sh script will make sure that you have a recent version of clamAV downloaded and all the files packaged in a zip file (lambda.zip) that you can then upload to AWS Lambda.

The script has roughly the following steps:

  • Download and run an amazonlinux instance from the repository and install the clamAV

We set up a shared folder between the host machine and the docker container as such:

docker create -i -t -v ${PWD}/clamav:/home/docker --name s3-antivirus-builder amazonlinux

  • Use rpm2cpio & cpio to extract the necessary files (library and executables) from the RPM packages

rpm2cpio clamav-0*.rpm | cpio -idmv

  • Copy all the files (.js, libraries, executable and configurations) into the folder structure and zip the rest
  • Cleanup

A few words on the configuration:

DatabaseMirror database.clamav.net CompressLocalDatabase yes ScriptedUpdates no

As we mentioned, AWS has some limitations on how much disk space can be used by a Lambda function. As of the writing of this post, the sum of the source code and additional data. ClamAV definition files can exceed several hundred Mbs which can obviously get problematic. To work around this, we disable scripted updates and force compression of the local database.

AV Definition update code

The definition update Lambda Function is straightforward:

  • Clean the /tmp/ folder from any remaining files. We observed that if your lambda function runs at a short interval the /tmp folder may have some old files from a previous run.
  • Run freschlam to download the new definitions downloaded to the /tmp/ folder
  • Upload new antivirus definitions to S3 bucket

Virus Scan Code

The virus scan lambda function has a few more steps but is still relatively easy to follow.

  • Download definitions from the S3 bucket into the /tmp/ folder
  • Extract the S3 bucket name and S3 Key from the file upload event
  • Download the incoming file in /tmp/
  • Run ClamAV on the file
  • Tag the file in S3 with the result of the virus scan

Lambda Function Setup

Create two lambda functions, make sure to select a runtime of Node.js 8.10 or above as well as a role that allows you to read and write to the S3 bucket. Upload the zip file for both functions.

On the configuration screen, you should see something like this:

Virus Definitions Updater

Handler: download-definitions.lambdaHandleEvent

Trigger Events

For the definition updater, we want to run it on a regular basis. CloudWatch can generate an event based on a chron definition. We set ours to run every 6 hours:

cron(0 */6 * * ? *)

Virus Scanner

Handler antivirus.lambdaHandleEvent

Trigger Events

The virus scanner is based on S3 events: while there are multiple events you can handle, if your bucket deals with third party uploads, the best way to do it is to run on “ObjectCreated”

It should look something like this:

Monitoring & Testing

Make sure that you see definition files being updated on your S3 folder. For example check that before and after the run, the last modified date has been updated.

If you want to trigger the lambda functions manually (i.e. without having to wait around for a file being uploaded or a CloudWatch event) you can trigger manual events. On the upper right corner, there’s a test button for handmade events to check that it’s running properly.

You can go with standard event template and modify them to reflect real data or you can handcraft very simple models:

  • For the Virus Updater Lambda function, you can enter a dummy event. No data is extracted from the event.
  • For the Virus Scan Lambda function, you will need to have an event that links back to a real and accessible bucket and object. In the code we only use two fields (bucket name and object key), so you can omit all other fields.
  • You can use this barebone event below, replace the “name” and “key” with real values.

{ "Records":[ { "s3":{ "bucket":{ "name":"YOUR_BUCKET_NAME_HERE", }, "object":{ "key":"YOUR_PATH_TO_A_FILE.EXTENSION", } } } ] }

f you need to test with example virus meant to trigger antivirus software (and yet be harmless for your computer) you can download examples of virus files here.

Optimizations & Caveats

  • There are limitations to the size of the file that can be scanned. Because the total available space is 500 Mb, the maximum size of the file you can scan is File Size = 500Mb - (Virus Definitions + Executable Size). It hasn’t been an issue for us.
  • Virus definitions updates are not optimized because as shown above we disabled the scripted updates. As a result, the definitions files have to be downloaded in their entirety result in more egress traffic.
  • ClamAV performance is great and is on par with a lot of commercially available solutions but it should not dispense you from getting antivirus software for your computer or doing an additional check.
  • It is a point in time check, if the file was an unknown virus it might have gone through. We recommend you do a full scan of all your files from time to time.

Credit goes to for giving the idea in the first place!

Grow your business with Truework

Join the group of 17,000 organizations that use Truework to increase applicant conversion with faster income and employment verifications.

Talk to our Team