Deploying Big Files with AWS Lambda and EFS Made Easy

Background

In some cases, we want to deploy our trained deep learning models or pre-trained models from platforms like Hugging Face to AWS Lambda for serverless inference.

While the official service for such tasks is AWS Sagemaker, it can sometimes be overly complex for simple deployment needs. Although Sagemaker offers benefits like model management and MLOps, there are scenarios where a simpler solution is preferred.

Deploying large models with just Lambda presents challenges due to the service's size limitations—50 MB for direct uploads (zipped) and 250 MB (unzipped). Using Lambda Docker with ECR can support up to 10 GB, but storing all files in memory can lead to slower cold starts and increased costs.

To achieve efficient deployment, I recommend using Lambda with EFS as the file system.

How it Works

EFS is a file system that Lambda can access if it is mounted properly. To achieve this, both the EFS resource and Lambda need to be within a VPC.

According to AWS best practices, only resources that need to be accessible from outside the VPC should be placed in public subnets. Typically, this includes NAT gateway, which adds an extra layer of network security by ensuring critical resources are only accessible from within the VPC.

Consequently, EFS mounts should be placed inside private subnets. Since Lambda functions do not have public IPs, they should also be placed in private subnets.

Here is an overview of the steps.

Create a Lambda function in a VPC with a private subnet.
Create an EFS in the same VPC as the Lambda function, also in a private subnet.
Create an EFS Mount Target in the Availability Zones (AZs) where the Lambda function will be deployed.
Create a Security Group to enable the Lambda function to access the EFS.
Mount the EFS from the Lambda settings.

To insert the file, I usually create a temporary EC2 instance and mount the EFS on that EC2 instance so that I can transfer my files through the EC2 instance to the EFS.

As a prerequisite, I assume that we already have a VPC with a NAT Gateway and internet access set up from the private subnet.

Sounds Hard Enough... How to build it?

There are many ways to build this infrastructure, and I generally recommend using some form of Infrastructure as Code (IaaC) such as Terraform, CloudFormation, or AWS SAM.

Since we are dealing with Lambda, I will be using AWS SAM in this tutorial. If you already use Terraform in your stack, I suggest managing EFS and EC2 with Terraform and Lambda with AWS SAM.

However, for simplicity, I will be using 100% AWS SAM for this tutorial. The easiest way for us to jump-start this is by using sam init.

sam init

After that, you should see the prompt below. Choose 1 - AWS Quick Start Templates .

Which template source would you like to use?
        1 - AWS Quick Start Templates
        2 - Custom Template Location
Choice:

There are a lot of templates. Choose 14 - Lambda EFS example1

Choose an AWS Quick Start application template
        1 - Hello World Example
        2 - Data processing
        3 - Hello World Example with Powertools for AWS Lambda
        4 - Multi-step workflow
        5 - Scheduled task
        6 - Standalone function
        7 - Serverless API
        8 - Infrastructure event management
        9 - Lambda Response Streaming
        10 - Serverless Connector Hello World Example
        11 - Multi-step workflow with Connectors
        12 - GraphQLApi Hello World Example
        13 - Full Stack
        14 - Lambda EFS example
        15 - DynamoDB Example
        16 - Machine Learning

Pick your Python version.

Which runtime would you like to use?
        1 - python3.9
        2 - python3.8
        3 - python3.12
        4 - python3.11
        5 - python3.10

By this point, we should have all we need to deploy Lambda and EFS to the cloud.

The complete generated template can be found here.

SAM Init, Lambda EFS example. Generated Template

Okay, Problem solved?

Not exactly. While using sam init can give us a jump start for our development, there are two problems when using it for Lambda with EFS.

It doesn't provide a way to put data on the EFS.
The bigger problem is that we cannot exactly use it locally. While there are issues on GitHub discussing this, there is currently no way to test it locally.

In the next sections, we will try to fix the above problems.

Improve SAM Template & Add EC2

When adding an EC2 instance, it's not necessary to rewrite the entire template. The template generated by sam init can be hard to read and understand, so I've rewritten it to make it clear what inputs are needed and what resources are required to create resources like EFS, Lambda, and EC2.

Also, to fix the second problem, we need our Lambda to be image-based instead of a zip file.

First, let's start with the parameters:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Build Lambda with EFS!

Parameters:
  VpcId:
    Type: String
    Default: your-vpc-id

  VpcCidr:
    Type: String
    Default: 12.1.0.0/16

  PublicSubnetId:
    Type: String
    Default: your-vpc-id's public subnet

  PrivateSubnetId:
    Type: String
    Default: your-vpc-id's private subnet

Resources:
  # We Fill this Later

As you can see, we need the VPC ID, VPC CIDR, one public subnet ID, and one private subnet ID. Using these inputs, we can create our resources.

Lambda Resources

Next, we define the Lambda resources:

Resources:
  #
  # Lambda
  #
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: [lambda.amazonaws.com]
            Action: ['sts:AssumeRole']
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - ec2:CreateNetworkInterface
                  - ec2:DescribeNetworkInterfaces
                  - ec2:DeleteNetworkInterface
                Resource: '*'
              - Effect: Allow
                Action:
                  - elasticfilesystem:ClientMount
                  - elasticfilesystem:ClientWrite
                Resource: '*'

  LambdaFunction:
    Type: AWS::Serverless::Function
    Properties:
      PackageType: Image
      FunctionName: my-function-name
      Role: !GetAtt LambdaExecutionRole.Arn

      # Compute
      Timeout: 600
      MemorySize: 4096
      Architectures:
        - x86_64

      # Network
      FileSystemConfigs:
        - Arn: !GetAtt EFSAccessPoint.Arn # TODO: Implement
          LocalMountPath: /mnt/files
      VpcConfig:
        SecurityGroupIds:
          - !Ref EFSAccessSecurityGroup # TODO: Implement
        SubnetIds:
          - !Ref PrivateSubnetId
    Metadata:
      Dockerfile: Dockerfile
      DockerContext: ./src
      DockerTag: test

To create the Lambda function, we need two resources: the Lambda itself and an IAM Role for executing it (accessing EFS, putting logs, etc.).

EFS Resources
Next, we create our EFS resources:

  #
  # EFS
  #
  EFSAccessSecurityGroup:
    Type: 'AWS::EC2::SecurityGroup'
    Properties:
      GroupDescription: Security Group for Lambda and EFS communication
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 2049  # NFS port used by EFS
          ToPort: 2049
          CidrIp: !Ref VpcCidr

  EFSFileSystem:
    Type: AWS::EFS::FileSystem
    Properties:
      Encrypted: false

  EFSMountTarget:
    Type: AWS::EFS::MountTarget
    Properties:
      FileSystemId: !Ref EFSFileSystem
      SubnetId: !Ref PrivateSubnetId
      SecurityGroups:
        - !Ref EFSAccessSecurityGroup

  EFSAccessPoint:
    Type: AWS::EFS::AccessPoint
    Properties:
      FileSystemId: !Ref EFSFileSystem
      PosixUser:
        Uid: "1000"
        Gid: "1000"
      RootDirectory:
        CreationInfo:
          OwnerGid: "1000"
          OwnerUid: "1000"
          Permissions: "0777"

The above defines the bare minimum so that our EFS can work. We need an EFS File System, a mount target, and an access point. Additionally, we need a Security Group to allow access between EFS and Lambda.

EC2 Resources

Finally, we add the EC2 instance to interact with EFS:

  #
  # EC2 Instance
  #
  EC2InstanceIAMRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ec2.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonElasticFileSystemClientFullAccess
        - arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess

  EC2InstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Roles:
        - !Ref EC2InstanceIAMRole

  EC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: ami-07c589821f2b353aa # ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20231207
      InstanceType: t2.micro
      SubnetId: !Ref PublicSubnetId
      SecurityGroupIds:
        - !Ref LambdaSecurityGroup
      IamInstanceProfile: !Ref EC2InstanceProfile

(AMI ID might need to be updated based on the region and availability)

This allows the EC2 instance to access EFS. Additionally, you will need to mount the EFS to the EC2 instance before accessing it. You can find the mounting guide in the AWS official documentation.

The final template can be confirmed at my Github repository.

Bigger Problem: Local Invocation.

While the inability to test locally is not the end of the world, it is certainly a major inconvenience. Without local invocation, you need to deploy every time you want to test your code. This can take hours of your time and has certainly taken hours or even days of mine.

Worry no more, I have found a solution by reverse-engineering sam-cli.

The way it works is that when you press sam local invoke, the sam-cli creates an image, makes a container of it, opens an endpoint, calls it, and then removes the container. So to solve this, we just need to create a container with a Docker volume attached to it, the same directory that we attach our EFS to. In this case, we define it as:

LocalMountPath: /mnt/files

To make this happen, we build our SAM project:

sam build --cached

Using the built image, we run the container with the volume attached to it:

docker run \
    --rm \
    -p 8000:8080 \
    --platform linux/amd64 \
    -v $$(pwd)/efs:/mnt/efs \
    -e AWS_LAMBDA_FUNCTION_MEMORY_SIZE=8192 \
    -e AWS_LAMBDA_FUNCTION_TIMEOUT=600 \
    -e AWS_LAMBDA_FUNCTION_NAME=my-function-name \
    -e AWS_ACCESS_KEY_ID=dummy \
    -e AWS_SECRET_ACCESS_KEY=dummy \
    lambdafunction:test

After that, you can invoke your function with the following command:

curl \
    -X POST \
    http://localhost:8000/2015-03-31/functions/function/invocations \
    -d '{"test":"test"}'

This approach allows you to test your Lambda function locally, saving you significant time and effort.

Closing

By improving our SAM template and adding the necessary configurations, we've streamlined the deployment process for Lambda and EFS resources. We've also implemented a solution for local invocation, significantly reducing development and testing time. Now, you can store your model in EFS and read it from Lambda, enabling efficient handling of large files for serverless inference.

Thank you for following along, and happy coding!

Deploying Big Files with AWS Lambda and EFS Made Easy

Background

How it Works

Sounds Hard Enough... How to build it?

Okay, Problem solved?

Improve SAM Template & Add EC2

Bigger Problem: Local Invocation.

Closing

Comments

More from this blog

The Founder's Perspective: Why Vertical SaaS?

Building a Stock System That Cannot Be Wrong

200台サーバーのローカルllmクラスターを構築

Building a 200‑Server Local LLM Cluster

Managing Snowflake's Procedure & UDF with Github

Command Palette

Background

How it Works

Sounds Hard Enough... How to build it?

Okay, Problem solved?

Improve SAM Template & Add EC2

Bigger Problem: Local Invocation.

Closing

Comments

More from this blog