AWS S3 as a database with S3 Select

AWS S3 as a database with S3 Select

Have you ever wished to pay little to no money for using a database?

ยท

5 min read

When I see AWS S3, I don't think of it as a database. I think of it as data storage to put my big file data like photos or videos. Someday I read somewhere on the Internet that says S3 is a database, making me want to experiment with it. In this article, I will try to implement CRUD operation in AWS S3.

Prerequisite

In this experiment, We will be using AWS S3 and Python. Here is what I have prepared before the experiment.

  • AWS Account
  • S3 Bucket to store CSV File
  • Python Environment
  • A Dummy CSV File filled with one thousand users' information

Implementing CRUD

And now, let's get our hands dirty and code this thing. First, of Its a good thing to know what we are going to build, so I wrote a test (not really) before implementing it.

from s3.user import read, delete, update, create

# Read Query
print("Read Data")
print(read("WHERE firstname = 'Devina'"))

# Output
# [['1274', 'Devina', 'Terencio', 'Devina.Terencio@yopmail.com', 'Devina.Terencio@gmail.com', 'doctor']]

# Update Query
print("Update Data")
update(
  "WHERE firstname = 'Devina'",
  {
    'lastname': 'Green'
  }
)

# Output
print(read("WHERE firstname = 'Devina'"))
# [['1274', 'Devina', 'Green', 'Devina.Terencio@yopmail.com', 'Devina.Terencio@gmail.com', 'doctor']]

# Delete Item
# print("Delete Item")
delete("WHERE firstname = 'Devina'")

# Output
print(read("WHERE firstname = 'Devina'"))
# []

# Create Item
print("Create Item")
create({
  'firstname': 'Devina',
  'lastname': 'Terencio',
  'email': 'Devina.Terencio@yopmail.com',
  'email2': 'Devina.Terencio@gmail.com',
  'profession': 'doctor'
})

# Output
print(read("WHERE firstname = 'Devina'"))
# [['2100', 'Devina', 'Terencio', 'Devina.Terencio@yopmail.com', 'Devina.Terencio@gmail.com', 'doctor']]

If what is printed in the terminal is the same as expected, we successfully built our application.

Here is the GitHub link: github.com/alvinend/s3-as-a-db

Read

What do we do when we want to query some items.

  • Use S3 Select to query CSV. This will return object
  • Get payload string from the response object
  • Change payload string to list

Easy!

import boto3
import pandas
import settings

s3_client = boto3.client('s3')

def read(
    query,
    select='*'
):
    # Use S3 Select to Query CSV
    res = s3_client.select_object_content(
        Bucket = settings.BUCKET_NAME,
        Key = 'user.csv',
        ExpressionType = 'SQL',
        Expression =f"Select {select} from S3Object s " + query,
        InputSerialization = {
            'CompressionType': 'NONE',
            'CSV' : {
                'FileHeaderInfo' : 'Use',
                'RecordDelimiter' : '\n',
                'FieldDelimiter' : ','
            }
        },
        OutputSerialization = {
            'CSV' : {
                'RecordDelimiter' : '\n',
                'FieldDelimiter' : ','
            }
        }
    )

    records = ''

    for event in res['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')

    # Change String to Array 1, "Name1, Address1\n 2, Name2, Address2\n" -> ["1, Name1, Address1", "2, Name2, Address2", ""]
    records = records.split('\n')

    # Remove Empty String Element ["1, Name1, Address1", "2, Name2, Address2", ""] -> ["1, Name1, Address1", "2, Name2, Address2"]
    records = filter(len, records)

    # Change Elemnt to be String [["1", "Name1", "Address1"], ["2", "Name2", "Address2"]]
    records = list(map(lambda x: x.replace('\r', '').split(','), records))

    return records

Delete

Since S3 has no feature to manipulate CSV files, we must do it manually. By manually, I mean taking the file from S3 manipulate and replacing the old file with the manipulated file.

import boto3
import pandas
import os
import settings
from io import StringIO

s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')

def get_df():
    res = s3_client.get_object(Bucket=settings.BUCKET_NAME, Key="user.csv")

    df = pandas.read_csv(res.get("Body"))

    return df

def save_to_s3(df):
    csv_buffer = StringIO()
    df.to_csv(csv_buffer, index=False)
    s3_resource.Object(settings.BUCKET_NAME, 'user.csv').put(Body=csv_buffer.getvalue())

We will use this method to create and update operations too. For delete, we do these actions.

  1. Get Dataframe from S3
  2. Get Ids from S3 Select query
  3. Filter out selected ID
  4. Save Dataframe to S3 as CSV
from s3.user.utils import get_df, save_to_s3
from s3.user.read import read

def delete(query):
    # Get Dataframe
    df = get_df()

    # Get Ids from query
    ids = read(query, select='id')
    ids = list(map(lambda x: int("".join(x)), ids))

    # Filter out Selected ID
    newdf = df[~df.id.isin(ids)]

    # Save Dataframe to S3 as CSV
    save_to_s3(newdf)

Update

The update operation is almost the same as delete, with the exception instead of filtering it, we update it.

  1. Get Dataframe from S3
  2. Get Ids from S3 Select query
  3. Update selected ID with new values.
  4. Save Dataframe to S3 as CSV
def update(
    query,
    data_dict
):
    update_column_key = list(data_dict.keys())
    update_column_value = list(data_dict.values())

    # Get Dataframe
    df = get_df()

    # Get Ids from query
    ids = read(query, select='id')
    ids = list(map(lambda x: int("".join(x)), ids))

    # Update DF
    df.loc[df.id.isin(ids), update_column_key] = update_column_value

    # Save Dataframe to S3 as CSV
    save_to_s3(df)

Create

Last but not least is the create operation. Although I would like to insert data in pandas' DataFrame, it has no auto-increment feature. So we have to manually search the latest stored ID, increment it, and save new data.

  1. Get Dataframe from S3
  2. Get latest id from S3 Select query
  3. Create new data with the incremented latest ID
  4. Save Dataframe to S3 as CSV
from s3.user.read import read
from s3.user.utils import get_df, save_to_s3

def create(
    data_dict
):
    # Get Dataframe
    df = get_df()

    # Get Ids from query
    max_id = read('', select='MAX(cast(id as int))')
    max_id = int(max_id[0][0])

    # Update DF
    data_dict['id'] = max_id + 1

    df = df.append(data_dict, ignore_index=True)

    # Save Dataframe to S3 as CSV
    save_to_s3(df)

Closing

There you have it, S3 as a database! I know it is slower than a conventional relational database like MySQL and has no ACID compliance. But it is cheap when you are experimenting or making a prototype of something. It is nice to have a choice that does not require you to manage the database and pay monthly server costs.

As always, thank you for reading this post!

Did you find this article valuable?

Support Alvin Endratno by becoming a sponsor. Any amount is appreciated!

ย