AWS S3 as a database with S3 Select
Have you ever wished to pay little to no money for using a database?
Table of contents
When I see AWS S3, I don't think of it as a database. I think of it as data storage to put my big file data like photos or videos. Someday I read somewhere on the Internet that says S3 is a database, making me want to experiment with it. In this article, I will try to implement CRUD operation in AWS S3.
Prerequisite
In this experiment, We will be using AWS S3 and Python. Here is what I have prepared before the experiment.
- AWS Account
- S3 Bucket to store CSV File
- Python Environment
- A Dummy CSV File filled with one thousand users' information
Implementing CRUD
And now, let's get our hands dirty and code this thing. First, of Its a good thing to know what we are going to build, so I wrote a test (not really) before implementing it.
from s3.user import read, delete, update, create
# Read Query
print("Read Data")
print(read("WHERE firstname = 'Devina'"))
# Output
# [['1274', 'Devina', 'Terencio', 'Devina.Terencio@yopmail.com', 'Devina.Terencio@gmail.com', 'doctor']]
# Update Query
print("Update Data")
update(
"WHERE firstname = 'Devina'",
{
'lastname': 'Green'
}
)
# Output
print(read("WHERE firstname = 'Devina'"))
# [['1274', 'Devina', 'Green', 'Devina.Terencio@yopmail.com', 'Devina.Terencio@gmail.com', 'doctor']]
# Delete Item
# print("Delete Item")
delete("WHERE firstname = 'Devina'")
# Output
print(read("WHERE firstname = 'Devina'"))
# []
# Create Item
print("Create Item")
create({
'firstname': 'Devina',
'lastname': 'Terencio',
'email': 'Devina.Terencio@yopmail.com',
'email2': 'Devina.Terencio@gmail.com',
'profession': 'doctor'
})
# Output
print(read("WHERE firstname = 'Devina'"))
# [['2100', 'Devina', 'Terencio', 'Devina.Terencio@yopmail.com', 'Devina.Terencio@gmail.com', 'doctor']]
If what is printed in the terminal is the same as expected, we successfully built our application.
Here is the GitHub link: github.com/alvinend/s3-as-a-db
Read
What do we do when we want to query some items.
- Use S3 Select to query CSV. This will return object
- Get payload string from the response object
- Change payload string to list
Easy!
import boto3
import pandas
import settings
s3_client = boto3.client('s3')
def read(
query,
select='*'
):
# Use S3 Select to Query CSV
res = s3_client.select_object_content(
Bucket = settings.BUCKET_NAME,
Key = 'user.csv',
ExpressionType = 'SQL',
Expression =f"Select {select} from S3Object s " + query,
InputSerialization = {
'CompressionType': 'NONE',
'CSV' : {
'FileHeaderInfo' : 'Use',
'RecordDelimiter' : '\n',
'FieldDelimiter' : ','
}
},
OutputSerialization = {
'CSV' : {
'RecordDelimiter' : '\n',
'FieldDelimiter' : ','
}
}
)
records = ''
for event in res['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
# Change String to Array 1, "Name1, Address1\n 2, Name2, Address2\n" -> ["1, Name1, Address1", "2, Name2, Address2", ""]
records = records.split('\n')
# Remove Empty String Element ["1, Name1, Address1", "2, Name2, Address2", ""] -> ["1, Name1, Address1", "2, Name2, Address2"]
records = filter(len, records)
# Change Elemnt to be String [["1", "Name1", "Address1"], ["2", "Name2", "Address2"]]
records = list(map(lambda x: x.replace('\r', '').split(','), records))
return records
Delete
Since S3 has no feature to manipulate CSV files, we must do it manually. By manually, I mean taking the file from S3 manipulate and replacing the old file with the manipulated file.
import boto3
import pandas
import os
import settings
from io import StringIO
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
def get_df():
res = s3_client.get_object(Bucket=settings.BUCKET_NAME, Key="user.csv")
df = pandas.read_csv(res.get("Body"))
return df
def save_to_s3(df):
csv_buffer = StringIO()
df.to_csv(csv_buffer, index=False)
s3_resource.Object(settings.BUCKET_NAME, 'user.csv').put(Body=csv_buffer.getvalue())
We will use this method to create and update operations too. For delete, we do these actions.
- Get Dataframe from S3
- Get Ids from S3 Select query
- Filter out selected ID
- Save Dataframe to S3 as CSV
from s3.user.utils import get_df, save_to_s3
from s3.user.read import read
def delete(query):
# Get Dataframe
df = get_df()
# Get Ids from query
ids = read(query, select='id')
ids = list(map(lambda x: int("".join(x)), ids))
# Filter out Selected ID
newdf = df[~df.id.isin(ids)]
# Save Dataframe to S3 as CSV
save_to_s3(newdf)
Update
The update operation is almost the same as delete, with the exception instead of filtering it, we update it.
- Get Dataframe from S3
- Get Ids from S3 Select query
- Update selected ID with new values.
- Save Dataframe to S3 as CSV
def update(
query,
data_dict
):
update_column_key = list(data_dict.keys())
update_column_value = list(data_dict.values())
# Get Dataframe
df = get_df()
# Get Ids from query
ids = read(query, select='id')
ids = list(map(lambda x: int("".join(x)), ids))
# Update DF
df.loc[df.id.isin(ids), update_column_key] = update_column_value
# Save Dataframe to S3 as CSV
save_to_s3(df)
Create
Last but not least is the create operation. Although I would like to insert data in pandas' DataFrame, it has no auto-increment feature. So we have to manually search the latest stored ID, increment it, and save new data.
- Get Dataframe from S3
- Get latest id from S3 Select query
- Create new data with the incremented latest ID
- Save Dataframe to S3 as CSV
from s3.user.read import read
from s3.user.utils import get_df, save_to_s3
def create(
data_dict
):
# Get Dataframe
df = get_df()
# Get Ids from query
max_id = read('', select='MAX(cast(id as int))')
max_id = int(max_id[0][0])
# Update DF
data_dict['id'] = max_id + 1
df = df.append(data_dict, ignore_index=True)
# Save Dataframe to S3 as CSV
save_to_s3(df)
Closing
There you have it, S3 as a database! I know it is slower than a conventional relational database like MySQL and has no ACID compliance. But it is cheap when you are experimenting or making a prototype of something. It is nice to have a choice that does not require you to manage the database and pay monthly server costs.
As always, thank you for reading this post!