Using Amazon S3 Select, you can filter the contents of an Amazon S3 object and retrieve a subset of data using SQL statements. S3 Select can reduce the amount of data that needs to be transferred from Amazon S3, lowering the cost and latency of data retrieval.
Here is an example of how to use the select_object_content method of the boto3 Python library to retrieve a portion of data stored in an S3 object:
This code will retrieve the Movie Names from the “TOP 1000 IMBD Movie” dataset having “MovieRating” greater than 9.0 from an object (Top1000IMDBmovies.csv) stored in an S3 bucket (awss3selectexample) and process it using the SQL expression
SELECT MovieName FROM s3object s WHERE s.MovieRating > ‘9.0’
The object is assumed to be a CSV file with a header row, and the response will also be in CSV format. The records in the response will be decoded using the UTF-8 encoding and printed to the console.
You can use S3 Select with the Amazon S3 REST API and the AWS SDK to select content from objects.You have to preapre your local environment to test the AWS S3 Select feature , below are few of the prerequisites.
import boto3
# Create an S3 client
session = boto3.Session(profile_name='abhi')
s3 = session.client('s3')
# Set the name of the bucket and object to retrieve
bucket = 'awss3selectexample'
key = 'Top1000IMDBmovies.csv'
# Set the SQL expression to use for processing the object
expression = "SELECT MovieName FROM s3object s WHERE s.MovieRating > '9.0'"
# Set the response format for the data
response_format = 'CSV'
# Set the encoding for the data
encoding = 'UTF-8'
# Set the desired output serialization for the data
output_serialization = {
'CSV': {
'QuoteFields': 'ASNEEDED',
'RecordDelimiter': '\n',
'FieldDelimiter': ',',
'QuoteEscapeCharacter': '"'
}
}
# Set the input serialization for the data
input_serialization = {
'CSV': {
'FileHeaderInfo': 'Use',
'RecordDelimiter': '\n',
'FieldDelimiter': ','
}
}
# Use the S3 client to retrieve the object and process it using S3 Select
response = s3.select_object_content(
Bucket=bucket,
Key=key,
Expression=expression,
ExpressionType='SQL',
InputSerialization=input_serialization,
OutputSerialization=output_serialization
)
# Get the records from the response
records = response['Payload']
# Iterate over the records and print each one
for record in records:
if 'Records' in record:
payload = record['Records']['Payload'].decode(encoding)
print(payload)
elif 'Stats' in record:
print(record['Stats'])
elif 'End' in record:
print(record['End'])
In the given dataset, only two movies have Movie Ratings greater than 9.0
In Summary, S3 Select is a valuable tool for executing SQL queries on data stored in Amazon S3. It allows you to process data stored in S3 programmatically and search for specific data within it, improving performance and reducing costs. Additionally, S3 Select can save you time and money compared to other data analysis methods.