Use redshift-data api with AWS Glue Python Shell job

AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. When you add a AWS Glue job, you can choose the job to be either Spark or Spark Streaming or Python shell type.

For one of my use-case I wanted to try the new “redshift-data” api service in AWS Glue Python Shell script. The Amazon Redshift Data API can be used to run SQL queries on Amazon Redshift tables. To test the redshift-data API I wrote a simple AWS Glue Python Shell job to execute a sql. The job failed with

UnknownServiceError: Unknown service: 'redshift-data'. Valid service names are: acm...

I was quite surprised with the error and it made me think to check the Boto3 version which AWS Glue Python Shell script is importing.

import boto3

print('boto3 version :', boto3.__version__)

Surprisingly the output is –

boto3 version : 1.9.203 

The current boto3 version available is 1.16.45 whereas AWS Glue Python shell is importing version 1.9.203. I was pretty sure I wasn’t the first person to get into this issue, so I searched it and came across the open issue.

User sarath-mec has provided the solution in the comment section for both AWS Glue Python Shell with and without Internet.

Here for my notes I am simply pasting the solution for AWS Glue Python Shell with Internet with minor changes –

AWS Glue Python Shell with Internet

Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.

  1. Download the latest whl files from boto3 files and awscli files
  1. Upload the files to s3 bucket.
  2. Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma.
  3. Add the following code snippet to load the new files.
#Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
import sys
sys.path.insert(0, '/glue/lib/installation')
keys = [k for k in sys.modules.keys() if 'boto' in k]
for k in keys:
    if 'boto' in k:
       del sys.modules[k]

import boto3

print('boto3 version :', boto3.__version__)

#Initiating redshift-data client with newer boto3 version
client = boto3.client('redshift-data')

#use the initiated client to list redshift database
response = client.list_databases(
    ClusterIdentifier='dwtest',
    Database='dwtest',
    DbUser='admin'
)

print(response)

Script log –

boto3 version : 1.16.45 
{'Databases': ['dev', 'dwtest'], 'ResponseMetadata': {'RequestId':  '3d3eb2d9-6ef4-4528-8881-819b78510487', 'HTTPStatusCode': 200,  'HTTPHeaders': {'x-amzn-requestid':  '3d3eb2d9-6ef4-4528-8881-819b78510487', 'content-type':  'application/x-amz-json-1.1', 'content-length': '36', 'date': 'Wed, 30  Dec 2020 03:49:31 GMT'}, 'RetryAttempts': 0}} 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s