AWS Glue Job Fails with CSV data source does not support map data type error

AWS Glue is a serverless ETL service to process large amount of datasets from various sources for analytics and data processing. Recently I came across “CSV data source does not support map data type” error for a newly created glue job. In a nutshell, the job was performing below steps:

  1. Read the data from S3 using create_dynamic_frame_from_options
  2. Perform some required transformations
  3. Write the transformed data to Amazon Redshift using write_dynamic_frame_from_jdbc_conf

And it was during this write step that the glue job was failing. Lets look into it in little more details –

  1. The pyspark script started with reading the data from source S3 into a Glue DynamicFrame

datasource0 = glueContext.create_dynamic_frame_from_options(
            connection_type="s3",
            connection_options = {
                "paths": [S3_location]
               },
            format="parquet"
            )

2. The schema for the data was as below:


datasource0.printSchema()
root
|-- id: string
|-- version: int
|-- description: string
|-- type: string
|-- status: string
|-- rel_metadata: map
|    |-- keyType: string
|    |-- valueType: string
|-- mod_metadata: map
|    |-- keyType: string
|    |-- valueType: string
|-- event_type: string
|-- created_at: long
|-- last_updated: long
|-- last_updated_by: string

3. As part of the script I was performing some basic transformations. For example, the last_updated and created_at column with long datatype were converted to timestamp datatype.

df1 = datasource0.toDF()

import pyspark.sql.functions as F
import pyspark.sql.types as T

df2 = (df1
       .withColumn('created_at', F.from_unixtime(F.col('created_at')/1000).cast(T.TimestampType()))
       .withColumn('last_updated', F.from_unixtime(F.col('last_updated')/1000).cast(T.TimestampType()))
      )

4. The final step in the script was to convert the spark dataframe into Glue DynamicFrame and write it to Amazon Redshift database using write_dynamic_frame_from_jdbc_conf method of glueContext class.

datasink = glueContext.write_dynamic_frame_from_jdbc_conf(
                frame=data,
                catalog_connection="devdwdb",
                connection_options={
                    "dbtable": "admin.anand_demo_table",
                    "database": "devdwdb",
                },
                redshift_tmp_dir=redshift_temp_path,
                transformation_ctx="datasink"
            )

The AWS Glue job was failing with ‘CSV data source does not support map<string,string> data type.’ error during the last step of writing the dynamicframe to database.

To debug the issue, if you look at the schema of the data, you will notice that the columns rel_metadata and mod_metadata are map datatype. Now, the way AWS Glue service internally handles the write_dynamic_frame_from_jdbc_conf method for redshift is to write the Glue DyanamicFrame data into multiple CSV files and create a manifest file onto the S3 location specified in the temporary directory parameter of the Glue job. Then these CSV files are loaded into the redshift database via COPY command using the manifest file. And as CSV does not support map data type the Glue job was failing.

The way I handled this issue is using the “to_json” pyspark function which converts the map column into JSON string.

import pyspark.sql.functions as F
logger.info("Convert the map datatype to json datatype")
df3 = (df2
        .withColumn("rel_metadata", F.to_json(df1.rel_metadata))
        .withColumn("mod_metadata", F.to_json(df1.mod_metadata))
     )

With this change the Glue job execution was successful and the data was loaded into Amazon Redshift. To query the map columns (converted into json string) you can use JSON_EXTRACT_PATH_TEXT function in Redshift database.

select JSON_EXTRACT_PATH_TEXT(mod_metadata, 'QCN') as QCN from admin.anand_demo_table limit 3;
+------------------------------+
|                qcn           |
+------------------------------+
| qualified-class-name         |
| qualified-class-name         |
| qualified-class-name         |
+------------------------------+
(3 rows)

Hope this helps!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s