Parquet timestamp and Athena Query

In this blog I will walk you through the way timestamp is stored in Parquet file version 1.0 and 2.0, how the timestamp column data is displayed in Athena for each version and how you can cast the timestamp column in Athena to view timestamp for version 2.0 Parquet files.

Using AWS Glue crawler, I crawled few parquet files stored in S3 created by RDS Snapshot to S3 feature. After the crawler completed and added the new table, I used AWS Athena to query the data and the timestamp column displayed the data as below –

To dive deep and understand the timestamp column value, I downloaded the parquet file and used parquet-cli to look at the schema metadata.

parq part-00000-18b2ce27-14eb-4975-ac08-873db788ceb5-c000.gz.parquet -s
# Schema
 pyarrow._parquet.ParquetSchema object at 0x11cf5c320
 required group field_id=0 schema {
   optional int64 field_id=1 updated_at (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
   optional int32 field_id=2 component_id;
   optional int64 field_id=3 created_at (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
   optional int32 field_id=4 attempts;
   optional int32 field_id=5 id;
 }

You can also use PyArrow for the same.

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile('part-00000-18b2ce27-14eb-4975-ac08-873db788ceb5-c000.gz.parquet')

parquet_file.schema
parquet_file.metadata       

If you notice the timestamp column is based of int64 storage. This is how the timestamp is stored in the new Parquet format version 2.0. The older Parquet version 1.0 uses int96 based storage of timestamp. Lets create a file with version 1.0 using PyArrow –

#Reading the row group from previously read file
table = parquet_file.read_row_group(0)
#write the table with use_deprecated_int96_timestamps set to True for Parquet version 1.0 
pq.write_table(table, 'example.parquet', use_deprecated_int96_timestamps=True, flavor='spark')
#Read the newly created file
parquet_file = pq.ParquetFile('example.parquet')
#View the schema
parquet_file.schema
 pyarrow._parquet.ParquetSchema object at 0x10ef60588
 required group field_id=0 schema {
   optional int96 field_id=1 updated_at;
   optional int32 field_id=2 component_id;
   optional int96 field_id=3 created_at;
   optional int32 field_id=4 attempts;
   optional int32 field_id=5 id;
 }

Now lets crawl this new parquet file of version 1.0 and look at the timestamp data.

Voila!

Now given that we have the original files in new Parquet format version 2.0 in S3, I used the below SQL in Athena to CAST the timestamp –

SELECT
 id
 , CAST("from_unixtime"(CAST(("to_unixtime"("updated_at") / 1000) AS bigint)) AS timestamp) "updated_at"
 , CAST("from_unixtime"(CAST(("to_unixtime"("created_at") / 1000.0) AS double)) AS timestamp) "created_at"
 FROM database.table_name  order by 1 limit 10;

If you notice carefully, the millisecond value for update_at is .000 whereas for column create_at its a non zero value. The difference is due to the way columns are casted. The updated_at column lost the precision because its casted as “bigint”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s