Pandas – ValueError: If using all scalar values, you must pass an index

Reading json file using Pandas read_json can fail with “ValueError: If using all scalar values, you must pass an index”. Let see with an example –

cat a.json
{
  "creator": "CaptainAmerica",
  "last_modifier": "NickFury",
  "title": "Captain America: The First Avenger",
  "view_count": 12000
}
>>> import pandas as pd
>>> import glob
>>> for f in glob.glob('*.json'):
...     print(f)
...
b.json
c.json
a.json
>>> pd.read_json('a.json')
Traceback (most recent call last):
  File "", line 1, in 
  .......
  raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index

Pandas expects the value to be a list or dict. For example –

cat al.json
{
  "creator": ["CaptainAmerica"],
  "last_modifier": ["NickFury"],
  "title": ["Captain America: The First Avenger"],
  "view_count": [12000]
}

>>> pd.read_json('al.json')
          creator last_modifier                               title  \
0  CaptainAmerica      NickFury  Captain America: The First Avenger

   view_count
0       12000
>>>

Pandas Series is a one-dimensional labeled array capable of holding any data type whereas DataFrame is two dimensional object that can have columns with potential different types. Frame is default.

Converting to use Series

>>> pd.read_json('a.json', typ='series')
creator                              CaptainAmerica
last_modifier                              NickFury
title            Captain America: The First Avenger
view_count                                    12000
dtype: object
>>>

>>> df = pd.DataFrame([pd.read_json('a.json',  typ='series')])
>>> df
          creator last_modifier                               title  \
0  CaptainAmerica      NickFury  Captain America: The First Avenger

   view_count
0       12000
Advertisements

S3 – fatal error: An error occurred (404) when calling the HeadObject operation

Make sure to use –recursive parameter.

[desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/
fatal error: An error occurred (404) when calling the HeadObject operation: Key "dw/user/dt=2019-07-30/" does not exist

[desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/ --recursive
download: s3://demo-beta/dw/user/dt=2019-07-30/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet to ../../../tmp/dw/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet

From Documentation –

–recursive (boolean) Command is performed on all files or objects under the specified directory or prefix.

Python – sort() vs sorted(list)

You can compare list using sort() or sorted(list), but be careful with sort() –

>>> c = [('d',4), ('c',3), ('a',1), ('b', 2)]
>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> a.sort() == c.sort()
True
>>>
>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> b = [('b',2), ('c', 3), ('a',1)]
>>>
>>> a.sort() == b.sort()
True

>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> b = [('b',2), ('c', 3), ('a',1)]
>>>
>>> sorted(a) == sorted(b)
False
>>>

https://stackoverflow.com/questions/22442378/what-is-the-difference-between-sortedlist-vs-list-sort

Python – str.maketrans()

Working on a Python code, I had a requirement for removing the single/double quotes and open/close brackets from the string of below format —

>>> text = """with summary as (select '
...  'p.col1,p.col2,p.col3, ROW_NUMBER() '
...  'OVER(PARTITION BY p.col1,p.col3 ORDER BY '
...  'p.col2) AS rk from (select * from (select '
...  'col2, col1, col3, '
...  'sum(col4) as col6 from '
...  '"demo"."tab1" a join '
...  "(select lpad(col5, 12, '0') as col5 from demo1.tab11) ta on "
...  '(a.col1 = ta.col5) where col4 >= 0 and '
...  "(col3 like 'A%' or col3 in ('complete', 'random', "
...  "'mobile')) group by col2, "
...  'col1, col3) where col6>=1 order by '
...  'col2) p)\n'
...  '\n'
...  'SELECT s.*\n'
...  '  FROM summary s where s.rk=1 order by '
...  's.col1,s.col2')"""
>>>

Below is the desired output format of the same string —

>>> text
'with summary as select  p.col1,p.col2,p.col3, ROW_NUMBER  OVERPARTITION BY p.col1,p.col3 ORDER BY  p.col2 AS rk from select * from select  col2, col1, col3,  sumcol4 as col6 from  demo.tab1 a join  select lpadcol5, 12, 0 as col5 from misc.tab11 ta on  a.col1 = ta.col5 where col4 >= 0 and  col3 like A% or col3 in complete, random,  mobile group by col2,  col1, col3 where col6>=1 order by  col2 p  SELECT s.*   FROM summary s where s.rk=1 order by  s.col1,s.col2'

To achieve it, one of the simplest ways I could find is using str.translate with str.maketrans —

text = text.translate(str.maketrans({"'":None,'"':None, "(":None, ")":None, "\n":None}))

Reference – https://docs.python.org/3.5/library/stdtypes.html?highlight=maketrans#str.translate

Expanding array to multiple rows – Athena

A single row in Athena table is stored as —

select id, course, date from demo.course_tab where id='1234567892'

 	id	course	                                                                 date
 1234567892	[95c3c1bc5873, 2e345b2eb678, 027b02599f4a, 8695a580520b, 5d453355d415, cdcc7682070b]	2019-06-13

The datatype for course column is array(string). Now, how can you get the output in below format –

 	id	        course          date

1	1234567892	95c3c1bc5873	2019-06-13
2	1234567892	2e345b2eb678	2019-06-13
3	1234567892	027b02599f4a	2019-06-13
4	1234567892	8695a580520b	2019-06-13
5	1234567892	5d453355d415	2019-06-13
6	1234567892	cdcc7682070b	2019-06-13

To get the desired output in Athena, simply use CROSS JOIN in conjunction with UNNEST operator —

SELECT id, course_id , date FROM demo.course_tab cross join unnest(course) as t(course_id) where id = '103314692'

Reference –
https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html

named tuple to JSON – Python

In pgdb – PostgreSQL DB API, the cursor which is used to manage the context of a fetch operation returns list of named tuples. These named tuples contain field names same as the column names of the database query.

An example of a row from the list of named tuples –


Row(log_time=datetime.datetime(2019, 3, 20, 5, 41, 29, 888000, tzinfo=), user_name='admin', connection_from='72.20.208.64:21132', command_tag='INSERT', message='AUDIT: SESSION,1,1,WRITE,INSERT,TABLE,user.demodml,"insert into user.demodml (id) values (1),(2),(3),(4),(5),(6),(7),(8),(9),(11);",', application_name='psql')

To convert these named tuple row to a JSON format, the easiest way is using _asdict() which returns a new dictionary mapping the field names to their corresponding values.


for row in scanForDml:
    violation.append(json.dumps(row._asdict(), default=jsondatetimeconverter))

Converting the datatime.datetime into JSON format throws a TypeError. To avoid this situation, you can use the below function –


Reference - https://code-maven.com/serialize-datetime-object-as-json-in-python

def jsondatetimeconverter(o):
    """To avoid TypeError: datetime.datetime(...) is not JSON serializable"""

    if isinstance(o, datetime.datetime):
        return o.__str__()

Final converted JSON data in list –


['{"log_time": "2019-03-20 05:41:29.888000+00:00", "user_name": "admin", "connection_from": "72.20.208.64:21132", "command_tag": "INSERT", "message": "AUDIT: SESSION,1,1,WRITE,INSERT,TABLE,user.demodml,\\"insert into user.demodml (id) values (1),(2),(3),(4),(5),(6),(7),(8),(9),(11);\\",", "application_name": "psql"}']