S3 bucket size using aws cli

To get the total objects count and total size for folder within S3 bucket recursively –


aws s3 ls --summarize --human-readable --recursive s3://content-dev/mywork/dumb | grep 'Total'

Total Objects: 1922
   Total Size: 297.4 GiB

Advertisements

named tuple to JSON – Python

In pgdb – PostgreSQL DB API, the cursor which is used to manage the context of a fetch operation returns list of named tuples. These named tuples contain field names same as the column names of the database query.

An example of a row from the list of named tuples –


Row(log_time=datetime.datetime(2019, 3, 20, 5, 41, 29, 888000, tzinfo=), user_name='admin', connection_from='72.20.208.64:21132', command_tag='INSERT', message='AUDIT: SESSION,1,1,WRITE,INSERT,TABLE,user.demodml,"insert into user.demodml (id) values (1),(2),(3),(4),(5),(6),(7),(8),(9),(11);",', application_name='psql')

To convert these named tuple row to a JSON format, the easiest way is using _asdict() which returns a new dictionary mapping the field names to their corresponding values.


for row in scanForDml:
    violation.append(json.dumps(row._asdict(), default=jsondatetimeconverter))

Converting the datatime.datetime into JSON format throws a TypeError. To avoid this situation, you can use the below function –


Reference - https://code-maven.com/serialize-datetime-object-as-json-in-python

def jsondatetimeconverter(o):
    """To avoid TypeError: datetime.datetime(...) is not JSON serializable"""

    if isinstance(o, datetime.datetime):
        return o.__str__()

Final converted JSON data in list –


['{"log_time": "2019-03-20 05:41:29.888000+00:00", "user_name": "admin", "connection_from": "72.20.208.64:21132", "command_tag": "INSERT", "message": "AUDIT: SESSION,1,1,WRITE,INSERT,TABLE,user.demodml,\\"insert into user.demodml (id) values (1),(2),(3),(4),(5),(6),(7),(8),(9),(11);\\",", "application_name": "psql"}']

PostgreSQL – LWLock:multixact_offset

On one of the PostgreSQL db , the active sessions were waiting on IO:SLRURead and LWLock:multixact_offset causing application latency.

Multioffset

As per PostgreSQL doc, SLRURead is waiting for a read of an SLRU page and multixact_offset is waiting for I/O on a multixact offset buffer.

Diagnosing the issue took some time so cut to the chase

1. Identified the SQLs, relation (and its size) for which these wait events were high. These table had ~1500 DMLs per 30mins.


\i seg_size.sql
Enter the relation name:xy_abc_queues
+---------------+---------+
|    relname    |  size   |
+---------------+---------+
| xy_abc_queues | 1658 MB |
+---------------+---------+
(1 row)

Time: 24.340 ms
\i seg_size.sql
Enter the relation name:xy_abc_queue_sla
+------------------+--------+
|     relname      |  size  |
+------------------+--------+
| xy_abc_queue_sla | 856 MB |
+------------------+--------+
(1 row)

2. Reviewed the last_autoanalyze and autovacuum_count. Both the tables were getting analyzed and vacuum twice in 30mins.

3. Installed pgstattuple to obtain tuple level stats.

4. Relation stats

SELECT * FROM pgstattuple('owner.xy_abc_queues');
+-[ RECORD 1 ]-------+------------+
| table_len          | 1738113024 |
| tuple_count        | 6880       |
| tuple_len          | 8947718    |
| tuple_percent      | 0.51       |
| dead_tuple_count   | 741053     |
| dead_tuple_len     | 1378505651 |
| dead_tuple_percent | 79.31      |
| free_space         | 337899240  |
| free_percent       | 19.44      |
+--------------------+------------+

Time: 219706.769 ms (03:39.707)


SELECT * FROM pgstattuple('owner.xy_abc_queue_sla');
+-[ RECORD 1 ]-------+-----------+
| table_len          | 897990656 |
| tuple_count        | 43623     |
| tuple_len          | 49133804  |
| tuple_percent      | 5.47      |
| dead_tuple_count   | 738004    |
| dead_tuple_len     | 604300048 |
| dead_tuple_percent | 67.29     |
| free_space         | 232187892 |
| free_percent       | 25.86     |
+--------------------+-----------+

Time: 475798.157 ms (07:55.798)

5. Vacuum should get rid of these dead tuples so that the space can be reused. Even though the tables were auto-vacuumed, still the tables had high dead tuple %.

6. One of the reason for autovacuum not to free the dead tuple is long running transaction.

7. Reviewed pg_stat_activity and noticed PIDs with open transaction for over a month.

8. Killed the PIDs using pg_terminate_backend(pid).

9. Executed vacuum manually.


vacuum analyze verbose owner.xy_abc_queue_sla;

vacuum analyze verbose owner.xy_abc_queues;

SELECT * FROM pgstattuple('owner.xy_abc_queues');
+-[ RECORD 1 ]-------+------------+
| table_len          | 1738113024 |
| tuple_count        | 3562       |
| tuple_len          | 2514281    |
| tuple_percent      | 0.14       |
| dead_tuple_count   | 5906       |
| dead_tuple_len     | 11542204   |
| dead_tuple_percent | 0.66       |
| free_space         | 1713851592 |
| free_percent       | 98.6       |
+--------------------+------------+


SELECT * FROM pgstattuple('owner.xy_abc_queue_sla');
+-[ RECORD 1 ]-------+-----------+
| table_len          | 897990656 |
| tuple_count        | 42127     |
| tuple_len          | 48074726  |
| tuple_percent      | 5.35      |
| dead_tuple_count   | 5059      |
| dead_tuple_len     | 3900324   |
| dead_tuple_percent | 0.43      |
| free_space         | 837034048 |
| free_percent       | 93.21     |
+--------------------+-----------+

Post this the wait events IO:SLRURead and LWLock:multixact_offset reduced and the application latency was back to normal.

 

Python List

This blog post is about appending data elements to list in Python.

Suppose we have a simple list “x”, we will look at different ways to append elements to this list.

x = [1, 2, 3]

The “append” method appends only a single element

>>> x
[1, 2, 3]
>>> x.append(4)
>>> x
[1, 2, 3, 4]
>>>

>> x.append(5, 6, 7)
Traceback (most recent call last):
File "", line 1, in
TypeError: append() takes exactly one argument (3 given)
>>>

How do you append multiple elements to list?

>>> x[len(x):] = [5, 6, 7]
>>> x
[1, 2, 3, 4, 5, 6, 7]
>>>

Another way to append multiple elements is create a new list and use “+” operator

>>> y = [8, 9, 10]
>>> x = x + y
>>> x
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>>

The other way which I learned today is to use “extend” method

>>> z = [ 11, 12, 13]
>>> x.extend(z)
>>> x
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
>>>

Incase you want to add new list elements between two existing elements, then use “insert” method.

>>> x.insert(5, "Hi")
>>> x
[1, 2, 3, 4, 5, 'Hi', 6, 7, 8, 9, 10, 11, 12, 13]
>>>

PostgreSQL – CPU Utilization and Index

One of the Production Aurora PostgreSQL instance running on db.r4.16xlarge instance (64 vCPU and 488 GB ) was reporting high CPU Utilization spiking upto 100%.

Screen Shot 2018-12-09 at 11.14.49 AM

With such issues, one of the first thing is to look for the SQLs with high buffers shared hit. I have built a small tool called pgsnap which is something similar to AWR respostory in Oracle maintaining the SQL stat history.  So, with pg_stat_statements and hist_pg_stat_statements(that’s what I call it), I was able to identify the SQL.

select col1, col2, col3, col4, col5 from cltn_errs redis0_ where redis0_.sbmn_id=123456;

Lets look at the execution plan

Gather  (cost=1000.00..2070031.26 rows=2696 width=262) (actual time=17475.126..18771.216 rows=1 loops=1)                                                       
   Output: col1, col2, col3, col4, col5                                   
   Workers Planned: 7                                                                                                                                           
   Workers Launched: 0                                                                                                                                          
   Buffers: shared hit=3945515                                                                                                                            
   ->  Parallel Seq Scan on demo.cltn_errs redis0_  (cost=0.00..2068761.66 rows=385 width=262) (actual time=17474.807..18770.895 rows=1 loops=1) 
         Output: col1, col2, col3, col4, col5                              
         Filter: (redis0_.sbmn_id = '123456'::numeric)                                                                                        
         Rows Removed by Filter: 52390761                                                                                                                       
         Buffers: shared hit=3945515                                                                                                                            
 Planning time: 0.652 ms  
 Execution time: 18771.384 ms      

The problem was obvious!! Full table scan of table cltn_errs (~15Gb table). So, I restored the latest snapshot, created an index on “sbmn_id” column and the execution plan changed to

Index Scan using idx_sbmn_id on demo.cltn_errs redis0_  (cost=0.57..182.80 rows=3086 width=272) (actual time=0.031..0.032 rows=1 loops=1) 
   Output: col1, col2, col3, col4, col5                                           
   Index Cond: (redis0_.sbmn_id = '123456'::numeric)                                                                                                 
   Buffers: shared hit=5                                                                                                                                 
 Planning time: 0.573 ms                                                                                                                                               
 Execution time: 0.085 ms        

Wow!! After the index, Buffers: Shared hit and the total execution time has magnificent improvement, more than 100x. So, with this testing, I created the index on Prod and after this little change CPU Utilization graph had to say it all.

Screen Shot 2018-12-09 at 11.58.06 AM

Python – Flatten List of Lists

Itertools is one of the most powerful module in Python. Today I had requirement to flatten list of lists and itertools made it so easy.

My list —

>> val = [['a','b'],'c',['d','e','f']]

Required Result

['a', 'b', 'c', 'd', 'e', 'f']

How do you do it? Itertools to the resuce —

>>> list(chain.from_iterable(val))
['a', 'b', 'c', 'd', 'e', 'f']

So simple !!