The curious case of High IOPS in AWS RDS
Not so long back I was working on a project where we got complaints from the client that the infrastructure monthly cost is too high and they asked us to see if there is anything that we can do to reduce it.
On inspection of the monthly billing we found that the RDS (Postgres) component cost was actually contributing to a major part of the cost. On further inspection we found that it was due to the high IOPS (50K/second).
The production system was live with only two customer sites with entry/exit information and so the data load on the system was still not so huge. Thus it was very strange to us as to why the RDS was recording such a high IOPS.
When we started our investigation on the high IOPS of RDS something that drew our attention was a periodic spike in IOPS every 15 mins.
When we drilled down to these times to see what queries are run during these times then we found the ones below:-
This helped us in diverting our attention into these AUTOVACUUM commands.
So what is an autovacuum — Introduced in PostgreSQL 8.1, the AUTOVACUUM daemon is an optional feature that automatically vacuums the database so that you don’t have to manually run the VACUUM statement. The AUTOVACUUM daemon is enabled in the default configuration. The VACUUM statement is used to reclaim storage by removing obsolete data or tuples from the PostgreSQL database.
To check the table statistics about the number of dead tuples we ran this query —
SELECT relname, n_live_tup, n_dead_tup, trunc(100*n_dead_tup/(n_live_tup+1))::float "ratio%",
to_char(last_autovacuum, 'YYYY-MM-DD HH24:MI:SS') as autovacuum_date,
to_char(last_autoanalyze, 'YYYY-MM-DD HH24:MI:SS') as autoanalyze_date
FROM pg_stat_all_tables
ORDER BY last_autovacuum;
The above query gave the following output —
So we could see a high number of dead tuples and that the tables were bloated. To alleviate this bloat we ran the “pg_repack” command which took around 10 hours to run.
pg_repack is a PostgreSQL extension which lets you remove bloat from tables and indexes, and optionally restore the physical order of clustered indexes. Unlike CLUSTER and VACUUM FULL it works online, without holding an exclusive lock on the processed tables during processing.
After running the pg_repack command we ran the stat query again and now we see a better ratio.
Also we could see since the command ended the IOPS went down to 0–1.5/secs from 50k/seconds
So we could conclude that the autovacuum became the root cause of IOPS spikes and pg_repack is a manual step performed when autovacuum didn’t work as expected.
Hopefully, this might help other people if they face same kind of a situation in their projects.