Understanding pg_repack: Eliminate PostgreSQL Table Bloat Without Downtime

Database bloat is a common challenge in PostgreSQL production environments. As tables undergo frequent updates and deletes, dead tuples accumulate and degrade performance. While VACUUM FULL can reclaim this space, it requires extensive locking that brings your application to a halt. This is where pg_repack shines: a PostgreSQL extension that eliminates bloat while keeping your database fully operational.

What is pg_repack?

pg_repack is a PostgreSQL extension designed to rebuild tables and indexes without blocking concurrent read and write operations. Unlike built-in maintenance commands, it operates as an external tool that connects to your database from an EC2 instance or local machine.

The mechanism is elegant: pg_repack creates a fresh copy of your bloated table, tracks any changes happening during the rebuild, then swaps the old table with the optimized version. The critical advantage is that it only requires exclusive locks for brief moments at the beginning and end of the process.

Why Use pg_repack?

Standard PostgreSQL maintenance tools have significant limitations. VACUUM FULL reclaims space but locks your table for the entire operation, making it unsuitable for production environments. CLUSTER and REINDEX have similar drawbacks.

pg_repack addresses these issues with three key benefits:

Minimal downtime: Tables remain accessible for reads and writes throughout the repacking process
Brief locking period: Exclusive locks are held only during initial setup and final swap operations
Operational flexibility: Run maintenance during business hours without impacting users

This makes pg_repack essential for high-availability systems where traditional maintenance windows are impractical or impossible.

How pg_repack Works

Understanding the repacking process helps you use the tool effectively. Here's what happens under the hood:

First, pg_repack creates a new empty table with the same schema as your original. It then copies all current rows from the bloated table to this fresh structure. While this bulk copy runs, a trigger captures any INSERT, UPDATE, or DELETE operations happening on the original table.

These concurrent changes are stored in a log table. Once the initial copy completes, pg_repack applies the logged changes to bring the new table up to date. Finally, it briefly acquires an exclusive lock to swap table names, making the new optimized table live.

The entire process runs without blocking normal database operations. Your application continues serving requests while pg_repack works in the background.

Installing and Using pg_repack

Before running pg_repack, you need to install the extension on your PostgreSQL server. For AWS RDS or Aurora, the extension is available but must be enabled. For self-hosted PostgreSQL, install via your package manager or build from source.

Once installed, you can repack tables or indexes using command-line syntax:

# Repack a specific table
pg_repack -h your-host.com -d mydatabase --table orders -k

# Repack an index
pg_repack -h your-host.com -d mydatabase --index idx_orders_customer

The -k flag ensures that only data is repacked without rebuilding indexes. This is faster when you only need to reclaim table bloat.

For database-wide maintenance, you can repack all tables in a database:

# Repack all tables in the database
pg_repack -h your-host.com -d mydatabase

This is useful for scheduled maintenance but requires more time and resources than targeting specific tables.

Important Limitations

pg_repack has several requirements and constraints to be aware of:

Primary keys required: Tables must have a primary key or unique index. pg_repack uses this to identify rows during the rebuild
No temporary tables: The tool cannot operate on temporary tables
No global indexes: Tables with global indexes (in partitioned scenarios) are not supported
DDL blocking: Schema changes on the target table are blocked during repacking

These limitations are important for planning your maintenance strategy. If a table lacks a primary key, you'll need to add one before running pg_repack.

Handling Invalid Indexes

One common issue is the error-on-invalid-index warning. This indicates corrupted or failed index builds that pg_repack refuses to process. Attempting to repack with invalid indexes could lead to data inconsistencies.

To identify invalid indexes on a table, run this query:

SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE indrelid = 'orders'::regclass
  AND NOT indisvalid;

If you find invalid indexes, drop them before proceeding:

DROP INDEX CONCURRENTLY idx_invalid_name;

After removing invalid indexes, pg_repack will run successfully. You can rebuild the indexes afterward using CREATE INDEX CONCURRENTLY.

Best Practices

To get the most from pg_repack, follow these guidelines:

Monitor bloat levels regularly using queries against pg_stat_user_tables to identify candidates for repacking
Target specific tables rather than repacking entire databases unless necessary
Run pg_repack during lower-traffic periods when possible to reduce resource contention
Ensure adequate disk space - pg_repack needs space for both the old and new table simultaneously
Test the operation on development or staging environments first to estimate duration and resource usage

For large tables, consider the --no-order flag to skip clustering by index, which speeds up the process when logical order isn't critical.

When to Choose pg_repack Over Alternatives

pg_repack isn't always the answer. Use VACUUM regularly for routine maintenance - it prevents bloat from accumulating in the first place. VACUUM FULL is acceptable during scheduled maintenance windows when brief downtime is tolerable.

Choose pg_repack when you need to reclaim significant bloat on production tables that must remain online. It's particularly valuable for high-traffic tables in 24/7 applications where traditional maintenance windows don't exist.

For tables that rarely change, bloat may not be an issue at all. Focus pg_repack efforts on frequently updated tables where dead tuples accumulate quickly.

Monitoring and Verification

After running pg_repack, verify the operation succeeded by checking table sizes:

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE tablename = 'orders';

Compare the size before and after repacking to confirm bloat was eliminated. You should see a significant reduction for heavily bloated tables.

Monitor system resources during repacking to ensure it doesn't overwhelm your database server. pg_repack is I/O intensive and can impact query performance if resources are constrained.

Conclusion

pg_repack is an essential tool for maintaining PostgreSQL databases in production environments. By eliminating table and index bloat without blocking concurrent operations, it enables effective maintenance without sacrificing availability.

Understanding its capabilities and limitations allows you to incorporate pg_repack into your database maintenance strategy effectively. Regular monitoring identifies bloat early, and targeted repacking keeps your database performing optimally without impacting users.