PostgreSQL Tips I Learned the Hard Way

PostgreSQL is my default database for everything. It's reliable, feature-rich, and handles most workloads well. But I've made enough mistakes with it to fill a book. Here are the lessons that cost me the most time and the most sleep.

Index What You Query, Not What You Think You'll Query

Early in my career, I'd add indexes based on gut feeling. "This column will probably get queried a lot." That approach led to tables with six indexes that slowed down writes, while the actual slow queries had no index at all.

Now I follow a simple rule: don't add indexes until you have real queries to optimize. Use EXPLAIN ANALYZE on your actual queries and look for sequential scans on large tables. That's where you need an index. Not before.

One gotcha that bit me hard: composite indexes have a column order that matters. An index on (user_id, created_at) will speed up queries that filter by user_id, or by user_id AND created_at. But it won't help a query that only filters by created_at. The leftmost columns in the index need to match your query's WHERE clause. I had a composite index that was completely useless for a month before I realized the column order was wrong.

The N+1 Problem Will Ruin Your Day

If you use an ORM, you've probably written N+1 queries without realizing it. Load a list of 100 users, then for each user, load their profile. That's 101 queries. Your local database handles this in milliseconds because there's no network latency. In production, with network round trips and connection overhead, that same code takes seconds.

I discovered our API had N+1 queries when response times spiked from 200ms to 4 seconds as the dataset grew. The fix was eager loading (JOIN the related data in a single query), but finding all the N+1 patterns took a week of auditing.

My prevention strategy: enable query logging in development with log_min_duration_statement = 0 (logs all queries) and watch for patterns. If a single page load generates more than 10 queries, something is probably wrong. Tools like pgBadger can analyze your logs and highlight the worst offenders.

Connection Pooling Is Not Optional

PostgreSQL creates a new process for each connection. That's a heavyweight operation compared to other databases. If your application creates a connection for every request, you'll hit the default 100 connection limit quickly under load. Even if you increase the limit, each connection consumes memory, and at some point your server runs out.

I learned this when a traffic spike brought down a production database. The application was creating and destroying connections on every request. Under normal load this was fine. Under 10x load, we exhausted connections and the database stopped accepting new ones. Everything went down.

The solution is connection pooling. PgBouncer sits between your app and the database, maintaining a pool of persistent connections and multiplexing your application's connections through them. For most deployments, PgBouncer in transaction mode is the right choice. It reduced our connection count from hundreds to about 20, and the database has been stable since.

Migrations Need to Be Backwards Compatible

This one took me too long to internalize. If you're deploying with zero downtime (rolling deploys, blue-green, etc.), there's a window where the old application code and the new database schema coexist. If your migration renames a column, the old code will break because it's looking for the old column name.

Safe migration pattern: add the new column, deploy code that writes to both columns, backfill the new column, deploy code that reads from the new column, then drop the old column. It's three deploys instead of one, but nothing breaks.

Also, never run ALTER TABLE ... ADD COLUMN ... DEFAULT ... on a large table in older PostgreSQL versions (before 11). In pre-11 versions, adding a column with a default rewrites the entire table, locking it for the duration. On a table with millions of rows, that lock can last minutes. In PostgreSQL 11+, this was fixed and the default is stored in the catalog, making it instant. But check your version first.

VACUUM Is Your Friend

PostgreSQL uses MVCC (multi-version concurrency control), which means UPDATE and DELETE don't actually remove old row versions. They create new versions and mark the old ones as dead. VACUUM reclaims the space from dead rows.

Autovacuum handles this automatically for most cases. But if you have a table with heavy write volume, autovacuum might not keep up. I've seen tables bloat to 10x their actual data size because autovacuum wasn't running aggressively enough.

Monitor table bloat with pg_stat_user_tables. If n_dead_tup is growing faster than last_autovacuum is running, you need to tune your autovacuum settings for that table. You can set per-table autovacuum parameters, which is useful for hot tables that need more frequent cleanup.

Use EXPLAIN ANALYZE, Not EXPLAIN

EXPLAIN shows you what the query planner thinks will happen. EXPLAIN ANALYZE actually runs the query and shows you what really happened. The difference can be significant. The planner might estimate 100 rows but the actual result is 100,000, which means it chose the wrong execution plan.

Always use EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) for the most useful output. The BUFFERS option shows you cache hit rates, which tells you whether your working set fits in memory or is hitting disk. If your buffer cache hit rate drops below 99% for frequent queries, you probably need more shared_buffers or more RAM.

These lessons all came from production incidents. Every one of them could have been avoided if I'd known what to look for. PostgreSQL is an excellent database, but it rewards understanding over assumptions.