Understanding Database Indexes
Database indexes are data structures that improve the speed of data retrieval operations. Think of an index as a table of contents for your database - instead of scanning every page (row), the database can quickly jump to the relevant data. However, indexes come with trade-offs: they speed up reads but slow down writes and consume storage space.
Effective indexing is crucial for application performance, especially as data grows. Poor indexing can lead to slow queries, timeouts, and poor user experience. This guide covers index types, when to use them, and optimization strategies. For database fundamentals, see our Databases guide.
Types of Indexes
B-Tree Indexes
Most common index type. Efficient for equality and range queries. Used by default in PostgreSQL and MySQL for most data types. Supports sorted data retrieval.
Hash Indexes
Fast for equality lookups but don't support range queries. Ideal for exact match queries. Used in PostgreSQL for certain data types and in-memory databases.
Composite Indexes
Indexes on multiple columns. Order matters - leftmost columns are used first. Useful for queries filtering on multiple columns.
Partial Indexes
Index only a subset of rows based on a condition. Smaller and faster than full indexes when condition matches query patterns.
Covering Indexes
Include all columns needed by a query, allowing the database to satisfy queries entirely from the index without accessing the table.
When to Create Indexes
Primary Keys and Foreign Keys
Primary keys are automatically indexed. Foreign keys should typically be indexed to speed up joins and enforce referential integrity efficiently. Most databases don't automatically index foreign keys, so create them explicitly.
-- Index foreign key
CREATE INDEX idx_user_id ON orders(user_id);
-- Composite index for common query patterns
CREATE INDEX idx_user_status ON orders(user_id, status);
Frequently Queried Columns
Index columns used frequently in WHERE clauses, JOIN conditions, and ORDER BY clauses. Analyze query patterns to identify candidates. Use database query analyzers to find slow queries.
However, avoid over-indexing. Each index adds overhead to INSERT, UPDATE, and DELETE operations. Balance read performance with write performance based on your application's needs. For query optimization, see our Databases documentation.
High Selectivity Columns
Index columns with high selectivity (many unique values). Low-selectivity columns (like boolean flags) may not benefit from indexes unless combined with other columns in composite indexes.
Index Design Strategies
Composite Index Column Order
Order matters in composite indexes. Place the most selective column first, or order by query frequency. The leftmost prefix rule applies - a composite index on (a, b, c) can be used for queries on (a), (a, b), or (a, b, c), but not (b) or (c) alone.
-- Good: user_id is more selective
CREATE INDEX idx_user_status ON orders(user_id, status);
-- Can be used for:
-- WHERE user_id = ?
-- WHERE user_id = ? AND status = ?
-- But NOT for: WHERE status = ?
Covering Indexes
Include frequently accessed columns in indexes to create covering indexes. This allows index-only scans, avoiding table lookups entirely.
-- Covering index includes all queried columns
CREATE INDEX idx_user_covering ON orders(user_id)
INCLUDE (order_date, total_amount);
-- Query can be satisfied entirely from index
SELECT order_date, total_amount
FROM orders
WHERE user_id = 123;
Partial Indexes
Create indexes on subsets of data when queries frequently filter by specific conditions. This reduces index size and maintenance overhead.
-- Index only active orders
CREATE INDEX idx_active_orders ON orders(user_id)
WHERE status = 'active';
-- Much smaller than full index
-- Faster for queries filtering active orders
Index Maintenance
Index Fragmentation
Over time, indexes become fragmented due to updates and deletes. Regularly rebuild or reorganize indexes to maintain performance. Monitor index usage and remove unused indexes.
-- PostgreSQL: REINDEX
REINDEX INDEX idx_user_id;
-- MySQL: OPTIMIZE TABLE
OPTIMIZE TABLE orders;
-- Monitor index usage
SELECT * FROM pg_stat_user_indexes
WHERE idx_scan = 0;
Index Statistics
Keep statistics updated so the query planner can make optimal decisions. Most databases update statistics automatically, but you may need to update them manually after large data changes.
-- PostgreSQL: ANALYZE
ANALYZE orders;
-- MySQL: ANALYZE TABLE
ANALYZE TABLE orders;
Query Optimization
Using EXPLAIN
Use EXPLAIN (or EXPLAIN ANALYZE) to understand how queries are executed and whether indexes are being used. Look for sequential scans that could benefit from indexes.
-- Analyze query execution plan
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 123;
-- Look for:
-- Index Scan (good)
-- Seq Scan (may need index)
-- Index Only Scan (best - covering index)
Avoiding Index Anti-Patterns
Functions on Indexed Columns
Avoid applying functions to indexed columns in WHERE clauses. This prevents index usage. Instead, restructure queries or use function-based indexes.
-- Bad: Index cannot be used
WHERE UPPER(name) = 'JOHN'
-- Good: Index can be used
WHERE name = 'john'
-- Or use function-based index
CREATE INDEX idx_upper_name ON users(UPPER(name));
LIKE Patterns
Leading wildcards prevent index usage. Use full-text search indexes for text searching, or restructure queries.
-- Bad: Index cannot be used
WHERE name LIKE '%john%'
-- Better: Leading characters allow index usage
WHERE name LIKE 'john%'
-- Best: Use full-text search
CREATE INDEX idx_name_fts ON users USING GIN(to_tsvector('english', name));
Indexing Best Practices
Measure Before Optimizing
Use query profiling tools to identify slow queries. Don't create indexes blindly - measure actual performance impact.
Monitor Index Usage
Regularly check which indexes are actually used. Remove unused indexes to reduce write overhead and storage.
Consider Write Patterns
Balance read and write performance. High-write tables may need fewer indexes. Use read replicas for read-heavy workloads.
Test in Production-Like Environment
Index performance varies with data volume and distribution. Test with realistic data volumes before deploying to production.