外观
PostgreSQL 关键指标检查项目
连接和会话指标
1. 当前连接数
检查方法:
sql
SELECT
count(*) AS total_connections,
state,
count(*) FILTER (WHERE state = 'active') AS active_connections,
count(*) FILTER (WHERE state = 'idle') AS idle_connections,
count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction
FROM pg_stat_activity;阈值建议:
- 活跃连接数不应超过
max_connections的80% - 空闲事务连接数应保持在较低水平(< 10%)
- 总连接数不应接近
max_connections
2. 连接来源分布
检查方法:
sql
SELECT
client_addr,
client_hostname,
application_name,
count(*) AS connection_count
FROM pg_stat_activity
GROUP BY client_addr, client_hostname, application_name
ORDER BY connection_count DESC;检查要点:
- 识别异常的连接来源
- 检查应用程序连接池配置
- 确认连接数分布是否合理
3. 连接超时配置
检查方法:
sql
SELECT
name,
setting,
unit,
short_desc
FROM pg_settings
WHERE name IN ('idle_in_transaction_session_timeout', 'statement_timeout', 'idle_session_timeout');建议配置:
idle_in_transaction_session_timeout:10分钟statement_timeout:5分钟idle_session_timeout:30分钟
内存使用指标
1. 共享内存使用
检查方法:
sql
SELECT
name,
setting,
unit,
short_desc
FROM pg_settings
WHERE name IN ('shared_buffers', 'work_mem', 'maintenance_work_mem', 'effective_cache_size');建议配置:
shared_buffers:系统内存的25%work_mem:根据并发数调整,建议64MB-256MBmaintenance_work_mem:系统内存的5%-10%effective_cache_size:系统内存的75%
2. 内存使用统计
检查方法:
sql
-- 查看内存上下文使用情况
SELECT
name,
allocated,
freed,
allocated - freed AS net_allocated
FROM pg_stat_bgwriter;检查要点:
- 监控内存泄漏情况
- 确认内存分配是否合理
- 检查后台写入器性能
3. 工作内存使用
检查方法:
sql
-- 查看工作内存使用情况
SELECT
datname,
usename,
query,
state,
now() - query_start AS duration,
work_mem,
temp_files,
temp_bytes
FROM pg_stat_activity
WHERE temp_files > 0
ORDER BY temp_bytes DESC;检查要点:
- 识别使用大量临时文件的查询
- 调整
work_mem参数 - 优化查询计划
磁盘和I/O指标
1. 磁盘空间使用
检查方法:
sql
-- 查看数据库大小
SELECT
datname,
pg_size_pretty(pg_database_size(datname)) AS database_size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- 查看表大小
SELECT
schemaname,
relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS table_size,
pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC;阈值建议:
- 磁盘使用率不应超过80%
- 单表大小不应超过200GB(建议分区)
- 索引大小不应超过表大小的50%
2. I/O统计
检查方法:
sql
-- 查看表I/O统计
SELECT
schemaname,
relname AS table_name,
seq_scan AS sequential_scans,
seq_tup_read AS sequential_tuples_read,
idx_scan AS index_scans,
idx_tup_fetch AS index_tuples_fetched,
n_tup_ins AS tuples_inserted,
n_tup_upd AS tuples_updated,
n_tup_del AS tuples_deleted
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;
-- 查看索引I/O统计
SELECT
schemaname,
relname AS table_name,
indexrelname AS index_name,
idx_blks_read AS index_blocks_read,
idx_blks_hit AS index_blocks_hit,
CASE WHEN idx_blks_read + idx_blks_hit > 0
THEN idx_blks_hit / (idx_blks_read + idx_blks_hit)::numeric * 100
ELSE 0
END AS index_hit_rate
FROM pg_stat_user_indexes i
JOIN pg_statio_user_indexes s ON i.indexrelid = s.indexrelid
ORDER BY index_hit_rate ASC;检查要点:
- 索引命中率应高于90%
- 减少全表扫描
- 优化高I/O表的查询
3. 检查点活动
检查方法:
sql
SELECT
name,
setting,
unit,
short_desc
FROM pg_settings
WHERE name IN ('checkpoint_timeout', 'max_wal_size', 'min_wal_size', 'checkpoint_completion_target');
-- 查看检查点统计
SELECT
checkpoints_timed,
checkpoints_req,
checkpoint_write_time,
checkpoint_sync_time
FROM pg_stat_bgwriter;建议配置:
checkpoint_timeout:30分钟max_wal_size:16GBcheckpoint_completion_target:0.9- 要求检查点请求次数不应超过定时检查点次数的20%
事务和锁指标
1. 事务统计
检查方法:
sql
-- 查看事务提交和回滚统计
SELECT
datname,
xact_commit AS transactions_committed,
xact_rollback AS transactions_rolled_back,
CASE WHEN xact_commit + xact_rollback > 0
THEN xact_rollback / (xact_commit + xact_rollback)::numeric * 100
ELSE 0
END AS rollback_rate
FROM pg_stat_database
ORDER BY rollback_rate DESC;阈值建议:
- 事务回滚率应低于5%
- 高回滚率可能表示应用程序问题
2. 锁等待事件
检查方法:
sql
-- 查看锁等待情况
SELECT
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocked.wait_event_type,
blocked.wait_event,
now() - blocked.query_start AS wait_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocked.waiting_pid = blocking.pid
WHERE blocked.waiting = true;检查要点:
- 识别长期阻塞的事务
- 分析锁等待原因
- 优化查询减少锁竞争
3. 死锁统计
检查方法:
sql
-- 查看死锁统计
SELECT
datname,
deadlocks
FROM pg_stat_database;
-- 查看死锁日志(需要配置log_deadlocks = on)
SELECT * FROM pg_log WHERE message LIKE '%deadlock%' ORDER BY log_time DESC LIMIT 10;检查要点:
- 死锁数量应保持为0
- 分析死锁原因并优化
- 确保应用程序访问资源的顺序一致
查询性能指标
1. 慢查询统计
检查方法:
sql
-- 查看慢查询日志配置
SELECT
name,
setting,
unit,
short_desc
FROM pg_settings
WHERE name IN ('log_min_duration_statement', 'log_statement');
-- 查看当前正在执行的慢查询
SELECT
pid,
datname,
usename,
query,
state,
now() - query_start AS duration
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 second'
ORDER BY duration DESC;建议配置:
log_min_duration_statement:100毫秒log_statement:ddl
2. 查询计划效率
检查方法:
sql
-- 使用EXPLAIN分析查询计划
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
-- 查看缓存命中率
SELECT
name,
setting,
unit,
short_desc
FROM pg_settings
WHERE name LIKE '%cache%';检查要点:
- 优化全表扫描和嵌套循环
- 确保索引被正确使用
- 调整
random_page_cost和seq_page_cost
3. 临时文件使用
检查方法:
sql
-- 查看临时文件使用情况
SELECT
datname,
temp_files AS total_temp_files,
pg_size_pretty(temp_bytes) AS total_temp_size
FROM pg_stat_database
ORDER BY temp_bytes DESC;
-- 查看当前使用临时文件的查询
SELECT
pid,
datname,
usename,
query,
temp_files,
temp_bytes
FROM pg_stat_activity
WHERE temp_files > 0
ORDER BY temp_bytes DESC;检查要点:
- 减少大型排序和哈希操作
- 调整
work_mem参数 - 优化查询计划
WAL和复制指标
1. WAL生成速率
检查方法:
sql
-- 查看WAL写入统计
SELECT
wal_written,
wal_buffers_full,
wal_write_time
FROM pg_stat_bgwriter;
-- 查看WAL配置
SELECT
name,
setting,
unit,
short_desc
FROM pg_settings
WHERE name IN ('wal_level', 'wal_buffers', 'synchronous_commit');建议配置:
wal_level:replicawal_buffers:16MBsynchronous_commit:on(生产环境)
2. 复制延迟
检查方法:
sql
-- 查看复制状态(主库)
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
-- 查看复制状态(从库)
SELECT
application_name,
state,
sync_state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_wal_receiver;阈值建议:
- 复制延迟应小于1秒
- 确保复制状态正常
- 监控复制中断情况
3. 复制槽状态
检查方法:
sql
-- 查看复制槽
SELECT
slot_name,
plugin,
slot_type,
active,
restart_lsn,
confirmed_flush_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS wal_lag_bytes
FROM pg_replication_slots;检查要点:
- 复制槽应处于活跃状态
- WAL延迟应保持在合理水平
- 清理不再使用的复制槽
系统资源指标
1. CPU使用率
检查方法:
sql
-- 查看PostgreSQL进程CPU使用
SELECT
pid,
usename,
application_name,
state,
now() - query_start AS duration,
query
FROM pg_stat_activity
ORDER BY now() - query_start DESC;检查要点:
- 识别CPU密集型查询
- 优化高CPU使用率的查询
- 考虑水平扩展或升级硬件
2. 系统负载
检查方法:
- 使用系统命令:
top、uptime、vmstat - 结合Prometheus+Grafana监控
阈值建议:
- 系统负载不应超过CPU核心数的1.5倍
- 监控长时间高负载情况
检查频率建议
| 指标类别 | 检查频率 | 告警阈值 |
|---|---|---|
| 连接和会话 | 每5分钟 | 活跃连接>80% |
| 内存使用 | 每小时 | 内存使用率>90% |
| 磁盘和I/O | 每30分钟 | 磁盘使用率>80% |
| 事务和锁 | 每10分钟 | 锁等待>30秒 |
| 查询性能 | 每15分钟 | 慢查询>1秒 |
| WAL和复制 | 每5分钟 | 复制延迟>1秒 |
| 系统资源 | 每5分钟 | CPU使用率>90% |
常见问题(FAQ)
Q1:如何设置PostgreSQL关键指标告警?
A1:推荐使用Prometheus+Grafana组合:
- 安装PostgreSQL exporter收集指标
- 配置Prometheus抓取指标
- 在Grafana中创建监控面板
- 设置告警规则(如连接数过高、复制延迟大等)
Q2:哪些指标最能反映PostgreSQL的健康状态?
A2:核心健康指标包括:
- 连接使用率
- 复制延迟
- 锁等待时间
- 慢查询数量
- 磁盘空间使用率
- 事务回滚率
Q3:如何减少PostgreSQL的内存使用?
A3:优化方法:
- 调整
shared_buffers、work_mem等参数 - 优化查询减少临时文件
- 清理空闲连接
- 考虑使用连接池
Q4:如何优化高I/O的PostgreSQL服务器?
A4:优化建议:
- 使用更快的存储设备(SSD、NVMe)
- 优化查询减少I/O
- 增加
shared_buffers - 调整检查点配置
- 使用表分区
Q5:如何监控PostgreSQL的慢查询?
A5:监控方法:
- 配置
log_min_duration_statement记录慢查询 - 使用pgBadger分析慢查询日志
- 在Grafana中设置慢查询告警
- 使用pg_stat_statements查看查询统计
Q6:如何检查PostgreSQL的索引使用情况?
A6:检查方法:
sql
SELECT
relname AS table_name,
indexrelname AS index_name,
idx_scan AS index_scans,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;Q7:如何评估PostgreSQL的性能瓶颈?
A7:评估步骤:
- 检查系统资源(CPU、内存、磁盘)
- 分析查询计划
- 监控锁和等待事件
- 检查索引使用情况
- 分析慢查询日志
Q8:如何优化PostgreSQL的事务性能?
A8:优化建议:
- 保持事务简短
- 避免长事务
- 使用合适的隔离级别
- 减少锁竞争
- 优化索引设计
