外观
KingBaseES常用监控指标详解
KingBaseES提供了丰富的监控指标,用于监控数据库的性能、状态和健康状况。这些指标主要通过动态性能视图(Dynamic Performance Views)和系统视图获取。本文将详细介绍KingBaseES的常用监控指标及其使用方法。
连接与会话指标
连接数
指标说明:监控数据库的当前连接数和最大连接数,用于评估数据库的连接负载。连接数过高可能导致数据库性能下降,甚至无法接受新连接。
获取方式:
sql
-- 查看当前连接数
SELECT count(*) AS current_connections FROM sys_stat_activity;
-- 查看最大连接数配置
SHOW max_connections;
-- 查看连接数使用百分比
SELECT
count(*) AS current_connections,
setting AS max_connections,
round(count(*)::numeric / setting::numeric * 100, 2) AS usage_percent
FROM sys_stat_activity,
sys_settings
WHERE name = 'max_connections'
GROUP BY setting;
-- 查看各数据库连接分布
SELECT
datname,
count(*) AS connection_count
FROM sys_stat_activity
WHERE datname IS NOT NULL
GROUP BY datname
ORDER BY connection_count DESC;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 连接数监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
ALERT_THRESHOLD=80
# 获取当前连接数和最大连接数
CONNECTIONS=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "SELECT count(*) FROM sys_stat_activity;")
MAX_CONNECTIONS=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "SHOW max_connections;")
# 计算使用率
USAGE=$(echo "scale=2; $CONNECTIONS / $MAX_CONNECTIONS * 100" | bc)
# 输出结果
echo "$(date +'%Y-%m-%d %H:%M:%S') - KingBaseES Connections: $CONNECTIONS/$MAX_CONNECTIONS ($USAGE%)"
# 告警
if (( $(echo "$USAGE > $ALERT_THRESHOLD" | bc -l) )); then
echo "WARNING: Connection usage exceeds $ALERT_THRESHOLD%"
# 发送告警邮件或短信
# mail -s "KingBaseES Connection Alert" dba@example.com <<< "Connection usage: $USAGE%"
fi监控建议:
- 连接数使用率超过80%时,应考虑增加max_connections参数或优化应用连接池配置
- 关注异常连接,如长时间空闲的连接(idle超过1小时)
- 定期清理无效连接,设置合理的连接超时参数
- 监控各数据库连接分布,识别连接热点
- 结合应用连接池配置,优化连接复用
会话状态分布
指标说明:监控不同状态的会话分布,了解数据库的工作负载类型和会话行为。会话状态分布可以帮助识别异常情况,如长时间空闲的事务、大量活跃会话等。
获取方式:
sql
-- 查看会话状态分布
SELECT
state,
count(*) AS count
FROM sys_stat_activity
WHERE state IS NOT NULL
GROUP BY state
ORDER BY count DESC;
-- 查看长时间idle in transaction的会话
SELECT
pid,
datname,
usename,
client_addr,
query_start,
now() - query_start AS duration,
query
FROM sys_stat_activity
WHERE state = 'idle in transaction'
AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;
-- 查看活跃会话的详细信息
SELECT
pid,
datname,
usename,
client_addr,
query_start,
now() - query_start AS duration,
wait_event_type,
wait_event,
query
FROM sys_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;状态说明:
- active:正在执行SQL命令的活跃会话,直接反映数据库当前工作负载
- idle:空闲会话,等待客户端发出命令,过多可能表示连接池配置不合理
- idle in transaction:在事务中但当前空闲的会话,可能导致锁持有时间过长
- fastpath function call:正在执行fastpath函数调用,通常是内部操作
- disabled:已禁用的会话,很少出现
生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 会话状态监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
# 获取会话状态分布
echo "$(date +'%Y-%m-%d %H:%M:%S') - 会话状态分布:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
state,
count(*) AS count
FROM sys_stat_activity
WHERE state IS NOT NULL
GROUP BY state
ORDER BY count DESC;
"
# 检查长时间idle in transaction会话
LONG_IDLE_TX=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT count(*)
FROM sys_stat_activity
WHERE state = 'idle in transaction'
AND now() - query_start > interval '5 minutes';")
if [ $LONG_IDLE_TX -gt 0 ]; then
echo "WARNING: 发现 $LONG_IDLE_TX 个长时间idle in transaction会话,可能导致锁问题"
# 发送告警
fi
# 检查活跃会话数量
ACTIVE_SESSIONS=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT count(*)
FROM sys_stat_activity
WHERE state = 'active';")
if [ $ACTIVE_SESSIONS -gt 50 ]; then
echo "INFO: 活跃会话数量较多($ACTIVE_SESSIONS),建议关注数据库负载"
fi生产环境监控建议:
- 配置会话状态监控告警,当idle in transaction会话超过10个或持续时间超过5分钟时触发告警
- 结合应用日志,分析大量idle会话产生的原因,优化连接池配置
- 监控活跃会话数量,与CPU、磁盘I/O等指标关联分析,识别性能瓶颈
- 定期清理长时间空闲的会话,可使用以下SQL:sql
SELECT pg_terminate_backend(pid) FROM sys_stat_activity WHERE state = 'idle in transaction' AND now() - query_start > interval '10 minutes'; - 对于频繁出现的idle in transaction会话,建议在应用代码中添加事务超时机制
性能指标
CPU使用率
指标说明:监控数据库进程的CPU使用率,评估数据库的CPU负载。CPU是数据库性能的关键资源,使用率过高可能导致查询延迟增加、连接超时等问题。
获取方式:
sql
-- 通过sys_os_usage视图查看CPU使用率
SELECT
cpu_usage_percent,
user_cpu_usage_percent,
system_cpu_usage_percent
FROM sys_os_usage;
-- 查看各进程的CPU使用情况
SELECT
pid,
usename,
datname,
query_start,
now() - query_start AS duration,
state,
query
FROM sys_stat_activity
WHERE state = 'active'
ORDER BY now() - query_start DESC;
-- 查看系统级CPU使用率历史(如果有相关视图)
SELECT
sample_time,
cpu_usage_percent,
user_cpu_usage_percent,
system_cpu_usage_percent
FROM sys_os_usage_history
WHERE sample_time > now() - interval '1 hour'
ORDER BY sample_time;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES CPU使用率监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
ALERT_THRESHOLD=85
# 获取CPU使用率
CPU_USAGE=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT cpu_usage_percent
FROM sys_os_usage;")
USER_CPU=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT user_cpu_usage_percent
FROM sys_os_usage;")
SYS_CPU=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT system_cpu_usage_percent
FROM sys_os_usage;")
# 输出结果
echo "$(date +'%Y-%m-%d %H:%M:%S') - CPU Usage: $CPU_USAGE% (User: $USER_CPU%, System: $SYS_CPU%)"
# 告警逻辑
if (( $(echo "$CPU_USAGE > $ALERT_THRESHOLD" | bc -l) )); then
echo "WARNING: CPU usage exceeds $ALERT_THRESHOLD%"
# 查看当前活跃的CPU密集型查询
echo "当前活跃的CPU密集型查询:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
pid,
usename,
datname,
round((now() - query_start)::numeric, 2) AS duration_seconds,
query
FROM sys_stat_activity
WHERE state = 'active'
ORDER BY now() - query_start DESC
LIMIT 5;
"
# 发送告警邮件
# mail -s "KingBaseES CPU Alert" dba@example.com <<< "CPU usage: $CPU_USAGE%"
fi生产环境监控建议:
- 配置CPU使用率告警,当持续超过85%达5分钟时触发告警
- 区分user CPU和system CPU使用率:
- user CPU过高:表示数据库正在执行大量计算密集型查询
- system CPU过高:表示数据库在处理I/O、锁管理等系统级操作
- 结合SQL执行统计,识别CPU密集型查询并优化
- 对于持续高CPU使用率,考虑:
- 优化SQL查询(添加索引、重写查询等)
- 增加CPU资源
- 垂直拆分或水平拆分数据库
- 考虑读写分离架构
- 使用性能分析工具(如pg_stat_statements、pgbadger)深入分析CPU使用情况
CPU瓶颈排查步骤:
- 确认CPU使用率确实来自KingBaseES进程
- 查看活跃会话,识别CPU密集型查询
- 分析查询执行计划,找出优化点
- 检查索引使用情况,添加必要的索引
- 考虑调整数据库参数(如work_mem、shared_buffers等)
- 如无法优化,考虑硬件升级或架构调整
内存使用率
指标说明:监控数据库的内存使用情况,包括共享内存、work_mem、维护内存等。内存是影响数据库性能的关键资源,使用率过高可能导致系统频繁使用swap,严重影响性能。
获取方式:
sql
-- 通过sys_memory_usage视图查看内存使用率
SELECT
total_memory,
used_memory,
free_memory,
round(used_memory::numeric / total_memory::numeric * 100, 2) AS usage_percent
FROM sys_memory_usage;
-- 查看内存相关配置参数
SELECT
name,
setting,
unit,
short_desc
FROM sys_settings
WHERE name IN ('shared_buffers', 'work_mem', 'maintenance_work_mem', 'effective_cache_size', 'max_connections');
-- 查看共享缓冲区使用情况
SELECT
name,
setting,
unit
FROM sys_settings
WHERE name LIKE '%buffer%';
-- 查看当前会话的内存使用情况
SELECT
pid,
usename,
datname,
current_query,
usesysid,
xact_start,
query_start,
state
FROM sys_stat_activity
ORDER BY query_start DESC;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 内存使用率监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
ALERT_THRESHOLD=90
# 获取内存使用率
MEMORY_INFO=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT total_memory, used_memory, free_memory, round(used_memory::numeric / total_memory::numeric * 100, 2) AS usage_percent
FROM sys_memory_usage;")
# 解析内存信息
TOTAL_MEM=$(echo $MEMORY_INFO | awk '{print $1}')
USED_MEM=$(echo $MEMORY_INFO | awk '{print $2}')
FREE_MEM=$(echo $MEMORY_INFO | awk '{print $3}')
USAGE_PERCENT=$(echo $MEMORY_INFO | awk '{print $4}')
# 输出结果
echo "$(date +'%Y-%m-%d %H:%M:%S') - Memory Usage: $USED_MEM/$TOTAL_MEM ($USAGE_PERCENT%)"
# 告警逻辑
if (( $(echo "$USAGE_PERCENT > $ALERT_THRESHOLD" | bc -l) )); then
echo "WARNING: Memory usage exceeds $ALERT_THRESHOLD%"
# 查看内存相关配置
echo "当前内存配置:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
name,
setting,
unit,
short_desc
FROM sys_settings
WHERE name IN ('shared_buffers', 'work_mem', 'maintenance_work_mem', 'effective_cache_size', 'max_connections');
"
# 查看活跃会话
echo "当前活跃会话:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
pid,
usename,
datname,
round((now() - query_start)::numeric, 2) AS duration_seconds,
state,
left(query, 100) AS query_sample
FROM sys_stat_activity
WHERE state = 'active'
ORDER BY now() - query_start DESC
LIMIT 5;
"
# 发送告警邮件
# mail -s "KingBaseES Memory Alert" dba@example.com <<< "Memory usage: $USAGE_PERCENT%"
fi生产环境监控建议:
- 配置内存使用率告警,当持续超过90%达10分钟时触发告警
- 合理配置内存参数:
- shared_buffers:通常设置为系统内存的25%-40%
- work_mem:根据查询复杂度和并发数调整,避免设置过大导致OOM
- maintenance_work_mem:通常设置为系统内存的5%-10%
- effective_cache_size:通常设置为系统内存的50%-75%
- 监控swap使用情况,swap使用率过高表示内存不足
- 定期分析内存使用趋势,预测内存增长需求
- 对于内存密集型应用,考虑:
- 增加系统内存
- 优化查询,减少内存消耗
- 调整连接池配置,减少并发连接数
- 考虑使用内存优化技术(如列存储、内存表等)
内存瓶颈排查步骤:
- 确认内存使用率是否确实过高
- 查看内存相关配置参数,检查是否配置合理
- 分析活跃会话,识别内存密集型查询
- 检查swap使用情况,确认是否存在内存压力
- 考虑调整内存配置或增加内存资源
- 优化查询,减少内存消耗
内存泄漏检测:
- 监控内存使用趋势,如持续增长而不释放,可能存在内存泄漏
- 使用内存上下文监控(V8 R7新增特性):sql
SELECT name, allocated_size, used_size FROM sys_memory_contexts ORDER BY allocated_size DESC LIMIT 10; - 定期重启数据库(作为临时解决方案)
- 升级到最新版本,修复已知的内存泄漏问题
磁盘I/O指标
指标说明:监控数据库的磁盘读写情况,包括读写速率、延迟和表空间I/O分布。磁盘I/O是数据库性能的关键瓶颈之一,高I/O延迟可能导致查询响应时间增加、事务处理缓慢等问题。
获取方式:
sql
-- 通过sys_disk_usage视图查看磁盘I/O情况
SELECT
read_bytes_per_second,
write_bytes_per_second,
read_latency_ms,
write_latency_ms
FROM sys_disk_usage;
-- 查看表空间I/O统计
SELECT
spcname,
read_count,
write_count,
read_bytes,
write_bytes
FROM sys_tablespace_io_stats;
-- 查看表级I/O统计
SELECT
relname,
read_count,
write_count,
read_bytes,
write_bytes
FROM sys_stat_user_tables
ORDER BY read_bytes + write_bytes DESC
LIMIT 10;
-- 查看索引I/O统计
SELECT
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM sys_stat_user_indexes
ORDER BY idx_scan DESC
LIMIT 10;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 磁盘I/O监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
READ_LATENCY_THRESHOLD=50 # 读取延迟阈值(毫秒)
WRITE_LATENCY_THRESHOLD=50 # 写入延迟阈值(毫秒)
# 获取磁盘I/O指标
DISK_IO=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
read_bytes_per_second,
write_bytes_per_second,
read_latency_ms,
write_latency_ms
FROM sys_disk_usage;")
# 解析I/O指标
READ_BPS=$(echo $DISK_IO | awk '{print $1}')
WRITE_BPS=$(echo $DISK_IO | awk '{print $2}')
READ_LATENCY=$(echo $DISK_IO | awk '{print $3}')
WRITE_LATENCY=$(echo $DISK_IO | awk '{print $4}')
# 输出结果
echo "$(date +'%Y-%m-%d %H:%M:%S') - Disk I/O:"
echo " Read: $READ_BPS B/s, Latency: $READ_LATENCY ms"
echo " Write: $WRITE_BPS B/s, Latency: $WRITE_LATENCY ms"
# 检查读写延迟
ALERT=false
ALERT_MSG=""
if (( $(echo "$READ_LATENCY > $READ_LATENCY_THRESHOLD" | bc -l) )); then
ALERT=true
ALERT_MSG+="Read latency high: $READ_LATENCY ms\n"
fi
if (( $(echo "$WRITE_LATENCY > $WRITE_LATENCY_THRESHOLD" | bc -l) )); then
ALERT=true
ALERT_MSG+="Write latency high: $WRITE_LATENCY ms\n"
fi
if [ "$ALERT" = true ]; then
echo "WARNING: High disk I/O latency detected!"
echo "$ALERT_MSG"
# 查看I/O密集的表空间
echo "Top I/O tablespaces:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
spcname,
round(read_bytes / 1024 / 1024, 2) AS read_mb,
round(write_bytes / 1024 / 1024, 2) AS write_mb,
round((read_bytes + write_bytes) / 1024 / 1024, 2) AS total_mb
FROM sys_tablespace_io_stats
ORDER BY total_mb DESC
LIMIT 5;
"
# 查看I/O密集的表
echo "Top I/O tables:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
relname,
round(read_bytes / 1024 / 1024, 2) AS read_mb,
round(write_bytes / 1024 / 1024, 2) AS write_mb
FROM sys_stat_user_tables
ORDER BY read_bytes + write_bytes DESC
LIMIT 5;
"
# 发送告警邮件
# mail -s "KingBaseES Disk I/O Alert" dba@example.com <<< "$ALERT_MSG"
fi生产环境监控建议:
- 配置磁盘I/O告警,当读写延迟超过50ms达5分钟时触发告警
- 关注以下指标:
- 读写速率(read_bytes_per_second, write_bytes_per_second)
- 读写延迟(read_latency_ms, write_latency_ms)
- 表空间I/O分布
- 表级和索引级I/O统计
- 针对高I/O负载,考虑以下优化措施:
- 优化SQL查询,减少全表扫描
- 添加或优化索引,减少I/O操作
- 调整数据库参数(如shared_buffers, effective_cache_size)
- 使用更快的存储设备(如SSD、NVMe)
- 考虑表分区,分散I/O负载
- 调整WAL配置,优化写入性能
- 对于写入密集型应用,考虑:
- 调整WAL刷盘策略(synchronous_commit, wal_writer_delay)
- 使用批量插入替代单条插入
- 考虑使用异步提交
磁盘I/O瓶颈排查步骤:
- 确认磁盘I/O延迟确实过高
- 查看表空间I/O分布,识别热点表空间
- 查看表级和索引级I/O统计,识别热点表和索引
- 分析相关SQL查询,优化I/O密集型操作
- 考虑存储设备升级或配置调整
- 监控优化效果,持续调整
常见I/O优化技巧:
- 为频繁查询的列添加索引
- 优化查询,避免不必要的全表扫描
- 定期清理和优化表(VACUUM, ANALYZE)
- 调整shared_buffers大小,增加内存缓存
- 考虑使用表分区,分散I/O负载
- 对于写入密集型应用,考虑使用SSD存储
SQL性能指标
慢查询数量
指标说明:监控慢查询的数量和详细信息,是评估SQL性能的重要指标。慢查询会占用大量数据库资源,影响整体性能,必须及时发现和优化。
获取方式:
sql
-- 查看慢查询配置
SHOW slow_query_log;
SHOW long_query_time;
SHOW log_min_duration_statement;
-- 查看当前正在执行的慢查询
SELECT
pid,
datname,
usename,
client_addr,
query_start,
now() - query_start AS duration,
wait_event_type,
wait_event,
query
FROM sys_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 second'
ORDER BY duration DESC;
-- 查看最近执行的慢查询(V8 R7新增功能)
SELECT
log_time,
user_name,
database_name,
query_duration,
query
FROM sys_slow_query_log
ORDER BY log_time DESC
LIMIT 10;
-- 统计单位时间内的慢查询数量
SELECT
date_trunc('minute', log_time) AS minute,
count(*) AS slow_query_count
FROM sys_slow_query_log
WHERE log_time > now() - interval '1 hour'
GROUP BY minute
ORDER BY minute;生产环境慢查询配置:
sql
-- 修改慢查询配置(需要重启数据库)
ALTER SYSTEM SET log_min_duration_statement = '500ms'; -- 记录执行时间超过500ms的查询
ALTER SYSTEM SET slow_query_log = on;
ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ';
ALTER SYSTEM SET log_statement = 'none'; -- 仅记录慢查询
ALTER SYSTEM SET log_duration = on;
ALTER SYSTEM SET log_lock_waits = on;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 慢查询监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
SLOW_QUERY_THRESHOLD=5 # 5分钟内慢查询数量阈值
# 获取5分钟内的慢查询数量
SLOW_COUNT=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT count(*)
FROM sys_slow_query_log
WHERE log_time > now() - interval '5 minutes';")
# 输出结果
echo "$(date +'%Y-%m-%d %H:%M:%S') - Slow Queries in last 5 minutes: $SLOW_COUNT"
# 检查慢查询数量
if [ $SLOW_COUNT -gt $SLOW_QUERY_THRESHOLD ]; then
echo "WARNING: High number of slow queries detected!"
# 查看最近的慢查询
echo "Recent slow queries:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
log_time,
user_name,
round(query_duration::numeric, 2) AS duration_ms,
left(query, 150) AS query_sample
FROM sys_slow_query_log
WHERE log_time > now() - interval '5 minutes'
ORDER BY query_duration DESC
LIMIT 5;
"
# 查看当前正在执行的慢查询
echo "Current long-running queries:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
pid,
usename,
datname,
round((now() - query_start)::numeric * 1000, 2) AS duration_ms,
left(query, 150) AS query_sample
FROM sys_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 second'
ORDER BY duration_ms DESC;
"
# 发送告警邮件
# mail -s "KingBaseES Slow Query Alert" dba@example.com <<< "Slow queries in last 5 minutes: $SLOW_COUNT"
fi生产环境监控建议:
- 配置合适的慢查询阈值:
- 对于OLTP应用,建议设置为500ms-1s
- 对于OLAP应用,建议设置为5s-10s
- 启用慢查询日志,定期分析慢查询
- 配置慢查询数量告警,当5分钟内超过5个慢查询时触发告警
- 使用pgbadger等工具分析慢查询日志,生成可视化报告
- 结合sys_stat_statements视图,分析历史SQL执行统计
慢查询优化步骤:
- 识别慢查询(通过慢查询日志或sys_stat_activity视图)
- 分析查询执行计划(使用EXPLAIN ANALYZE)
- 优化查询:
- 添加或优化索引
- 重写查询,避免不必要的连接和子查询
- 调整查询条件,使用更高效的过滤方式
- 优化表结构:
- 拆分大表
- 优化数据类型
- 定期清理和维护表
- 调整数据库参数(如work_mem, effective_cache_size)
- 监控优化效果,持续调整
常见慢查询原因:
- 缺少合适的索引
- 全表扫描或全索引扫描
- 复杂的连接查询
- 大量数据排序或分组
- 锁等待或死锁
- 磁盘I/O瓶颈
- 内存不足
SQL执行统计
指标说明:通过sys_stat_statements视图监控SQL执行统计信息,包括执行次数、总时间、平均时间、内存使用等。这是分析历史SQL性能的重要工具,可以帮助识别需要优化的SQL语句。
获取方式:
sql
-- 查看Top 10总耗时SQL
SELECT
queryid,
query,
calls,
round(total_time::numeric, 2) AS total_time_ms,
round(mean_time::numeric, 2) AS mean_time_ms,
rows,
round(rows::numeric / calls::numeric, 2) AS rows_per_call,
shared_blks_hit,
shared_blks_read,
temp_blks_written
FROM sys_stat_statements
ORDER BY total_time DESC
LIMIT 10;
-- 查看Top 10平均耗时SQL
SELECT
queryid,
query,
calls,
round(total_time::numeric, 2) AS total_time_ms,
round(mean_time::numeric, 2) AS mean_time_ms,
rows
FROM sys_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- 查看Top 10高频SQL
SELECT
queryid,
query,
calls,
round(total_time::numeric, 2) AS total_time_ms,
round(mean_time::numeric, 2) AS mean_time_ms,
rows
FROM sys_stat_statements
ORDER BY calls DESC
LIMIT 10;
-- 查看Top 10消耗内存多的SQL
SELECT
queryid,
query,
calls,
round(mean_time::numeric, 2) AS mean_time_ms,
temp_blks_written,
round(temp_blks_written::numeric / calls::numeric, 2) AS temp_blks_per_call
FROM sys_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;
-- 查看Top 10 I/O密集SQL
SELECT
queryid,
query,
calls,
shared_blks_read,
shared_blks_hit,
round(shared_blks_hit::numeric / (shared_blks_read + shared_blks_hit + 1)::numeric * 100, 2) AS cache_hit_rate
FROM sys_stat_statements
WHERE shared_blks_read > 0
ORDER BY shared_blks_read DESC
LIMIT 10;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES SQL执行统计监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
REPORT_FILE=/var/log/kingbase/sql_stat_report_$(date +%Y%m%d_%H%M%S).txt
# 生成SQL执行统计报告
echo "KingBaseES SQL执行统计报告" > $REPORT_FILE
echo "生成时间: $(date)" >> $REPORT_FILE
echo "=====================================" >> $REPORT_FILE
# 1. Top 10总耗时SQL
echo "\n1. Top 10总耗时SQL:" >> $REPORT_FILE
echo "=====================================" >> $REPORT_FILE
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
queryid,
round(total_time::numeric, 2) AS total_time_ms,
calls,
round(mean_time::numeric, 2) AS mean_time_ms,
left(query, 200) AS query_sample
FROM sys_stat_statements
ORDER BY total_time DESC
LIMIT 10;
" >> $REPORT_FILE
# 2. Top 10平均耗时SQL
echo "\n2. Top 10平均耗时SQL:" >> $REPORT_FILE
echo "=====================================" >> $REPORT_FILE
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
queryid,
round(mean_time::numeric, 2) AS mean_time_ms,
calls,
round(total_time::numeric, 2) AS total_time_ms,
left(query, 200) AS query_sample
FROM sys_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
" >> $REPORT_FILE
# 3. Top 10高频SQL
echo "\n3. Top 10高频SQL:" >> $REPORT_FILE
echo "=====================================" >> $REPORT_FILE
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
queryid,
calls,
round(total_time::numeric, 2) AS total_time_ms,
round(mean_time::numeric, 2) AS mean_time_ms,
left(query, 200) AS query_sample
FROM sys_stat_statements
ORDER BY calls DESC
LIMIT 10;
" >> $REPORT_FILE
# 4. Top 10 I/O密集SQL
echo "\n4. Top 10 I/O密集SQL:" >> $REPORT_FILE
echo "=====================================" >> $REPORT_FILE
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
queryid,
shared_blks_read,
shared_blks_hit,
round(shared_blks_hit::numeric / (shared_blks_read + shared_blks_hit + 1)::numeric * 100, 2) AS cache_hit_rate,
left(query, 200) AS query_sample
FROM sys_stat_statements
WHERE shared_blks_read > 0
ORDER BY shared_blks_read DESC
LIMIT 10;
" >> $REPORT_FILE
# 5. Top 10消耗内存多的SQL
echo "\n5. Top 10消耗内存多的SQL:" >> $REPORT_FILE
echo "=====================================" >> $REPORT_FILE
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
queryid,
temp_blks_written,
round(temp_blks_written::numeric / calls::numeric, 2) AS temp_blks_per_call,
left(query, 200) AS query_sample
FROM sys_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;
" >> $REPORT_FILE
# 发送报告(可选)
# mail -s "KingBaseES SQL执行统计报告" dba@example.com < $REPORT_FILE
echo "SQL执行统计报告已生成:$REPORT_FILE"生产环境监控建议:
- 定期生成SQL执行统计报告,建议每天或每周一次
- 关注以下类型的SQL:
- 总耗时高的SQL:对整体性能影响较大
- 平均耗时高的SQL:可能存在性能问题
- 调用次数多的SQL:优化后收益显著
- I/O密集型SQL:可能需要添加索引
- 内存消耗多的SQL:可能需要调整work_mem参数
- 结合慢查询日志,深入分析SQL性能问题
- 定期重置sys_stat_statements统计(谨慎操作):sql
SELECT pg_stat_statements_reset();
sys_stat_statements配置建议:
sql
-- 在kingbase.conf中配置
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000 -- 最多保存10000条SQL记录
pg_stat_statements.track = all -- 跟踪所有SQL
pg_stat_statements.track_utility = on -- 跟踪工具命令
pg_stat_statements.save = on -- 重启后保存统计信息SQL优化建议:
- 为频繁查询的列添加合适的索引
- 重写复杂查询,简化连接和子查询
- 避免在WHERE子句中使用函数或计算表达式
- 使用合适的数据类型,避免类型转换
- 考虑使用物化视图加速复杂查询
- 对于批量操作,考虑使用批量插入或更新
常见性能问题SQL特征:
- 全表扫描(Seq Scan)
- 嵌套循环连接(Nested Loop Join)处理大量数据
- 大量数据排序(Sort)或分组(Group)
- 高临时文件写入(temp_blks_written)
- 低缓存命中率(shared_blks_hit / (shared_blks_read + shared_blks_hit))
事务与锁指标
事务统计
指标说明:监控数据库的事务提交和回滚情况,评估事务负载和质量。高回滚率可能表示应用存在逻辑问题或锁争用严重。
获取方式:
sql
-- 查看数据库事务统计
SELECT
datname,
xact_commit,
xact_rollback,
round(xact_rollback::numeric / nullif(xact_commit + xact_rollback, 0)::numeric * 100, 2) AS rollback_ratio
FROM sys_stat_database
WHERE datname IS NOT NULL;
-- 查看事务提交和回滚的趋势(V8 R7新增功能)
SELECT
date_trunc('minute', commit_time) AS minute,
count(*) AS transaction_count,
sum(CASE WHEN is_rollback THEN 1 ELSE 0 END) AS rollback_count,
round(sum(CASE WHEN is_rollback THEN 1 ELSE 0 END)::numeric / count(*)::numeric * 100, 2) AS rollback_ratio
FROM sys_transaction_log
WHERE commit_time > now() - interval '1 hour'
GROUP BY minute
ORDER BY minute;
-- 查看当前活跃事务
SELECT
pid,
datname,
usename,
client_addr,
xact_start,
now() - xact_start AS duration,
state,
query
FROM sys_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY duration DESC;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 事务统计监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
ROLLBACK_THRESHOLD=10 # 回滚率阈值(百分比)
# 获取事务统计信息
echo "$(date +'%Y-%m-%d %H:%M:%S') - 事务统计信息:"
# 1. 数据库事务统计
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
datname,
xact_commit,
xact_rollback,
round(xact_rollback::numeric / nullif(xact_commit + xact_rollback, 0)::numeric * 100, 2) AS rollback_ratio
FROM sys_stat_database
WHERE datname IS NOT NULL;
"
# 2. 当前活跃事务
ACTIVE_TX=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT count(*) FROM sys_stat_activity WHERE xact_start IS NOT NULL;")
echo "\n当前活跃事务数:$ACTIVE_TX"
# 3. 长时间活跃事务
LONG_TX=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT count(*) FROM sys_stat_activity WHERE xact_start IS NOT NULL AND now() - xact_start > interval '5 minutes';")
if [ $LONG_TX -gt 0 ]; then
echo "\nWARNING: 发现 $LONG_TX 个长时间活跃事务(超过5分钟):"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
pid,
usename,
datname,
round((now() - xact_start)::numeric, 2) AS duration_seconds,
state,
left(query, 100) AS query_sample
FROM sys_stat_activity
WHERE xact_start IS NOT NULL AND now() - xact_start > interval '5 minutes'
ORDER BY duration_seconds DESC;
"
fi
# 4. 回滚率检查
ROLLBACK_RATIO=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
round(sum(xact_rollback)::numeric / nullif(sum(xact_commit + xact_rollback), 0)::numeric * 100, 2)
FROM sys_stat_database;")
if (( $(echo "$ROLLBACK_RATIO > $ROLLBACK_THRESHOLD" | bc -l) )); then
echo "\nWARNING: 回滚率过高:$ROLLBACK_RATIO%(阈值:$ROLLBACK_THRESHOLD%)"
# 发送告警邮件
# mail -s "KingBaseES Transaction Rollback Alert" dba@example.com <<< "Rollback ratio: $ROLLBACK_RATIO%"
fi生产环境监控建议:
- 配置回滚率告警,当超过10%时触发告警
- 监控长时间活跃事务,当超过5分钟时触发告警
- 关注事务提交和回滚的趋势变化
- 结合锁监控,分析高回滚率的原因
事务优化建议:
- 保持事务短小,减少锁持有时间
- 避免在事务中执行不必要的操作
- 使用合适的事务隔离级别
- 实现事务超时机制
- 优化锁竞争,减少死锁发生
常见事务问题:
- 长时间运行的事务导致锁积累
- 事务隔离级别设置过高
- 应用逻辑问题导致大量回滚
- 死锁导致事务失败
锁等待情况
指标说明:监控数据库的锁等待情况,识别锁争用问题。锁等待是数据库性能问题的常见原因,会导致查询延迟增加、事务超时等问题。
获取方式:
sql
-- 查看锁等待情况
SELECT
wait_event_type,
wait_event,
count(*) AS count
FROM sys_stat_activity
WHERE wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY count DESC;
-- 查看详细的锁信息
SELECT
locktype,
database,
relation::regclass AS table_name,
mode,
granted,
pid,
usename,
query
FROM sys_locks
WHERE NOT granted;
-- 查看锁等待关系
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query,
blocked_locks.mode AS blocked_mode,
blocking_locks.mode AS blocking_mode,
blocked_locks.relation::regclass AS table_name
FROM sys_locks blocked_locks
JOIN sys_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN sys_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN sys_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- 查看持有锁时间长的事务
SELECT
pid,
usename,
datname,
xact_start,
now() - xact_start AS transaction_duration,
query
FROM sys_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY now() - xact_start DESC
LIMIT 10;
-- 查看死锁信息(V8 R7新增功能)
SELECT
deadlock_id,
deadlock_time,
deadlock_graph
FROM sys_deadlock_history
ORDER BY deadlock_time DESC
LIMIT 5;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 锁等待监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
LOCK_WAIT_THRESHOLD=5 # 锁等待数量阈值
# 获取当前时间
echo "$(date +'%Y-%m-%d %H:%M:%S') - 锁等待监控报告"
echo "====================================="
# 1. 查看锁等待情况
LOCK_WAIT_COUNT=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT count(*) FROM sys_stat_activity WHERE wait_event IS NOT NULL;")
echo "当前锁等待数量:$LOCK_WAIT_COUNT"
# 2. 如果锁等待数量超过阈值,查看详细信息
if [ $LOCK_WAIT_COUNT -gt $LOCK_WAIT_THRESHOLD ]; then
echo "\nWARNING: 锁等待数量超过阈值($LOCK_WAIT_THRESHOLD)!"
# 查看锁等待事件分布
echo "\n锁等待事件分布:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
wait_event_type,
wait_event,
count(*) AS count
FROM sys_stat_activity
WHERE wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY count DESC;
"
# 查看锁等待关系
echo "\n锁等待关系:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_locks.mode AS blocked_mode,
blocking_locks.mode AS blocking_mode,
blocked_locks.relation::regclass AS table_name,
left(blocked_activity.query, 100) AS blocked_query,
left(blocking_activity.query, 100) AS blocking_query
FROM sys_locks blocked_locks
JOIN sys_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN sys_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN sys_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"
# 查看持有锁时间长的事务
echo "\n持有锁时间长的事务:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
pid,
usename,
datname,
round((now() - xact_start)::numeric, 2) AS transaction_duration_seconds,
left(query, 100) AS query_sample
FROM sys_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY now() - xact_start DESC
LIMIT 5;
"
# 发送告警邮件
# mail -s "KingBaseES Lock Wait Alert" dba@example.com <<< "Lock wait count: $LOCK_WAIT_COUNT"
fi
# 3. 检查死锁历史(V8 R7)
DEADLOCK_COUNT=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT count(*) FROM sys_deadlock_history WHERE deadlock_time > now() - interval '1 hour';")
if [ $DEADLOCK_COUNT -gt 0 ]; then
echo "\nWARNING: 过去1小时内发生了 $DEADLOCK_COUNT 次死锁!"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
deadlock_time,
deadlock_graph
FROM sys_deadlock_history
WHERE deadlock_time > now() - interval '1 hour'
ORDER BY deadlock_time DESC;
"
fi生产环境监控建议:
- 配置锁等待数量告警,当超过5个锁等待时触发告警
- 监控持有锁时间长的事务,当超过5分钟时触发告警
- 定期检查死锁历史,分析死锁原因
- 使用pg_stat_statements结合锁监控,识别导致锁争用的SQL
锁问题解决方法:
- 识别锁等待关系:使用sys_locks和sys_stat_activity视图查看锁等待关系
- 终止阻塞事务:对于长时间阻塞的事务,可以考虑终止:sql
SELECT pg_terminate_backend(blocking_pid); - 优化持有锁时间长的SQL:
- 保持事务短小
- 避免在事务中执行不必要的操作
- 优化查询,减少锁持有时间
- 调整事务隔离级别:
- 考虑使用较低的事务隔离级别(如READ COMMITTED)
- 对于只读查询,考虑使用REPEATABLE READ
- 优化并发控制:
- 使用行级锁代替表级锁
- 考虑使用乐观锁
- 实现合理的重试机制
常见锁类型和等待事件:
- 行锁:最常见的锁类型,保护表中的行
- 表锁:保护整个表,会导致严重的并发问题
- 共享锁(S):用于读取操作,多个事务可以同时持有
- 排他锁(X):用于写入操作,同一时间只能有一个事务持有
- 等待事件:
relation: relation:等待表级锁tuple: tuple:等待行级锁transactionid: transactionid:等待事务ID锁virtualxid: virtualxid:等待虚拟事务ID锁
锁优化最佳实践:
- 为频繁更新的列添加合适的索引,减少锁范围
- 避免长时间运行的事务
- 实现合理的并发控制策略
- 监控和分析锁等待情况,及时优化
- 考虑使用分区表,分散锁竞争
存储指标
表空间使用率
指标说明:监控表空间的使用情况,评估存储容量。表空间使用率过高可能导致数据库无法写入新数据,必须及时监控和处理。
获取方式:
sql
-- 查看表空间使用率
SELECT
spcname,
size_mb,
used_mb,
free_mb,
round(used_mb::numeric / size_mb::numeric * 100, 2) AS usage_percent
FROM sys_tablespace_usage;
-- 查看表空间详细信息
SELECT
spcname,
pg_size_pretty(pg_tablespace_size(spcname)) AS size,
pg_size_pretty(pg_tablespace_size(spcname) - pg_tablespace_size('pg_default')) AS used,
round((pg_tablespace_size(spcname) - pg_tablespace_size('pg_default'))::numeric / pg_tablespace_size(spcname)::numeric * 100, 2) AS usage_percent
FROM sys_tablespace
WHERE spcname NOT LIKE 'pg_%';
-- 查看表空间增长趋势(V8 R7新增功能)
SELECT
date_trunc('day', sample_time) AS day,
spcname,
avg(usage_percent) AS avg_usage_percent,
max(usage_percent) AS max_usage_percent
FROM sys_tablespace_usage_history
WHERE sample_time > now() - interval '30 days'
GROUP BY day, spcname
ORDER BY day, spcname;生产环境监控脚本:
bash
#!/bin/bash
# KingBaseES 表空间使用率监控脚本
# 配置
KB_BIN=/opt/Kingbase/ES/V8R6/Server/bin
KB_HOST=127.0.0.1
KB_PORT=54321
KB_USER=monitor_user
KB_PASS=secure_password
ALERT_THRESHOLD=80 # 表空间使用率阈值(百分比)
# 获取表空间使用率
echo "$(date +'%Y-%m-%d %H:%M:%S') - 表空间使用率监控报告"
echo "====================================="
TABLESPACES=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
spcname,
size_mb,
used_mb,
free_mb,
round(used_mb::numeric / size_mb::numeric * 100, 2) AS usage_percent
FROM sys_tablespace_usage;
")
echo "$TABLESPACES"
# 检查高使用率表空间
HIGH_USAGE_TABLESPACES=$($KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
spcname,
round(used_mb::numeric / size_mb::numeric * 100, 2) AS usage_percent
FROM sys_tablespace_usage
WHERE round(used_mb::numeric / size_mb::numeric * 100, 2) > $ALERT_THRESHOLD;
")
if [ -n "$HIGH_USAGE_TABLESPACES" ]; then
echo "\nWARNING: 以下表空间使用率超过 $ALERT_THRESHOLD%:"
echo "$HIGH_USAGE_TABLESPACES"
# 查看高使用率表空间中的大表
echo "\n高使用率表空间中的大表:"
$KB_BIN/ksql -h $KB_HOST -p $KB_PORT -U $KB_USER -d template1 -t -c "
SELECT
spcname,
relname,
round(size_mb::numeric, 2) AS size_mb
FROM sys_tables_size
WHERE spcname IN (
SELECT spcname
FROM sys_tablespace_usage
WHERE round(used_mb::numeric / size_mb::numeric * 100, 2) > $ALERT_THRESHOLD
)
ORDER BY size_mb DESC
LIMIT 10;
"
# 发送告警邮件
# mail -s "KingBaseES Tablespace Alert" dba@example.com <<< "High usage tablespaces: $HIGH_USAGE_TABLESPACES"
fi生产环境监控建议:
- 配置表空间使用率告警,当超过80%时触发告警
- 监控表空间增长趋势,提前规划存储容量
- 定期清理不再使用的数据,释放存储空间
- 对于增长迅速的表空间,考虑:
- 增加存储容量
- 表分区
- 数据归档
表空间扩容方法:
- 添加新的数据文件:sql
ALTER TABLESPACE tbs_name ADD DATAFILE '/path/to/new_file.dbf' SIZE 10G; - 扩展现有数据文件:sql
ALTER TABLESPACE tbs_name RESIZE DATAFILE '/path/to/existing_file.dbf' SIZE 20G; - 自动扩展数据文件:sql
ALTER TABLESPACE tbs_name AUTOEXTEND ON NEXT 1G MAXSIZE 100G;
常见表空间问题:
- 系统表空间满:可能导致数据库无法启动,需要紧急处理
- 用户表空间满:影响用户数据写入
- 临时表空间满:影响排序、分组等操作
- UNDO表空间满:影响事务处理
表大小统计
指标说明:监控表的大小和增长情况,识别大表。
获取方式:
sql
-- 查看Top 10大表
SELECT
relname,
relkind,
size_mb,
table_size_mb,
index_size_mb,
toast_size_mb
FROM sys_tables_size
WHERE relkind = 'r'
ORDER BY size_mb DESC
LIMIT 10;监控建议:
- 关注大表的增长趋势
- 考虑对大表进行分区或归档
- 优化大表的查询和索引
复制指标
主备复制状态
指标说明:监控主备复制的状态和延迟情况,评估高可用性。
获取方式:
sql
-- 主节点查看复制状态
SELECT
application_name,
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM sys_stat_replication;
-- 备节点查看复制状态
SELECT
pid,
status,
receive_start_lsn,
receive_lsn,
replay_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_lsn,
latest_end_time
FROM sys_stat_wal_receiver;
-- 备节点查看复制延迟
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_delay
FROM dual;监控建议:
- 复制延迟过大可能影响数据一致性和故障切换
- 监控复制状态,确保备库正常接收和应用WAL日志
- 配置合适的复制参数,如wal_sender_timeout、wal_receiver_timeout等
WAL日志统计
指标说明:监控WAL日志的生成和归档情况,评估WAL负载。
获取方式:
sql
-- 查看WAL生成速率
SELECT
wal_rate_mb_per_minute,
archive_rate_mb_per_minute,
wal_pending_archive_mb
FROM sys_wal_usage;
-- 查看WAL归档状态
SELECT
wal_file_name,
archive_status
FROM sys_archive_status;监控建议:
- WAL生成速率过高可能表示写入负载过大
- 大量pending归档文件可能表示归档过程存在问题
- 配置合适的wal_keep_segments或使用复制槽,防止WAL文件被过早删除
系统资源指标
内存使用明细
指标说明:监控数据库的内存使用明细,包括shared_buffers、work_mem等。
获取方式:
sql
-- 查看内存使用明细
SELECT
name,
setting,
unit,
short_desc
FROM sys_settings
WHERE name IN (
'shared_buffers',
'work_mem',
'maintenance_work_mem',
'effective_cache_size',
'max_connections'
);
-- 查看内存上下文使用情况
SELECT
name,
allocated_size,
used_size
FROM sys_memory_contexts
ORDER BY allocated_size DESC
LIMIT 10;监控建议:
- 根据服务器内存大小配置合适的shared_buffers(通常为系统内存的25%-40%)
- 合理配置work_mem,避免内存溢出
- 监控内存上下文,识别内存泄漏
系统负载
指标说明:监控系统的整体负载情况。
获取方式:
sql
-- 查看系统负载
SELECT
load_average_1min,
load_average_5min,
load_average_15min
FROM sys_system_load;监控建议:
- 系统负载持续超过CPU核心数时,应考虑增加资源或优化负载
- 关注1分钟、5分钟、15分钟负载的变化趋势
动态性能视图汇总
常用动态性能视图
| 视图名称 | 描述 |
|---|---|
| sys_stat_activity | 会话和连接信息 |
| sys_stat_database | 数据库级统计信息 |
| sys_stat_user_tables | 用户表统计信息 |
| sys_stat_user_indexes | 用户索引统计信息 |
| sys_stat_statements | SQL执行统计信息 |
| sys_locks | 锁信息 |
| sys_lock_waits | 锁等待信息 |
| sys_os_usage | 操作系统资源使用情况 |
| sys_memory_usage | 内存使用情况 |
| sys_disk_usage | 磁盘I/O使用情况 |
| sys_tablespace_usage | 表空间使用情况 |
| sys_tables_size | 表大小统计 |
| sys_stat_replication | 主备复制状态(主节点) |
| sys_stat_wal_receiver | WAL接收状态(备节点) |
| sys_wal_usage | WAL使用情况 |
| sys_archive_status | WAL归档状态 |
视图查询示例
sql
-- 综合查询数据库状态
SELECT
d.datname,
d.xact_commit + d.xact_rollback AS total_transactions,
d.blks_read + d.blks_hit AS total_blocks,
round(d.blks_hit::numeric / (d.blks_read + d.blks_hit)::numeric * 100, 2) AS cache_hit_ratio,
a.current_connections,
s.setting AS max_connections,
round(a.current_connections::numeric / s.setting::numeric * 100, 2) AS connection_usage_percent
FROM sys_stat_database d,
(SELECT count(*) AS current_connections FROM sys_stat_activity) a,
(SELECT setting FROM sys_settings WHERE name = 'max_connections') s
WHERE d.datname = 'test';监控最佳实践
监控频率
| 指标类型 | 建议监控频率 |
|---|---|
| 连接与会话 | 1-5分钟 |
| 性能指标 | 1-5分钟 |
| SQL性能 | 5-15分钟 |
| 事务与锁 | 5-15分钟 |
| 存储指标 | 30分钟-1小时 |
| 复制指标 | 1-5分钟 |
| 系统资源 | 5-15分钟 |
告警阈值设置
| 指标 | 建议告警阈值 |
|---|---|
| 连接使用率 | > 80% |
| CPU使用率 | > 85% 持续5分钟 |
| 内存使用率 | > 90% 持续10分钟 |
| 表空间使用率 | > 80% |
| 复制延迟 | > 30秒 |
| 锁等待时间 | > 10秒 |
| 慢查询数量 | > 10个 持续5分钟 |
监控工具推荐
- KingBaseES Manager(KEM):KingBase官方提供的监控和管理工具
- Prometheus + Grafana:开源监控解决方案,通过KingBaseES Exporter采集指标
- Zabbix:开源监控系统,支持KingBaseES监控
- 自定义脚本:使用Shell、Python等脚本定期采集和分析指标
V8 R6与V8 R7版本差异
KingBaseES V8 R7 在监控指标方面进行了显著增强,新增了更多细粒度的监控视图和指标,提供了更全面的性能监控能力。以下是主要差异:
1. 动态性能视图增强
| 视图类型 | V8 R6 | V8 R7 |
|---|---|---|
| 连接与会话 | 基础的sys_stat_activity视图 | 增强的sys_stat_activity,新增wait_event_type和wait_event字段 |
| 内存监控 | 基础的sys_memory_usage视图 | 新增sys_memory_contexts视图,提供细粒度内存上下文监控 |
| 复制监控 | 基础的sys_stat_replication和sys_stat_wal_receiver | 新增sys_replication_stats视图,提供复制性能统计 |
| 存储监控 | 基础的sys_tablespace_usage和sys_tables_size | 新增sys_tablespace_usage_history,提供表空间使用率历史趋势 |
| SQL性能 | 基础的sys_stat_statements | 增强的sys_stat_statements,新增更多统计字段 |
| 锁监控 | 基础的sys_locks和sys_lock_waits | 新增sys_deadlock_history视图,提供死锁历史记录 |
| 事务监控 | 基础的sys_stat_database | 新增sys_transaction_log视图,提供事务提交和回滚的详细记录 |
| WAL监控 | 基础的sys_wal_usage | 新增sys_wal_stats视图,提供更详细的WAL生成和归档统计 |
2. 新增监控指标
内存监控
- 内存上下文监控:V8 R7 新增了sys_memory_contexts视图,提供细粒度的内存使用明细,包括:
- 内存上下文名称
- 分配的内存大小
- 使用的内存大小
- 内存上下文层级
复制监控
- 复制性能统计:V8 R7 新增了sys_replication_stats视图,提供:
- 复制延迟趋势
- WAL 传输速率
- 复制槽状态
- 备库应用WAL的速率
存储监控
- 表空间使用率历史:V8 R7 新增了sys_tablespace_usage_history视图,提供表空间使用率的历史趋势数据,可用于预测存储增长。
SQL性能监控
- 增强的SQL统计:V8 R7 增强了sys_stat_statements视图,新增了:
- 临时文件写入统计
- 共享缓冲区命中和读取统计
- 索引使用统计
- 更详细的执行时间分布
锁监控
- 死锁历史记录:V8 R7 新增了sys_deadlock_history视图,记录死锁发生的时间、参与的事务和锁信息,便于分析死锁原因。
3. 监控功能增强
慢查询捕获
- V8 R6:基本的慢查询日志功能
- V8 R7:增强的慢查询捕获,支持:
- 更详细的慢查询日志格式
- 慢查询的执行计划记录
- 慢查询的锁等待信息
- 慢查询的资源使用统计
性能采样
- V8 R6:基本的性能统计
- V8 R7:新增性能采样功能,可定期采样数据库性能指标,用于趋势分析。
告警功能
- V8 R6:无内置告警功能
- V8 R7:新增内置告警功能,可配置告警规则和通知方式。
4. 版本迁移注意事项
从 V8 R6 迁移到 V8 R7
- 原有监控脚本大部分可以直接使用,但需要注意:
- 部分视图名称可能发生变化
- 新增视图可以用于更细粒度的监控
- 监控工具需要更新到支持 V8 R7 的版本
监控配置调整
- 建议利用 V8 R7 新增的监控视图,增强监控粒度
- 考虑使用新增的告警功能,替代第三方告警工具
- 利用性能采样功能,实现更全面的性能趋势分析
5. 版本特定监控建议
V8 R6
- 重点监控基础指标:连接数、CPU、内存、磁盘I/O
- 使用sys_stat_statements分析SQL性能
- 定期分析慢查询日志
V8 R7
- 利用新增的内存上下文监控,识别内存泄漏
- 使用sys_deadlock_history分析死锁问题
- 利用sys_tablespace_usage_history预测存储增长
- 使用sys_transaction_log分析事务性能
- 配置内置告警功能,实现实时监控告警
常见问题(FAQ)
Q1: 如何查看数据库的当前连接数?
A: 使用sys_stat_activity视图查看当前连接数:
sql
-- 查看当前总连接数
SELECT count(*) AS current_connections FROM sys_stat_activity;
-- 查看各状态连接分布
SELECT
state,
count(*) AS count
FROM sys_stat_activity
WHERE state IS NOT NULL
GROUP BY state
ORDER BY count DESC;
-- 查看各数据库连接分布
SELECT
datname,
count(*) AS connection_count
FROM sys_stat_activity
WHERE datname IS NOT NULL
GROUP BY datname
ORDER BY connection_count DESC;Q2: 如何识别和优化慢查询?
A: 可以通过以下方式识别和优化慢查询:
识别慢查询:
- 配置慢查询日志:sql
ALTER SYSTEM SET log_min_duration_statement = '500ms'; ALTER SYSTEM SET slow_query_log = on; - 实时查看长时间运行的SQL:sql
SELECT pid, usename, datname, round((now() - query_start)::numeric * 1000, 2) AS duration_ms, query FROM sys_stat_activity WHERE state = 'active' AND now() - query_start > interval '1 second' ORDER BY duration_ms DESC; - 分析历史SQL执行统计:sql
SELECT queryid, round(total_time::numeric, 2) AS total_time_ms, calls, round(mean_time::numeric, 2) AS mean_time_ms, left(query, 100) AS query_sample FROM sys_stat_statements ORDER BY total_time DESC LIMIT 10;
优化慢查询:
- 分析执行计划:sql
EXPLAIN ANALYZE SELECT * FROM large_table WHERE condition_column = 'value'; - 添加合适的索引:sql
CREATE INDEX idx_large_table_condition ON large_table(condition_column); - 重写复杂查询,简化连接和子查询
Q3: 如何查看表空间使用率和大表?
A: 查看表空间使用率:
sql
SELECT
spcname,
size_mb,
used_mb,
free_mb,
round(used_mb::numeric / size_mb::numeric * 100, 2) AS usage_percent
FROM sys_tablespace_usage;查看大表:
sql
SELECT
relname,
relkind,
round(size_mb::numeric, 2) AS size_mb,
table_size_mb,
index_size_mb
FROM sys_tables_size
WHERE relkind = 'r'
ORDER BY size_mb DESC
LIMIT 10;Q4: 如何监控和解决锁等待问题?
A: 监控锁等待:
sql
-- 查看锁等待情况
SELECT
wait_event_type,
wait_event,
count(*) AS count
FROM sys_stat_activity
WHERE wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY count DESC;
-- 查看锁等待关系
SELECT
blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.usename AS blocked_user,
blocking_activity.usename AS blocking_user,
blocked_locks.mode AS blocked_mode,
blocking_locks.mode AS blocking_mode,
blocked_locks.relation::regclass AS table_name,
left(blocked_activity.query, 100) AS blocked_query,
left(blocking_activity.query, 100) AS blocking_query
FROM sys_locks blocked_locks
JOIN sys_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN sys_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN sys_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;解决锁等待:
- 终止阻塞事务:sql
SELECT pg_terminate_backend(blocking_pid); - 优化持有锁时间长的SQL
- 调整事务隔离级别
- 实现合理的并发控制策略
Q5: 如何监控主备复制延迟?
A: 在主库上查看复制状态:
sql
SELECT
application_name,
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM sys_stat_replication;在备库上查看复制延迟:
sql
-- 查看复制延迟
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;
-- 查看WAL接收状态
SELECT
pid,
status,
receive_start_lsn,
receive_lsn,
replay_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_lsn,
latest_end_time
FROM sys_stat_wal_receiver;Q6: 如何查看和优化内存使用情况?
A: 查看内存使用情况:
sql
-- 查看系统内存使用
SELECT
total_memory,
used_memory,
free_memory,
round(used_memory::numeric / total_memory::numeric * 100, 2) AS usage_percent
FROM sys_memory_usage;
-- 查看内存相关配置
SELECT
name,
setting,
unit,
short_desc
FROM sys_settings
WHERE name IN ('shared_buffers', 'work_mem', 'maintenance_work_mem', 'effective_cache_size');优化内存使用:
- 调整shared_buffers大小(通常为系统内存的25%-40%)
- 根据查询复杂度调整work_mem
- 调整maintenance_work_mem(通常为系统内存的5%-10%)
- 优化SQL查询,减少内存消耗
Q7: 如何监控和优化磁盘I/O?
A: 监控磁盘I/O:
sql
-- 查看磁盘I/O情况
SELECT
read_bytes_per_second,
write_bytes_per_second,
read_latency_ms,
write_latency_ms
FROM sys_disk_usage;
-- 查看表空间I/O统计
SELECT
spcname,
read_count,
write_count,
read_bytes,
write_bytes
FROM sys_tablespace_io_stats
ORDER BY read_bytes + write_bytes DESC
LIMIT 5;优化磁盘I/O:
- 为频繁查询的列添加索引,减少I/O操作
- 优化查询,避免全表扫描
- 调整shared_buffers大小,增加内存缓存
- 使用更快的存储设备(如SSD、NVMe)
- 考虑表分区,分散I/O负载
Q8: 如何使用sys_stat_statements分析SQL性能?
A: 启用和配置sys_stat_statements:
sql
-- 在kingbase.conf中配置
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.save = on分析SQL性能:
sql
-- 查看总耗时最高的SQL
SELECT
queryid,
round(total_time::numeric, 2) AS total_time_ms,
calls,
round(mean_time::numeric, 2) AS mean_time_ms,
rows,
left(query, 200) AS query_sample
FROM sys_stat_statements
ORDER BY total_time DESC
LIMIT 10;
-- 查看平均耗时最高的SQL
SELECT
queryid,
round(mean_time::numeric, 2) AS mean_time_ms,
calls,
left(query, 200) AS query_sample
FROM sys_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- 查看调用次数最多的SQL
SELECT
queryid,
calls,
round(total_time::numeric, 2) AS total_time_ms,
left(query, 200) AS query_sample
FROM sys_stat_statements
ORDER BY calls DESC
LIMIT 10;Q9: 如何查看和解决高CPU使用率问题?
A: 查看CPU使用率:
sql
SELECT
cpu_usage_percent,
user_cpu_usage_percent,
system_cpu_usage_percent
FROM sys_os_usage;解决高CPU使用率:
- 识别CPU密集型查询:sql
SELECT pid, usename, datname, round((now() - query_start)::numeric * 1000, 2) AS duration_ms, query FROM sys_stat_activity WHERE state = 'active' ORDER BY duration_ms DESC LIMIT 5; - 优化CPU密集型查询
- 增加CPU资源
- 考虑使用读写分离架构
Q10: 如何配置监控告警?
A: 配置监控告警可以通过以下方式:
使用Prometheus + Grafana:
- 配置Prometheus告警规则
- 使用Alertmanager发送告警
- 在Grafana中配置告警面板
使用Zabbix:
- 创建Zabbix监控项
- 配置触发器和告警动作
- 配置通知方式
使用自定义脚本:
- 编写监控脚本,定期检查指标
- 配置告警条件
- 集成邮件、短信等通知方式
V8 R7内置告警功能:
sql-- 配置告警规则 CREATE ALARM RULE cpu_high ON sys_os_usage WHERE cpu_usage_percent > 85 FOR 5 MINUTES NOTIFY BY email TO 'dba@example.com';
Q11: 如何查看死锁历史记录?
A: V8 R7 新增了sys_deadlock_history视图,可以查看死锁历史记录:
sql
SELECT
deadlock_id,
deadlock_time,
deadlock_graph
FROM sys_deadlock_history
ORDER BY deadlock_time DESC
LIMIT 5;Q12: 如何监控表空间增长趋势?
A: V8 R7 新增了sys_tablespace_usage_history视图,可以查看表空间使用率的历史趋势:
sql
SELECT
date_trunc('day', sample_time) AS day,
spcname,
avg(usage_percent) AS avg_usage_percent,
max(usage_percent) AS max_usage_percent
FROM sys_tablespace_usage_history
WHERE sample_time > now() - interval '30 days'
GROUP BY day, spcname
ORDER BY day, spcname;Q13: 如何重置sys_stat_statements统计信息?
A: 可以使用pg_stat_statements_reset()函数重置统计信息:
sql
-- 重置所有统计信息
SELECT pg_stat_statements_reset();
-- 重置特定查询的统计信息
SELECT pg_stat_statements_reset(queryid)
FROM sys_stat_statements
WHERE query LIKE '%specific_query%';Q14: 如何查看事务提交和回滚情况?
A: 查看数据库事务统计:
sql
SELECT
datname,
xact_commit,
xact_rollback,
round(xact_rollback::numeric / nullif(xact_commit + xact_rollback, 0)::numeric * 100, 2) AS rollback_ratio
FROM sys_stat_database
WHERE datname IS NOT NULL;V8 R7 还可以查看事务提交和回滚的详细记录:
sql
SELECT
date_trunc('minute', commit_time) AS minute,
count(*) AS transaction_count,
sum(CASE WHEN is_rollback THEN 1 ELSE 0 END) AS rollback_count,
round(sum(CASE WHEN is_rollback THEN 1 ELSE 0 END)::numeric / count(*)::numeric * 100, 2) AS rollback_ratio
FROM sys_transaction_log
WHERE commit_time > now() - interval '1 hour'
GROUP BY minute
ORDER BY minute;Q15: 如何选择合适的监控工具?
A: 选择监控工具应考虑以下因素:
KingBaseES Manager(KEM):
- 官方提供的监控和管理工具
- 适合KingBaseES数据库的专业监控
- 提供图形化界面和告警功能
Prometheus + Grafana:
- 开源监控解决方案
- 灵活的告警配置
- 强大的数据可视化能力
- 适合大规模集群监控
Zabbix:
- 企业级开源监控系统
- 支持多种监控对象
- 提供完善的告警机制
- 适合综合监控场景
自定义脚本:
- 灵活定制监控逻辑
- 适合特定监控需求
- 轻量级,资源消耗低
云监控服务:
- 如阿里云云监控、腾讯云云监控
- 适合云环境中的KingBaseES监控
- 提供托管式监控服务
## 总结
KingBaseES提供了丰富的监控指标,通过动态性能视图和系统视图可以全面监控数据库的性能、状态和健康状况。DBA应根据实际需求选择合适的监控指标,设置合理的告警阈值,定期分析监控数据,及时发现和解决问题,确保数据库的稳定和高效运行。
在实际监控中,建议结合多种监控工具和方法,建立全面的监控体系,包括实时监控、历史趋势分析和告警机制,以提高数据库的可用性和性能。