PostgreSQL 基于 Patroni 的高可用实现

Patroni 是一款功能强大的 PostgreSQL 高可用解决方案，提供自动故障检测、自动故障转移、配置管理和集群监控等核心功能。基于分布式一致性存储（如 etcd、Consul 或 ZooKeeper），Patroni 能确保 PostgreSQL 集群的高可用性和数据一致性。本文将从 DBA 实际运维角度，详细介绍 Patroni 的架构、安装配置、故障转移机制和最佳实践。

Patroni 架构与工作原理

核心组件

组件	描述	版本支持
Patroni 守护进程	运行在每个 PostgreSQL 节点，监控状态和管理实例	所有版本
分布式一致性存储（DCS）	存储集群配置和状态，支持 etcd、Consul、ZooKeeper	所有版本
PostgreSQL 实例	集群中的主库和从库	10+
负载均衡器（可选）	如 HAProxy、pgpool-II，管理客户端连接	所有版本
监控系统（可选）	Prometheus + Grafana，监控集群状态	1.6+

工作原理

集群初始化：第一个启动的 Patroni 节点初始化集群，创建主库并在 DCS 注册
节点加入：其他节点启动后，从 DCS 获取集群信息，以从库身份加入
健康监控：每个 Patroni 守护进程定期检查本地 PostgreSQL 状态，并更新到 DCS
故障检测：通过监控 DCS 状态，检测主库故障
自动故障转移：主库故障时，自动选举从库提升为主库，更新 DCS 信息
集群重配置：其他从库自动重新配置，连接到新主库

故障转移流程

故障检测：Patroni 检测到主库故障
leader 选举：从可用从库中选举新 leader
提升主库：将选定从库提升为主库
重新配置：更新其他从库的复制配置
客户端重定向：更新负载均衡器配置，将客户端请求重定向到新主库

环境准备与规划

硬件要求（生产环境推荐）

组件	配置建议	说明
CPU	8-16 核	处理数据库负载和 Patroni 进程
内存	16-32 GB	足够的内存减少 I/O 压力
存储	500 GB NVMe SSD	高 IOPS 确保良好的复制性能
网络	10G 以太网	低延迟网络保证 DCS 通信和复制性能

软件版本兼容性

Patroni 版本	支持的 PostgreSQL 版本	支持的 DCS 版本
2.0-2.1	PostgreSQL 10-14	etcd 3.4+, Consul 1.9+, ZooKeeper 3.5+
2.2-2.3	PostgreSQL 10-15	etcd 3.4+, Consul 1.10+, ZooKeeper 3.5+
2.4+	PostgreSQL 10-16	etcd 3.5+, Consul 1.11+, ZooKeeper 3.6+

节点规划示例

节点角色	主机名	IP 地址	安装软件	备注
etcd 节点 1	etcd1	192.168.1.10	etcd 3.5	分布式一致性存储
etcd 节点 2	etcd2	192.168.1.11	etcd 3.5	分布式一致性存储
etcd 节点 3	etcd3	192.168.1.12	etcd 3.5	分布式一致性存储
PostgreSQL 节点 1	pg1	192.168.1.20	PostgreSQL 15, Patroni 2.4	主库/从库
PostgreSQL 节点 2	pg2	192.168.1.21	PostgreSQL 15, Patroni 2.4	主库/从库
PostgreSQL 节点 3	pg3	192.168.1.22	PostgreSQL 15, Patroni 2.4	主库/从库
负载均衡节点	lb	192.168.1.30	HAProxy 2.6	客户端连接管理

安装与配置

1. 安装配置 etcd 集群

安装 etcd

bash

# 在所有 etcd 节点执行
ETCD_VERSION="3.5.9"
wget https://github.com/etcd-io/etcd/releases/download/v${ETCD_VERSION}/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz
tar xvf etcd-v${ETCD_VERSION}-linux-amd64.tar.gz
cp etcd-v${ETCD_VERSION}-linux-amd64/etcd* /usr/local/bin/

# 创建 etcd 用户和目录
useradd -r -s /sbin/nologin etcd
mkdir -p /var/lib/etcd /etc/etcd
chown -R etcd:etcd /var/lib/etcd /etc/etcd

配置 etcd（etcd1 节点）

创建 /etc/etcd/etcd.conf.yml：

yaml

name: etcd1
data-dir: /var/lib/etcd
listen-client-urls: http://192.168.1.10:2379,http://127.0.0.1:2379
advertise-client-urls: http://192.168.1.10:2379
listen-peer-urls: http://192.168.1.10:2380
initial-advertise-peer-urls: http://192.168.1.10:2380
initial-cluster: etcd1=http://192.168.1.10:2380,etcd2=http://192.168.1.11:2380,etcd3=http://192.168.1.12:2380
initial-cluster-state: new
initial-cluster-token: etcd-cluster-1
enable-v2: false
logger: zap
log-level: info
snapshot-count: 10000
auto-compaction-mode: periodic
auto-compaction-retention: "1h"

系统服务配置

创建 /etc/systemd/system/etcd.service：

ini

[Unit]
Description=etcd key-value store
Documentation=https://etcd.io/docs/
After=network.target

[Service]
User=etcd
Group=etcd
Type=notify
ExecStart=/usr/local/bin/etcd --config-file /etc/etcd/etcd.conf.yml
Restart=always
RestartSec=10s
LimitNOFILE=40000

[Install]
WantedBy=multi-user.target

启动并验证 etcd 集群

bash

# 启动 etcd 服务
systemctl daemon-reload
systemctl start etcd
systemctl enable etcd

# 验证集群状态
etcdctl --endpoints=http://192.168.1.10:2379,http://192.168.1.11:2379,http://192.168.1.12:2379 endpoint health
etcdctl --endpoints=http://192.168.1.10:2379 member list

2. 安装配置 PostgreSQL

安装 PostgreSQL 15

bash

# 在所有 PostgreSQL 节点执行
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-$(rpm -E %{rhel})-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum install -y postgresql15 postgresql15-server postgresql15-contrib postgresql15-devel

# 创建数据目录
mkdir -p /var/lib/pgsql/15/data
chown -R postgres:postgres /var/lib/pgsql/15/data

注意：不要手动启动 PostgreSQL 服务，Patroni 将负责管理。

3. 安装配置 Patroni

安装 Patroni

bash

# 在所有 PostgreSQL 节点执行
pip3 install --upgrade pip
pip3 install patroni[etcd3] python-etcd3 psycopg2-binary

# 验证安装
patroni --version
patronictl --version

创建 Patroni 配置文件

创建 /etc/patroni.yml（pg1 节点）：

yaml

scope: postgres-cluster
ttl: 30
retry_timeout: 10
maximum_lag_on_failover: 1048576  # 1MB

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.1.20:5432
  data_dir: /var/lib/pgsql/15/data
  bin_dir: /usr/pgsql-15/bin
  pgpass: /tmp/pgpass0
  authentication:
    replication:
      username: replicator
      password: rep-pass-2025
    superuser:
      username: postgres
      password: postgres-pass-2025
    rewind:
      username: rewind_user
      password: rewind-pass-2025
  parameters:
    unix_socket_directories: '/var/run/postgresql'
    wal_level: replica
    hot_standby: on
    max_connections: 500
    max_wal_senders: 10
    max_replication_slots: 10
    wal_keep_size: 16GB
    shared_buffers: 4GB
    work_mem: 16MB
    maintenance_work_mem: 1GB
effective_cache_size: 12GB
    checkpoint_completion_target: 0.9
    min_wal_size: 1GB
    max_wal_size: 4GB
    random_page_cost: 1.1  # SSD 存储推荐值
  # pg_rewind 配置
  rewind:
    username: rewind_user
    password: rewind-pass-2025
  # 备份配置（可选，集成 pgBackRest）
  # basebackup:
  #   method: pgbackrest
  #   command: pgbackrest --stanza=main restore

# DCS 配置
etcd:
  hosts: http://192.168.1.10:2379,http://192.168.1.11:2379,http://192.168.1.12:2379
  # etcd 认证配置（可选）
  # username: etcd_user
  # password: etcd_pass
  # protocol: https

# REST API 配置
restapi:
  listen: 0.0.0.0:8008
  connect_address: 192.168.1.20:8008
  # Prometheus 监控
  metrics:
    enabled: true
    path: /metrics

# 引导配置
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
  initdb:
  - encoding: UTF8
  - locale: en_US.UTF-8
  - data-checksums
  pg_hba:
  - host all all 0.0.0.0/0 md5
  - host replication replicator 0.0.0.0/0 md5
  - host all rewind_user 0.0.0.0/0 md5
  users:
    rewind_user:
      password: rewind-pass-2025
      options:
        - createrole
        - replication
        - login
    # 应用用户（可选）
    app_user:
      password: app-pass-2025
      options:
        - login

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: true
  nosync: false

注意：其他节点需要修改 connect_address 为对应节点 IP。

创建 Patroni 系统服务

创建 /etc/systemd/system/patroni.service：

ini

[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target etcd.service

[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni.yml
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=process
TimeoutSec=30
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

启动并验证 Patroni 集群

bash

# 启动 Patroni 服务
systemctl daemon-reload
systemctl start patroni
systemctl enable patroni

# 验证服务状态
systemctl status patroni

# 查看集群状态（在任意节点执行）
patronictl -c /etc/patroni.yml list

预期输出：

+ Cluster: postgres-cluster (7000000000000000001) ---+----+-----------+-----------------+-----------+--------+---------+----+-----------+
| Member | Host | Role | State | TL | LSN | Lag in MB | Pending restart | Tags |
+--------+-----------------+--------+---------+----+-----------+-----------+-----------------+------+
| pg1 | 192.168.1.20:5432 | Leader | running | 1 | 0/4000000 | 0 | * | clonefrom: true |
| | | | | | | | | nofailover: false |
| | | | | | | | | noloadbalance: false |
| | | | | | | | | nosync: false |
| pg2 | 192.168.1.21:5432 | Replica | running | 1 | 0/4000000 | 0 | * | clonefrom: true |
| | | | | | | | | nofailover: false |
| | | | | | | | | noloadbalance: false |
| | | | | | | | | nosync: false |
| pg3 | 192.168.1.22:5432 | Replica | running | 1 | 0/4000000 | 0 | * | clonefrom: true |
| | | | | | | | | nofailover: false |
| | | | | | | | | noloadbalance: false |
| | | | | | | | | nosync: false |
+--------+-----------------+--------+---------+----+-----------+-----------+-----------------+------+

4. 配置 HAProxy 负载均衡

安装 HAProxy

bash

# 在负载均衡节点执行
yum install -y haproxy

配置 HAProxy

编辑 /etc/haproxy/haproxy.cfg：

ini

global
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats

defaults
    mode                    tcp
    log                     global
    option                  tcplog
    option                  dontlognull
    option                  http-server-close
    option                  forwardfor       except 127.0.0.1
    option                  redispatch
    retries                 3
    timeout connect         5s
    timeout client          1m
    timeout server          1m
    maxconn                 3000

# Patroni REST API 健康检查
listen patroni-api
    bind 0.0.0.0:8080
    mode http
    stats enable
    stats uri /haproxy?stats
    stats realm HAProxy Stats
    stats auth admin:admin-pass-2025
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg1 192.168.1.20:8008 check port 8008
    server pg2 192.168.1.21:8008 check port 8008
    server pg3 192.168.1.22:8008 check port 8008

# PostgreSQL 主库读写连接
listen postgres-write
    bind 0.0.0.0:5432
    mode tcp
    option tcplog
    balance roundrobin
    option httpchk GET /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg1 192.168.1.20:5432 check port 8008
    server pg2 192.168.1.21:5432 check port 8008 backup
    server pg3 192.168.1.22:5432 check port 8008 backup

# PostgreSQL 从库只读连接
listen postgres-read
    bind 0.0.0.0:5433
    mode tcp
    option tcplog
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg1 192.168.1.20:5432 check port 8008 backup
    server pg2 192.168.1.21:5432 check port 8008
    server pg3 192.168.1.22:5432 check port 8008

启动 HAProxy

bash

systemctl start haproxy
systemctl enable haproxy

# 验证状态
systemctl status haproxy

监控与告警

1. Prometheus + Grafana 监控

配置 Prometheus 抓取

yaml

scrape_configs:
  # PostgreSQL 实例监控
  - job_name: 'postgresql'
    static_configs:
      - targets: ['192.168.1.20:9187', '192.168.1.21:9187', '192.168.1.22:9187']
    scrape_interval: 15s

  # Patroni 监控
  - job_name: 'patroni'
    static_configs:
      - targets: ['192.168.1.20:8008', '192.168.1.21:8008', '192.168.1.22:8008']
    metrics_path: /metrics
    scrape_interval: 15s

  # etcd 监控
  - job_name: 'etcd'
    static_configs:
      - targets: ['192.168.1.10:2379', '192.168.1.11:2379', '192.168.1.12:2379']
    scrape_interval: 15s
    metrics_path: /metrics

关键监控指标

指标类型	关键指标	告警阈值
Patroni 状态	patroni_postgresql_role	主库角色变化
	patroni_postgresql_state	实例状态异常
	patroni_postgresql_last_failover_timestamp_seconds	发生故障转移
复制状态	patroni_postgresql_replication_lag_bytes	> 10MB
	patroni_postgresql_last_wal_receive_lsn_offset	> 10MB
PostgreSQL 性能	pg_stat_database_xact_commit	提交速率异常变化
	pg_stat_database_xact_rollback	回滚率 > 5%
	pg_stat_bgwriter_checkpoint_sync_time	检查点同步时间 > 1s
系统资源	node_cpu_seconds_total	CPU 使用率 > 80%
	node_memory_MemAvailable_bytes	可用内存 < 20%
	node_disk_free_bytes	磁盘可用空间 < 10%

2. 日志监控

配置日志集中管理，使用 ELK Stack 或 Loki + Grafana：

bash

# 配置 rsyslog 转发 Patroni 日志
echo '*.* @@log-server:514' >> /etc/rsyslog.conf
systemctl restart rsyslog

# 或配置 filebeat 收集日志
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/lib/pgsql/15/data/log/*.log
    - /var/log/messages
  tags: ["postgresql"]

- type: log
  enabled: true
  paths:
    - /var/log/patroni/*.log
  tags: ["patroni"]

自动化管理与维护

1. 自动备份脚本

bash

#!/bin/bash
# auto_backup.sh

BACKUP_DIR="/backup/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
PATRONI_CONFIG="/etc/patroni.yml"

# 获取主库信息
MASTER_INFO=$(patronictl -c $PATRONI_CONFIG list | grep Leader)
MASTER_HOST=$(echo $MASTER_INFO | awk '{print $2}' | cut -d: -f1)

# 执行基础备份
pg_basebackup -h $MASTER_HOST -U replicator -D ${BACKUP_DIR}/basebackup_${DATE} -F t -z -P

# 保留最近 7 天备份
find $BACKUP_DIR -name "basebackup_*" -type d -mtime +7 -delete

# 记录备份日志
echo "Backup completed: ${BACKUP_DIR}/basebackup_${DATE} on ${MASTER_HOST}" >> ${BACKUP_DIR}/backup_log.txt

2. 定期检查脚本

bash

#!/bin/bash
# check_patroni_cluster.sh

PATRONI_CONFIG="/etc/patroni.yml"
LOG_FILE="/var/log/patroni/check_cluster.log"

# 检查集群状态
CLUSTER_STATUS=$(patronictl -c $PATRONI_CONFIG list)

# 检查主库是否存在
if ! echo "$CLUSTER_STATUS" | grep -q Leader; then
    echo "$(date): ERROR - No leader found in cluster!" >> $LOG_FILE
    exit 1
fi

# 检查从库是否正常
REPLICA_COUNT=$(echo "$CLUSTER_STATUS" | grep -c Replica)
if [ $REPLICA_COUNT -lt 2 ]; then
    echo "$(date): WARNING - Less than 2 replicas available!" >> $LOG_FILE
fi

# 检查复制延迟
LAG_INFO=$(echo "$CLUSTER_STATUS" | grep Replica | awk '{print $7}')
for lag in $LAG_INFO; do
    if [ "$lag" != "0" ] && [ "$lag" != "unknown" ]; then
        if (( $(echo "$lag > 5" | bc -l) )); then
            echo "$(date): WARNING - Replica lag exceeds 5MB: $lag" >> $LOG_FILE
        fi
    fi
done

echo "$(date): Cluster check completed successfully" >> $LOG_FILE

3. 自动扩容脚本

bash

#!/bin/bash
# add_new_node.sh

NEW_NODE_IP=$1
NEW_NODE_NAME=$2
PATRONI_CONFIG_TEMPLATE="/etc/patroni.yml"

if [ -z "$NEW_NODE_IP" ] || [ -z "$NEW_NODE_NAME" ]; then
    echo "Usage: $0 <new_node_ip> <new_node_name>"
    exit 1
fi

# 生成新节点配置
cp $PATRONI_CONFIG_TEMPLATE /tmp/patroni_${NEW_NODE_NAME}.yml
sed -i "s/192.168.1.20/$NEW_NODE_IP/g" /tmp/patroni_${NEW_NODE_NAME}.yml

# 复制配置到新节点
scp /tmp/patroni_${NEW_NODE_NAME}.yml ${NEW_NODE_IP}:/etc/patroni.yml
scp /etc/systemd/system/patroni.service ${NEW_NODE_IP}:/etc/systemd/system/

# 在新节点上执行初始化命令
ssh ${NEW_NODE_IP} "
systemctl daemon-reload
systemctl start patroni
systemctl enable patroni
"

# 验证新节点加入
patronictl -c $PATRONI_CONFIG_TEMPLATE list

故障转移与恢复演练

1. 模拟主库故障

bash

# 停止主库上的 Patroni 服务
# 首先确认当前主库
patronictl -c /etc/patroni.yml list

# 在主库上执行
systemctl stop patroni

# 观察故障转移过程
watch -n 1 patronictl -c /etc/patroni.yml list

2. 模拟网络分区

bash

# 使用 iptables 模拟网络故障
# 在主库上执行
iptables -A OUTPUT -d 192.168.1.10 -j DROP  # 阻断与 etcd1 的通信
iptables -A OUTPUT -d 192.168.1.11 -j DROP  # 阻断与 etcd2 的通信

# 观察故障转移
watch -n 1 patronictl -c /etc/patroni.yml list

# 恢复网络
iptables -D OUTPUT -d 192.168.1.10 -j DROP
iptables -D OUTPUT -d 192.168.1.11 -j DROP

3. 节点恢复流程

bash

# 1. 修复故障节点
# 2. 启动 Patroni 服务
systemctl start patroni

# 3. 验证节点重新加入
patronictl -c /etc/patroni.yml list

# 4. 如果需要，可以手动触发 rewind
patronictl -c /etc/patroni.yml rewind $NODE_NAME

不同版本特性差异

PostgreSQL 版本	Patroni 版本	主要特性差异
10-12	1.6-2.0	基础的故障转移功能，支持 pg_rewind
13-14	2.1-2.2	增强的监控指标，改进的 rewind 支持
15	2.3-2.4	支持 PostgreSQL 15 新特性，改进的 WAL 管理
16	2.4+	支持并行逻辑复制，改进的性能监控

最佳实践

1. 架构设计

DCS 部署：至少 3 个节点，建议 5 个节点以提高容错能力
PostgreSQL 节点：建议奇数个节点（3-5 个），避免脑裂
网络配置：使用独立的网络平面用于 DCS 通信和复制
存储配置：使用 SSD/NVMe 存储，独立存储 WAL 文件
负载均衡：部署至少 2 个负载均衡节点，使用 Keepalived 实现高可用

2. 配置优化

合理设置 TTL：根据网络延迟调整 ttl 参数，建议 30-60 秒
复制延迟阈值：根据业务需求设置 maximum_lag_on_failover，建议 1-10MB
启用 pg_rewind：加快故障节点恢复速度
配置复制槽：防止 WAL 日志过早回收
优化 PostgreSQL 参数：根据硬件配置调整内存、连接数等参数

3. 安全加固

DCS 安全：启用认证和加密通信
PostgreSQL 安全：
- 使用强密码
- 配置严格的 pg_hba.conf
- 启用 SSL/TLS 连接
- 定期更新密码
Patroni 安全：
- 限制 REST API 访问 IP
- 配置认证（可选）
- 保护配置文件权限（chmod 600 /etc/patroni.yml）

4. 监控与告警

关键指标告警：主库角色变化、实例状态异常、复制延迟过高
定期健康检查：每日自动检查集群状态
日志集中管理：收集并分析 PostgreSQL 和 Patroni 日志
性能趋势分析：监控长期性能趋势，预测容量需求

5. 备份与恢复

多重备份策略：定期全量备份 + WAL 归档
异地备份：将备份存储到不同地域
定期恢复测试：每季度至少测试一次从备份恢复
PITR 准备：配置好时间点恢复所需的所有资源

6. 日常维护

定期更新：及时更新 Patroni 和 PostgreSQL 版本
故障演练：每季度进行一次故障转移演练
文档更新：及时更新架构图和操作文档
容量规划：根据业务增长规划资源扩展

常见问题与解决方案

1. 节点无法加入集群

症状：新节点启动后无法加入集群

解决方案：

检查网络连接：确保节点间可以通信
检查 DCS 连接：验证 etcd 集群健康
检查配置文件：确保 scope 和认证信息正确
查看日志：分析 /var/log/patroni/patroni.log 中的错误信息

2. 故障转移失败

症状：主库故障后，无法选举新主库

解决方案：

检查 DCS 状态：确保 etcd 集群正常
检查从库状态：确保至少有一个从库可用
检查复制延迟：确保从库延迟在 maximum_lag_on_failover 范围内
查看 Patroni 日志：分析故障转移失败原因

3. 复制延迟过高

症状：从库复制延迟持续增长

解决方案：

检查主库负载：是否有大量写入或长事务
检查网络延迟：是否存在网络瓶颈
调整 wal_keep_size：增加 WAL 保留量
优化从库性能：调整从库 PostgreSQL 参数
考虑使用更快的存储设备

4. pg_rewind 失败

症状：故障节点恢复时 pg_rewind 失败

解决方案：

确保配置了正确的 rewind 用户和密码
检查 WAL 日志是否完整
如果失败，考虑重新初始化节点（从其他节点克隆）

与其他高可用方案对比

方案	优点	缺点	适用场景
Patroni	自动故障转移、配置管理、监控集成	需要外部 DCS	中大型集群，高可用要求高
repmgr	简单易用、无需外部依赖	功能相对简单	小型集群，快速部署
pg_auto_failover	自动配置、简单易用	灵活性较低	中小型集群
pgpool-II	集成负载均衡、连接池	单点故障风险	对读写分离有需求的场景

总结

基于 Patroni 的 PostgreSQL 高可用架构是一个功能强大、可靠性高的解决方案，适用于各种规模的生产环境。通过合理的部署和配置，可以实现自动故障检测、自动故障转移和集中式配置管理，确保 PostgreSQL 集群的高可用性和数据一致性。

在实际部署中，需要注意：

确保 DCS 集群的高可用性
合理配置 Patroni 参数，根据业务需求调整
建立完善的监控和告警机制
定期进行故障转移演练
结合负载均衡器使用，提高客户端连接可靠性
实施严格的安全措施
制定完善的备份和恢复策略

通过遵循这些最佳实践，可以构建一个稳定、可靠的 PostgreSQL 高可用集群，满足业务的高可用性需求。

PostgreSQL 基于 Patroni 的高可用实现 ​

Patroni 架构与工作原理 ​

核心组件 ​

工作原理 ​

故障转移流程 ​

环境准备与规划 ​

硬件要求（生产环境推荐） ​

软件版本兼容性 ​

节点规划示例 ​

安装与配置 ​

1. 安装配置 etcd 集群 ​

安装 etcd ​

配置 etcd（etcd1 节点） ​

系统服务配置 ​

启动并验证 etcd 集群 ​

2. 安装配置 PostgreSQL ​

安装 PostgreSQL 15 ​

3. 安装配置 Patroni ​

安装 Patroni ​

创建 Patroni 配置文件 ​

创建 Patroni 系统服务 ​

启动并验证 Patroni 集群 ​

4. 配置 HAProxy 负载均衡 ​

安装 HAProxy ​

配置 HAProxy ​

启动 HAProxy ​

监控与告警 ​

1. Prometheus + Grafana 监控 ​

配置 Prometheus 抓取 ​

关键监控指标 ​

推荐 Grafana 看板 ​

2. 日志监控 ​

自动化管理与维护 ​

1. 自动备份脚本 ​

2. 定期检查脚本 ​

3. 自动扩容脚本 ​

故障转移与恢复演练 ​

1. 模拟主库故障 ​

2. 模拟网络分区 ​

3. 节点恢复流程 ​

不同版本特性差异 ​

最佳实践 ​

1. 架构设计 ​

2. 配置优化 ​

3. 安全加固 ​

4. 监控与告警 ​

5. 备份与恢复 ​

6. 日常维护 ​

常见问题与解决方案 ​

1. 节点无法加入集群 ​

2. 故障转移失败 ​

3. 复制延迟过高 ​

4. pg_rewind 失败 ​

与其他高可用方案对比 ​

总结 ​

PostgreSQL 基于 Patroni 的高可用实现

Patroni 架构与工作原理

核心组件

工作原理

故障转移流程

环境准备与规划

硬件要求（生产环境推荐）

软件版本兼容性

节点规划示例

安装与配置

1. 安装配置 etcd 集群

安装 etcd

配置 etcd（etcd1 节点）

系统服务配置

启动并验证 etcd 集群

2. 安装配置 PostgreSQL

安装 PostgreSQL 15

3. 安装配置 Patroni

安装 Patroni

创建 Patroni 配置文件

创建 Patroni 系统服务

启动并验证 Patroni 集群

4. 配置 HAProxy 负载均衡

安装 HAProxy

配置 HAProxy

启动 HAProxy

监控与告警

1. Prometheus + Grafana 监控

配置 Prometheus 抓取

关键监控指标

推荐 Grafana 看板

2. 日志监控

自动化管理与维护

1. 自动备份脚本

2. 定期检查脚本

3. 自动扩容脚本

故障转移与恢复演练

1. 模拟主库故障

2. 模拟网络分区

3. 节点恢复流程

不同版本特性差异

最佳实践

1. 架构设计

2. 配置优化

3. 安全加固

4. 监控与告警

5. 备份与恢复

6. 日常维护

常见问题与解决方案

1. 节点无法加入集群

2. 故障转移失败

3. 复制延迟过高

4. pg_rewind 失败

与其他高可用方案对比

总结