TiDB 集群扩容操作

扩容前准备

扩容场景分析

扩容类型	适用场景	操作复杂度	影响范围
TiDB 扩容	高并发查询场景，CPU 利用率高	低	无服务中断
TiKV 扩容	存储容量不足，读写热点集中	中	数据迁移期间可能有性能影响
PD 扩容	大规模集群（>100 节点），PD 负载高	低	无服务中断
TiFlash 扩容	分析查询需求增加，TiFlash 资源不足	中	数据同步期间可能有性能影响
监控组件扩容	监控数据量增长，查询性能下降	低	无服务中断

扩容规划

容量规划
- 计算所需新增节点数量
- 确定新增节点的硬件配置
- 规划存储容量和 IOPS 需求
网络规划
- 确保新增节点与现有集群网络连通
- 配置防火墙规则
- 规划网络带宽
资源规划
- 确保新增节点的 CPU、内存资源充足
- 配置操作系统参数
- 安装必要的依赖软件
时间规划
- 选择业务低峰期进行扩容
- 预估扩容时间窗口
- 制定回滚计划

环境检查

bash

# 1. 检查现有集群状态
tiup cluster display [cluster-name]

# 2. 检查新增节点的操作系统版本
ssh [new-node-ip] "cat /etc/os-release"

# 3. 检查新增节点的硬件配置
ssh [new-node-ip] "lscpu | grep -E 'Model name|CPU\(s\)'"
ssh [new-node-ip] "free -h"
ssh [new-node-ip] "lsblk -f"

# 4. 检查新增节点的网络连通性
ping -c 3 [new-node-ip]
ssh [new-node-ip] "ping -c 3 [existing-node-ip]"

# 5. 检查 SSH 免密登录
ssh-copy-id -i ~/.ssh/id_rsa.pub <user>@[new-node-ip]
ssh <user>@[new-node-ip] "echo 'SSH connection successful'"

# 6. 检查 TiUP 版本
tiup --version

数据备份

bash

# 执行全量备份
tiup br backup full --pd [pd-ip]:2379 --storage s3://[bucket]/[backup-path] --s3.endpoint [endpoint] --s3.access-key [access-key] --s3.secret-access-key <secret-key>

# 或执行本地备份
tiup br backup full --pd [pd-ip]:2379 --storage local:///path/to/backup

TiKV 节点扩容

扩容步骤

修改拓扑文件

yaml

# topology.yaml
global:
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tidb-deploy"
  data_dir: "/tidb-data"

pd_servers:
  - host: 192.168.1.101
  - host: 192.168.1.102
  - host: 192.168.1.103

tikv_servers:
  - host: 192.168.1.104
  - host: 192.168.1.105
  - host: 192.168.1.106
  # 新增 TiKV 节点
  - host: 192.168.1.107
  - host: 192.168.1.108

tidb_servers:
  - host: 192.168.1.109
  - host: 192.168.1.110

执行扩容操作

bash

# 检查拓扑文件
tiup cluster check [cluster-name] topology.yaml --apply --user tidb

# 执行扩容
tiup cluster scale-out [cluster-name] topology.yaml --user tidb

查看扩容进度
bash
```
tiup cluster display [cluster-name]
```

扩容过程监控

监控数据迁移进度

bash

# 使用 PD Control 查看 Region 迁移状态
pd-ctl -u http://[pd-ip]:2379 operator show

# 查看 TiKV 节点的存储容量
pd-ctl -u http://[pd-ip]:2379 store

监控集群性能
- 监控 TiDB Dashboard 中的 "Cluster Info" 页面
- 关注 "Region Status" 和 "Store Status"
- 监控节点的 CPU、内存、磁盘 I/O 利用率

扩容后优化

bash

# 调整 PD 调度参数，加速数据均衡
pd-ctl -u http://[pd-ip]:2379 config set schedule.leader-schedule-limit 4
pd-ctl -u http://[pd-ip]:2379 config set schedule.region-schedule-limit 20
pd-ctl -u http://[pd-ip]:2379 config set schedule.replica-schedule-limit 6

# 数据均衡完成后恢复默认参数
# pd-ctl -u http://[pd-ip]:2379 config set schedule.leader-schedule-limit 4
# pd-ctl -u http://[pd-ip]:2379 config set schedule.region-schedule-limit 14
# pd-ctl -u http://[pd-ip]:2379 config set schedule.replica-schedule-limit 6

TiDB 节点扩容

扩容步骤

修改拓扑文件

yaml

# topology.yaml
global:
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tidb-deploy"
  data_dir: "/tidb-data"

pd_servers:
  - host: 192.168.1.101
  - host: 192.168.1.102
  - host: 192.168.1.103

tikv_servers:
  - host: 192.168.1.104
  - host: 192.168.1.105
  - host: 192.168.1.106

tidb_servers:
  - host: 192.168.1.109
  - host: 192.168.1.110
  # 新增 TiDB 节点
  - host: 192.168.1.111
  - host: 192.168.1.112

执行扩容操作

bash

   # 检查拓扑文件
tiup cluster check [cluster-name] topology.yaml --apply --user tidb

# 执行扩容
tiup cluster scale-out [cluster-name] topology.yaml --user tidb

查看扩容结果
bash
```
tiup cluster display [cluster-name]
```

负载均衡配置

配置前端负载均衡

更新 Nginx 配置，添加新的 TiDB 节点
重新加载 Nginx 配置

nginx

# nginx.conf
upstream tidb {
    server 192.168.1.109:4000;
    server 192.168.1.110:4000;
    server 192.168.1.111:4000;  # 新增节点
    server 192.168.1.112:4000;  # 新增节点
    least_conn;
}

验证负载均衡

bash

# 重新加载 Nginx 配置
nginx -s reload

# 验证所有 TiDB 节点都能接受连接
for ip in 192.168.1.109 192.168.1.110 192.168.1.111 192.168.1.112; do
    mysql -h $ip -P 4000 -u root -e "SELECT @@tidb_server_id;"
done

PD 节点扩容

扩容步骤

修改拓扑文件

yaml

# topology.yaml
global:
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tidb-deploy"
  data_dir: "/tidb-data"

pd_servers:
  - host: 192.168.1.101
  - host: 192.168.1.102
  - host: 192.168.1.103
  # 新增 PD 节点
  - host: 192.168.1.113

tikv_servers:
  - host: 192.168.1.104
  - host: 192.168.1.105
  - host: 192.168.1.106

tidb_servers:
  - host: 192.168.1.109
  - host: 192.168.1.110

执行扩容操作

bash

# 检查拓扑文件
tiup cluster check [cluster-name] topology.yaml --apply --user tidb

# 执行扩容
tiup cluster scale-out [cluster-name] topology.yaml --user tidb

验证 PD 集群状态

bash

# 查看 PD 集群成员
pd-ctl -u http://[pd-ip]:2379 member

# 查看 PD 集群健康状态
pd-ctl -u http://[pd-ip]:2379 cluster

TiFlash 节点扩容

扩容步骤

修改拓扑文件

yaml

# topology.yaml
global:
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tidb-deploy"
  data_dir: "/tidb-data"

pd_servers:
  - host: 192.168.1.101
  - host: 192.168.1.102
  - host: 192.168.1.103

tikv_servers:
  - host: 192.168.1.104
  - host: 192.168.1.105
  - host: 192.168.1.106

tidb_servers:
  - host: 192.168.1.109
  - host: 192.168.1.110

tiflash_servers:
  - host: 192.168.1.114
    tcp_port: 9000
    http_port: 8123
    flash_service_port: 3930
    flash_proxy_port: 20170
    flash_proxy_status_port: 20292
    metrics_port: 8234
  # 新增 TiFlash 节点
  - host: 192.168.1.115
    tcp_port: 9000
    http_port: 8123
    flash_service_port: 3930
    flash_proxy_port: 20170
    flash_proxy_status_port: 20292
    metrics_port: 8234

执行扩容操作

bash

# 检查拓扑文件
tiup cluster check [cluster-name] topology.yaml --apply --user tidb

# 执行扩容
tiup cluster scale-out [cluster-name] topology.yaml --user tidb

查看 TiFlash 状态

bash

# 查看 TiFlash 节点状态
tiup cluster display [cluster-name] -R tiflash

# 查看 TiFlash 副本信息
mysql -h <tidb-ip> -P 4000 -u root -e "SELECT * FROM information_schema.tiflash_replica;"

TiFlash 副本调整

sql

-- 为现有表添加 TiFlash 副本
ALTER TABLE `database`.`table` SET TIFLASH REPLICA 2;

-- 查看 TiFlash 副本同步进度
SELECT * FROM information_schema.tiflash_replica WHERE table_schema = `database` AND table_name = `table`;

-- 调整 TiFlash 副本数量
ALTER TABLE `database`.`table` SET TIFLASH REPLICA 1;

监控组件扩容

Prometheus 扩容

修改拓扑文件

yaml

# topology.yaml
monitoring_servers:
  - host: 192.168.1.116
  # 新增监控节点
  - host: 192.168.1.117

grafana_servers:
  - host: 192.168.1.116

执行扩容操作

bash

# 检查拓扑文件
tiup cluster check [cluster-name] topology.yaml --apply --user tidb

# 执行扩容
tiup cluster scale-out [cluster-name] topology.yaml --user tidb

配置 Prometheus 联邦集群（可选）
- 配置主 Prometheus 从多个子 Prometheus 拉取数据
- 适用于大规模集群监控

Grafana 扩容

修改拓扑文件

yaml

# topology.yaml
grafana_servers:
  - host: 192.168.1.116
  # 新增 Grafana 节点
  - host: 192.168.1.117

执行扩容操作

bash

# 检查拓扑文件
tiup cluster check [cluster-name] topology.yaml --apply --user tidb

# 执行扩容
tiup cluster scale-out [cluster-name] topology.yaml --user tidb

配置 Grafana 高可用
- 使用共享数据库存储 Grafana 配置
- 配置前端负载均衡

扩容后验证

集群状态验证

bash

# 查看集群整体状态
tiup cluster display [cluster-name]

# 检查所有节点的状态
for component in pd tikv tidb tiflash monitoring grafana; do
    echo "=== Checking $component status ==="
    tiup cluster display [cluster-name] -R $component
    echo ""
done

# 验证 TiDB 连接
mysql -h <tidb-ip> -P 4000 -u root -e "SELECT version();"

# 验证 TiKV 数据分布
pd-ctl -u http://[pd-ip]:2379 store

功能验证

sql

-- 验证基本 DML 操作
CREATE DATABASE IF NOT EXISTS expand_test;
USE expand_test;
CREATE TABLE test_table (id INT PRIMARY KEY, name VARCHAR(255));
INSERT INTO test_table VALUES (1, 'test');
UPDATE test_table SET name = 'updated' WHERE id = 1;
DELETE FROM test_table WHERE id = 1;
DROP TABLE test_table;
DROP DATABASE expand_test;

-- 验证 TiFlash 查询（如果扩容了 TiFlash）
SELECT /*+ READ_FROM_STORAGE(TIFLASH[`table`]) */ COUNT(*) FROM `database`.`table`;

性能验证

bash

# 使用 Sysbench 进行性能测试
sysbench oltp_read_write --mysql-host=<tidb-ip> --mysql-port=4000 --mysql-user=root --mysql-db=test --tables=10 --table-size=1000000 --threads=16 --time=60 run

# 检查节点负载
for ip in <node-ip-list>; do
    echo "=== Node $ip load ==="
    ssh $ip "uptime"
    echo ""
done

扩容最佳实践

扩容时机选择

业务低峰期
- 选择凌晨或周末进行扩容
- 避开业务高峰期和重要活动
提前通知
- 提前通知业务部门
- 建立应急响应机制

扩容操作最佳实践

分批扩容
- 每次扩容 2-4 个节点
- 避免一次性扩容大量节点导致集群不稳定
监控优先
- 扩容过程中密切监控集群状态
- 设置告警阈值，及时发现异常
参数优化
- 扩容后调整调度参数
- 优化 TiKV 存储参数
- 调整监控保留策略

常见问题处理

扩容失败
- 查看详细日志：tiup cluster scale-out [cluster-name] topology.yaml --user tidb --verbose
- 检查 SSH 连接和权限
- 验证拓扑文件格式
数据均衡缓慢
- 调整 PD 调度参数
- 检查网络带宽
- 验证节点资源充足
节点状态异常
- 检查节点日志
- 验证硬件资源
- 检查网络连接

自动化扩容

使用 TiUP 自动化扩容

bash

# 1. 编写扩容脚本
cat > expand_cluster.sh << 'EOF'
#!/bin/bash

CLUSTER_NAME="tidb-cluster"
USER="tidb"
TOPOLOGY_FILE="topology.yaml"

# 检查集群状态
echo "Checking cluster status..."
tiup cluster display $CLUSTER_NAME

# 检查拓扑文件
echo "\nChecking topology file..."
tiup cluster check $CLUSTER_NAME $TOPOLOGY_FILE --apply --user $USER

# 执行扩容
echo "\nStarting scale-out operation..."
tiup cluster scale-out $CLUSTER_NAME $TOPOLOGY_FILE --user $USER --yes

# 验证扩容结果
echo "\nVerifying scale-out result..."
tiup cluster display $CLUSTER_NAME
EOF

# 2. 赋予执行权限
chmod +x expand_cluster.sh

# 3. 执行扩容脚本
./expand_cluster.sh

使用 Ansible 自动化扩容

安装 Ansible
bash
```
pip install ansible
```

编写 Ansible Playbook

yaml

# expand_tidb.yaml
- name: Expand TiDB Cluster
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Check cluster status
      command: tiup cluster display {{ cluster_name }}
      register: cluster_status
      
    - name: Display cluster status
      debug:
        msg: "{{ cluster_status.stdout }}"
    
    - name: Check topology file
      command: tiup cluster check {{ cluster_name }} {{ topology_file }} --apply --user {{ user }}
    
    - name: Execute scale-out
      command: tiup cluster scale-out {{ cluster_name }} {{ topology_file }} --user {{ user }} --yes
    
    - name: Verify scale-out result
      command: tiup cluster display {{ cluster_name }}
      register: result
      
    - name: Display result
      debug:
        msg: "{{ result.stdout }}"

执行 Playbook

bash

ansible-playbook expand_tidb.yaml -e "cluster_name=tidb-cluster topology_file=topology.yaml user=tidb"

常见问题（FAQ）

Q1: 扩容 TiKV 节点时，数据迁移需要多长时间？

A1: 数据迁移时间取决于以下因素：

现有数据量大小
网络带宽
新增节点数量
PD 调度参数配置

一般情况下，1TB 数据的迁移时间大约为 1-2 小时。可以通过调整 PD 调度参数加速数据迁移，但可能会对集群性能造成影响。

Q2: 扩容过程中遇到节点无法启动，如何处理？

A2: 处理步骤：

查看节点日志：cat /tidb-deploy/[component]-[port]/logs/[component].log | grep -i error
检查节点配置：cat /tidb-deploy/[component]-[port]/conf/[component].toml
验证依赖软件是否安装：ssh [node-ip] "which [dependency]"
检查端口是否被占用：ssh [node-ip] "netstat -tlnp | grep [port]"
尝试手动启动节点：tiup cluster start [cluster-name] -N [node-ip]:[port]

Q3: 扩容后发现部分节点负载不均衡，如何处理？

A3: 负载不均衡的处理方法：

调整 PD 调度参数，加速数据均衡
检查节点硬件配置是否一致
检查是否存在热点数据
对于 TiDB 节点，检查负载均衡配置
对于 TiKV 节点，考虑调整 Region 副本分布

Q4: 扩容 PD 节点时，是否需要修改客户端连接配置？

A4: 不需要。TiDB 客户端会自动发现新的 PD 节点，无需修改连接配置。建议客户端连接字符串中包含多个 PD 节点地址，以提高可用性。

Q5: 扩容 TiFlash 节点后，如何加速数据同步？

A5: 加速 TiFlash 数据同步的方法：

调整 TiFlash 同步参数：alter table table set tiflash replica 2;
确保 TiKV 节点有足够的 IO 资源用于数据同步
监控 TiFlash 同步进度：SELECT * FROM information_schema.tiflash_replica;
对于大规模表，可以考虑分批添加 TiFlash 副本

Q6: 扩容过程中可以取消操作吗？

A6: 对于 TiDB、PD 和监控组件的扩容，不支持直接取消，但可以安全地停止操作，不会影响现有集群。对于 TiKV 和 TiFlash 扩容，数据迁移过程中不建议取消，否则可能导致数据分布不均衡。如果必须取消，可以通过调整 PD 调度参数暂停数据迁移。

TiDB 集群扩容操作 ​

扩容前准备 ​

扩容场景分析 ​

扩容规划 ​

环境检查 ​

数据备份 ​

TiKV 节点扩容 ​

扩容步骤 ​

扩容过程监控 ​

扩容后优化 ​

TiDB 节点扩容 ​

扩容步骤 ​

负载均衡配置 ​

PD 节点扩容 ​

扩容步骤 ​

TiFlash 节点扩容 ​

扩容步骤 ​

TiFlash 副本调整 ​

监控组件扩容 ​

Prometheus 扩容 ​

Grafana 扩容 ​

扩容后验证 ​

集群状态验证 ​

功能验证 ​

性能验证 ​

扩容最佳实践 ​

扩容时机选择 ​

扩容操作最佳实践 ​

常见问题处理 ​

自动化扩容 ​

使用 TiUP 自动化扩容 ​

使用 Ansible 自动化扩容 ​

常见问题（FAQ） ​

Q1: 扩容 TiKV 节点时，数据迁移需要多长时间？ ​

Q2: 扩容过程中遇到节点无法启动，如何处理？ ​

Q3: 扩容后发现部分节点负载不均衡，如何处理？ ​

Q4: 扩容 PD 节点时，是否需要修改客户端连接配置？ ​

Q5: 扩容 TiFlash 节点后，如何加速数据同步？ ​

Q6: 扩容过程中可以取消操作吗？ ​

TiDB 集群扩容操作

扩容前准备

扩容场景分析

扩容规划

环境检查

数据备份

TiKV 节点扩容

扩容步骤

扩容过程监控

扩容后优化

TiDB 节点扩容

扩容步骤

负载均衡配置

PD 节点扩容

扩容步骤

TiFlash 节点扩容

扩容步骤

TiFlash 副本调整

监控组件扩容

Prometheus 扩容

Grafana 扩容

扩容后验证

集群状态验证

功能验证

性能验证

扩容最佳实践

扩容时机选择

扩容操作最佳实践

常见问题处理

自动化扩容

使用 TiUP 自动化扩容

使用 Ansible 自动化扩容

常见问题（FAQ）

Q1: 扩容 TiKV 节点时，数据迁移需要多长时间？

Q2: 扩容过程中遇到节点无法启动，如何处理？

Q3: 扩容后发现部分节点负载不均衡，如何处理？

Q4: 扩容 PD 节点时，是否需要修改客户端连接配置？

Q5: 扩容 TiFlash 节点后，如何加速数据同步？

Q6: 扩容过程中可以取消操作吗？