• 35648

    文章

  • 23

    评论

  • 20

    友链

  • 最近新加了很多技术文章,大家多来逛逛吧~~~~
  • 喜欢这个网站的朋友可以加一下QQ群,我们一起交流技术。

(高并发探测)四、redis集群部署整理续(容灾处理)

欢迎来到阿八个人博客网站。本 阿八个人博客 网站提供最新的站长新闻,各种互联网资讯。 喜欢本站的朋友可以收藏本站,或者加QQ:我们大家一起来交流技术! URL链接:https://www.abboke.com/jsh/2019/0712/7705.html 1190000019735355

前言

前面已经部署了:redis的4对主从集群 + 1对主从session主从备份。如果redis集群中有宕机情况发生,怎么保障服务的可用性呢,本文准备在session服务器上添加启动哨兵服务,测试集群的容灾情况。
整理后的环境网络地址集合:

"Name": "rm", "172.1.13.11/16", (session主,cl集群的sentinel——1)
"Name": "rs", "172.1.13.12/16", (session从,cl集群的sentinel——2)
"Name": "clm1", "172.1.50.11/16",
"Name": "clm2", "172.1.50.12/16",
"Name": "clm3", "172.1.50.13/16",
"Name": "cls1", "172.1.30.11/16",
"Name": "cls2", "172.1.30.12/16",
"Name": "cls3", "172.1.30.13/16",
"Name": "rbt1", "172.1.12.13/16",
"Name": "rbt2", "172.1.12.14/16",
"Name": "p1", "172.1.1.11/16",
"Name": "p2", "172.1.1.12/16",
"Name": "p3", "172.1.1.13/16",
"Name": "mm", "172.1.11.11/16", (对外端口,主)
"Name": "mm", "172.1.12.12/16", (对外端口,从)
"Name": "n1", "172.1.0.2/16", (对外端口,内网ip随机)
"Name": "n2", "172.1.0.3/16", (对外端口,内网ip随机)

1.集群外部添加哨兵

a.重新整理集群

由于集群的连接操作均使用内网,实际应用同样。修改容器启动命令、配置文件,取消redis集群对外公网的端口映射、cli连接密码。另外节点太多不方便管理,所以减小点。
调整为无密码的3对主从集群,删除clm4、cls4:

/ # redis-cli --cluster del-node 172.1.30.21:6379 c2b42a6c35ab6afb1f360280f9545b3d1761725e
>>> Removing node c2b42a6c35ab6afb1f360280f9545b3d1761725e from cluster 172.1.30.21:6379
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.
/ # redis-cli --cluster del-node 172.1.50.21:6379 6d1b7a14a6d0be55a5fcb9266358bd1a42244d47
>>> Removing node 6d1b7a14a6d0be55a5fcb9266358bd1a42244d47 from cluster 172.1.50.21:6379
[ERR] Node 172.1.50.21:6379 is not empty! Reshard data away and try again.
#需要先清空槽数据(rebalance成weigth=0)
/ # redis-cli --cluster rebalance 172.1.50.21:6379 --cluster-weight 6d1b7a14a6d0be55a5fcb9266358bd1a42244d47=0
Moving 2186 slots from 172.1.50.21:6379 to 172.1.30.11:6379
###
Moving 2185 slots from 172.1.50.21:6379 to 172.1.50.11:6379
###
Moving 2185 slots from 172.1.50.21:6379 to 172.1.50.12:6379
###
/ # redis-cli --cluster del-node 172.1.50.21:6379 6d1b7a14a6d0be55a5fcb9266358bd1a42244d47
>>> Removing node 6d1b7a14a6d0be55a5fcb9266358bd1a42244d47 from cluster 172.1.50.21:6379
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.

缩容成功。

b.设置配置项,启动哨兵服务

这里修改rm/rs的容器启动命令:

docker run --name rm \
           --restart=always \
    --network=mybridge --ip=172.1.13.11 \
    -v /root/tmp/dk/redis/data:/data \
    -v /root/tmp/dk/redis/redis.conf:/etc/redis/redis.conf \
    -v /root/tmp/dk/redis/sentinel.conf:/etc/redis/sentinel.conf \
    -d cffycls/redis5:1.7  
docker run --name rs \
    --restart=always \
    --network=mybridge --ip=172.1.13.12 \
    -v /root/tmp/dk/redis_slave/data:/data \
    -v /root/tmp/dk/redis_slave/redis.conf:/etc/redis/redis.conf \
    -v /root/tmp/dk/redis_slave/sentinel.conf:/etc/redis/sentinel.conf \
    -d cffycls/redis5:1.7

参考《redis集群实现(六) 容灾与宕机恢复》、《Redis及其Sentinel配置项详细说明》,修改配置文件:

#若产生数据的存放路径
dir /data/sentinel

#<master-name> <ip> <redis-port> <quorum>
#监视名,ip,port,保证一致性的最小数目
sentinel monitor mymaster1 172.1.50.11 6379 2
sentinel monitor mymaster2 172.1.50.12 6379 2
sentinel monitor mymaster3 172.1.50.13 6379 2

#sentinel down-after-milliseconds <master-name> <milliseconds>
#监视名,认为此节点下线了的超时时间
# Default is 30 seconds.
sentinel down-after-milliseconds mymaster1 30000
sentinel down-after-milliseconds mymaster2 30000
sentinel down-after-milliseconds mymaster3 30000

#sentinel parallel-syncs <master-name> <numslaves>
#监视名,值设为 1 来保证每次只有一个slave 处于不能处理命令请求的状态
sentinel parallel-syncs mymaster1 1
sentinel parallel-syncs mymaster2 1
sentinel parallel-syncs mymaster3 1

#默认值
# Default is 3 minutes.
sentinel failover-timeout mymaster1 180000
sentinel failover-timeout mymaster2 180000
sentinel failover-timeout mymaster3 180000

创建相应文件夹(xx/data/sentinel),重启2个容器,并进入rm:

/ # redis-sentinel /etc/redis/sentinel.conf
... ... 
14:X 11 Jul 2019 18:25:24.418 # +monitor master mymaster3 172.1.50.13 6379 quorum 2
14:X 11 Jul 2019 18:25:24.419 # +monitor master mymaster1 172.1.50.11 6379 quorum 2
14:X 11 Jul 2019 18:25:24.419 # +monitor master mymaster2 172.1.50.12 6379 quorum 2
14:X 11 Jul 2019 18:25:24.421 * +slave slave 172.1.30.12:6379 172.1.30.12 6379 @ mymaster1 172.1.50.11 6379
14:X 11 Jul 2019 18:25:24.425 * +slave slave 172.1.30.13:6379 172.1.30.13 6379 @ mymaster2 172.1.50.12 6379
14:X 11 Jul 2019 18:26:14.464 # +sdown master mymaster3 172.1.50.13 6379 

“不需要监视slave,监视了master的话,slave会自动加入到sentinel里边”,+sdown感觉什么不对,查看发现sentinel.conf被修改了??是被redis运行时自动修改了,重新调用这个命令是稳定了的配置:/data/sentinel被加上引号,监视配置改为了这样:

sentinel monitor mymaster2 172.1.50.12 6379 2
sentinel config-epoch mymaster2 0
sentinel leader-epoch mymaster2 0

c.处理哨兵的异常

上面有看到,后面每次启动哨兵,会有+sdown的日志(实际集群情况不对应)。进入集群发现:

819ad37676cc77b6691d0e74258c9f8b2d163121 172.1.50.13:6379@16379 slave cd2d78f87dd8a696dc127f762a168129ab91d9c6 0 1562843221035 10 connected
775bf0b33a34898a6a33bee85299982aae0d8a72 172.1.30.13:6379@16379 slave f02ee958993c79b63ffbef5238bb65b3cf552418 0 1562843220030 12 connected
ee0dcbbcc3634ca6e5d079835695bfe822ce17e6 172.1.50.11:6379@16379 myself,master - 0 1562843219000 11 connected 2185-5460 5462-7646
b69937a22d69d71596167104a3c2a9b8e308622c 172.1.30.12:6379@16379 slave ee0dcbbcc3634ca6e5d079835695bfe822ce17e6 0 1562843218000 11 connected
f02ee958993c79b63ffbef5238bb65b3cf552418 172.1.50.12:6379@16379 master - 0 1562843218000 12 connected 7647-13107
cd2d78f87dd8a696dc127f762a168129ab91d9c6 172.1.30.11:6379@16379 master - 0 1562843219029 10 connected 0-2184 5461 13108-16383

原来是50.13和30.13交换了主从,可能是之前宕机clm4、cls4时产生的问题。
这里首先想到应该把初始调一致,所以准备先切换这对主从。

--方案一、手动停掉30.11主
结果30.12(cls2)成了master、多了个鬼[:0@0 slave,noaddr],而50.11依然slave,失败。
这个是使用 redis-cli --cluster create 创建的所以其中节点之间并没有明确的主从关系;使用php客户端仍然可以获取比先前到不变的数据。
--方案二、继续手动停掉所有计划外的从节点
docker stop cls1 cls2 cls3
等待全部变为主节点(刷新web页面无变化);看到:

b69937a22d69d71596167104a3c2a9b8e308622c 172.1.30.12:6379@16379 master,fail - 1562850125182 1562850123000 14 connected
5a95cbf53f635b1bd28dad6f25ed1e093bc5a2ba :0@0 slave,noaddr - 1562833027365 1562833027365 9 disconnected
775bf0b33a34898a6a33bee85299982aae0d8a72 172.1.30.13:6379@16379 slave,fail f02ee958993c79b63ffbef5238bb65b3cf552418 1562850125182 1562850121174 12 connected
f02ee958993c79b63ffbef5238bb65b3cf552418 172.1.50.12:6379@16379 myself,master - 0 1562850409000 12 connected 7647-13107
819ad37676cc77b6691d0e74258c9f8b2d163121 172.1.50.13:6379@16379 master - 0 1562850408000 15 connected 0-2184 5461 13108-16383
ee0dcbbcc3634ca6e5d079835695bfe822ce17e6 172.1.50.11:6379@16379 master - 0 1562850410051 16 connected 2185-5460 5462-7646
cd2d78f87dd8a696dc127f762a168129ab91d9c6 172.1.30.11:6379@16379 master,fail - 1562849524911 1562849524503 10 connected

OK,重启docker start cls1 cls2 cls3,再看,达到预期效果,50.x全部升级成master。回来kill掉redis-sentinel,重启错误消失。

2.宕机测试

a.初步测试:停掉主节点

再启动一个哨兵,进入session从节点(rs)启动(用于对比观察):

16:X 11 Jul 2019 21:19:47.255 # +monitor master mymaster3 172.1.50.13 6379 quorum 2
16:X 11 Jul 2019 21:19:47.256 # +monitor master mymaster1 172.1.50.11 6379 quorum 2
16:X 11 Jul 2019 21:19:47.256 # +monitor master mymaster2 172.1.50.12 6379 quorum 2
16:X 11 Jul 2019 21:19:47.260 * +slave slave 172.1.30.11:6379 172.1.30.11 6379 @ mymaster3 172.1.50.13 6379
16:X 11 Jul 2019 21:19:47.264 * +slave slave 172.1.30.12:6379 172.1.30.12 6379 @ mymaster1 172.1.50.11 6379
16:X 11 Jul 2019 21:19:47.267 * +slave slave 172.1.30.13:6379 172.1.30.13 6379 @ mymaster2 172.1.50.12 6379
16:X 11 Jul 2019 21:19:48.252 * +sentinel sentinel 6b0995ba08e950c69848e3b2ffaf468bb6662626 172.1.13.11 26379 @ mymaster3 172.1.50.13 6379
16:X 11 Jul 2019 21:19:48.258 * +sentinel sentinel 6b0995ba08e950c69848e3b2ffaf468bb6662626 172.1.13.11 26379 @ mymaster1 172.1.50.11 6379
16:X 11 Jul 2019 21:19:48.261 * +sentinel sentinel 6b0995ba08e950c69848e3b2ffaf468bb6662626 172.1.13.11 26379 @ mymaster2 172.1.50.12 6379

#宕机 docker stop clm1,观察如下:
16:X 11 Jul 2019 21:23:11.259 # +sdown master mymaster1 172.1.50.11 6379
16:X 11 Jul 2019 21:23:11.327 # +new-epoch 1
16:X 11 Jul 2019 21:23:11.329 # +vote-for-leader 6b0995ba08e950c69848e3b2ffaf468bb6662626 1
16:X 11 Jul 2019 21:23:12.370 # +odown master mymaster1 172.1.50.11 6379 #quorum 2/2
16:X 11 Jul 2019 21:23:12.371 # Next failover delay: I will not start a failover before Thu Jul 11 21:29:11 2019
16:X 11 Jul 2019 21:24:00.417 # +config-update-from sentinel 6b0995ba08e950c69848e3b2ffaf468bb6662626 172.1.13.11 26379 @ mymaster1 172.1.50.11 6379
16:X 11 Jul 2019 21:24:00.418 # +switch-master mymaster1 172.1.50.11 6379 172.1.30.12 6379
16:X 11 Jul 2019 21:24:00.418 * +slave slave 172.1.50.11:6379 172.1.50.11 6379 @ mymaster1 172.1.30.12 6379
16:X 11 Jul 2019 21:24:30.484 # +sdown slave 172.1.50.11:6379 172.1.50.11 6379 @ mymaster1 172.1.30.12 6379

可以明确看到,会把当前配置情况修改保存到配置文件(主从中也是这样,甚至命令帮助中有命令专用于导出配置)。web端正常。
继续 docker stop clm2,名句“Next failover delay: I will not start a failover”出现在rm上,更新配置文件,web刷新有提示 Host is unreachable: 172.1.50.11:6379、能出现结果。
再继续 docker stop clm3,名句“Next failover delay: I will not start a failover”出现在rs上,更新配置文件。web获取不到数据,显示
Host is unreachable: 172.1.50.11:6379Host is unreachable: 172.1.50.12:6379Host is unreachable: 172.1.50.13:6379。但集群正常。

b.web访问测试:php端的数据操作

require "../vendor/autoload.php";
//使用swoole时可以保持在线,从缓存当中读取,根据集群状态更新
$servers = ['172.1.50.11:6379', '172.1.50.12:6379', '172.1.50.13:6379','172.1.30.11:6379', '172.1.30.12:6379', '172.1.30.13:6379'];

//查出所有节点分布
$rs = [];
$slotNodes = [];
foreach ($servers as $addr){
    $server=explode(':',$addr);
    try{
        $r = new Redis();
        $r->connect($server[0], (int) $server[1], 0.2);
        $slotInfo = $r->rawCommand('cluster','slots');
        foreach ($slotInfo as $ix => $value){
            $slotNodes[$value[2][0].':'.$value[2][1].' '.($ix+1)]=[$value[0], $value[1]];
        }
        $rs[$addr] = $r;
        //节点分片分槽信息
        foreach ($slotNodes as $slot => $value){
            $addr = explode(' ', $slot)[0];
            if(!isset($rs[$addr])){
                $server = explode(':', $host);
                $r = new Redis();
                $r->connect($server[0], (int) $server[1]);
                $rs[$addr] = $r;
            }
        }
        break;
    }catch (\RedisException $e){
        echo $e->getMessage(). ': '. $addr;
        continue;
    }
}
echo '<pre>';
//print_r($rs);

//计算,测试批量查询
$crc = new \Predis\Cluster\Hash\CRC16();
$getAddr = function ($key) use (&$slotNodes, &$crc, &$rs) {
    $code = $crc->hash($key) % 16384;
    foreach ($slotNodes as $addr => $boundry){
        if( $code>=$boundry[0] && $code<=$boundry[1] ){
            $host =explode(' ', $addr)[0];
            //print_r(['OK: '. $addr => $boundry, $host, $rs]);
            return $addr. ' = '. $rs[$host]->get($key);
        }
    }
};

$result=[];
for($i=10; $i<30; $i++){
    $key = 'set-'.$i;
    $result[$key] = $getAddr($key);
}
print_r($result);

foreach ($rs as $r){
    $r->close();
}

50.x节点逐一、全部挂掉运行结果(集群中有1万条数据):

Operation timed out: 172.1.50.11:6379Operation timed out: 172.1.50.12:6379Operation timed out: 172.1.50.13:6379
Array
(
    [set-10] => 172.1.30.11:6379 6 = bc1c1134c6b9da41dce82bb7b50d6fa5
    [set-11] => 172.1.30.13:6379 1 = 78e23ac793c7ce7a7ec498f46c7a0ee0
    [set-12] => 172.1.30.12:6379 3 = 90191fa0ba4d3ee127c5bc2295a524c7
    [set-13] => 172.1.30.12:6379 2 = bb626b73081c69ae737a4f0b66af376f
    [set-14] => 172.1.30.11:6379 6 = c7b5a610b9aa9640a277ec0d19336aea
    [set-15] => 172.1.30.13:6379 1 = ef2a7c6c2ebc01c937551f59ce1be516
    [set-16] => 172.1.30.12:6379 3 = d9cb45c5fe69875f9c3cea47f3d7c81d
    [set-17] => 172.1.30.12:6379 2 = 2ecc3cb21debbc6d24c07f18c036c66f
    [set-18] => 172.1.30.11:6379 6 = e6186afca37c42ccb828fdc94fb34be8
    [set-19] => 172.1.30.13:6379 1 = 50400663e0ab9eea2cd0e3389f8e9007
    [set-20] => 172.1.30.13:6379 1 = 46e1162866db417d987b64bc89690da3
    [set-21] => 172.1.30.11:6379 6 = 08dbce4c73e6ba90e3f54da890e63ba3
    [set-22] => 172.1.30.11:6379 4 = df10958fd828d505c9a91d97c8641355
    [set-23] => 172.1.30.12:6379 3 = e8a9615af5b2ed5360987e5ed9d49cea
    [set-24] => 172.1.30.13:6379 1 = 00cd8741e8828a1ddb7a272e89b64aeb
    [set-25] => 172.1.30.11:6379 6 = c808e68289fb9dcd93f19629e2dd7795
    [set-26] => 172.1.30.11:6379 4 = d85eeb441f895ff7cac12ecf7c08313b
    [set-27] => 172.1.30.12:6379 3 = cbf006ee0c96b4d585cfbed7d0edecd0
    [set-28] => 172.1.30.13:6379 1 = 01b0268951256097595c714bb90c3b8c
    [set-29] => 172.1.30.11:6379 6 = 853fa04d88512a0dfe4f27e72c08b125
)

符合预期,启动docker start clm1 clm2 clm3,这时30.x全部变为master节点。

c.综合测试:批量宕机

测试一、主节点同时宕机
一次下线当前全部的30.x master节点,此时2个哨兵都一直在输出选举日志,大约等了10分钟都没有选出结果。启动一个或全部的30.x节点无用,进入容器cluster info显示集群失败。无法重组群。

127.0.0.1:6379> cluster nodes
cd2d78f87dd8a696dc127f762a168129ab91d9c6 172.1.30.11:6379@16379 master - 0 1562854957611 21 connected 0-2184 5461 13108-16383
b69937a22d69d71596167104a3c2a9b8e308622c 172.1.30.12:6379@16379 master,fail? - 1562854249393 1562854248493 18 connected 2185-5460 5462-7646
819ad37676cc77b6691d0e74258c9f8b2d163121 172.1.50.13:6379@16379 slave cd2d78f87dd8a696dc127f762a168129ab91d9c6 0 1562854958615 21 connected
f02ee958993c79b63ffbef5238bb65b3cf552418 172.1.50.12:6379@16379 slave 775bf0b33a34898a6a33bee85299982aae0d8a72 0 1562854956000 19 connected
775bf0b33a34898a6a33bee85299982aae0d8a72 172.1.30.13:6379@16379 master,fail? - 1562854249393 1562854247000 19 connected 7647-13107
ee0dcbbcc3634ca6e5d079835695bfe822ce17e6 172.1.50.11:6379@16379 myself,slave b69937a22d69d71596167104a3c2a9b8e308622c 0 1562854955000 16 connected

单个重启无效,除非全部启动(全部恢复旧节点),正常。

测试二、双节点节点宕机
同上,集群重建失败。这里与raft协议的选择机制有关,需要过半投票,所以必须是少于一半的主节点挂掉,才能选出新的master、恢复集群的服务状态。如果出现了这个状况,需要尝试重新启动相应服务器,回复状态到可以完成选举的情况,待选举完成。

测试三、主从批量切换
逐一关闭30.x(clsx)节点,在哨兵处(任一个哨兵)观察是否已经选出新的master节点(没有的话、可以看到无法选出新master节点,需要先恢复上步操作);直到剩余节点全是50.x(clmx),这是剩余节点全部是master节点。
哨兵观察完成情况:

43:X 11 Jul 2019 22:55:15.208 # +sdown master mymaster2 172.1.30.13 6379
43:X 11 Jul 2019 22:55:15.326 # +new-epoch 22
43:X 11 Jul 2019 22:55:15.329 # +vote-for-leader 82710666110f7241e0d4aa6fa445fb95790fac86 22
43:X 11 Jul 2019 22:55:16.268 # +odown master mymaster2 172.1.30.13 6379 #quorum 2/2
43:X 11 Jul 2019 22:55:16.269 # Next failover delay: I will not start a failover before Thu Jul 11 23:01:15 2019
43:X 11 Jul 2019 22:55:16.414 # +config-update-from sentinel 82710666110f7241e0d4aa6fa445fb95790fac86 172.1.13.12 26379 @ mymaster2 172.1.30.13 6379
43:X 11 Jul 2019 22:55:16.414 # +switch-master mymaster2 172.1.30.13 6379 172.1.50.12 6379
43:X 11 Jul 2019 22:55:16.415 * +slave slave 172.1.30.13:6379 172.1.30.13 6379 @ mymaster2 172.1.50.12 6379
43:X 11 Jul 2019 22:57:32.364 # +sdown slave 172.1.30.13:6379 172.1.30.13 6379 @ mymaster2 172.1.50.12 6379

重新启动 docker start cls1 cls2 cls3,自动加入到slave节点,主从完成切换。

小结

a.数据迁移(扩容、缩容)

操作均使用集群命令 redis-cli --cluster。扩容时add-node、reblance,缩容reblance weight=0、del-node。

b.宕机

宕机时哨兵会自动执行选举、切换、下线,打开查看哨兵日志(可以指定配置中logfile,上文cli观察),如果不可用需要重启相应机器。查看选举情况,如果不能完成选举,说明+down的机器已经过半。

相关文章

暂住......别动,不想说点什么吗?
  • 全部评论(0
    还没有评论,快来抢沙发吧!