合 PostgreSQL高可用之repmgr(1主2从+1witness)+Pgpool-II实现主从切换+读写分离+负载均衡
Tags: PG高可用Pgpool-IIpgpoolAdmin负载均衡读写分离repmgr主从流复制主从切换witness
五、检查repmgr集群
1 2 3 4 5 | repmgr -f /pg13/pg13/repmgr.conf cluster matrix repmgr -f /pg13/pg13/repmgr.conf cluster crosscheck repmgr -f /pg13/pg13/repmgr.conf node status repmgr -f /pg13/pg13/repmgr.conf node check |
一些结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | [pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster matrix INFO: connecting to database Name | ID | 1 | 2 | 3 | 4 ----------------+----+---+---+---+--- lhrrepmgr64361 | 1 | * | * | * | * lhrrepmgr64362 | 2 | * | * | * | * lhrrepmgr64363 | 3 | * | * | * | * lhrrepmgr64364 | 4 | * | * | * | * [pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf node status Node "lhrrepmgr64361": PostgreSQL version: 13.2 Total data size: 37 MB Conninfo: host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 Role: primary WAL archiving: enabled Archive command: test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f WALs pending archiving: 0 pending files Replication connections: 2 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Replication lag: n/a [pg13@lhrrepmgr64361 ~]$ [pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf node check Node "lhrrepmgr64361": Server role: OK (node is primary) Replication lag: OK (N/A - node is primary) WAL archiving: OK (0 pending archive ready files) Upstream connection: OK (N/A - node is primary) Downstream servers: OK (2 of 2 downstream nodes attached) Replication slots: OK (node has no physical replication slots) Missing physical replication slots: OK (node has no missing physical replication slots) Configured data directory: OK (configured "data_directory" is "/pg13/pgdata") [pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf node status Node "lhrrepmgr64362": PostgreSQL version: 13.2 Total data size: 37 MB Conninfo: host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 Role: standby WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective) Archive command: test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f WALs pending archiving: 0 pending files Replication connections: 0 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Upstream node: lhrrepmgr64361 (ID: 1) Replication lag: 0 seconds Last received LSN: 0/8001480 Last replayed LSN: 0/8001480 [pg13@lhrrepmgr64364 ~]$ repmgr -f /pg13/pg13/repmgr.conf node status Node "lhrrepmgr64364": PostgreSQL version: 13.2 Total data size: 29 MB Conninfo: host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 Role: witness WAL archiving: enabled Archive command: test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f WALs pending archiving: 0 pending files Replication connections: 0 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Replication lag: n/a |
六、主从切换
6.1、switchover正常主从切换
switchover切换即主变备,备变主。
官网:https://repmgr.org/docs/current/repmgr-standby-switchover.html
6.1.1、把备库62变为主库
在62上操作:
1 2 3 4 5 6 7 8 | repmgr -f /pg13/pg13/repmgr.conf cluster show repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind repmgr -f /pg13/pg13/repmgr.conf cluster show -- 可以debug打印详细的切换过程 repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose |
--siblings-follow 表示所有的从库同步源自动改成最新的主库节点
--force-rewind 如果repmgr检测到需要执行pg_rewind(同步)的时候,在执行pg_rewind之前,在新主节点执行checkpoint
执行过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | [pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------+---------+-----------+----------------+----------+----------+----------+--------------------------------------------------------------------------- 1 | lhrrepmgr64361 | primary | * running | | default | 100 | 3 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 2 | lhrrepmgr64362 | standby | running | lhrrepmgr64361 | default | 100 | 3 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 3 | lhrrepmgr64363 | standby | running | lhrrepmgr64361 | default | 100 | 3 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 4 | lhrrepmgr64364 | witness | * running | lhrrepmgr64361 | default | 0 | n/a | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 [pg13@lhrrepmgr64362 ~]$ [pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind NOTICE: checking switchover on node "lhrrepmgr64362" (ID: 2) in --dry-run mode INFO: prerequisites for using pg_rewind are met INFO: SSH connection to host "172.72.6.61" succeeded INFO: able to execute "repmgr" on remote host "172.72.6.61" INFO: all sibling nodes are reachable via SSH INFO: 3 walsenders required, 10 available INFO: demotion candidate is able to make replication connection to promotion candidate INFO: 0 pending archive files INFO: replication lag on this standby is 0 seconds NOTICE: local node "lhrrepmgr64362" (ID: 2) would be promoted to primary; current primary "lhrrepmgr64361" (ID: 1) would be demoted to standby INFO: following shutdown command would be run on node "lhrrepmgr64361": "/pg13/pg13/bin/pg_ctl -D '/pg13/pgdata' -W -m fast stop" INFO: parameter "shutdown_check_timeout" is set to 60 seconds INFO: prerequisites for executing STANDBY SWITCHOVER are met [pg13@lhrrepmgr64362 ~]$ [pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind NOTICE: executing switchover on node "lhrrepmgr64362" (ID: 2) NOTICE: local node "lhrrepmgr64362" (ID: 2) will be promoted to primary; current primary "lhrrepmgr64361" (ID: 1) will be demoted to standby NOTICE: stopping current primary node "lhrrepmgr64361" (ID: 1) NOTICE: issuing CHECKPOINT on node "lhrrepmgr64361" (ID: 1) DETAIL: executing server command "/pg13/pg13/bin/pg_ctl -D '/pg13/pgdata' -W -m fast stop" INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout") INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout") NOTICE: current primary has been cleanly shut down at location 0/9000028 NOTICE: promoting standby to primary DETAIL: promoting server "lhrrepmgr64362" (ID: 2) using pg_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete NOTICE: STANDBY PROMOTE successful DETAIL: server "lhrrepmgr64362" (ID: 2) was successfully promoted to primary NOTICE: issuing CHECKPOINT on node "lhrrepmgr64362" (ID: 2) ERROR: unable to execute CHECKPOINT INFO: local node 1 can attach to rejoin target node 2 DETAIL: local node's recovery point: 0/9000028; rejoin target node's fork point: 0/90000A0 NOTICE: setting node 1 upstream to node 2 WARNING: unable to ping "host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/pg13/pg13/bin/pg_ctl -w -D '/pg13/pgdata' start" NOTICE: NODE REJOIN successful DETAIL: node 1 is now attached to node 2 NOTICE: node "lhrrepmgr64362" (ID: 2) promoted to primary, node "lhrrepmgr64361" (ID: 1) demoted to standby NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes NOTICE: switchover was successful DETAIL: node "lhrrepmgr64362" is now primary and node "lhrrepmgr64361" is attached as standby NOTICE: STANDBY SWITCHOVER has completed successfully [pg13@lhrrepmgr64362 ~]$ [pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------+---------+-----------+----------------+----------+----------+----------+--------------------------------------------------------------------------- 1 | lhrrepmgr64361 | standby | running | lhrrepmgr64362 | default | 100 | 3 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 2 | lhrrepmgr64362 | primary | * running | | default | 100 | 4 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 3 | lhrrepmgr64363 | standby | running | lhrrepmgr64362 | default | 100 | 3 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 4 | lhrrepmgr64364 | witness | * running | lhrrepmgr64362 | default | 0 | n/a | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 C:\Users\lhrxxt>psql -U repmgr -h192.168.66.35 -p64361 -d repmgr Password for user repmgr: psql (13.2) Type "help" for help. repmgr=# select * from nodes; node_id | upstream_node_id | active | node_name | type | location | priority | conninfo | repluser | slot_name | config_file ---------+------------------+--------+----------------+---------+----------+----------+---------------------------------------------------------------------------+----------+-----------+------------------------ 2 | | t | lhrrepmgr64362 | primary | default | 100 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr | | /pg13/pg13/repmgr.conf 1 | 2 | t | lhrrepmgr64361 | standby | default | 100 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr | | /pg13/pg13/repmgr.conf 3 | 2 | t | lhrrepmgr64363 | standby | default | 100 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr | | /pg13/pg13/repmgr.conf 4 | 2 | t | lhrrepmgr64364 | witness | default | 0 | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr | | /pg13/pg13/repmgr.conf (4 rows) |
可以看到,主库变为lhrrepmgr64362,而lhrrepmgr64361变为了从库。
6.1.2、把新备库61切换为主库
1 2 3 4 5 6 7 8 | -- 在61上做操作 repmgr -f /pg13/pg13/repmgr.conf cluster show repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind -- 可以debug打印详细的切换过程 repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose |
执行过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | [pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------+---------+-----------+----------------+----------+----------+----------+--------------------------------------------------------------------------- 1 | lhrrepmgr64361 | standby | running | lhrrepmgr64362 | default | 100 | 4 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 2 | lhrrepmgr64362 | primary | * running | | default | 100 | 4 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 3 | lhrrepmgr64363 | standby | running | lhrrepmgr64362 | default | 100 | 4 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 4 | lhrrepmgr64364 | witness | * running | lhrrepmgr64362 | default | 0 | n/a | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 [pg13@lhrrepmgr64361 ~]$ [pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind NOTICE: checking switchover on node "lhrrepmgr64361" (ID: 1) in --dry-run mode INFO: prerequisites for using pg_rewind are met INFO: SSH connection to host "172.72.6.62" succeeded INFO: able to execute "repmgr" on remote host "172.72.6.62" INFO: all sibling nodes are reachable via SSH INFO: 3 walsenders required, 10 available INFO: demotion candidate is able to make replication connection to promotion candidate INFO: 0 pending archive files INFO: replication lag on this standby is 0 seconds NOTICE: local node "lhrrepmgr64361" (ID: 1) would be promoted to primary; current primary "lhrrepmgr64362" (ID: 2) would be demoted to standby INFO: following shutdown command would be run on node "lhrrepmgr64362": "/pg13/pg13/bin/pg_ctl -D '/pg13/pgdata' -W -m fast stop" INFO: parameter "shutdown_check_timeout" is set to 60 seconds INFO: prerequisites for executing STANDBY SWITCHOVER are met [pg13@lhrrepmgr64361 ~]$ [pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind NOTICE: executing switchover on node "lhrrepmgr64361" (ID: 1) NOTICE: local node "lhrrepmgr64361" (ID: 1) will be promoted to primary; current primary "lhrrepmgr64362" (ID: 2) will be demoted to standby NOTICE: stopping current primary node "lhrrepmgr64362" (ID: 2) NOTICE: issuing CHECKPOINT on node "lhrrepmgr64362" (ID: 2) DETAIL: executing server command "/pg13/pg13/bin/pg_ctl -D '/pg13/pgdata' -W -m fast stop" INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout") INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout") NOTICE: current primary has been cleanly shut down at location 0/A000028 NOTICE: promoting standby to primary DETAIL: promoting server "lhrrepmgr64361" (ID: 1) using pg_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete NOTICE: STANDBY PROMOTE successful DETAIL: server "lhrrepmgr64361" (ID: 1) was successfully promoted to primary NOTICE: issuing CHECKPOINT on node "lhrrepmgr64361" (ID: 1) ERROR: unable to execute CHECKPOINT INFO: local node 2 can attach to rejoin target node 1 DETAIL: local node's recovery point: 0/A000028; rejoin target node's fork point: 0/A0000A0 NOTICE: setting node 2's upstream to node 1 WARNING: unable to ping "host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/pg13/pg13/bin/pg_ctl -w -D '/pg13/pgdata' start" NOTICE: NODE REJOIN successful DETAIL: node 2 is now attached to node 1 NOTICE: node "lhrrepmgr64361" (ID: 1) promoted to primary, node "lhrrepmgr64362" (ID: 2) demoted to standby NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes NOTICE: switchover was successful DETAIL: node "lhrrepmgr64361" is now primary and node "lhrrepmgr64362" is attached as standby NOTICE: STANDBY SWITCHOVER has completed successfully [pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------+---------+-----------+----------------+----------+----------+----------+--------------------------------------------------------------------------- 1 | lhrrepmgr64361 | primary | * running | | default | 100 | 5 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 2 | lhrrepmgr64362 | standby | running | lhrrepmgr64361 | default | 100 | 4 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 3 | lhrrepmgr64363 | standby | running | lhrrepmgr64361 | default | 100 | 4 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 4 | lhrrepmgr64364 | witness | * running | lhrrepmgr64361 | default | 0 | n/a | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 |
6.2、failover切换
6.2.1、failover异常手工主从切换
主库出现故障,然后直接将从库提升为主库:
1 2 3 4 5 6 7 8 9 10 11 12 13 | -- 主库61宕机 pg_ctl -m fast stop -- 从库63提升为主库 repmgr -f /pg13/pg13/repmgr.conf cluster show repmgr -f /pg13/pg13/repmgr.conf --siblings-follow standby promote repmgr -f /pg13/pg13/repmgr.conf cluster show -- 等原主库61修复之后,由于新提升的主库63数据已经很新了,所以只能把61作为备库来运行 pg_ctl stop repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --dry-run --verbose repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose |
执行过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | -- 主库61宕机 [pg13@lhrrepmgr64361 ~]$ pg_ctl -m fast stop waiting for server to shut down.... done server stopped [pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ERROR: connection to database failed DETAIL: could not connect to server: Connection refused Is the server running on host "172.72.6.61" and accepting TCP/IP connections on port 5432? DETAIL: attempted to connect using: user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr options=-csearch_path= -- 从库63提升为主库 [pg13@lhrrepmgr64363 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------+---------+---------------+------------------+----------+----------+----------+--------------------------------------------------------------------------- 1 | lhrrepmgr64361 | primary | ? unreachable | ? | default | 100 | | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 2 | lhrrepmgr64362 | standby | running | ? lhrrepmgr64361 | default | 100 | 5 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 3 | lhrrepmgr64363 | standby | running | ? lhrrepmgr64361 | default | 100 | 5 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 4 | lhrrepmgr64364 | witness | * running | ? lhrrepmgr64361 | default | 0 | n/a | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "lhrrepmgr64361" (ID: 1) - node "lhrrepmgr64361" (ID: 1) is registered as an active primary but is unreachable - unable to connect to node "lhrrepmgr64362" (ID: 2)'s upstream node "lhrrepmgr64361" (ID: 1) - unable to determine if node "lhrrepmgr64362" (ID: 2) is attached to its upstream node "lhrrepmgr64361" (ID: 1) - unable to connect to node "lhrrepmgr64363" (ID: 3)'s upstream node "lhrrepmgr64361" (ID: 1) - unable to determine if node "lhrrepmgr64363" (ID: 3) is attached to its upstream node "lhrrepmgr64361" (ID: 1) - unable to connect to node "lhrrepmgr64364" (ID: 4)'s upstream node "lhrrepmgr64361" (ID: 1)' HINT: execute with --verbose option to see connection error messages [pg13@lhrrepmgr64363 ~]$ repmgr -f /pg13/pg13/repmgr.conf --siblings-follow standby promote NOTICE: promoting standby to primary DETAIL: promoting server "lhrrepmgr64363" (ID: 3) using pg_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete NOTICE: STANDBY PROMOTE successful DETAIL: server "lhrrepmgr64363" (ID: 3) was successfully promoted to primary NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes [pg13@lhrrepmgr64363 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------+---------+-----------+----------------+----------+----------+----------+--------------------------------------------------------------------------- 1 | lhrrepmgr64361 | primary | - failed | ? | default | 100 | | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 2 | lhrrepmgr64362 | standby | running | lhrrepmgr64363 | default | 100 | 5 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 3 | lhrrepmgr64363 | primary | * running | | default | 100 | 6 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 4 | lhrrepmgr64364 | witness | * running | lhrrepmgr64363 | default | 0 | n/a | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "lhrrepmgr64361" (ID: 1) HINT: execute with --verbose option to see connection error messages |
修复好61后,再启动61数据库,作为63的备库:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | [pg13@lhrrepmgr64361 pg13]$ repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose NOTICE: using provided configuration file "/pg13/pg13/repmgr.conf" NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3 DETAIL: rejoin target server timeline 6 forked off current database system timeline 5 before current recovery point 1/1C000028 INFO: prerequisites for using pg_rewind are met INFO: 0 files copied to "/tmp/repmgr-config-archive-lhrrepmgr64361" NOTICE: executing pg_rewind DETAIL: pg_rewind command is "/pg13/pg13/bin/pg_rewind -D '/pg13/pgdata' --source-server='host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2'" ERROR: pg_rewind execution failed DETAIL: pg_rewind: servers diverged at WAL location 1/190000A0 on timeline 5 pg_rewind: rewinding from last common checkpoint at 1/19000028 on timeline 5 pg_rewind: error: could not open file "/pg13/pgdata/pg_wal/00000005000000010000001A": No such file or directory pg_rewind: fatal: could not read WAL record at 1/1A000000 [pg13@lhrrepmgr64361 pg13]$ cp ./archive/00000005000000010000001A /pg13/pgdata/pg_wal/00000005000000010000001A [pg13@lhrrepmgr64361 pg13]$ repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose NOTICE: using provided configuration file "/pg13/pg13/repmgr.conf" NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3 DETAIL: rejoin target server timeline 6 forked off current database system timeline 5 before current recovery point 1/1C000028 INFO: prerequisites for using pg_rewind are met INFO: 0 files copied to "/tmp/repmgr-config-archive-lhrrepmgr64361" NOTICE: executing pg_rewind DETAIL: pg_rewind command is "/pg13/pg13/bin/pg_rewind -D '/pg13/pgdata' --source-server='host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2'" NOTICE: 0 files copied to /pg13/pgdata INFO: directory "/tmp/repmgr-config-archive-lhrrepmgr64361" deleted NOTICE: setting node 1's upstream to node 3' WARNING: unable to ping "host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/pg13/pg13/bin/pg_ctl -w -D '/pg13/pgdata' start" INFO: node "lhrrepmgr64361" (ID: 1) is pingable INFO: node "lhrrepmgr64361" (ID: 1) has attached to its upstream node NOTICE: NODE REJOIN successful DETAIL: node 1 is now attached to node 3 [pg13@lhrrepmgr64361 pg13]$ repmgr -f /pg13/pg13/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------+---------+-----------+----------------+----------+----------+----------+--------------------------------------------------------------------------- 1 | lhrrepmgr64361 | standby | running | lhrrepmgr64363 | default | 100 | 5 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 2 | lhrrepmgr64362 | standby | running | lhrrepmgr64363 | default | 100 | 6 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 3 | lhrrepmgr64363 | primary | * running | | default | 100 | 6 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 4 | lhrrepmgr64364 | witness | * running | lhrrepmgr64363 | default | 0 | n/a | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 |
6.2.2、failover异常自动主从切换
生产建议配置自动failover切换,要配置自动failover,则需要做以下事情,在所有节点均配置:
1、在所有节点都配置:echo "shared_preload_libraries='repmgr' " >>/pg13/pgdata/postgresq.conf
,配置后重启PG。该步骤我之前已配置过。
2、在所有节点都配置文件/pg13/pg13/repmgr.conf
,增加内容包括:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | cat >> /pg13/pg13/repmgr.conf << "EOF" monitoring_history=yes monitor_interval_secs=5 failover=automatic reconnect_attempts=6 reconnect_interval=5 promote_command='repmgr standby promote -f /pg13/pg13/repmgr.conf --log-to-file' follow_command='repmgr standby follow -f /pg13/pg13/repmgr.conf --log-to-file --upstream-node-id=%n' log_level=INFO log_status_interval=10 log_file='/pg13/pg13/repmgr.log' EOF cat >> /etc/logrotate.conf <<"EOF" /pg13/pg13/repmgr.log { missingok compress rotate 30 daily dateext create 0600 pg13 pg13 } EOF |
3、在所有节点都启动repmgrd进程
1 2 3 4 5 6 7 8 9 | -- 启动 repmgrd -f /pg13/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize -- 建议加到开机自动启动:/etc/rc.local echo "repmgrd -f /pg13/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize" >> /etc/rc.local chmod +x /etc/rc.d/rc.local -- 停止 kill -9 `cat /tmp/repmgrd.pid` |
日志: