On this article i will show the failover testing postgresql replication with repmgr, repmgr already configure automatic failover, reference from this article

Begin failover

Stop service postgresql on master node

$ pg_ctl -D /var/lib/postgresql/data -m fast stop

See log on slave node, like bellow. The log will show error cause master down until standby promote as master

[2017-08-09 04:05:14] [DEBUG] get_upstream_connection(): conninfo is "host=172.17.0.2 user=repmgr dbname=repmgr"
[2017-08-09 04:05:14] [DEBUG] connecting to: 'host=172.17.0.2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2017-08-09 04:05:14] [ERROR] connection to database failed: FATAL:  the database system is shutting down

[2017-08-09 04:05:14] [ERROR] unable to connect to upstream node: FATAL:  the database system is shutting down

[2017-08-09 04:05:14] [DEBUG] connecting to: 'host=172.17.0.2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2017-08-09 04:05:14] [ERROR] connection to database failed: FATAL:  the database system is shutting down

[2017-08-09 04:05:14] [WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
[2017-08-09 04:05:24] [WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
[2017-08-09 04:05:34] [WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
[2017-08-09 04:05:44] [WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
[2017-08-09 04:05:54] [WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
[2017-08-09 04:06:04] [WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
[2017-08-09 04:06:14] [ERROR] unable to reconnect to master (timeout 60 seconds)...
[2017-08-09 04:06:14] [DEBUG] get_node_record():
SELECT id, type, upstream_node_id, name, conninfo,        slot_name, priority, active  FROM "repmgr_db_cluster".repl_nodes  WHERE cluster = 'db_cluster'    AND id = 1
[2017-08-09 04:06:14] [DEBUG] failure detected on master node (1); attempting to promote a standby
[2017-08-09 04:06:14] [DEBUG] 2 active nodes registered
[2017-08-09 04:06:14] [DEBUG] node=1 conninfo="host=172.17.0.2 user=repmgr dbname=repmgr" type=master
[2017-08-09 04:06:14] [DEBUG] node=2 conninfo="host=node2 user=repmgr dbname=repmgr" type=standby
[2017-08-09 04:06:14] [DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2017-08-09 04:06:14] [DEBUG] set_config():
SET synchronous_commit TO 'local'
[2017-08-09 04:06:14] [DEBUG] total nodes counted: registered=2, visible=1
[2017-08-09 04:06:14] [DEBUG] checking node 1...
[2017-08-09 04:06:14] [DEBUG] checking node 2...
[2017-08-09 04:06:14] [DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2017-08-09 04:06:14] [DEBUG] set_config():
SET synchronous_commit TO 'local'
[2017-08-09 04:06:14] [DEBUG] LSN of node 2 is: 0/6000098
[2017-08-09 04:06:14] [DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2017-08-09 04:06:14] [DEBUG] set_config():
SET synchronous_commit TO 'local'
[2017-08-09 04:06:14] [DEBUG] LSN of node 2 is: 0/6000098
[2017-08-09 04:06:14] [DEBUG] best candidate node id is 2
[2017-08-09 04:06:19] [NOTICE] this node is the best candidate to be the new master, promoting...
[2017-08-09 04:06:19] [DEBUG] promote command is: "/opt/pgsql-9.6/bin/repmgr standby promote -f /etc/repmgr/repmgr.conf"
INFO: connecting to standby database
DEBUG: connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
INFO: retrieving node list for cluster 'db_cluster'
DEBUG: connecting to: 'host=172.17.0.2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
ERROR: connection to database failed: could not connect to server: Connection refused
	Is the server running on host "172.17.0.2" and accepting
	TCP/IP connections on port 5432?

DEBUG: connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
NOTICE: promoting standby
NOTICE: promoting server using '/opt/pgsql-9.6/bin/pg_ctl -D /var/lib/postgres/data promote'
INFO: reconnecting to promoted server
DEBUG: connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
DEBUG: setting node 2 as master and marking existing master as failed
NOTICE: STANDBY PROMOTE successful
[2017-08-09 04:06:21] [DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2017-08-09 04:06:21] [DEBUG] set_config():
SET synchronous_commit TO 'local'
[2017-08-09 04:06:21] [DEBUG] get_node_record():
SELECT id, type, upstream_node_id, name, conninfo,        slot_name, priority, active  FROM "repmgr_db_cluster".repl_nodes  WHERE cluster = 'db_cluster'    AND id = 2
[2017-08-09 04:06:21] [DEBUG] create_event_record():
 INSERT INTO "repmgr_db_cluster".repl_events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp 
[2017-08-09 04:06:21] [DEBUG] create_event_record(): Event timestamp is "2017-08-09 04:06:21.880298+00"
[2017-08-09 04:06:23] [DEBUG] standby check loop will terminate
[2017-08-09 04:06:23] [INFO] checking cluster configuration with schema 'repmgr_db_cluster'
[2017-08-09 04:06:23] [INFO] checking node 2 in cluster 'db_cluster'
[2017-08-09 04:06:23] [INFO] reloading configuration file
[2017-08-09 04:06:23] [INFO] configuration has not changed
[2017-08-09 04:06:23] [INFO] starting continuous master connection check

Bring old master as new standby

Before run standby clone, please run this query on new master

repmgr=# select * from repmgr_db_cluster.repl_nodes ;
id | type | upstream_node_id | cluster | name | conninfo
| slot_name | priority | active
----+--------+------------------+------------+-------+--------------------------------------
-----+---------------+----------+--------
1 | master | | db_cluster | node1 | host=node1 user=repmgr dbname=re
pmgr | repmgr_slot_1 | 100 | f
2 | master | | db_cluster | node2 | host=node2 user=repmgr dbname=repmgr
| repmgr_slot_2 | 100 | t

Delete old master from repmgr_db_cluster.repl_nodes

repmgr=# delete from repmgr_db_cluster.repl_nodes where id =1;
DELETE 1

Next run repmgr to clone data from master

$ /opt/pgsql-9.5/bin/repmgr -h node2 -U repmgr -d repmgr -D /var/lib/postgresql/data -f /etc/repmgr/repmgr.conf standby clone
NOTICE: destination directory '/var/lib/postgres/data' provided
INFO: connecting to upstream node
INFO: Successfully connected to upstream node. Current installation size is 41 MB
INFO: checking and correcting permissions on existing directory /var/lib/postgres/data/ ...
DEBUG: standby clone:   WITH dd AS (     SELECT setting AS data_directory      FROM pg_catalog.pg_settings      WHERE name = 'data_directory'   )     SELECT DISTINCT(sourcefile),            regexp_replace(sourcefile, '^.*\/', '') AS filename,            sourcefile ~ ('^' || dd.data_directory) AS in_data_dir       FROM dd, pg_catalog.pg_settings ps      WHERE sourcefile IS NOT NULL   ORDER BY 1 
DEBUG: standby clone:   WITH dd AS (     SELECT setting AS data_directory      FROM pg_catalog.pg_settings      WHERE name = 'data_directory'   )     SELECT ps.setting,            regexp_replace(setting, '^.*\/', '') AS filename,            ps.setting ~ ('^' || dd.data_directory) AS in_data_dir       FROM dd, pg_catalog.pg_settings ps      WHERE ps.name IN ('hba_file', 'ident_file')   ORDER BY 1 
DEBUG: create_replication_slot(): Creating slot 'repmgr_slot_1' on master
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing: '/opt/pgsql-9.6/bin/pg_basebackup -l "repmgr base backup"  -D /var/lib/postgres/data/ -h node2 -p 5432 -U repmgr -X stream -S repmgr_slot_1 '
DEBUG: create_recovery_file(): creating '/var/lib/postgres/data/recovery.conf'...
DEBUG: recovery.conf: standby_mode = 'on'
DEBUG: recovery.conf: primary_conninfo = 'user=repmgr port=5432 sslmode=disable sslcompression=1 host=node2 application_name=node1'
DEBUG: recovery.conf: recovery_target_timeline = 'latest'
DEBUG: recovery.conf: primary_slot_name = repmgr_slot_1
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example : pg_ctl -D /var/lib/postgres/data/ start
HINT: After starting the server, you need to register this standby with "repmgr standby register"

Start service postgresql and register server as new standby

$ pg_ctl -D /var/lib/postgres/data/ start       
server starting
$ /opt/pgsql-9.5/bin/repmgr -h node2 -U repmgr -d repmgr -D /var/lib/postgres/data/ -f /etc/repmgr/repmgr.conf standby register

WARNING: master connection parameters not required when executing STANDBY REGISTER
WARNING: destination directory not required when executing STANDBY REGISTER
INFO: connecting to standby database
DEBUG: connecting to: 'host=node1 user=repmgr dbname=repmgr fallback_application_name='repmgr''
INFO: connecting to master database
INFO: retrieving node list for cluster 'db_cluster'
DEBUG: connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
DEBUG: get_master_connection(): current master node is 2
INFO: registering the standby
INFO: standby registration complete
NOTICE: standby node correctly registered for cluster db_cluster with id 1 (conninfo: host=node1 user=repmgr dbname=repmgr)

Failover testing done

CMIIW

Advertisements