1<chapter id="repmgrd-overview" xreflabel="repmgrd overview"> 2 <title>repmgrd overview</title> 3 4 <indexterm> 5 <primary>repmgrd</primary> 6 <secondary>overview</secondary> 7 </indexterm> 8 9 <para> 10 &repmgrd; ("<literal>replication manager daemon</literal>") 11 is a management and monitoring daemon which runs 12 on each node in a replication cluster. It can automate actions such as 13 failover and updating standbys to follow the new primary, as well as 14 providing monitoring information about the state of each standby. 15 </para> 16 <para> 17 &repmgrd; is designed to be straightforward to set up 18 and does not require additional external infrastructure. 19 </para> 20 <para> 21 Functionality provided by &repmgrd; includes: 22 <itemizedlist spacing="compact" mark="bullet"> 23 24 <listitem> 25 <simpara> 26 wide range of <link linkend="repmgrd-basic-configuration">configuration options</link> 27 </simpara> 28 </listitem> 29 30 <listitem> 31 <simpara> 32 option to execute custom scripts ("<link linkend="event-notifications">event notifications</link> 33 at different points in the failover sequence 34 </simpara> 35 </listitem> 36 37 <listitem> 38 <simpara> 39 ability to <link linkend="repmgrd-pausing">pause repmgrd</link> 40 operation on all nodes with a 41 <link linkend="repmgr-service-pause"><command>single command</command></link> 42 </simpara> 43 </listitem> 44 45 <listitem> 46 <simpara> 47 optional <link linkend="repmgrd-witness-server">witness server</link> 48 </simpara> 49 </listitem> 50 51 <listitem> 52 <simpara> 53 "location" configuration option to restrict 54 potential promotion candidates to a single location 55 (e.g. when nodes are spread over multiple data centres) 56 </simpara> 57 </listitem> 58 59 <listitem> 60 <simpara> 61 <link linkend="connection-check-type">choice of method</link> to determine node availability 62 (PostgreSQL ping, query execution or new connection) 63 </simpara> 64 </listitem> 65 66 <listitem> 67 <simpara> 68 retention of monitoring statistics (optional) 69 </simpara> 70 </listitem> 71 72 73 </itemizedlist> 74 75 </para> 76 77 <sect1 id="repmgrd-demonstration"> 78 79 <title>repmgrd demonstration</title> 80 <para> 81 To demonstrate automatic failover, set up a 3-node replication cluster (one primary 82 and two standbys streaming directly from the primary) so that the cluster looks 83 something like this: 84 <programlisting> 85 $ repmgr -f /etc/repmgr.conf cluster show --compact 86 ID | Name | Role | Status | Upstream | Location | Prio. 87 ----+-------+---------+-----------+----------+----------+------- 88 1 | node1 | primary | * running | | default | 100 89 2 | node2 | standby | running | node1 | default | 100 90 3 | node3 | standby | running | node1 | default | 100</programlisting> 91 </para> 92 93 <tip> 94 <para> 95 See section <link linkend="repmgrd-automatic-failover-configuration">Required configuration for automatic failover</link> 96 for an example of minimal <filename>repmgr.conf</filename> file settings suitable for use with &repmgrd;. 97 </para> 98 </tip> 99 <para> 100 Start &repmgrd; on each standby and verify that it's running by examining the 101 log output, which at log level <literal>INFO</literal> will look like this: 102 <programlisting> 103 [2019-08-15 07:14:42] [NOTICE] repmgrd (repmgrd 5.0) starting up 104 [2019-08-15 07:14:42] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr connect_timeout=2" 105 INFO: set_repmgrd_pid(): provided pidfile is /var/run/repmgr/repmgrd-12.pid 106 [2019-08-15 07:14:42] [NOTICE] starting monitoring of node "node2" (ID: 2) 107 [2019-08-15 07:14:42] [INFO] monitoring connection to upstream node "node1" (ID: 1)</programlisting> 108 </para> 109 <para> 110 Each &repmgrd; should also have recorded its successful startup as an event: 111 <programlisting> 112 $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start 113 Node ID | Name | Event | OK | Timestamp | Details 114 ---------+-------+---------------+----+---------------------+-------------------------------------------------------- 115 3 | node3 | repmgrd_start | t | 2019-08-15 07:14:42 | monitoring connection to upstream node "node1" (ID: 1) 116 2 | node2 | repmgrd_start | t | 2019-08-15 07:14:41 | monitoring connection to upstream node "node1" (ID: 1) 117 1 | node1 | repmgrd_start | t | 2019-08-15 07:14:39 | monitoring cluster primary "node1" (ID: 1)</programlisting> 118 </para> 119 <para> 120 Now stop the current primary server with e.g.: 121 <programlisting> 122 pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting> 123 </para> 124 <para> 125 This will force the primary to shut down straight away, aborting all processes 126 and transactions. This will cause a flurry of activity in the &repmgrd; log 127 files as each &repmgrd; detects the failure of the primary and a failover 128 decision is made. This is an extract from the log of a standby server (<literal>node2</literal>) 129 which has promoted to new primary after failure of the original primary (<literal>node1</literal>). 130 <programlisting> 131 [2019-08-15 07:27:50] [WARNING] unable to connect to upstream node "node1" (ID: 1) 132 [2019-08-15 07:27:50] [INFO] checking state of node 1, 1 of 3 attempts 133 [2019-08-15 07:27:50] [INFO] sleeping 5 seconds until next reconnection attempt 134 [2019-08-15 07:27:55] [INFO] checking state of node 1, 2 of 3 attempts 135 [2019-08-15 07:27:55] [INFO] sleeping 5 seconds until next reconnection attempt 136 [2019-08-15 07:28:00] [INFO] checking state of node 1, 3 of 3 attempts 137 [2019-08-15 07:28:00] [WARNING] unable to reconnect to node 1 after 3 attempts 138 [2019-08-15 07:28:00] [INFO] primary and this node have the same location ("default") 139 [2019-08-15 07:28:00] [INFO] local node's last receive lsn: 0/900CBF8 140 [2019-08-15 07:28:00] [INFO] node 3 last saw primary node 12 second(s) ago 141 [2019-08-15 07:28:00] [INFO] last receive LSN for sibling node "node3" (ID: 3) is: 0/900CBF8 142 [2019-08-15 07:28:00] [INFO] node "node3" (ID: 3) has same LSN as current candidate "node2" (ID: 2) 143 [2019-08-15 07:28:00] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds 144 [2019-08-15 07:28:00] [NOTICE] promotion candidate is "node2" (ID: 2) 145 [2019-08-15 07:28:00] [NOTICE] this node is the winner, will now promote itself and inform other nodes 146 [2019-08-15 07:28:00] [INFO] promote_command is: 147 "/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote" 148 NOTICE: promoting standby to primary 149 DETAIL: promoting server "node2" (ID: 2) using "/usr/pgsql-12/bin/pg_ctl -w -D '/var/lib/pgsql/12/data' promote" 150 NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete 151 NOTICE: STANDBY PROMOTE successful 152 DETAIL: server "node2" (ID: 2) was successfully promoted to primary 153 [2019-08-15 07:28:01] [INFO] 3 followers to notify 154 [2019-08-15 07:28:01] [NOTICE] notifying node "node3" (ID: 3) to follow node 2 155 INFO: node 3 received notification to follow node 2 156 [2019-08-15 07:28:01] [INFO] switching to primary monitoring mode 157 [2019-08-15 07:28:01] [NOTICE] monitoring cluster primary "node2" (ID: 2)</programlisting> 158 </para> 159 <para> 160 The cluster status will now look like this, with the original primary (<literal>node1</literal>) 161 marked as inactive, and standby <literal>node3</literal> now following the new primary 162 (<literal>node2</literal>): 163 <programlisting> 164 $ repmgr -f /etc/repmgr.conf cluster show --compact 165 ID | Name | Role | Status | Upstream | Location | Prio. 166 ----+-------+---------+-----------+----------+----------+------- 167 1 | node1 | primary | - failed | | default | 100 168 2 | node2 | primary | * running | | default | 100 169 3 | node3 | standby | running | node2 | default | 100</programlisting> 170 171 </para> 172 <para> 173 <link linkend="repmgr-cluster-event"><command>repmgr cluster event</command></link> will display a summary of 174 what happened to each server during the failover: 175 <programlisting> 176 $ repmgr -f /etc/repmgr.conf cluster event 177 Node ID | Name | Event | OK | Timestamp | Details 178 ---------+-------+----------------------------+----+---------------------+------------------------------------------------------------- 179 3 | node3 | repmgrd_failover_follow | t | 2019-08-15 07:38:03 | node 3 now following new upstream node 2 180 3 | node3 | standby_follow | t | 2019-08-15 07:38:02 | standby attached to upstream node "node2" (ID: 2) 181 2 | node2 | repmgrd_reload | t | 2019-08-15 07:38:01 | monitoring cluster primary "node2" (ID: 2) 182 2 | node2 | repmgrd_failover_promote | t | 2019-08-15 07:38:01 | node 2 promoted to primary; old primary 1 marked as failed 183 2 | node2 | standby_promote | t | 2019-08-15 07:38:01 | server "node2" (ID: 2) was successfully promoted to primary</programlisting> 184 </para> 185 186 </sect1> 187</chapter> 188