1<chapter id="repmgrd-overview" xreflabel="repmgrd overview">
2  <title>repmgrd overview</title>
3
4  <indexterm>
5    <primary>repmgrd</primary>
6    <secondary>overview</secondary>
7  </indexterm>
8
9  <para>
10    &repmgrd; (&quot;<literal>replication manager daemon</literal>&quot;)
11    is a management and monitoring daemon which runs
12    on each node in a replication cluster. It can automate actions such as
13    failover and updating standbys to follow the new primary, as well as
14    providing monitoring information about the state of each standby.
15  </para>
16  <para>
17    &repmgrd; is designed to be straightforward to set up
18    and does not require additional external infrastructure.
19  </para>
20  <para>
21    Functionality provided by &repmgrd; includes:
22    <itemizedlist spacing="compact" mark="bullet">
23
24       <listitem>
25         <simpara>
26           wide range of <link linkend="repmgrd-basic-configuration">configuration options</link>
27         </simpara>
28       </listitem>
29
30       <listitem>
31         <simpara>
32           option to execute custom scripts (&quot;<link linkend="event-notifications">event notifications</link>
33           at different points in the failover sequence
34         </simpara>
35       </listitem>
36
37       <listitem>
38         <simpara>
39           ability to <link linkend="repmgrd-pausing">pause repmgrd</link>
40           operation on all nodes with a
41           <link linkend="repmgr-service-pause"><command>single command</command></link>
42         </simpara>
43       </listitem>
44
45       <listitem>
46         <simpara>
47           optional <link linkend="repmgrd-witness-server">witness server</link>
48         </simpara>
49       </listitem>
50
51       <listitem>
52         <simpara>
53           &quot;location&quot; configuration option to restrict
54           potential promotion candidates to a single location
55           (e.g. when nodes are spread over multiple data centres)
56         </simpara>
57       </listitem>
58
59       <listitem>
60         <simpara>
61           <link linkend="connection-check-type">choice of method</link> to determine node availability
62           (PostgreSQL ping, query execution or new connection)
63         </simpara>
64       </listitem>
65
66       <listitem>
67         <simpara>
68           retention of monitoring statistics (optional)
69         </simpara>
70       </listitem>
71
72
73    </itemizedlist>
74
75  </para>
76
77  <sect1 id="repmgrd-demonstration">
78
79    <title>repmgrd demonstration</title>
80    <para>
81  To demonstrate automatic failover, set up a 3-node replication cluster (one primary
82  and two standbys streaming directly from the primary) so that the cluster looks
83  something like this:
84  <programlisting>
85    $ repmgr -f /etc/repmgr.conf cluster show --compact
86     ID | Name  | Role    | Status    | Upstream | Location | Prio.
87    ----+-------+---------+-----------+----------+----------+-------
88     1  | node1 | primary | * running |          | default  | 100
89     2  | node2 | standby |   running | node1    | default  | 100
90     3  | node3 | standby |   running | node1    | default  | 100</programlisting>
91 </para>
92
93 <tip>
94   <para>
95     See section <link linkend="repmgrd-automatic-failover-configuration">Required configuration for automatic failover</link>
96     for an example of minimal <filename>repmgr.conf</filename> file settings suitable for use with &repmgrd;.
97   </para>
98 </tip>
99 <para>
100  Start &repmgrd; on each standby and verify that it's running by examining the
101  log output, which at log level <literal>INFO</literal> will look like this:
102  <programlisting>
103    [2019-08-15 07:14:42] [NOTICE] repmgrd (repmgrd 5.0) starting up
104    [2019-08-15 07:14:42] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr connect_timeout=2"
105    INFO:  set_repmgrd_pid(): provided pidfile is /var/run/repmgr/repmgrd-12.pid
106    [2019-08-15 07:14:42] [NOTICE] starting monitoring of node "node2" (ID: 2)
107    [2019-08-15 07:14:42] [INFO] monitoring connection to upstream node "node1" (ID: 1)</programlisting>
108 </para>
109 <para>
110  Each &repmgrd; should also have recorded its successful startup as an event:
111  <programlisting>
112    $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
113     Node ID | Name  | Event         | OK | Timestamp           | Details
114    ---------+-------+---------------+----+---------------------+--------------------------------------------------------
115     3       | node3 | repmgrd_start | t  | 2019-08-15 07:14:42 | monitoring connection to upstream node "node1" (ID: 1)
116     2       | node2 | repmgrd_start | t  | 2019-08-15 07:14:41 | monitoring connection to upstream node "node1" (ID: 1)
117     1       | node1 | repmgrd_start | t  | 2019-08-15 07:14:39 | monitoring cluster primary "node1" (ID: 1)</programlisting>
118 </para>
119 <para>
120  Now stop the current primary server with e.g.:
121  <programlisting>
122    pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
123 </para>
124 <para>
125  This will force the primary to shut down straight away, aborting all processes
126  and transactions.  This will cause a flurry of activity in the &repmgrd; log
127  files as each &repmgrd; detects the failure of the primary and a failover
128  decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
129  which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
130  <programlisting>
131    [2019-08-15 07:27:50] [WARNING] unable to connect to upstream node "node1" (ID: 1)
132    [2019-08-15 07:27:50] [INFO] checking state of node 1, 1 of 3 attempts
133    [2019-08-15 07:27:50] [INFO] sleeping 5 seconds until next reconnection attempt
134    [2019-08-15 07:27:55] [INFO] checking state of node 1, 2 of 3 attempts
135    [2019-08-15 07:27:55] [INFO] sleeping 5 seconds until next reconnection attempt
136    [2019-08-15 07:28:00] [INFO] checking state of node 1, 3 of 3 attempts
137    [2019-08-15 07:28:00] [WARNING] unable to reconnect to node 1 after 3 attempts
138    [2019-08-15 07:28:00] [INFO] primary and this node have the same location ("default")
139    [2019-08-15 07:28:00] [INFO] local node's last receive lsn: 0/900CBF8
140    [2019-08-15 07:28:00] [INFO] node 3 last saw primary node 12 second(s) ago
141    [2019-08-15 07:28:00] [INFO] last receive LSN for sibling node "node3" (ID: 3) is: 0/900CBF8
142    [2019-08-15 07:28:00] [INFO] node "node3" (ID: 3) has same LSN as current candidate "node2" (ID: 2)
143    [2019-08-15 07:28:00] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
144    [2019-08-15 07:28:00] [NOTICE] promotion candidate is "node2" (ID: 2)
145    [2019-08-15 07:28:00] [NOTICE] this node is the winner, will now promote itself and inform other nodes
146    [2019-08-15 07:28:00] [INFO] promote_command is:
147      "/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote"
148    NOTICE: promoting standby to primary
149    DETAIL: promoting server "node2" (ID: 2) using "/usr/pgsql-12/bin/pg_ctl  -w -D '/var/lib/pgsql/12/data' promote"
150    NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
151    NOTICE: STANDBY PROMOTE successful
152    DETAIL: server "node2" (ID: 2) was successfully promoted to primary
153    [2019-08-15 07:28:01] [INFO] 3 followers to notify
154    [2019-08-15 07:28:01] [NOTICE] notifying node "node3" (ID: 3) to follow node 2
155    INFO:  node 3 received notification to follow node 2
156    [2019-08-15 07:28:01] [INFO] switching to primary monitoring mode
157    [2019-08-15 07:28:01] [NOTICE] monitoring cluster primary "node2" (ID: 2)</programlisting>
158 </para>
159 <para>
160  The cluster status will now look like this, with the original primary (<literal>node1</literal>)
161  marked as inactive, and standby <literal>node3</literal> now following the new primary
162  (<literal>node2</literal>):
163  <programlisting>
164    $ repmgr -f /etc/repmgr.conf cluster show --compact
165     ID | Name  | Role    | Status    | Upstream | Location | Prio.
166    ----+-------+---------+-----------+----------+----------+-------
167     1  | node1 | primary | - failed  |          | default  | 100
168     2  | node2 | primary | * running |          | default  | 100
169     3  | node3 | standby |   running | node2    | default  | 100</programlisting>
170
171 </para>
172 <para>
173   <link linkend="repmgr-cluster-event"><command>repmgr cluster event</command></link> will display a summary of
174   what happened to each server during the failover:
175  <programlisting>
176    $ repmgr -f /etc/repmgr.conf cluster event
177     Node ID | Name  | Event                      | OK | Timestamp           | Details
178    ---------+-------+----------------------------+----+---------------------+-------------------------------------------------------------
179     3       | node3 | repmgrd_failover_follow    | t  | 2019-08-15 07:38:03 | node 3 now following new upstream node 2
180     3       | node3 | standby_follow             | t  | 2019-08-15 07:38:02 | standby attached to upstream node "node2" (ID: 2)
181     2       | node2 | repmgrd_reload             | t  | 2019-08-15 07:38:01 | monitoring cluster primary "node2" (ID: 2)
182     2       | node2 | repmgrd_failover_promote   | t  | 2019-08-15 07:38:01 | node 2 promoted to primary; old primary 1 marked as failed
183     2       | node2 | standby_promote            | t  | 2019-08-15 07:38:01 | server "node2" (ID: 2) was successfully promoted to primary</programlisting>
184 </para>
185
186  </sect1>
187</chapter>
188