1<refentry id="repmgr-node-rejoin">
2
3  <indexterm>
4    <primary>repmgr node rejoin</primary>
5  </indexterm>
6
7  <refmeta>
8    <refentrytitle>repmgr node rejoin</refentrytitle>
9  </refmeta>
10
11  <refnamediv>
12    <refname>repmgr node rejoin</refname>
13    <refpurpose>rejoin a dormant (stopped) node to the replication cluster</refpurpose>
14  </refnamediv>
15
16  <refsect1>
17    <title>Description</title>
18    <para>
19      Enables a dormant (stopped) node to be rejoined to the replication cluster.
20    </para>
21    <para>
22      This can optionally use <application>pg_rewind</application> to re-integrate
23      a node which has diverged from the rest of the cluster, typically a failed primary.
24    </para>
25
26    <tip>
27      <para>
28        If the node is running and needs to be attached to the current primary, use
29        <xref linkend="repmgr-standby-follow"/>.
30      </para>
31      <para>
32        Note <xref linkend="repmgr-standby-follow"/> can only be used for standbys which have not diverged
33        from the rest of the cluster.
34      </para>
35    </tip>
36  </refsect1>
37
38
39  <refsect1>
40    <title>Usage</title>
41
42    <para>
43      <programlisting>
44      repmgr node rejoin -d '$conninfo'</programlisting>
45
46      where <literal>$conninfo</literal> is the PostgreSQL <literal>conninfo</literal> string of the
47      <emphasis>current</emphasis> primary node (or that of any reachable node in the cluster, but
48      <emphasis>not</emphasis> the local node). This is so that &repmgr; can fetch up-to-date information
49      about the current state of the cluster.
50    </para>
51    <para>
52      <filename>repmgr.conf</filename> for the stopped node *must* be supplied explicitly if not
53      otherwise available.
54    </para>
55  </refsect1>
56
57  <refsect1>
58
59    <title>Options</title>
60    <variablelist>
61
62      <varlistentry>
63        <term><option>--dry-run</option></term>
64        <listitem>
65          <para>
66            Check prerequisites but don't actually execute the rejoin.
67          </para>
68        </listitem>
69      </varlistentry>
70
71      <varlistentry>
72        <term><option>--force-rewind[=/path/to/pg_rewind]</option></term>
73        <listitem>
74          <para>
75            Execute <application>pg_rewind</application>.
76          </para>
77          <para>
78            It is only necessary to provide the <application>pg_rewind</application> path
79            if using PostgreSQL 9.4, and <application>pg_rewind</application>
80            is not installed in the PostgreSQL <filename>bin</filename> directory.
81          </para>
82        </listitem>
83      </varlistentry>
84
85      <varlistentry>
86        <term><option>--config-files</option></term>
87        <listitem>
88          <para>
89            comma-separated list of configuration files to retain after
90            executing <application>pg_rewind</application>.
91          </para>
92          <para>
93            Currently <application>pg_rewind</application> will overwrite
94            the local node's configuration files with the files from the source node,
95            so it's advisable to use this option to ensure they are kept.
96          </para>
97        </listitem>
98      </varlistentry>
99
100
101      <varlistentry>
102        <term><option>--config-archive-dir</option></term>
103        <listitem>
104          <para>
105            Directory to temporarily store configuration files specified with
106            <option>--config-files</option>; default: <filename>/tmp</filename>.
107          </para>
108        </listitem>
109      </varlistentry>
110
111
112      <varlistentry>
113        <term><option>-W/--no-wait</option></term>
114        <listitem>
115          <para>
116            Don't wait for the node to rejoin cluster.
117          </para>
118          <para>
119            If this option is supplied, &repmgr; will restart the node but
120            not wait for it to connect to the primary.
121          </para>
122        </listitem>
123      </varlistentry>
124
125    </variablelist>
126  </refsect1>
127
128  <refsect1>
129    <title>Configuration file settings</title>
130
131    <para>
132      <itemizedlist spacing="compact" mark="bullet">
133       <listitem>
134         <simpara>
135           <literal>node_rejoin_timeout</literal>:
136		   the maximum length of time (in seconds) to wait for
137		   the node to reconnect to the replication cluster (defaults to
138		   the value set in <literal>standby_reconnect_timeout</literal>,
139		   60 seconds).
140		 </simpara>
141         <simpara>
142           Note that <literal>standby_reconnect_timeout</literal> must be
143           set to a value equal to or greater than
144           <literal>node_rejoin_timeout</literal>.
145         </simpara>
146	   </listitem>
147	  </itemizedlist>
148	</para>
149
150  </refsect1>
151
152  <refsect1 id="repmgr-node-rejoin-events">
153    <title>Event notifications</title>
154    <para>
155      A <literal>node_rejoin</literal> <link linkend="event-notifications">event notification</link> will be generated.
156    </para>
157  </refsect1>
158  <refsect1>
159    <title>Exit codes</title>
160    <para>
161      One of the following exit codes will be emitted by <command>repmgr node rejoin</command>:
162    </para>
163
164    <variablelist>
165
166      <varlistentry>
167        <term><option>SUCCESS (0)</option></term>
168        <listitem>
169          <para>
170            The node rejoin succeeded; or if <option>--dry-run</option> was provided,
171            no issues were detected which would prevent the node rejoin.
172          </para>
173        </listitem>
174      </varlistentry>
175
176      <varlistentry>
177        <term><option>ERR_BAD_CONFIG (1)</option></term>
178        <listitem>
179          <para>
180            A configuration issue was detected which prevented &repmgr; from
181            continuing with the node rejoin.
182          </para>
183        </listitem>
184      </varlistentry>
185
186      <varlistentry>
187        <term><option>ERR_NO_RESTART (4)</option></term>
188        <listitem>
189          <para>
190            The node could not be restarted.
191          </para>
192        </listitem>
193      </varlistentry>
194
195      <varlistentry>
196        <term><option>ERR_REJOIN_FAIL (24)</option></term>
197        <listitem>
198          <para>
199            The node rejoin operation failed.
200          </para>
201        </listitem>
202      </varlistentry>
203
204    </variablelist>
205
206  </refsect1>
207
208  <refsect1>
209    <title>Notes</title>
210    <para>
211      Currently <command>repmgr node rejoin</command> can only be used to attach
212      a standby to the current primary, not another standby.
213    </para>
214    <para>
215      The node's PostgreSQL instance  must have been shut down cleanly. If this was not the
216      case, it will need to be started up until it has reached a consistent recovery point,
217      then shut down cleanly.
218    </para>
219    <para>
220      In PostgreSQL 13 and later, this will be done automatically
221      if the <option>--force-rewind</option> is provided (even if an actual rewind
222      is not necessary).
223    </para>
224    <para>
225      With PostgreSQL 12 and earlier, PostgreSQL will need to
226      be started and shut down manually; see below for the best way to do this.
227    </para>
228    <tip>
229      <para>
230        If <application>PostgreSQL</application> is started in single-user mode and
231        input is directed from <filename>/dev/null/</filename>, it will perform recovery
232        then immediately quit, and will then be in a state suitable for use by
233        <application>pg_rewind</application>.
234        <programlisting>
235          rm -f /var/lib/pgsql/data/recovery.conf
236          postgres --single -D /var/lib/pgsql/data/ &lt; /dev/null</programlisting>
237      </para>
238      <para>
239        Note that  <filename>standby.signal</filename> (PostgreSQL 11 and earlier:
240        <filename>recovery.conf</filename>) <emphasis>must</emphasis> be removed
241        from the data directory for PostgreSQL to be able to start in single
242        user mode.
243      </para>
244    </tip>
245
246  </refsect1>
247
248  <refsect1 id="repmgr-node-rejoin-pg-rewind" xreflabel="Using pg_rewind">
249
250   <title>Using <command>pg_rewind</command></title>
251
252   <indexterm>
253      <primary>pg_rewind</primary>
254      <secondary>using with "repmgr node rejoin"</secondary>
255    </indexterm>
256
257    <para>
258      <command>repmgr node rejoin</command> can optionally use <command>pg_rewind</command> to re-integrate a
259      node which has diverged from the rest of the cluster, typically a failed primary.
260      <command>pg_rewind</command> is available in PostgreSQL 9.5 and later as part of the core distribution,
261      and can be installed from external sources for PostgreSQL 9.4.
262    </para>
263    <note>
264      <para>
265        <command>pg_rewind</command> <emphasis>requires</emphasis> that either
266        <varname>wal_log_hints</varname> is enabled, or that
267        data checksums were enabled when the cluster was initialized. See the
268        <ulink url="https://www.postgresql.org/docs/current/app-pgrewind.html"><command>pg_rewind</command> documentation</ulink> for details.
269      </para>
270    </note>
271
272    <para>
273      We strongly recommend familiarizing yourself with <command>pg_rewind</command> before attempting
274      to use it with &repmgr;, as while it is an extremely useful tool, it is <emphasis>not</emphasis>
275      a &quot;magic bullet&quot; which can resolve all problematic replication situations.
276    </para>
277
278    <para>
279      A typical use-case for <command>pg_rewind</command> is when a scenario like the following
280      is encountered:
281      <programlisting>
282    $ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node3 dbname=repmgr user=repmgr' \
283        --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose --dry-run
284    INFO: replication connection to the rejoin target node was successful
285    INFO: local and rejoin target system identifiers match
286    DETAIL: system identifier is 6652184002263212600
287    ERROR: this node cannot attach to rejoin target node 3
288    DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/610D710
289    HINT: use --force-rewind to execute pg_rewind</programlisting>
290
291      Here, <literal>node3</literal> was promoted to a primary while the local node was
292      still attached to the previous primary; this can potentially happen during e.g. a
293      network split. <command>pg_rewind</command> can re-sync the local node with <literal>node3</literal>,
294      removing the need for a full reclone.
295    </para>
296
297    <para>
298      To have <command>repmgr node rejoin</command> use <command>pg_rewind</command>,
299      pass the command line option <literal>--force-rewind</literal>, which will tell &repmgr;
300      to execute <command>pg_rewind</command> to ensure the node can be rejoined successfully.
301    </para>
302
303    <refsect2 id="repmgr-node-rejoin-pg-rewind-config-files" xreflabel="pg_rewind and configuration files">
304
305      <title><command>pg_rewind</command> and configuration file retention</title>
306
307      <indexterm>
308        <primary>pg_rewind</primary>
309        <secondary>configuration file retention</secondary>
310      </indexterm>
311
312      <para>
313        Be aware that if <command>pg_rewind</command> is executed and actually performs a
314        rewind operation, any configuration files in the PostgreSQL data directory will be
315        overwritten with those from the source server.
316      </para>
317      <para>
318        To prevent this happening, provide a comma-separated list of files to retain
319        using the <option>--config-file</option> command line option; the specified files
320        will be archived in a temporary directory (whose parent directory can be specified with
321        <option>--config-archive-dir</option>, default: <filename>/tmp</filename>)
322        and restored once the rewind operation is complete.
323      </para>
324    </refsect2>
325
326    <refsect2 id="repmgr-node-rejoin-pg-rewind-example" xreflabel="example using repmgr node rejoin and pg_rewind">
327
328      <title>Example using <command>repmgr node rejoin</command> and <command>pg_rewind</command></title>
329
330      <indexterm>
331        <primary>pg_rewind</primary>
332        <secondary>configuration file retention</secondary>
333      </indexterm>
334
335
336      <para>
337        Example, first using <option>--dry-run</option>, then actually executing the
338        <literal>node rejoin command</literal>.
339        <programlisting>
340    $ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node3 dbname=repmgr user=repmgr' \
341        --config-files=postgresql.local.conf,postgresql.conf --verbose --force-rewind --dry-run
342    INFO: replication connection to the rejoin target node was successful
343    INFO: local and rejoin target system identifiers match
344    DETAIL: system identifier is 6652460429293670710
345    NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3
346    DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/610D710
347    INFO: prerequisites for using pg_rewind are met
348    INFO: file "postgresql.local.conf" would be copied to "/tmp/repmgr-config-archive-node2/postgresql.local.conf"
349    INFO: file "postgresql.replication-setup.conf" would be copied to "/tmp/repmgr-config-archive-node2/postgresql.replication-setup.conf"
350    INFO: pg_rewind would now be executed
351    DETAIL: pg_rewind command is:
352      pg_rewind -D '/var/lib/postgresql/data' --source-server='host=node3 dbname=repmgr user=repmgr'
353    INFO: prerequisites for executing NODE REJOIN are met</programlisting>
354
355        <note>
356          <para>
357            If <option>--force-rewind</option> is used with the <option>--dry-run</option> option,
358            this checks the prerequisites for using <application>pg_rewind</application>, but is
359            not an absolute guarantee that actually executing <application>pg_rewind</application>
360            will succeed. See also section <xref linkend="repmgr-node-rejoin-caveats"/> below.
361          </para>
362
363        </note>
364
365        <programlisting>
366    $ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node3 dbname=repmgr user=repmgr' \
367        --config-files=postgresql.local.conf,postgresql.conf --verbose --force-rewind
368    NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3
369    DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/610D710
370    NOTICE: executing pg_rewind
371    DETAIL: pg_rewind command is "pg_rewind -D '/var/lib/postgresql/data' --source-server='host=node3 dbname=repmgr user=repmgr'"
372    NOTICE: 2 files copied to /var/lib/postgresql/data
373    NOTICE: setting node 2's upstream to node 3
374    NOTICE: starting server using "pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' start"
375    NOTICE: NODE REJOIN successful
376    DETAIL: node 2 is now attached to node 3</programlisting>
377      </para>
378    </refsect2>
379  </refsect1>
380
381  <refsect1 id="repmgr-node-rejoin-caveats" xreflabel="Caveats">
382
383   <title>Caveats when using <command>repmgr node rejoin</command></title>
384
385   <indexterm>
386     <primary>repmgr node rejoin</primary>
387     <secondary>caveats</secondary>
388   </indexterm>
389
390   <para>
391     <command>repmgr node rejoin</command> attempts to determine whether it will succeed by
392     comparing the timelines and relative WAL positions of the local node (rejoin candidate) and primary
393     (rejoin target). This is particularly important if planning to use <application>pg_rewind</application>,
394     which currently (as of PostgreSQL 12) may appear to succeed (or indicate there is no action
395     needed) but potentially allow an impossible action, such as trying to rejoin a standby to a
396     primary which is behind the standby. &repmgr; will prevent this situation from occurring.
397   </para>
398   <para>
399     Currently it is <emphasis>not</emphasis> possible to detect a situation where the rejoin target
400     is a standby which has been &quot;promoted&quot; by removing <filename>recovery.conf</filename>
401     (PostgreSQL 12 and later: <filename>standby.signal</filename>) and restarting it.
402     In this case there will be no information about the point the rejoin target diverged
403     from the current standby; the rejoin operation will fail and
404     the current standby's PostgreSQL log will contain entries with the text
405     &quot;<literal>record with incorrect prev-link</literal>&quot;.
406   </para>
407   <para>
408     In PostgreSQL 9.5 and earlier, it is <emphasis>not</emphasis> possible to use
409     <application>pg_rewind</application> to attach to a target node with a lower
410     timeline than the local node.
411   </para>
412   <para>
413     We strongly recommend running <command>repmgr node rejoin</command> with the
414     <option>--dry-run</option> option first. Additionally it might be a good idea
415     to execute the <application>pg_rewind</application> command displayed by
416     &repmgr; with the <application>pg_rewind</application> <option>--dry-run</option>
417     option. Note that <application>pg_rewind</application> does not indicate that it
418     is running in <option>--dry-run</option> mode.
419   </para>
420
421   <warning>
422     <para>
423       In all current PostgreSQL versions (as of September 2020), <application>pg_rewind</application>
424       contains a corner-case bug which affects standbys in a very specific situation.
425     </para>
426     <para>
427       This situation occurs when a standby was shut down <emphasis>before</emphasis> its
428       primary node, and an attempt is made to attach this standby to another primary
429       in the same cluster (following a &quot;split brain&quot; situation where the standby
430       was connected to the wrong primary). In this case, &repmgr; will correctly determine
431       that <application>pg_rewind</application> should be executed, however
432       <application>pg_rewind</application> incorrectly decides that no action is necessary.
433     </para>
434     <para>
435       In this situation, &repmgr; will report something like:
436<programlisting>
437    NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1
438    DETAIL: rejoin target server's timeline 3 forked off current database system timeline 2 before current recovery point 0/7019C10</programlisting>
439       but when executed, <application>pg_rewind</application> will report:
440<programlisting>
441    pg_rewind: servers diverged at WAL location 0/7015540 on timeline 2
442    pg_rewind: no rewind required</programlisting>
443       and if an attempt is made to attach the standby to the new primary, PostgreSQL logs on the standby
444       will contain errors like:
445<programlisting>
446    [2020-09-07 15:01:41 UTC]    LOG:  00000: replication terminated by primary server
447    [2020-09-07 15:01:41 UTC]    DETAIL:  End of WAL reached on timeline 2 at 0/7015540.
448    [2020-09-07 15:01:41 UTC]    LOG:  00000: new timeline 3 forked off current database system timeline 2 before current recovery point 0/7019C10</programlisting>
449     </para>
450     <para>
451       Currently it is not possible to resolve this situation using <application>pg_rewind</application>.
452       A <ulink url="https://www.postgresql.org/message-id/flat/CABvVfJU-LDWvoz4-Yow3Ay5LZYTuPD7eSjjE4kGyNZpXC6FrVQ@mail.gmail.com">patch</ulink>
453       has been submitted and will hopefully be included in a forthcoming PostgreSQL minor release.
454     </para>
455     <para>
456       As a workaround, start the primary server the standby was previously attached to,
457       and ensure the standby can be attached to it. If <application>pg_rewind</application> was actually executed,
458       it will have copied in the <filename>.history</filename> file from the target primary server; this must
459       be removed. <command>repmgr node rejoin</command> can then be used to attach the standby to the original
460       primary. Ensure any changes pending on the primary have propogated to the standby. Then shut down the primary
461       server <emphasis>first</emphasis>, before shutting down the standby. It should then be possible to
462       use <command>repmgr node rejoin</command> to attach the standby to the new primary.
463     </para>
464   </warning>
465
466  </refsect1>
467
468  <refsect1>
469    <title>See also</title>
470    <para>
471     <xref linkend="repmgr-standby-follow"/>
472    </para>
473  </refsect1>
474</refentry>
475