1<refentry id="repmgr-node-rejoin"> 2 3 <indexterm> 4 <primary>repmgr node rejoin</primary> 5 </indexterm> 6 7 <refmeta> 8 <refentrytitle>repmgr node rejoin</refentrytitle> 9 </refmeta> 10 11 <refnamediv> 12 <refname>repmgr node rejoin</refname> 13 <refpurpose>rejoin a dormant (stopped) node to the replication cluster</refpurpose> 14 </refnamediv> 15 16 <refsect1> 17 <title>Description</title> 18 <para> 19 Enables a dormant (stopped) node to be rejoined to the replication cluster. 20 </para> 21 <para> 22 This can optionally use <application>pg_rewind</application> to re-integrate 23 a node which has diverged from the rest of the cluster, typically a failed primary. 24 </para> 25 26 <tip> 27 <para> 28 If the node is running and needs to be attached to the current primary, use 29 <xref linkend="repmgr-standby-follow"/>. 30 </para> 31 <para> 32 Note <xref linkend="repmgr-standby-follow"/> can only be used for standbys which have not diverged 33 from the rest of the cluster. 34 </para> 35 </tip> 36 </refsect1> 37 38 39 <refsect1> 40 <title>Usage</title> 41 42 <para> 43 <programlisting> 44 repmgr node rejoin -d '$conninfo'</programlisting> 45 46 where <literal>$conninfo</literal> is the PostgreSQL <literal>conninfo</literal> string of the 47 <emphasis>current</emphasis> primary node (or that of any reachable node in the cluster, but 48 <emphasis>not</emphasis> the local node). This is so that &repmgr; can fetch up-to-date information 49 about the current state of the cluster. 50 </para> 51 <para> 52 <filename>repmgr.conf</filename> for the stopped node *must* be supplied explicitly if not 53 otherwise available. 54 </para> 55 </refsect1> 56 57 <refsect1> 58 59 <title>Options</title> 60 <variablelist> 61 62 <varlistentry> 63 <term><option>--dry-run</option></term> 64 <listitem> 65 <para> 66 Check prerequisites but don't actually execute the rejoin. 67 </para> 68 </listitem> 69 </varlistentry> 70 71 <varlistentry> 72 <term><option>--force-rewind[=/path/to/pg_rewind]</option></term> 73 <listitem> 74 <para> 75 Execute <application>pg_rewind</application>. 76 </para> 77 <para> 78 It is only necessary to provide the <application>pg_rewind</application> path 79 if using PostgreSQL 9.4, and <application>pg_rewind</application> 80 is not installed in the PostgreSQL <filename>bin</filename> directory. 81 </para> 82 </listitem> 83 </varlistentry> 84 85 <varlistentry> 86 <term><option>--config-files</option></term> 87 <listitem> 88 <para> 89 comma-separated list of configuration files to retain after 90 executing <application>pg_rewind</application>. 91 </para> 92 <para> 93 Currently <application>pg_rewind</application> will overwrite 94 the local node's configuration files with the files from the source node, 95 so it's advisable to use this option to ensure they are kept. 96 </para> 97 </listitem> 98 </varlistentry> 99 100 101 <varlistentry> 102 <term><option>--config-archive-dir</option></term> 103 <listitem> 104 <para> 105 Directory to temporarily store configuration files specified with 106 <option>--config-files</option>; default: <filename>/tmp</filename>. 107 </para> 108 </listitem> 109 </varlistentry> 110 111 112 <varlistentry> 113 <term><option>-W/--no-wait</option></term> 114 <listitem> 115 <para> 116 Don't wait for the node to rejoin cluster. 117 </para> 118 <para> 119 If this option is supplied, &repmgr; will restart the node but 120 not wait for it to connect to the primary. 121 </para> 122 </listitem> 123 </varlistentry> 124 125 </variablelist> 126 </refsect1> 127 128 <refsect1> 129 <title>Configuration file settings</title> 130 131 <para> 132 <itemizedlist spacing="compact" mark="bullet"> 133 <listitem> 134 <simpara> 135 <literal>node_rejoin_timeout</literal>: 136 the maximum length of time (in seconds) to wait for 137 the node to reconnect to the replication cluster (defaults to 138 the value set in <literal>standby_reconnect_timeout</literal>, 139 60 seconds). 140 </simpara> 141 <simpara> 142 Note that <literal>standby_reconnect_timeout</literal> must be 143 set to a value equal to or greater than 144 <literal>node_rejoin_timeout</literal>. 145 </simpara> 146 </listitem> 147 </itemizedlist> 148 </para> 149 150 </refsect1> 151 152 <refsect1 id="repmgr-node-rejoin-events"> 153 <title>Event notifications</title> 154 <para> 155 A <literal>node_rejoin</literal> <link linkend="event-notifications">event notification</link> will be generated. 156 </para> 157 </refsect1> 158 <refsect1> 159 <title>Exit codes</title> 160 <para> 161 One of the following exit codes will be emitted by <command>repmgr node rejoin</command>: 162 </para> 163 164 <variablelist> 165 166 <varlistentry> 167 <term><option>SUCCESS (0)</option></term> 168 <listitem> 169 <para> 170 The node rejoin succeeded; or if <option>--dry-run</option> was provided, 171 no issues were detected which would prevent the node rejoin. 172 </para> 173 </listitem> 174 </varlistentry> 175 176 <varlistentry> 177 <term><option>ERR_BAD_CONFIG (1)</option></term> 178 <listitem> 179 <para> 180 A configuration issue was detected which prevented &repmgr; from 181 continuing with the node rejoin. 182 </para> 183 </listitem> 184 </varlistentry> 185 186 <varlistentry> 187 <term><option>ERR_NO_RESTART (4)</option></term> 188 <listitem> 189 <para> 190 The node could not be restarted. 191 </para> 192 </listitem> 193 </varlistentry> 194 195 <varlistentry> 196 <term><option>ERR_REJOIN_FAIL (24)</option></term> 197 <listitem> 198 <para> 199 The node rejoin operation failed. 200 </para> 201 </listitem> 202 </varlistentry> 203 204 </variablelist> 205 206 </refsect1> 207 208 <refsect1> 209 <title>Notes</title> 210 <para> 211 Currently <command>repmgr node rejoin</command> can only be used to attach 212 a standby to the current primary, not another standby. 213 </para> 214 <para> 215 The node's PostgreSQL instance must have been shut down cleanly. If this was not the 216 case, it will need to be started up until it has reached a consistent recovery point, 217 then shut down cleanly. 218 </para> 219 <para> 220 In PostgreSQL 13 and later, this will be done automatically 221 if the <option>--force-rewind</option> is provided (even if an actual rewind 222 is not necessary). 223 </para> 224 <para> 225 With PostgreSQL 12 and earlier, PostgreSQL will need to 226 be started and shut down manually; see below for the best way to do this. 227 </para> 228 <tip> 229 <para> 230 If <application>PostgreSQL</application> is started in single-user mode and 231 input is directed from <filename>/dev/null/</filename>, it will perform recovery 232 then immediately quit, and will then be in a state suitable for use by 233 <application>pg_rewind</application>. 234 <programlisting> 235 rm -f /var/lib/pgsql/data/recovery.conf 236 postgres --single -D /var/lib/pgsql/data/ < /dev/null</programlisting> 237 </para> 238 <para> 239 Note that <filename>standby.signal</filename> (PostgreSQL 11 and earlier: 240 <filename>recovery.conf</filename>) <emphasis>must</emphasis> be removed 241 from the data directory for PostgreSQL to be able to start in single 242 user mode. 243 </para> 244 </tip> 245 246 </refsect1> 247 248 <refsect1 id="repmgr-node-rejoin-pg-rewind" xreflabel="Using pg_rewind"> 249 250 <title>Using <command>pg_rewind</command></title> 251 252 <indexterm> 253 <primary>pg_rewind</primary> 254 <secondary>using with "repmgr node rejoin"</secondary> 255 </indexterm> 256 257 <para> 258 <command>repmgr node rejoin</command> can optionally use <command>pg_rewind</command> to re-integrate a 259 node which has diverged from the rest of the cluster, typically a failed primary. 260 <command>pg_rewind</command> is available in PostgreSQL 9.5 and later as part of the core distribution, 261 and can be installed from external sources for PostgreSQL 9.4. 262 </para> 263 <note> 264 <para> 265 <command>pg_rewind</command> <emphasis>requires</emphasis> that either 266 <varname>wal_log_hints</varname> is enabled, or that 267 data checksums were enabled when the cluster was initialized. See the 268 <ulink url="https://www.postgresql.org/docs/current/app-pgrewind.html"><command>pg_rewind</command> documentation</ulink> for details. 269 </para> 270 </note> 271 272 <para> 273 We strongly recommend familiarizing yourself with <command>pg_rewind</command> before attempting 274 to use it with &repmgr;, as while it is an extremely useful tool, it is <emphasis>not</emphasis> 275 a "magic bullet" which can resolve all problematic replication situations. 276 </para> 277 278 <para> 279 A typical use-case for <command>pg_rewind</command> is when a scenario like the following 280 is encountered: 281 <programlisting> 282 $ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node3 dbname=repmgr user=repmgr' \ 283 --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose --dry-run 284 INFO: replication connection to the rejoin target node was successful 285 INFO: local and rejoin target system identifiers match 286 DETAIL: system identifier is 6652184002263212600 287 ERROR: this node cannot attach to rejoin target node 3 288 DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/610D710 289 HINT: use --force-rewind to execute pg_rewind</programlisting> 290 291 Here, <literal>node3</literal> was promoted to a primary while the local node was 292 still attached to the previous primary; this can potentially happen during e.g. a 293 network split. <command>pg_rewind</command> can re-sync the local node with <literal>node3</literal>, 294 removing the need for a full reclone. 295 </para> 296 297 <para> 298 To have <command>repmgr node rejoin</command> use <command>pg_rewind</command>, 299 pass the command line option <literal>--force-rewind</literal>, which will tell &repmgr; 300 to execute <command>pg_rewind</command> to ensure the node can be rejoined successfully. 301 </para> 302 303 <refsect2 id="repmgr-node-rejoin-pg-rewind-config-files" xreflabel="pg_rewind and configuration files"> 304 305 <title><command>pg_rewind</command> and configuration file retention</title> 306 307 <indexterm> 308 <primary>pg_rewind</primary> 309 <secondary>configuration file retention</secondary> 310 </indexterm> 311 312 <para> 313 Be aware that if <command>pg_rewind</command> is executed and actually performs a 314 rewind operation, any configuration files in the PostgreSQL data directory will be 315 overwritten with those from the source server. 316 </para> 317 <para> 318 To prevent this happening, provide a comma-separated list of files to retain 319 using the <option>--config-file</option> command line option; the specified files 320 will be archived in a temporary directory (whose parent directory can be specified with 321 <option>--config-archive-dir</option>, default: <filename>/tmp</filename>) 322 and restored once the rewind operation is complete. 323 </para> 324 </refsect2> 325 326 <refsect2 id="repmgr-node-rejoin-pg-rewind-example" xreflabel="example using repmgr node rejoin and pg_rewind"> 327 328 <title>Example using <command>repmgr node rejoin</command> and <command>pg_rewind</command></title> 329 330 <indexterm> 331 <primary>pg_rewind</primary> 332 <secondary>configuration file retention</secondary> 333 </indexterm> 334 335 336 <para> 337 Example, first using <option>--dry-run</option>, then actually executing the 338 <literal>node rejoin command</literal>. 339 <programlisting> 340 $ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node3 dbname=repmgr user=repmgr' \ 341 --config-files=postgresql.local.conf,postgresql.conf --verbose --force-rewind --dry-run 342 INFO: replication connection to the rejoin target node was successful 343 INFO: local and rejoin target system identifiers match 344 DETAIL: system identifier is 6652460429293670710 345 NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3 346 DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/610D710 347 INFO: prerequisites for using pg_rewind are met 348 INFO: file "postgresql.local.conf" would be copied to "/tmp/repmgr-config-archive-node2/postgresql.local.conf" 349 INFO: file "postgresql.replication-setup.conf" would be copied to "/tmp/repmgr-config-archive-node2/postgresql.replication-setup.conf" 350 INFO: pg_rewind would now be executed 351 DETAIL: pg_rewind command is: 352 pg_rewind -D '/var/lib/postgresql/data' --source-server='host=node3 dbname=repmgr user=repmgr' 353 INFO: prerequisites for executing NODE REJOIN are met</programlisting> 354 355 <note> 356 <para> 357 If <option>--force-rewind</option> is used with the <option>--dry-run</option> option, 358 this checks the prerequisites for using <application>pg_rewind</application>, but is 359 not an absolute guarantee that actually executing <application>pg_rewind</application> 360 will succeed. See also section <xref linkend="repmgr-node-rejoin-caveats"/> below. 361 </para> 362 363 </note> 364 365 <programlisting> 366 $ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node3 dbname=repmgr user=repmgr' \ 367 --config-files=postgresql.local.conf,postgresql.conf --verbose --force-rewind 368 NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3 369 DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/610D710 370 NOTICE: executing pg_rewind 371 DETAIL: pg_rewind command is "pg_rewind -D '/var/lib/postgresql/data' --source-server='host=node3 dbname=repmgr user=repmgr'" 372 NOTICE: 2 files copied to /var/lib/postgresql/data 373 NOTICE: setting node 2's upstream to node 3 374 NOTICE: starting server using "pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' start" 375 NOTICE: NODE REJOIN successful 376 DETAIL: node 2 is now attached to node 3</programlisting> 377 </para> 378 </refsect2> 379 </refsect1> 380 381 <refsect1 id="repmgr-node-rejoin-caveats" xreflabel="Caveats"> 382 383 <title>Caveats when using <command>repmgr node rejoin</command></title> 384 385 <indexterm> 386 <primary>repmgr node rejoin</primary> 387 <secondary>caveats</secondary> 388 </indexterm> 389 390 <para> 391 <command>repmgr node rejoin</command> attempts to determine whether it will succeed by 392 comparing the timelines and relative WAL positions of the local node (rejoin candidate) and primary 393 (rejoin target). This is particularly important if planning to use <application>pg_rewind</application>, 394 which currently (as of PostgreSQL 12) may appear to succeed (or indicate there is no action 395 needed) but potentially allow an impossible action, such as trying to rejoin a standby to a 396 primary which is behind the standby. &repmgr; will prevent this situation from occurring. 397 </para> 398 <para> 399 Currently it is <emphasis>not</emphasis> possible to detect a situation where the rejoin target 400 is a standby which has been "promoted" by removing <filename>recovery.conf</filename> 401 (PostgreSQL 12 and later: <filename>standby.signal</filename>) and restarting it. 402 In this case there will be no information about the point the rejoin target diverged 403 from the current standby; the rejoin operation will fail and 404 the current standby's PostgreSQL log will contain entries with the text 405 "<literal>record with incorrect prev-link</literal>". 406 </para> 407 <para> 408 In PostgreSQL 9.5 and earlier, it is <emphasis>not</emphasis> possible to use 409 <application>pg_rewind</application> to attach to a target node with a lower 410 timeline than the local node. 411 </para> 412 <para> 413 We strongly recommend running <command>repmgr node rejoin</command> with the 414 <option>--dry-run</option> option first. Additionally it might be a good idea 415 to execute the <application>pg_rewind</application> command displayed by 416 &repmgr; with the <application>pg_rewind</application> <option>--dry-run</option> 417 option. Note that <application>pg_rewind</application> does not indicate that it 418 is running in <option>--dry-run</option> mode. 419 </para> 420 421 <warning> 422 <para> 423 In all current PostgreSQL versions (as of September 2020), <application>pg_rewind</application> 424 contains a corner-case bug which affects standbys in a very specific situation. 425 </para> 426 <para> 427 This situation occurs when a standby was shut down <emphasis>before</emphasis> its 428 primary node, and an attempt is made to attach this standby to another primary 429 in the same cluster (following a "split brain" situation where the standby 430 was connected to the wrong primary). In this case, &repmgr; will correctly determine 431 that <application>pg_rewind</application> should be executed, however 432 <application>pg_rewind</application> incorrectly decides that no action is necessary. 433 </para> 434 <para> 435 In this situation, &repmgr; will report something like: 436<programlisting> 437 NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1 438 DETAIL: rejoin target server's timeline 3 forked off current database system timeline 2 before current recovery point 0/7019C10</programlisting> 439 but when executed, <application>pg_rewind</application> will report: 440<programlisting> 441 pg_rewind: servers diverged at WAL location 0/7015540 on timeline 2 442 pg_rewind: no rewind required</programlisting> 443 and if an attempt is made to attach the standby to the new primary, PostgreSQL logs on the standby 444 will contain errors like: 445<programlisting> 446 [2020-09-07 15:01:41 UTC] LOG: 00000: replication terminated by primary server 447 [2020-09-07 15:01:41 UTC] DETAIL: End of WAL reached on timeline 2 at 0/7015540. 448 [2020-09-07 15:01:41 UTC] LOG: 00000: new timeline 3 forked off current database system timeline 2 before current recovery point 0/7019C10</programlisting> 449 </para> 450 <para> 451 Currently it is not possible to resolve this situation using <application>pg_rewind</application>. 452 A <ulink url="https://www.postgresql.org/message-id/flat/CABvVfJU-LDWvoz4-Yow3Ay5LZYTuPD7eSjjE4kGyNZpXC6FrVQ@mail.gmail.com">patch</ulink> 453 has been submitted and will hopefully be included in a forthcoming PostgreSQL minor release. 454 </para> 455 <para> 456 As a workaround, start the primary server the standby was previously attached to, 457 and ensure the standby can be attached to it. If <application>pg_rewind</application> was actually executed, 458 it will have copied in the <filename>.history</filename> file from the target primary server; this must 459 be removed. <command>repmgr node rejoin</command> can then be used to attach the standby to the original 460 primary. Ensure any changes pending on the primary have propogated to the standby. Then shut down the primary 461 server <emphasis>first</emphasis>, before shutting down the standby. It should then be possible to 462 use <command>repmgr node rejoin</command> to attach the standby to the new primary. 463 </para> 464 </warning> 465 466 </refsect1> 467 468 <refsect1> 469 <title>See also</title> 470 <para> 471 <xref linkend="repmgr-standby-follow"/> 472 </para> 473 </refsect1> 474</refentry> 475