1 /*-
2  * See the file LICENSE for redistribution information.
3  *
4  * Copyright (c) 2002, 2014 Oracle and/or its affiliates.  All rights reserved.
5  *
6  */
7 
8 package com.sleepycat.je.log;
9 
10 import com.sleepycat.je.utilint.DbLsn;
11 
12 /**
13  * Specifies whether to log an entry provisionally.
14  *
15  * Provisional log entries:
16  *
17  * What are provisional log entries?
18  *
19  *    Provisional log entries are those tagged with the provisional attribute
20  *    in the log entry header. The provisional attribute can be applied to any
21  *    type of log entry, and is implemented in
22  *    com.sleepycat.je.log.LogEntryHeader as two stolen bits in the 8 bit
23  *    version field.
24  *
25  * When is the provisional attribute used?
26  *
27  *    The provisional attribute is used only during recovery. It very simply
28  *    indicates that recovery will ignore and skip over this log entry.
29  *
30  * When is the provisional attribute set?
31  *
32  *    The provisional attribute started out as a way to create atomicity among
33  *    different log entries. Child pointers in the JE Btree are physical LSNs,
34  *    so each Btree node's children must be logged before it in the log. On
35  *    the other hand, one fundamental assumption of the JE log is that each
36  *    Btree node found in the log can be replayed and placed in the in-memory
37  *    tree. To do so, each Btree node must have a parent node that will house
38  *    it. The grouping of multiple log entries into one atomic group is often
39  *    used to fulfiil this requirement.
40  *
41  *     * Atomic log entries:
42  *
43  *           + When a btree is first created, we bootstrap tree creation by
44  *           logging the first BIN provisionally, then creating a parent IN
45  *           which is the Btree root IN, which points to this first BIN.
46  *
47  *           + When we split a Btree node, we create a new IN, which is the
48  *           sibling of the split node. We log the old sibling and the new
49  *           sibling provisionally, and then log the parent, so that any
50  *           crashes in the middle of this triumvirate which result in the
51  *           failure to log the parent will skip over the orphaned siblings.
52  *
53  *           + Splitting the Btree root is just a special case of a split.
54  *
55  *           + Creating a duplicate subtree to hang in the middle of a btree is
56  *           just a special case of a split and btree first creation.
57  *
58  *     * Entries not meant to be recovered
59  *
60  *           Temp DBs are not meant to be recovered and we log their LN
61  *           and IN nodes in a very lax fashion, purely as a way of evicting
62  *           them out of the cache temporarily. There is no guarantee that a
63  *           consistent set has been logged to disk. We skip over them for both
64  *           recovery performance and the "each-node-must-have-a-parent" rule.
65  *
66  *     * Durable deferred-write entries
67  *
68  *           Deferred-write INs are logged provisionally for the same reasons
69  *           as for temp DBs (above): for recovery performance and the
70  *           "each-node-must-have-a-parent" rule.
71  *
72  *           Deferred-write LNs are logged non-provisionally to support
73  *           obsolete LSN counting.  It would be nice to log them provisionally
74  *           for recovery performance and to allow LN deletion without logging;
75  *           however, that is not currently practical if obsolete counting is
76  *           to be supported.  See [#16864].
77  *
78  *     * Checkpoint performance
79  *
80  *           When we flush a series of nodes, it's a waste to replay nodes
81  *           which are referenced by higher levels. For example, if we
82  *           checkpoint this btree:
83  *
84  *           INA -> INb -> BINc (dirty)-> LNd
85  *
86  *           we log them in this order:
87  *
88  *           BINc
89  *           INb
90  *
91  *           And there's no point to replaying BINc, because it's referenced by
92  *           INb.  We skip over BINc, which we do by logging it provisionally.
93  *
94  *           In addition, BEFORE_CKPT_END is used to improve cleaner
95  *           performance by keeping utilization information up-to-date during
96  *           the checkpoint.  See below for details.
97  *
98  *     * Log cleaning - removing references to deleted files.
99  *
100  *       When we delete a file for log cleaning we guarantee that no active log
101  *       entries refer to any log entry in the deleted file. Suppose our
102  *       checkpoint looks like this:
103  *
104  *         5/100 LNa
105  *         5/200 Ckpt start
106  *         5/300 INs
107  *         ...
108  *         5/500 Ckpt end
109  *         ...
110  *         5/800 last entry in log
111  *
112  *       Because we do not delete a file until the Ckpt end after processing
113  *       (cleaning) it, nothing from 5/500 to 5/800 can refer to a file deleted
114  *       due to the Ckpt end in 5/500.
115  *
116  *       BEFORE_CKPT_END is motivated in part (see below for a complete
117  *       description) by the fact that while log entries between 5/100
118  *       (first active lsn) and 5/500 (ckpt end) will not in of themselves
119  *       contain a LSN for a cleaned, deleted file, the act of processing them
120  *       during recovery could require fetching a node from a deleted file. For
121  *       example, the IN at 5/300 could have an in-memory parent which has a
122  *       reference to an older, cleaned version of that IN.  Treating the span
123  *       between 5/200 and 5/500 as provisional is both optimal, because only
124  *       the high level INs need to be processed, and required, in order not to
125  *       fetch from a cleaned file.
126  *
127  *       The correctness issue is described in [#16037] comment 151, where we
128  *       attempted to log non-provisionally below maxFlushLevel.  It is
129  *       repeated below.
130  *
131  *       IN-A
132  *        \
133  *        IN-B
134  *         \
135  *         IN-C
136  *          \
137  *          BIN-D
138  *
139  *       1/100 CkptStart
140  *       1/200 BIN-D provisional
141  *       1/300 IN-C non-provisional
142  *       2/100 IN-B non-provisional
143  *       2/200 IN-A non-provisional
144  *       2/300 MapLN refers to IN-A
145  *       2/400 CkptEnd
146  *       5/100 cleaner processes file 1
147  *             BIN-D and IN-C are dirty
148  *       5/200 CkptStart
149  *       5/300 BIN-D provisional
150  *       5/400 IN-C non-provisional
151  *       5/500 IN-B non-provisional (must log one extra level)
152  *             IN-A is not logged
153  *             MapLN still refers to IN-A at 2/200
154  *       5/600 CkptEnd
155  *             file 1 is deleted
156  *       6/100 Start recovery
157  *
158  *       Note that only the bottom level BINs are logged provisionally because
159  *       we're logging level 2 and up non-provisionally in this experiment.
160  *
161  *       Recovery replays IN-C at 5/400 because it is non-provisional.
162  *
163  *       When it does the tree lookup (getParentINForChildIN) it uses the root
164  *       IN-A at 2/200. This search fetches IN-B at 2/100 and then fails
165  *       fetching IN-C at 1/300 because file 1 has been deleted.
166  *
167  *       In reality we log provisionally below maxFlushLevel, so that IN-C at
168  *       5/400 is not replayed. IN-B at 5/500 is at the maxFlushLevel and is
169  *       non-provisional and is replayed.  The search succeeds because nothing
170  *       in file 1 needs to be fetched to find the parent.
171  *
172  *       TODO: Could we instead replay INs in reverse order?
173  *       Then IN-B at 5/500 would be replayed first. Unfortunately this would
174  *       probably break something else. For example, the utilization tracking
175  *       replay for INs currently depends on reading forward.  This is worth
176  *       exploring, however, since reducing logging during checkpoints would be
177  *       extremely beneficial.
178  *
179  * Provisional.BEFORE_CKPT_END
180  * ---------------------------
181  * This property was added to solve a specific problem that occurs in earlier
182  * versions of JE:  When a long checkpoint runs, the BINs are not counted
183  * obsolete until after the entire BIN level has been logged.  Specifically,
184  * they are counted obsolete when their ancestor is logged non-provisionally.
185  * Most INs logged by a checkpoint are BINs.  This means that during a very
186  * long checkpoint, cleaning of the files containing those old BINs is delayed,
187  * and more importantly the calculated utilization is much higher than it
188  * actually is.  The correction in utilization does not occur until the end of
189  * the checkpoint, when the higher level INs are logged.  This manifests as a
190  * lull in cleaning during the checkpoint, because calculated utilization is
191  * artificially high, and a spike in cleaning at the end of the checkpoint.  In
192  * some cases, the cleaner cannot recover from the backlog that is created by
193  * the spike.
194  *
195  * The provisional property effects obsolete counting as follows:
196  *
197  *  + If an IN is logged with Provisional.YES, the old version of the IN is not
198  *    counted obsolete immediately.  Instead, the offset of the old version of
199  *    the IN is added to a list in its parent IN.  The offsets migrate upward
200  *    in the tree in this manner until an ancestor IN is logged
201  *    non-provisionally.
202  *
203  *  + If an IN is logged with Provisional.NO or BEFORE_CKPT_END, the old
204  *    version of the IN is counted obsolete immediately (and offsets
205  *    accumulated from provisional child INs are counted).  This means
206  *    that the obsolete offset is added to the UtilizationTracker, and may be
207  *    flushed in a FileSummaryLN any time after that.  At the latest, it is
208  *    flushed at the end of the checkpoint.
209  *
210  * Because subtree logging is now used for checkpoints and the parent IN of
211  * each logged sub-tree is logged with BEFORE_CKPT_END, the prior version of
212  * all INs in the sub-tree will be counted obsolete at that time.  This keeps
213  * the calculated utilization accurate throughout the checkpoint, and prevents
214  * the large per-checkpoint lull and spike in log cleaning.
215  *
216  * For the intermediate levels, Provisional.BEFORE_CKPT_END must be used rather
217  * than Provisional.NO, which is reserved for the highest level only.  During
218  * recovery, the Provisional values are treated as follows (this is from the
219  * Provisional javadoc):
220  *  + NO:  The entry is non-provisional and is always processed by recovery.
221  *  + YES: The entry is provisional and is never processed by recovery.
222  *  + BEFORE_CKPT_END: The entry is provisional (not processed by recovery) if
223  *    it occurs before the CkptEnd in the recovery interval, or is
224  *    non-provisional (is processed) if it occurs after CkptEnd.
225  *
226  * The key to BEFORE_CKPT_END is that it is treated as provisional if a CkptEnd
227  * is logged, i.e., if we do not crash before completing the checkpoint.
228  * Because the checkpoint completed, we may have deleted log files that
229  * would be necessary to replay the IN.  So we cannot safely replay it.
230  *
231  * Note the difference in the treatment of BEFORE_CKPT_END for obsolete
232  * counting and recovery:
233  *  + For obsolete counting, BEFORE_CKPT_END is treated as non-provisional.
234  *  + For recovery, when the IN occurs before CkptEnd, BEFORE_CKPT_END is
235  *    treated as provisional.
236  * This difference is the reason for the existence of BEFORE_CKPT_END.
237  */
238 public enum Provisional {
239 
240     /**
241      * The entry is non-provisional and is always processed by recovery.
242      */
243     NO,
244 
245     /**
246      * The entry is provisional and is never processed by recovery.
247      */
248     YES,
249 
250     /**
251      * The entry is provisional (not processed by recovery) if it occurs before
252      * the CkptEnd in the recovery interval, or is non-provisional (is
253      * processed) if it occurs after CkptEnd.
254      */
255     BEFORE_CKPT_END;
256 
257     /**
258      * Determines whether a given log entry should be processed during
259      * recovery.
260      */
isProvisional(long logEntryLsn, long ckptEndLsn)261     public boolean isProvisional(long logEntryLsn, long ckptEndLsn) {
262         assert logEntryLsn != DbLsn.NULL_LSN;
263         switch (this) {
264         case NO:
265             return false;
266         case YES:
267             return true;
268         case BEFORE_CKPT_END:
269             return ckptEndLsn != DbLsn.NULL_LSN &&
270                    DbLsn.compareTo(logEntryLsn, ckptEndLsn) < 0;
271         default:
272             assert false;
273             return false;
274         }
275     }
276 }
277