1 /*- 2 * See the file LICENSE for redistribution information. 3 * 4 * Copyright (c) 2002, 2014 Oracle and/or its affiliates. All rights reserved. 5 * 6 */ 7 8 package com.sleepycat.je.log; 9 10 import com.sleepycat.je.utilint.DbLsn; 11 12 /** 13 * Specifies whether to log an entry provisionally. 14 * 15 * Provisional log entries: 16 * 17 * What are provisional log entries? 18 * 19 * Provisional log entries are those tagged with the provisional attribute 20 * in the log entry header. The provisional attribute can be applied to any 21 * type of log entry, and is implemented in 22 * com.sleepycat.je.log.LogEntryHeader as two stolen bits in the 8 bit 23 * version field. 24 * 25 * When is the provisional attribute used? 26 * 27 * The provisional attribute is used only during recovery. It very simply 28 * indicates that recovery will ignore and skip over this log entry. 29 * 30 * When is the provisional attribute set? 31 * 32 * The provisional attribute started out as a way to create atomicity among 33 * different log entries. Child pointers in the JE Btree are physical LSNs, 34 * so each Btree node's children must be logged before it in the log. On 35 * the other hand, one fundamental assumption of the JE log is that each 36 * Btree node found in the log can be replayed and placed in the in-memory 37 * tree. To do so, each Btree node must have a parent node that will house 38 * it. The grouping of multiple log entries into one atomic group is often 39 * used to fulfiil this requirement. 40 * 41 * * Atomic log entries: 42 * 43 * + When a btree is first created, we bootstrap tree creation by 44 * logging the first BIN provisionally, then creating a parent IN 45 * which is the Btree root IN, which points to this first BIN. 46 * 47 * + When we split a Btree node, we create a new IN, which is the 48 * sibling of the split node. We log the old sibling and the new 49 * sibling provisionally, and then log the parent, so that any 50 * crashes in the middle of this triumvirate which result in the 51 * failure to log the parent will skip over the orphaned siblings. 52 * 53 * + Splitting the Btree root is just a special case of a split. 54 * 55 * + Creating a duplicate subtree to hang in the middle of a btree is 56 * just a special case of a split and btree first creation. 57 * 58 * * Entries not meant to be recovered 59 * 60 * Temp DBs are not meant to be recovered and we log their LN 61 * and IN nodes in a very lax fashion, purely as a way of evicting 62 * them out of the cache temporarily. There is no guarantee that a 63 * consistent set has been logged to disk. We skip over them for both 64 * recovery performance and the "each-node-must-have-a-parent" rule. 65 * 66 * * Durable deferred-write entries 67 * 68 * Deferred-write INs are logged provisionally for the same reasons 69 * as for temp DBs (above): for recovery performance and the 70 * "each-node-must-have-a-parent" rule. 71 * 72 * Deferred-write LNs are logged non-provisionally to support 73 * obsolete LSN counting. It would be nice to log them provisionally 74 * for recovery performance and to allow LN deletion without logging; 75 * however, that is not currently practical if obsolete counting is 76 * to be supported. See [#16864]. 77 * 78 * * Checkpoint performance 79 * 80 * When we flush a series of nodes, it's a waste to replay nodes 81 * which are referenced by higher levels. For example, if we 82 * checkpoint this btree: 83 * 84 * INA -> INb -> BINc (dirty)-> LNd 85 * 86 * we log them in this order: 87 * 88 * BINc 89 * INb 90 * 91 * And there's no point to replaying BINc, because it's referenced by 92 * INb. We skip over BINc, which we do by logging it provisionally. 93 * 94 * In addition, BEFORE_CKPT_END is used to improve cleaner 95 * performance by keeping utilization information up-to-date during 96 * the checkpoint. See below for details. 97 * 98 * * Log cleaning - removing references to deleted files. 99 * 100 * When we delete a file for log cleaning we guarantee that no active log 101 * entries refer to any log entry in the deleted file. Suppose our 102 * checkpoint looks like this: 103 * 104 * 5/100 LNa 105 * 5/200 Ckpt start 106 * 5/300 INs 107 * ... 108 * 5/500 Ckpt end 109 * ... 110 * 5/800 last entry in log 111 * 112 * Because we do not delete a file until the Ckpt end after processing 113 * (cleaning) it, nothing from 5/500 to 5/800 can refer to a file deleted 114 * due to the Ckpt end in 5/500. 115 * 116 * BEFORE_CKPT_END is motivated in part (see below for a complete 117 * description) by the fact that while log entries between 5/100 118 * (first active lsn) and 5/500 (ckpt end) will not in of themselves 119 * contain a LSN for a cleaned, deleted file, the act of processing them 120 * during recovery could require fetching a node from a deleted file. For 121 * example, the IN at 5/300 could have an in-memory parent which has a 122 * reference to an older, cleaned version of that IN. Treating the span 123 * between 5/200 and 5/500 as provisional is both optimal, because only 124 * the high level INs need to be processed, and required, in order not to 125 * fetch from a cleaned file. 126 * 127 * The correctness issue is described in [#16037] comment 151, where we 128 * attempted to log non-provisionally below maxFlushLevel. It is 129 * repeated below. 130 * 131 * IN-A 132 * \ 133 * IN-B 134 * \ 135 * IN-C 136 * \ 137 * BIN-D 138 * 139 * 1/100 CkptStart 140 * 1/200 BIN-D provisional 141 * 1/300 IN-C non-provisional 142 * 2/100 IN-B non-provisional 143 * 2/200 IN-A non-provisional 144 * 2/300 MapLN refers to IN-A 145 * 2/400 CkptEnd 146 * 5/100 cleaner processes file 1 147 * BIN-D and IN-C are dirty 148 * 5/200 CkptStart 149 * 5/300 BIN-D provisional 150 * 5/400 IN-C non-provisional 151 * 5/500 IN-B non-provisional (must log one extra level) 152 * IN-A is not logged 153 * MapLN still refers to IN-A at 2/200 154 * 5/600 CkptEnd 155 * file 1 is deleted 156 * 6/100 Start recovery 157 * 158 * Note that only the bottom level BINs are logged provisionally because 159 * we're logging level 2 and up non-provisionally in this experiment. 160 * 161 * Recovery replays IN-C at 5/400 because it is non-provisional. 162 * 163 * When it does the tree lookup (getParentINForChildIN) it uses the root 164 * IN-A at 2/200. This search fetches IN-B at 2/100 and then fails 165 * fetching IN-C at 1/300 because file 1 has been deleted. 166 * 167 * In reality we log provisionally below maxFlushLevel, so that IN-C at 168 * 5/400 is not replayed. IN-B at 5/500 is at the maxFlushLevel and is 169 * non-provisional and is replayed. The search succeeds because nothing 170 * in file 1 needs to be fetched to find the parent. 171 * 172 * TODO: Could we instead replay INs in reverse order? 173 * Then IN-B at 5/500 would be replayed first. Unfortunately this would 174 * probably break something else. For example, the utilization tracking 175 * replay for INs currently depends on reading forward. This is worth 176 * exploring, however, since reducing logging during checkpoints would be 177 * extremely beneficial. 178 * 179 * Provisional.BEFORE_CKPT_END 180 * --------------------------- 181 * This property was added to solve a specific problem that occurs in earlier 182 * versions of JE: When a long checkpoint runs, the BINs are not counted 183 * obsolete until after the entire BIN level has been logged. Specifically, 184 * they are counted obsolete when their ancestor is logged non-provisionally. 185 * Most INs logged by a checkpoint are BINs. This means that during a very 186 * long checkpoint, cleaning of the files containing those old BINs is delayed, 187 * and more importantly the calculated utilization is much higher than it 188 * actually is. The correction in utilization does not occur until the end of 189 * the checkpoint, when the higher level INs are logged. This manifests as a 190 * lull in cleaning during the checkpoint, because calculated utilization is 191 * artificially high, and a spike in cleaning at the end of the checkpoint. In 192 * some cases, the cleaner cannot recover from the backlog that is created by 193 * the spike. 194 * 195 * The provisional property effects obsolete counting as follows: 196 * 197 * + If an IN is logged with Provisional.YES, the old version of the IN is not 198 * counted obsolete immediately. Instead, the offset of the old version of 199 * the IN is added to a list in its parent IN. The offsets migrate upward 200 * in the tree in this manner until an ancestor IN is logged 201 * non-provisionally. 202 * 203 * + If an IN is logged with Provisional.NO or BEFORE_CKPT_END, the old 204 * version of the IN is counted obsolete immediately (and offsets 205 * accumulated from provisional child INs are counted). This means 206 * that the obsolete offset is added to the UtilizationTracker, and may be 207 * flushed in a FileSummaryLN any time after that. At the latest, it is 208 * flushed at the end of the checkpoint. 209 * 210 * Because subtree logging is now used for checkpoints and the parent IN of 211 * each logged sub-tree is logged with BEFORE_CKPT_END, the prior version of 212 * all INs in the sub-tree will be counted obsolete at that time. This keeps 213 * the calculated utilization accurate throughout the checkpoint, and prevents 214 * the large per-checkpoint lull and spike in log cleaning. 215 * 216 * For the intermediate levels, Provisional.BEFORE_CKPT_END must be used rather 217 * than Provisional.NO, which is reserved for the highest level only. During 218 * recovery, the Provisional values are treated as follows (this is from the 219 * Provisional javadoc): 220 * + NO: The entry is non-provisional and is always processed by recovery. 221 * + YES: The entry is provisional and is never processed by recovery. 222 * + BEFORE_CKPT_END: The entry is provisional (not processed by recovery) if 223 * it occurs before the CkptEnd in the recovery interval, or is 224 * non-provisional (is processed) if it occurs after CkptEnd. 225 * 226 * The key to BEFORE_CKPT_END is that it is treated as provisional if a CkptEnd 227 * is logged, i.e., if we do not crash before completing the checkpoint. 228 * Because the checkpoint completed, we may have deleted log files that 229 * would be necessary to replay the IN. So we cannot safely replay it. 230 * 231 * Note the difference in the treatment of BEFORE_CKPT_END for obsolete 232 * counting and recovery: 233 * + For obsolete counting, BEFORE_CKPT_END is treated as non-provisional. 234 * + For recovery, when the IN occurs before CkptEnd, BEFORE_CKPT_END is 235 * treated as provisional. 236 * This difference is the reason for the existence of BEFORE_CKPT_END. 237 */ 238 public enum Provisional { 239 240 /** 241 * The entry is non-provisional and is always processed by recovery. 242 */ 243 NO, 244 245 /** 246 * The entry is provisional and is never processed by recovery. 247 */ 248 YES, 249 250 /** 251 * The entry is provisional (not processed by recovery) if it occurs before 252 * the CkptEnd in the recovery interval, or is non-provisional (is 253 * processed) if it occurs after CkptEnd. 254 */ 255 BEFORE_CKPT_END; 256 257 /** 258 * Determines whether a given log entry should be processed during 259 * recovery. 260 */ isProvisional(long logEntryLsn, long ckptEndLsn)261 public boolean isProvisional(long logEntryLsn, long ckptEndLsn) { 262 assert logEntryLsn != DbLsn.NULL_LSN; 263 switch (this) { 264 case NO: 265 return false; 266 case YES: 267 return true; 268 case BEFORE_CKPT_END: 269 return ckptEndLsn != DbLsn.NULL_LSN && 270 DbLsn.compareTo(logEntryLsn, ckptEndLsn) < 0; 271 default: 272 assert false; 273 return false; 274 } 275 } 276 } 277