1eda14cbcSMatt Macy /* 2eda14cbcSMatt Macy * CDDL HEADER START 3eda14cbcSMatt Macy * 4eda14cbcSMatt Macy * The contents of this file are subject to the terms of the 5eda14cbcSMatt Macy * Common Development and Distribution License (the "License"). 6eda14cbcSMatt Macy * You may not use this file except in compliance with the License. 7eda14cbcSMatt Macy * 8eda14cbcSMatt Macy * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9eda14cbcSMatt Macy * or http://www.opensolaris.org/os/licensing. 10eda14cbcSMatt Macy * See the License for the specific language governing permissions 11eda14cbcSMatt Macy * and limitations under the License. 12eda14cbcSMatt Macy * 13eda14cbcSMatt Macy * When distributing Covered Code, include this CDDL HEADER in each 14eda14cbcSMatt Macy * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15eda14cbcSMatt Macy * If applicable, add the following below this CDDL HEADER, with the 16eda14cbcSMatt Macy * fields enclosed by brackets "[]" replaced with your own identifying 17eda14cbcSMatt Macy * information: Portions Copyright [yyyy] [name of copyright owner] 18eda14cbcSMatt Macy * 19eda14cbcSMatt Macy * CDDL HEADER END 20eda14cbcSMatt Macy */ 21eda14cbcSMatt Macy 22eda14cbcSMatt Macy /* 23eda14cbcSMatt Macy * Copyright (c) 2018, 2019 by Delphix. All rights reserved. 24eda14cbcSMatt Macy */ 25eda14cbcSMatt Macy 26eda14cbcSMatt Macy #include <sys/dmu_objset.h> 27eda14cbcSMatt Macy #include <sys/metaslab.h> 28eda14cbcSMatt Macy #include <sys/metaslab_impl.h> 29eda14cbcSMatt Macy #include <sys/spa.h> 30eda14cbcSMatt Macy #include <sys/spa_impl.h> 31eda14cbcSMatt Macy #include <sys/spa_log_spacemap.h> 32eda14cbcSMatt Macy #include <sys/vdev_impl.h> 33eda14cbcSMatt Macy #include <sys/zap.h> 34eda14cbcSMatt Macy 35eda14cbcSMatt Macy /* 36eda14cbcSMatt Macy * Log Space Maps 37eda14cbcSMatt Macy * 38eda14cbcSMatt Macy * Log space maps are an optimization in ZFS metadata allocations for pools 39eda14cbcSMatt Macy * whose workloads are primarily random-writes. Random-write workloads are also 40eda14cbcSMatt Macy * typically random-free, meaning that they are freeing from locations scattered 41eda14cbcSMatt Macy * throughout the pool. This means that each TXG we will have to append some 42eda14cbcSMatt Macy * FREE records to almost every metaslab. With log space maps, we hold their 43eda14cbcSMatt Macy * changes in memory and log them altogether in one pool-wide space map on-disk 44eda14cbcSMatt Macy * for persistence. As more blocks are accumulated in the log space maps and 45eda14cbcSMatt Macy * more unflushed changes are accounted in memory, we flush a selected group 46eda14cbcSMatt Macy * of metaslabs every TXG to relieve memory pressure and potential overheads 47eda14cbcSMatt Macy * when loading the pool. Flushing a metaslab to disk relieves memory as we 48eda14cbcSMatt Macy * flush any unflushed changes from memory to disk (i.e. the metaslab's space 49eda14cbcSMatt Macy * map) and saves import time by making old log space maps obsolete and 50eda14cbcSMatt Macy * eventually destroying them. [A log space map is said to be obsolete when all 51eda14cbcSMatt Macy * its entries have made it to their corresponding metaslab space maps]. 52eda14cbcSMatt Macy * 53eda14cbcSMatt Macy * == On disk data structures used == 54eda14cbcSMatt Macy * 55eda14cbcSMatt Macy * - The pool has a new feature flag and a new entry in the MOS. The feature 56eda14cbcSMatt Macy * is activated when we create the first log space map and remains active 57eda14cbcSMatt Macy * for the lifetime of the pool. The new entry in the MOS Directory [refer 58eda14cbcSMatt Macy * to DMU_POOL_LOG_SPACEMAP_ZAP] is populated with a ZAP whose key-value 59eda14cbcSMatt Macy * pairs are of the form <key: txg, value: log space map object for that txg>. 60eda14cbcSMatt Macy * This entry is our on-disk reference of the log space maps that exist in 61eda14cbcSMatt Macy * the pool for each TXG and it is used during import to load all the 62eda14cbcSMatt Macy * metaslab unflushed changes in memory. To see how this structure is first 63eda14cbcSMatt Macy * created and later populated refer to spa_generate_syncing_log_sm(). To see 64eda14cbcSMatt Macy * how it is used during import time refer to spa_ld_log_sm_metadata(). 65eda14cbcSMatt Macy * 66eda14cbcSMatt Macy * - Each vdev has a new entry in its vdev_top_zap (see field 67eda14cbcSMatt Macy * VDEV_TOP_ZAP_MS_UNFLUSHED_PHYS_TXGS) which holds the msp_unflushed_txg of 68eda14cbcSMatt Macy * each metaslab in this vdev. This field is the on-disk counterpart of the 69eda14cbcSMatt Macy * in-memory field ms_unflushed_txg which tells us from which TXG and onwards 70eda14cbcSMatt Macy * the metaslab haven't had its changes flushed. During import, we use this 71eda14cbcSMatt Macy * to ignore any entries in the space map log that are for this metaslab but 72eda14cbcSMatt Macy * from a TXG before msp_unflushed_txg. At that point, we also populate its 73eda14cbcSMatt Macy * in-memory counterpart and from there both fields are updated every time 74eda14cbcSMatt Macy * we flush that metaslab. 75eda14cbcSMatt Macy * 76eda14cbcSMatt Macy * - A space map is created every TXG and, during that TXG, it is used to log 77eda14cbcSMatt Macy * all incoming changes (the log space map). When created, the log space map 78eda14cbcSMatt Macy * is referenced in memory by spa_syncing_log_sm and its object ID is inserted 79eda14cbcSMatt Macy * to the space map ZAP mentioned above. The log space map is closed at the 80eda14cbcSMatt Macy * end of the TXG and will be destroyed when it becomes fully obsolete. We 81eda14cbcSMatt Macy * know when a log space map has become obsolete by looking at the oldest 82eda14cbcSMatt Macy * (and smallest) ms_unflushed_txg in the pool. If the value of that is bigger 83eda14cbcSMatt Macy * than the log space map's TXG, then it means that there is no metaslab who 84eda14cbcSMatt Macy * doesn't have the changes from that log and we can therefore destroy it. 85eda14cbcSMatt Macy * [see spa_cleanup_old_sm_logs()]. 86eda14cbcSMatt Macy * 87eda14cbcSMatt Macy * == Important in-memory structures == 88eda14cbcSMatt Macy * 89eda14cbcSMatt Macy * - The per-spa field spa_metaslabs_by_flushed sorts all the metaslabs in 90eda14cbcSMatt Macy * the pool by their ms_unflushed_txg field. It is primarily used for three 91eda14cbcSMatt Macy * reasons. First of all, it is used during flushing where we try to flush 92eda14cbcSMatt Macy * metaslabs in-order from the oldest-flushed to the most recently flushed 93eda14cbcSMatt Macy * every TXG. Secondly, it helps us to lookup the ms_unflushed_txg of the 94eda14cbcSMatt Macy * oldest flushed metaslab to distinguish which log space maps have become 95eda14cbcSMatt Macy * obsolete and which ones are still relevant. Finally it tells us which 96eda14cbcSMatt Macy * metaslabs have unflushed changes in a pool where this feature was just 97eda14cbcSMatt Macy * enabled, as we don't immediately add all of the pool's metaslabs but we 98eda14cbcSMatt Macy * add them over time as they go through metaslab_sync(). The reason that 99eda14cbcSMatt Macy * we do that is to ease these pools into the behavior of the flushing 100eda14cbcSMatt Macy * algorithm (described later on). 101eda14cbcSMatt Macy * 102eda14cbcSMatt Macy * - The per-spa field spa_sm_logs_by_txg can be thought as the in-memory 103eda14cbcSMatt Macy * counterpart of the space map ZAP mentioned above. It's an AVL tree whose 104eda14cbcSMatt Macy * nodes represent the log space maps in the pool. This in-memory 105eda14cbcSMatt Macy * representation of log space maps in the pool sorts the log space maps by 106eda14cbcSMatt Macy * the TXG that they were created (which is also the TXG of their unflushed 107eda14cbcSMatt Macy * changes). It also contains the following extra information for each 108eda14cbcSMatt Macy * space map: 109eda14cbcSMatt Macy * [1] The number of metaslabs that were last flushed on that TXG. This is 110eda14cbcSMatt Macy * important because if that counter is zero and this is the oldest 111eda14cbcSMatt Macy * log then it means that it is also obsolete. 112eda14cbcSMatt Macy * [2] The number of blocks of that space map. This field is used by the 113eda14cbcSMatt Macy * block heuristic of our flushing algorithm (described later on). 114eda14cbcSMatt Macy * It represents how many blocks of metadata changes ZFS had to write 115eda14cbcSMatt Macy * to disk for that TXG. 116eda14cbcSMatt Macy * 117eda14cbcSMatt Macy * - The per-spa field spa_log_summary is a list of entries that summarizes 118eda14cbcSMatt Macy * the metaslab and block counts of all the nodes of the spa_sm_logs_by_txg 119eda14cbcSMatt Macy * AVL tree mentioned above. The reason this exists is that our flushing 120eda14cbcSMatt Macy * algorithm (described later) tries to estimate how many metaslabs to flush 121eda14cbcSMatt Macy * in each TXG by iterating over all the log space maps and looking at their 122eda14cbcSMatt Macy * block counts. Summarizing that information means that don't have to 123eda14cbcSMatt Macy * iterate through each space map, minimizing the runtime overhead of the 124eda14cbcSMatt Macy * flushing algorithm which would be induced in syncing context. In terms of 125eda14cbcSMatt Macy * implementation the log summary is used as a queue: 126eda14cbcSMatt Macy * * we modify or pop entries from its head when we flush metaslabs 127eda14cbcSMatt Macy * * we modify or append entries to its tail when we sync changes. 128eda14cbcSMatt Macy * 129eda14cbcSMatt Macy * - Each metaslab has two new range trees that hold its unflushed changes, 130eda14cbcSMatt Macy * ms_unflushed_allocs and ms_unflushed_frees. These are always disjoint. 131eda14cbcSMatt Macy * 132eda14cbcSMatt Macy * == Flushing algorithm == 133eda14cbcSMatt Macy * 134eda14cbcSMatt Macy * The decision of how many metaslabs to flush on a give TXG is guided by 135eda14cbcSMatt Macy * two heuristics: 136eda14cbcSMatt Macy * 137eda14cbcSMatt Macy * [1] The memory heuristic - 138eda14cbcSMatt Macy * We keep track of the memory used by the unflushed trees from all the 139eda14cbcSMatt Macy * metaslabs [see sus_memused of spa_unflushed_stats] and we ensure that it 140eda14cbcSMatt Macy * stays below a certain threshold which is determined by an arbitrary hard 141eda14cbcSMatt Macy * limit and an arbitrary percentage of the system's memory [see 142eda14cbcSMatt Macy * spa_log_exceeds_memlimit()]. When we see that the memory usage of the 143eda14cbcSMatt Macy * unflushed changes are passing that threshold, we flush metaslabs, which 144eda14cbcSMatt Macy * empties their unflushed range trees, reducing the memory used. 145eda14cbcSMatt Macy * 146eda14cbcSMatt Macy * [2] The block heuristic - 147eda14cbcSMatt Macy * We try to keep the total number of blocks in the log space maps in check 148eda14cbcSMatt Macy * so the log doesn't grow indefinitely and we don't induce a lot of overhead 149eda14cbcSMatt Macy * when loading the pool. At the same time we don't want to flush a lot of 150eda14cbcSMatt Macy * metaslabs too often as this would defeat the purpose of the log space map. 151eda14cbcSMatt Macy * As a result we set a limit in the amount of blocks that we think it's 152eda14cbcSMatt Macy * acceptable for the log space maps to have and try not to cross it. 153eda14cbcSMatt Macy * [see sus_blocklimit from spa_unflushed_stats]. 154eda14cbcSMatt Macy * 155eda14cbcSMatt Macy * In order to stay below the block limit every TXG we have to estimate how 156eda14cbcSMatt Macy * many metaslabs we need to flush based on the current rate of incoming blocks 157eda14cbcSMatt Macy * and our history of log space map blocks. The main idea here is to answer 158eda14cbcSMatt Macy * the question of how many metaslabs do we need to flush in order to get rid 159eda14cbcSMatt Macy * at least an X amount of log space map blocks. We can answer this question 160eda14cbcSMatt Macy * by iterating backwards from the oldest log space map to the newest one 161eda14cbcSMatt Macy * and looking at their metaslab and block counts. At this point the log summary 162eda14cbcSMatt Macy * mentioned above comes handy as it reduces the amount of things that we have 163eda14cbcSMatt Macy * to iterate (even though it may reduce the preciseness of our estimates due 164eda14cbcSMatt Macy * to its aggregation of data). So with that in mind, we project the incoming 165eda14cbcSMatt Macy * rate of the current TXG into the future and attempt to approximate how many 166eda14cbcSMatt Macy * metaslabs would we need to flush from now in order to avoid exceeding our 167eda14cbcSMatt Macy * block limit in different points in the future (granted that we would keep 168eda14cbcSMatt Macy * flushing the same number of metaslabs for every TXG). Then we take the 169eda14cbcSMatt Macy * maximum number from all these estimates to be on the safe side. For the 170eda14cbcSMatt Macy * exact implementation details of algorithm refer to 171eda14cbcSMatt Macy * spa_estimate_metaslabs_to_flush. 172eda14cbcSMatt Macy */ 173eda14cbcSMatt Macy 174eda14cbcSMatt Macy /* 175eda14cbcSMatt Macy * This is used as the block size for the space maps used for the 176eda14cbcSMatt Macy * log space map feature. These space maps benefit from a bigger 177eda14cbcSMatt Macy * block size as we expect to be writing a lot of data to them at 178eda14cbcSMatt Macy * once. 179eda14cbcSMatt Macy */ 180eda14cbcSMatt Macy unsigned long zfs_log_sm_blksz = 1ULL << 17; 181eda14cbcSMatt Macy 182eda14cbcSMatt Macy /* 183eda14cbcSMatt Macy * Percentage of the overall system's memory that ZFS allows to be 184eda14cbcSMatt Macy * used for unflushed changes (e.g. the sum of size of all the nodes 185eda14cbcSMatt Macy * in the unflushed trees). 186eda14cbcSMatt Macy * 187eda14cbcSMatt Macy * Note that this value is calculated over 1000000 for finer granularity 188eda14cbcSMatt Macy * (thus the _ppm suffix; reads as "parts per million"). As an example, 189eda14cbcSMatt Macy * the default of 1000 allows 0.1% of memory to be used. 190eda14cbcSMatt Macy */ 191eda14cbcSMatt Macy unsigned long zfs_unflushed_max_mem_ppm = 1000; 192eda14cbcSMatt Macy 193eda14cbcSMatt Macy /* 194eda14cbcSMatt Macy * Specific hard-limit in memory that ZFS allows to be used for 195eda14cbcSMatt Macy * unflushed changes. 196eda14cbcSMatt Macy */ 197eda14cbcSMatt Macy unsigned long zfs_unflushed_max_mem_amt = 1ULL << 30; 198eda14cbcSMatt Macy 199eda14cbcSMatt Macy /* 200eda14cbcSMatt Macy * The following tunable determines the number of blocks that can be used for 201eda14cbcSMatt Macy * the log space maps. It is expressed as a percentage of the total number of 202eda14cbcSMatt Macy * metaslabs in the pool (i.e. the default of 400 means that the number of log 203eda14cbcSMatt Macy * blocks is capped at 4 times the number of metaslabs). 204eda14cbcSMatt Macy * 205eda14cbcSMatt Macy * This value exists to tune our flushing algorithm, with higher values 206eda14cbcSMatt Macy * flushing metaslabs less often (doing less I/Os) per TXG versus lower values 207eda14cbcSMatt Macy * flushing metaslabs more aggressively with the upside of saving overheads 208eda14cbcSMatt Macy * when loading the pool. Another factor in this tradeoff is that flushing 209eda14cbcSMatt Macy * less often can potentially lead to better utilization of the metaslab space 210eda14cbcSMatt Macy * map's block size as we accumulate more changes per flush. 211eda14cbcSMatt Macy * 212eda14cbcSMatt Macy * Given that this tunable indirectly controls the flush rate (metaslabs 213eda14cbcSMatt Macy * flushed per txg) and that's why making it a percentage in terms of the 214eda14cbcSMatt Macy * number of metaslabs in the pool makes sense here. 215eda14cbcSMatt Macy * 216eda14cbcSMatt Macy * As a rule of thumb we default this tunable to 400% based on the following: 217eda14cbcSMatt Macy * 218eda14cbcSMatt Macy * 1] Assuming a constant flush rate and a constant incoming rate of log blocks 219eda14cbcSMatt Macy * it is reasonable to expect that the amount of obsolete entries changes 220eda14cbcSMatt Macy * linearly from txg to txg (e.g. the oldest log should have the most 221eda14cbcSMatt Macy * obsolete entries, and the most recent one the least). With this we could 222eda14cbcSMatt Macy * say that, at any given time, about half of the entries in the whole space 223eda14cbcSMatt Macy * map log are obsolete. Thus for every two entries for a metaslab in the 224eda14cbcSMatt Macy * log space map, only one of them is valid and actually makes it to the 225eda14cbcSMatt Macy * metaslab's space map. 226eda14cbcSMatt Macy * [factor of 2] 227eda14cbcSMatt Macy * 2] Each entry in the log space map is guaranteed to be two words while 228eda14cbcSMatt Macy * entries in metaslab space maps are generally single-word. 229eda14cbcSMatt Macy * [an extra factor of 2 - 400% overall] 230eda14cbcSMatt Macy * 3] Even if [1] and [2] are slightly less than 2 each, we haven't taken into 231eda14cbcSMatt Macy * account any consolidation of segments from the log space map to the 232eda14cbcSMatt Macy * unflushed range trees nor their history (e.g. a segment being allocated, 233eda14cbcSMatt Macy * then freed, then allocated again means 3 log space map entries but 0 234eda14cbcSMatt Macy * metaslab space map entries). Depending on the workload, we've seen ~1.8 235eda14cbcSMatt Macy * non-obsolete log space map entries per metaslab entry, for a total of 236eda14cbcSMatt Macy * ~600%. Since most of these estimates though are workload dependent, we 237eda14cbcSMatt Macy * default on 400% to be conservative. 238eda14cbcSMatt Macy * 239eda14cbcSMatt Macy * Thus we could say that even in the worst 240eda14cbcSMatt Macy * case of [1] and [2], the factor should end up being 4. 241eda14cbcSMatt Macy * 242eda14cbcSMatt Macy * That said, regardless of the number of metaslabs in the pool we need to 243eda14cbcSMatt Macy * provide upper and lower bounds for the log block limit. 244eda14cbcSMatt Macy * [see zfs_unflushed_log_block_{min,max}] 245eda14cbcSMatt Macy */ 246eda14cbcSMatt Macy unsigned long zfs_unflushed_log_block_pct = 400; 247eda14cbcSMatt Macy 248eda14cbcSMatt Macy /* 249eda14cbcSMatt Macy * If the number of metaslabs is small and our incoming rate is high, we could 250eda14cbcSMatt Macy * get into a situation that we are flushing all our metaslabs every TXG. Thus 251eda14cbcSMatt Macy * we always allow at least this many log blocks. 252eda14cbcSMatt Macy */ 253eda14cbcSMatt Macy unsigned long zfs_unflushed_log_block_min = 1000; 254eda14cbcSMatt Macy 255eda14cbcSMatt Macy /* 256eda14cbcSMatt Macy * If the log becomes too big, the import time of the pool can take a hit in 257eda14cbcSMatt Macy * terms of performance. Thus we have a hard limit in the size of the log in 258eda14cbcSMatt Macy * terms of blocks. 259eda14cbcSMatt Macy */ 260eda14cbcSMatt Macy unsigned long zfs_unflushed_log_block_max = (1ULL << 18); 261eda14cbcSMatt Macy 262eda14cbcSMatt Macy /* 263eda14cbcSMatt Macy * Max # of rows allowed for the log_summary. The tradeoff here is accuracy and 264eda14cbcSMatt Macy * stability of the flushing algorithm (longer summary) vs its runtime overhead 265eda14cbcSMatt Macy * (smaller summary is faster to traverse). 266eda14cbcSMatt Macy */ 267eda14cbcSMatt Macy unsigned long zfs_max_logsm_summary_length = 10; 268eda14cbcSMatt Macy 269eda14cbcSMatt Macy /* 270eda14cbcSMatt Macy * Tunable that sets the lower bound on the metaslabs to flush every TXG. 271eda14cbcSMatt Macy * 272eda14cbcSMatt Macy * Setting this to 0 has no effect since if the pool is idle we won't even be 273eda14cbcSMatt Macy * creating log space maps and therefore we won't be flushing. On the other 274eda14cbcSMatt Macy * hand if the pool has any incoming workload our block heuristic will start 275eda14cbcSMatt Macy * flushing metaslabs anyway. 276eda14cbcSMatt Macy * 277eda14cbcSMatt Macy * The point of this tunable is to be used in extreme cases where we really 278eda14cbcSMatt Macy * want to flush more metaslabs than our adaptable heuristic plans to flush. 279eda14cbcSMatt Macy */ 280eda14cbcSMatt Macy unsigned long zfs_min_metaslabs_to_flush = 1; 281eda14cbcSMatt Macy 282eda14cbcSMatt Macy /* 283eda14cbcSMatt Macy * Tunable that specifies how far in the past do we want to look when trying to 284eda14cbcSMatt Macy * estimate the incoming log blocks for the current TXG. 285eda14cbcSMatt Macy * 286eda14cbcSMatt Macy * Setting this too high may not only increase runtime but also minimize the 287eda14cbcSMatt Macy * effect of the incoming rates from the most recent TXGs as we take the 288eda14cbcSMatt Macy * average over all the blocks that we walk 289eda14cbcSMatt Macy * [see spa_estimate_incoming_log_blocks]. 290eda14cbcSMatt Macy */ 291eda14cbcSMatt Macy unsigned long zfs_max_log_walking = 5; 292eda14cbcSMatt Macy 293eda14cbcSMatt Macy /* 294eda14cbcSMatt Macy * This tunable exists solely for testing purposes. It ensures that the log 295eda14cbcSMatt Macy * spacemaps are not flushed and destroyed during export in order for the 296eda14cbcSMatt Macy * relevant log spacemap import code paths to be tested (effectively simulating 297eda14cbcSMatt Macy * a crash). 298eda14cbcSMatt Macy */ 299eda14cbcSMatt Macy int zfs_keep_log_spacemaps_at_export = 0; 300eda14cbcSMatt Macy 301eda14cbcSMatt Macy static uint64_t 302eda14cbcSMatt Macy spa_estimate_incoming_log_blocks(spa_t *spa) 303eda14cbcSMatt Macy { 304eda14cbcSMatt Macy ASSERT3U(spa_sync_pass(spa), ==, 1); 305eda14cbcSMatt Macy uint64_t steps = 0, sum = 0; 306eda14cbcSMatt Macy for (spa_log_sm_t *sls = avl_last(&spa->spa_sm_logs_by_txg); 307eda14cbcSMatt Macy sls != NULL && steps < zfs_max_log_walking; 308eda14cbcSMatt Macy sls = AVL_PREV(&spa->spa_sm_logs_by_txg, sls)) { 309eda14cbcSMatt Macy if (sls->sls_txg == spa_syncing_txg(spa)) { 310eda14cbcSMatt Macy /* 311eda14cbcSMatt Macy * skip the log created in this TXG as this would 312eda14cbcSMatt Macy * make our estimations inaccurate. 313eda14cbcSMatt Macy */ 314eda14cbcSMatt Macy continue; 315eda14cbcSMatt Macy } 316eda14cbcSMatt Macy sum += sls->sls_nblocks; 317eda14cbcSMatt Macy steps++; 318eda14cbcSMatt Macy } 319eda14cbcSMatt Macy return ((steps > 0) ? DIV_ROUND_UP(sum, steps) : 0); 320eda14cbcSMatt Macy } 321eda14cbcSMatt Macy 322eda14cbcSMatt Macy uint64_t 323eda14cbcSMatt Macy spa_log_sm_blocklimit(spa_t *spa) 324eda14cbcSMatt Macy { 325eda14cbcSMatt Macy return (spa->spa_unflushed_stats.sus_blocklimit); 326eda14cbcSMatt Macy } 327eda14cbcSMatt Macy 328eda14cbcSMatt Macy void 329eda14cbcSMatt Macy spa_log_sm_set_blocklimit(spa_t *spa) 330eda14cbcSMatt Macy { 331eda14cbcSMatt Macy if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)) { 332eda14cbcSMatt Macy ASSERT0(spa_log_sm_blocklimit(spa)); 333eda14cbcSMatt Macy return; 334eda14cbcSMatt Macy } 335eda14cbcSMatt Macy 336eda14cbcSMatt Macy uint64_t calculated_limit = 337eda14cbcSMatt Macy (spa_total_metaslabs(spa) * zfs_unflushed_log_block_pct) / 100; 338eda14cbcSMatt Macy spa->spa_unflushed_stats.sus_blocklimit = MIN(MAX(calculated_limit, 339eda14cbcSMatt Macy zfs_unflushed_log_block_min), zfs_unflushed_log_block_max); 340eda14cbcSMatt Macy } 341eda14cbcSMatt Macy 342eda14cbcSMatt Macy uint64_t 343eda14cbcSMatt Macy spa_log_sm_nblocks(spa_t *spa) 344eda14cbcSMatt Macy { 345eda14cbcSMatt Macy return (spa->spa_unflushed_stats.sus_nblocks); 346eda14cbcSMatt Macy } 347eda14cbcSMatt Macy 348eda14cbcSMatt Macy /* 349eda14cbcSMatt Macy * Ensure that the in-memory log space map structures and the summary 350eda14cbcSMatt Macy * have the same block and metaslab counts. 351eda14cbcSMatt Macy */ 352eda14cbcSMatt Macy static void 353eda14cbcSMatt Macy spa_log_summary_verify_counts(spa_t *spa) 354eda14cbcSMatt Macy { 355eda14cbcSMatt Macy ASSERT(spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)); 356eda14cbcSMatt Macy 357eda14cbcSMatt Macy if ((zfs_flags & ZFS_DEBUG_LOG_SPACEMAP) == 0) 358eda14cbcSMatt Macy return; 359eda14cbcSMatt Macy 360eda14cbcSMatt Macy uint64_t ms_in_avl = avl_numnodes(&spa->spa_metaslabs_by_flushed); 361eda14cbcSMatt Macy 362eda14cbcSMatt Macy uint64_t ms_in_summary = 0, blk_in_summary = 0; 363eda14cbcSMatt Macy for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 364eda14cbcSMatt Macy e; e = list_next(&spa->spa_log_summary, e)) { 365eda14cbcSMatt Macy ms_in_summary += e->lse_mscount; 366eda14cbcSMatt Macy blk_in_summary += e->lse_blkcount; 367eda14cbcSMatt Macy } 368eda14cbcSMatt Macy 369eda14cbcSMatt Macy uint64_t ms_in_logs = 0, blk_in_logs = 0; 370eda14cbcSMatt Macy for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg); 371eda14cbcSMatt Macy sls; sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls)) { 372eda14cbcSMatt Macy ms_in_logs += sls->sls_mscount; 373eda14cbcSMatt Macy blk_in_logs += sls->sls_nblocks; 374eda14cbcSMatt Macy } 375eda14cbcSMatt Macy 376eda14cbcSMatt Macy VERIFY3U(ms_in_logs, ==, ms_in_summary); 377eda14cbcSMatt Macy VERIFY3U(ms_in_logs, ==, ms_in_avl); 378eda14cbcSMatt Macy VERIFY3U(blk_in_logs, ==, blk_in_summary); 379eda14cbcSMatt Macy VERIFY3U(blk_in_logs, ==, spa_log_sm_nblocks(spa)); 380eda14cbcSMatt Macy } 381eda14cbcSMatt Macy 382eda14cbcSMatt Macy static boolean_t 383eda14cbcSMatt Macy summary_entry_is_full(spa_t *spa, log_summary_entry_t *e) 384eda14cbcSMatt Macy { 385eda14cbcSMatt Macy uint64_t blocks_per_row = MAX(1, 386eda14cbcSMatt Macy DIV_ROUND_UP(spa_log_sm_blocklimit(spa), 387eda14cbcSMatt Macy zfs_max_logsm_summary_length)); 388eda14cbcSMatt Macy return (blocks_per_row <= e->lse_blkcount); 389eda14cbcSMatt Macy } 390eda14cbcSMatt Macy 391eda14cbcSMatt Macy /* 392eda14cbcSMatt Macy * Update the log summary information to reflect the fact that a metaslab 393eda14cbcSMatt Macy * was flushed or destroyed (e.g due to device removal or pool export/destroy). 394eda14cbcSMatt Macy * 395eda14cbcSMatt Macy * We typically flush the oldest flushed metaslab so the first (and oldest) 396eda14cbcSMatt Macy * entry of the summary is updated. However if that metaslab is getting loaded 397eda14cbcSMatt Macy * we may flush the second oldest one which may be part of an entry later in 398eda14cbcSMatt Macy * the summary. Moreover, if we call into this function from metaslab_fini() 399eda14cbcSMatt Macy * the metaslabs probably won't be ordered by ms_unflushed_txg. Thus we ask 400eda14cbcSMatt Macy * for a txg as an argument so we can locate the appropriate summary entry for 401eda14cbcSMatt Macy * the metaslab. 402eda14cbcSMatt Macy */ 403eda14cbcSMatt Macy void 404eda14cbcSMatt Macy spa_log_summary_decrement_mscount(spa_t *spa, uint64_t txg) 405eda14cbcSMatt Macy { 406eda14cbcSMatt Macy /* 407eda14cbcSMatt Macy * We don't track summary data for read-only pools and this function 408eda14cbcSMatt Macy * can be called from metaslab_fini(). In that case return immediately. 409eda14cbcSMatt Macy */ 410eda14cbcSMatt Macy if (!spa_writeable(spa)) 411eda14cbcSMatt Macy return; 412eda14cbcSMatt Macy 413eda14cbcSMatt Macy log_summary_entry_t *target = NULL; 414eda14cbcSMatt Macy for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 415eda14cbcSMatt Macy e != NULL; e = list_next(&spa->spa_log_summary, e)) { 416eda14cbcSMatt Macy if (e->lse_start > txg) 417eda14cbcSMatt Macy break; 418eda14cbcSMatt Macy target = e; 419eda14cbcSMatt Macy } 420eda14cbcSMatt Macy 421eda14cbcSMatt Macy if (target == NULL || target->lse_mscount == 0) { 422eda14cbcSMatt Macy /* 423eda14cbcSMatt Macy * We didn't find a summary entry for this metaslab. We must be 424eda14cbcSMatt Macy * at the teardown of a spa_load() attempt that got an error 425eda14cbcSMatt Macy * while reading the log space maps. 426eda14cbcSMatt Macy */ 427eda14cbcSMatt Macy VERIFY3S(spa_load_state(spa), ==, SPA_LOAD_ERROR); 428eda14cbcSMatt Macy return; 429eda14cbcSMatt Macy } 430eda14cbcSMatt Macy 431eda14cbcSMatt Macy target->lse_mscount--; 432eda14cbcSMatt Macy } 433eda14cbcSMatt Macy 434eda14cbcSMatt Macy /* 435eda14cbcSMatt Macy * Update the log summary information to reflect the fact that we destroyed 436eda14cbcSMatt Macy * old log space maps. Since we can only destroy the oldest log space maps, 437eda14cbcSMatt Macy * we decrement the block count of the oldest summary entry and potentially 438eda14cbcSMatt Macy * destroy it when that count hits 0. 439eda14cbcSMatt Macy * 440eda14cbcSMatt Macy * This function is called after a metaslab is flushed and typically that 441eda14cbcSMatt Macy * metaslab is the oldest flushed, which means that this function will 442eda14cbcSMatt Macy * typically decrement the block count of the first entry of the summary and 443eda14cbcSMatt Macy * potentially free it if the block count gets to zero (its metaslab count 444eda14cbcSMatt Macy * should be zero too at that point). 445eda14cbcSMatt Macy * 446eda14cbcSMatt Macy * There are certain scenarios though that don't work exactly like that so we 447eda14cbcSMatt Macy * need to account for them: 448eda14cbcSMatt Macy * 449eda14cbcSMatt Macy * Scenario [1]: It is possible that after we flushed the oldest flushed 450eda14cbcSMatt Macy * metaslab and we destroyed the oldest log space map, more recent logs had 0 451eda14cbcSMatt Macy * metaslabs pointing to them so we got rid of them too. This can happen due 452eda14cbcSMatt Macy * to metaslabs being destroyed through device removal, or because the oldest 453eda14cbcSMatt Macy * flushed metaslab was loading but we kept flushing more recently flushed 454eda14cbcSMatt Macy * metaslabs due to the memory pressure of unflushed changes. Because of that, 455eda14cbcSMatt Macy * we always iterate from the beginning of the summary and if blocks_gone is 456eda14cbcSMatt Macy * bigger than the block_count of the current entry we free that entry (we 457eda14cbcSMatt Macy * expect its metaslab count to be zero), we decrement blocks_gone and on to 458eda14cbcSMatt Macy * the next entry repeating this procedure until blocks_gone gets decremented 459eda14cbcSMatt Macy * to 0. Doing this also works for the typical case mentioned above. 460eda14cbcSMatt Macy * 461eda14cbcSMatt Macy * Scenario [2]: The oldest flushed metaslab isn't necessarily accounted by 462eda14cbcSMatt Macy * the first (and oldest) entry in the summary. If the first few entries of 463eda14cbcSMatt Macy * the summary were only accounting metaslabs from a device that was just 464eda14cbcSMatt Macy * removed, then the current oldest flushed metaslab could be accounted by an 465eda14cbcSMatt Macy * entry somewhere in the middle of the summary. Moreover flushing that 466eda14cbcSMatt Macy * metaslab will destroy all the log space maps older than its ms_unflushed_txg 467eda14cbcSMatt Macy * because they became obsolete after the removal. Thus, iterating as we did 468eda14cbcSMatt Macy * for scenario [1] works out for this case too. 469eda14cbcSMatt Macy * 470eda14cbcSMatt Macy * Scenario [3]: At times we decide to flush all the metaslabs in the pool 471eda14cbcSMatt Macy * in one TXG (either because we are exporting the pool or because our flushing 472eda14cbcSMatt Macy * heuristics decided to do so). When that happens all the log space maps get 473eda14cbcSMatt Macy * destroyed except the one created for the current TXG which doesn't have 474eda14cbcSMatt Macy * any log blocks yet. As log space maps get destroyed with every metaslab that 475eda14cbcSMatt Macy * we flush, entries in the summary are also destroyed. This brings a weird 476eda14cbcSMatt Macy * corner-case when we flush the last metaslab and the log space map of the 477eda14cbcSMatt Macy * current TXG is in the same summary entry with other log space maps that 478eda14cbcSMatt Macy * are older. When that happens we are eventually left with this one last 479eda14cbcSMatt Macy * summary entry whose blocks are gone (blocks_gone equals the entry's block 480eda14cbcSMatt Macy * count) but its metaslab count is non-zero (because it accounts all the 481eda14cbcSMatt Macy * metaslabs in the pool as they all got flushed). Under this scenario we can't 482eda14cbcSMatt Macy * free this last summary entry as it's referencing all the metaslabs in the 483eda14cbcSMatt Macy * pool and its block count will get incremented at the end of this sync (when 484eda14cbcSMatt Macy * we close the syncing log space map). Thus we just decrement its current 485eda14cbcSMatt Macy * block count and leave it alone. In the case that the pool gets exported, 486eda14cbcSMatt Macy * its metaslab count will be decremented over time as we call metaslab_fini() 487eda14cbcSMatt Macy * for all the metaslabs in the pool and the entry will be freed at 488eda14cbcSMatt Macy * spa_unload_log_sm_metadata(). 489eda14cbcSMatt Macy */ 490eda14cbcSMatt Macy void 491eda14cbcSMatt Macy spa_log_summary_decrement_blkcount(spa_t *spa, uint64_t blocks_gone) 492eda14cbcSMatt Macy { 493eda14cbcSMatt Macy for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 494eda14cbcSMatt Macy e != NULL; e = list_head(&spa->spa_log_summary)) { 495eda14cbcSMatt Macy if (e->lse_blkcount > blocks_gone) { 496eda14cbcSMatt Macy /* 497eda14cbcSMatt Macy * Assert that we stopped at an entry that is not 498eda14cbcSMatt Macy * obsolete. 499eda14cbcSMatt Macy */ 500eda14cbcSMatt Macy ASSERT(e->lse_mscount != 0); 501eda14cbcSMatt Macy 502eda14cbcSMatt Macy e->lse_blkcount -= blocks_gone; 503eda14cbcSMatt Macy blocks_gone = 0; 504eda14cbcSMatt Macy break; 505eda14cbcSMatt Macy } else if (e->lse_mscount == 0) { 506eda14cbcSMatt Macy /* remove obsolete entry */ 507eda14cbcSMatt Macy blocks_gone -= e->lse_blkcount; 508eda14cbcSMatt Macy list_remove(&spa->spa_log_summary, e); 509eda14cbcSMatt Macy kmem_free(e, sizeof (log_summary_entry_t)); 510eda14cbcSMatt Macy } else { 511eda14cbcSMatt Macy /* Verify that this is scenario [3] mentioned above. */ 512eda14cbcSMatt Macy VERIFY3U(blocks_gone, ==, e->lse_blkcount); 513eda14cbcSMatt Macy 514eda14cbcSMatt Macy /* 515eda14cbcSMatt Macy * Assert that this is scenario [3] further by ensuring 516eda14cbcSMatt Macy * that this is the only entry in the summary. 517eda14cbcSMatt Macy */ 518eda14cbcSMatt Macy VERIFY3P(e, ==, list_tail(&spa->spa_log_summary)); 519eda14cbcSMatt Macy ASSERT3P(e, ==, list_head(&spa->spa_log_summary)); 520eda14cbcSMatt Macy 521eda14cbcSMatt Macy blocks_gone = e->lse_blkcount = 0; 522eda14cbcSMatt Macy break; 523eda14cbcSMatt Macy } 524eda14cbcSMatt Macy } 525eda14cbcSMatt Macy 526eda14cbcSMatt Macy /* 527eda14cbcSMatt Macy * Ensure that there is no way we are trying to remove more blocks 528eda14cbcSMatt Macy * than the # of blocks in the summary. 529eda14cbcSMatt Macy */ 530eda14cbcSMatt Macy ASSERT0(blocks_gone); 531eda14cbcSMatt Macy } 532eda14cbcSMatt Macy 533eda14cbcSMatt Macy void 534eda14cbcSMatt Macy spa_log_sm_decrement_mscount(spa_t *spa, uint64_t txg) 535eda14cbcSMatt Macy { 536eda14cbcSMatt Macy spa_log_sm_t target = { .sls_txg = txg }; 537eda14cbcSMatt Macy spa_log_sm_t *sls = avl_find(&spa->spa_sm_logs_by_txg, 538eda14cbcSMatt Macy &target, NULL); 539eda14cbcSMatt Macy 540eda14cbcSMatt Macy if (sls == NULL) { 541eda14cbcSMatt Macy /* 542eda14cbcSMatt Macy * We must be at the teardown of a spa_load() attempt that 543eda14cbcSMatt Macy * got an error while reading the log space maps. 544eda14cbcSMatt Macy */ 545eda14cbcSMatt Macy VERIFY3S(spa_load_state(spa), ==, SPA_LOAD_ERROR); 546eda14cbcSMatt Macy return; 547eda14cbcSMatt Macy } 548eda14cbcSMatt Macy 549eda14cbcSMatt Macy ASSERT(sls->sls_mscount > 0); 550eda14cbcSMatt Macy sls->sls_mscount--; 551eda14cbcSMatt Macy } 552eda14cbcSMatt Macy 553eda14cbcSMatt Macy void 554eda14cbcSMatt Macy spa_log_sm_increment_current_mscount(spa_t *spa) 555eda14cbcSMatt Macy { 556eda14cbcSMatt Macy spa_log_sm_t *last_sls = avl_last(&spa->spa_sm_logs_by_txg); 557eda14cbcSMatt Macy ASSERT3U(last_sls->sls_txg, ==, spa_syncing_txg(spa)); 558eda14cbcSMatt Macy last_sls->sls_mscount++; 559eda14cbcSMatt Macy } 560eda14cbcSMatt Macy 561eda14cbcSMatt Macy static void 562eda14cbcSMatt Macy summary_add_data(spa_t *spa, uint64_t txg, uint64_t metaslabs_flushed, 563eda14cbcSMatt Macy uint64_t nblocks) 564eda14cbcSMatt Macy { 565eda14cbcSMatt Macy log_summary_entry_t *e = list_tail(&spa->spa_log_summary); 566eda14cbcSMatt Macy 567eda14cbcSMatt Macy if (e == NULL || summary_entry_is_full(spa, e)) { 568eda14cbcSMatt Macy e = kmem_zalloc(sizeof (log_summary_entry_t), KM_SLEEP); 569eda14cbcSMatt Macy e->lse_start = txg; 570eda14cbcSMatt Macy list_insert_tail(&spa->spa_log_summary, e); 571eda14cbcSMatt Macy } 572eda14cbcSMatt Macy 573eda14cbcSMatt Macy ASSERT3U(e->lse_start, <=, txg); 574eda14cbcSMatt Macy e->lse_mscount += metaslabs_flushed; 575eda14cbcSMatt Macy e->lse_blkcount += nblocks; 576eda14cbcSMatt Macy } 577eda14cbcSMatt Macy 578eda14cbcSMatt Macy static void 579eda14cbcSMatt Macy spa_log_summary_add_incoming_blocks(spa_t *spa, uint64_t nblocks) 580eda14cbcSMatt Macy { 581eda14cbcSMatt Macy summary_add_data(spa, spa_syncing_txg(spa), 0, nblocks); 582eda14cbcSMatt Macy } 583eda14cbcSMatt Macy 584eda14cbcSMatt Macy void 585eda14cbcSMatt Macy spa_log_summary_add_flushed_metaslab(spa_t *spa) 586eda14cbcSMatt Macy { 587eda14cbcSMatt Macy summary_add_data(spa, spa_syncing_txg(spa), 1, 0); 588eda14cbcSMatt Macy } 589eda14cbcSMatt Macy 590eda14cbcSMatt Macy /* 591eda14cbcSMatt Macy * This function attempts to estimate how many metaslabs should 592eda14cbcSMatt Macy * we flush to satisfy our block heuristic for the log spacemap 593eda14cbcSMatt Macy * for the upcoming TXGs. 594eda14cbcSMatt Macy * 595eda14cbcSMatt Macy * Specifically, it first tries to estimate the number of incoming 596eda14cbcSMatt Macy * blocks in this TXG. Then by projecting that incoming rate to 597eda14cbcSMatt Macy * future TXGs and using the log summary, it figures out how many 598eda14cbcSMatt Macy * flushes we would need to do for future TXGs individually to 599eda14cbcSMatt Macy * stay below our block limit and returns the maximum number of 600eda14cbcSMatt Macy * flushes from those estimates. 601eda14cbcSMatt Macy */ 602eda14cbcSMatt Macy static uint64_t 603eda14cbcSMatt Macy spa_estimate_metaslabs_to_flush(spa_t *spa) 604eda14cbcSMatt Macy { 605eda14cbcSMatt Macy ASSERT(spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)); 606eda14cbcSMatt Macy ASSERT3U(spa_sync_pass(spa), ==, 1); 607eda14cbcSMatt Macy ASSERT(spa_log_sm_blocklimit(spa) != 0); 608eda14cbcSMatt Macy 609eda14cbcSMatt Macy /* 610eda14cbcSMatt Macy * This variable contains the incoming rate that will be projected 611eda14cbcSMatt Macy * and used for our flushing estimates in the future. 612eda14cbcSMatt Macy */ 613eda14cbcSMatt Macy uint64_t incoming = spa_estimate_incoming_log_blocks(spa); 614eda14cbcSMatt Macy 615eda14cbcSMatt Macy /* 616eda14cbcSMatt Macy * At any point in time this variable tells us how many 617eda14cbcSMatt Macy * TXGs in the future we are so we can make our estimations. 618eda14cbcSMatt Macy */ 619eda14cbcSMatt Macy uint64_t txgs_in_future = 1; 620eda14cbcSMatt Macy 621eda14cbcSMatt Macy /* 622eda14cbcSMatt Macy * This variable tells us how much room do we have until we hit 623eda14cbcSMatt Macy * our limit. When it goes negative, it means that we've exceeded 624eda14cbcSMatt Macy * our limit and we need to flush. 625eda14cbcSMatt Macy * 626eda14cbcSMatt Macy * Note that since we start at the first TXG in the future (i.e. 627eda14cbcSMatt Macy * txgs_in_future starts from 1) we already decrement this 628eda14cbcSMatt Macy * variable by the incoming rate. 629eda14cbcSMatt Macy */ 630eda14cbcSMatt Macy int64_t available_blocks = 631eda14cbcSMatt Macy spa_log_sm_blocklimit(spa) - spa_log_sm_nblocks(spa) - incoming; 632eda14cbcSMatt Macy 633eda14cbcSMatt Macy /* 634eda14cbcSMatt Macy * This variable tells us the total number of flushes needed to 635eda14cbcSMatt Macy * keep the log size within the limit when we reach txgs_in_future. 636eda14cbcSMatt Macy */ 637eda14cbcSMatt Macy uint64_t total_flushes = 0; 638eda14cbcSMatt Macy 639eda14cbcSMatt Macy /* Holds the current maximum of our estimates so far. */ 640eda14cbcSMatt Macy uint64_t max_flushes_pertxg = 641eda14cbcSMatt Macy MIN(avl_numnodes(&spa->spa_metaslabs_by_flushed), 642eda14cbcSMatt Macy zfs_min_metaslabs_to_flush); 643eda14cbcSMatt Macy 644eda14cbcSMatt Macy /* 645eda14cbcSMatt Macy * For our estimations we only look as far in the future 646eda14cbcSMatt Macy * as the summary allows us. 647eda14cbcSMatt Macy */ 648eda14cbcSMatt Macy for (log_summary_entry_t *e = list_head(&spa->spa_log_summary); 649eda14cbcSMatt Macy e; e = list_next(&spa->spa_log_summary, e)) { 650eda14cbcSMatt Macy 651eda14cbcSMatt Macy /* 652eda14cbcSMatt Macy * If there is still room before we exceed our limit 653eda14cbcSMatt Macy * then keep skipping TXGs accumulating more blocks 654eda14cbcSMatt Macy * based on the incoming rate until we exceed it. 655eda14cbcSMatt Macy */ 656eda14cbcSMatt Macy if (available_blocks >= 0) { 657eda14cbcSMatt Macy uint64_t skip_txgs = (available_blocks / incoming) + 1; 658eda14cbcSMatt Macy available_blocks -= (skip_txgs * incoming); 659eda14cbcSMatt Macy txgs_in_future += skip_txgs; 660eda14cbcSMatt Macy ASSERT3S(available_blocks, >=, -incoming); 661eda14cbcSMatt Macy } 662eda14cbcSMatt Macy 663eda14cbcSMatt Macy /* 664eda14cbcSMatt Macy * At this point we're far enough into the future where 665eda14cbcSMatt Macy * the limit was just exceeded and we flush metaslabs 666eda14cbcSMatt Macy * based on the current entry in the summary, updating 667eda14cbcSMatt Macy * our available_blocks. 668eda14cbcSMatt Macy */ 669eda14cbcSMatt Macy ASSERT3S(available_blocks, <, 0); 670eda14cbcSMatt Macy available_blocks += e->lse_blkcount; 671eda14cbcSMatt Macy total_flushes += e->lse_mscount; 672eda14cbcSMatt Macy 673eda14cbcSMatt Macy /* 674eda14cbcSMatt Macy * Keep the running maximum of the total_flushes that 675eda14cbcSMatt Macy * we've done so far over the number of TXGs in the 676eda14cbcSMatt Macy * future that we are. The idea here is to estimate 677eda14cbcSMatt Macy * the average number of flushes that we should do 678eda14cbcSMatt Macy * every TXG so that when we are that many TXGs in the 679eda14cbcSMatt Macy * future we stay under the limit. 680eda14cbcSMatt Macy */ 681eda14cbcSMatt Macy max_flushes_pertxg = MAX(max_flushes_pertxg, 682eda14cbcSMatt Macy DIV_ROUND_UP(total_flushes, txgs_in_future)); 683eda14cbcSMatt Macy ASSERT3U(avl_numnodes(&spa->spa_metaslabs_by_flushed), >=, 684eda14cbcSMatt Macy max_flushes_pertxg); 685eda14cbcSMatt Macy } 686eda14cbcSMatt Macy return (max_flushes_pertxg); 687eda14cbcSMatt Macy } 688eda14cbcSMatt Macy 689eda14cbcSMatt Macy uint64_t 690eda14cbcSMatt Macy spa_log_sm_memused(spa_t *spa) 691eda14cbcSMatt Macy { 692eda14cbcSMatt Macy return (spa->spa_unflushed_stats.sus_memused); 693eda14cbcSMatt Macy } 694eda14cbcSMatt Macy 695eda14cbcSMatt Macy static boolean_t 696eda14cbcSMatt Macy spa_log_exceeds_memlimit(spa_t *spa) 697eda14cbcSMatt Macy { 698eda14cbcSMatt Macy if (spa_log_sm_memused(spa) > zfs_unflushed_max_mem_amt) 699eda14cbcSMatt Macy return (B_TRUE); 700eda14cbcSMatt Macy 701eda14cbcSMatt Macy uint64_t system_mem_allowed = ((physmem * PAGESIZE) * 702eda14cbcSMatt Macy zfs_unflushed_max_mem_ppm) / 1000000; 703eda14cbcSMatt Macy if (spa_log_sm_memused(spa) > system_mem_allowed) 704eda14cbcSMatt Macy return (B_TRUE); 705eda14cbcSMatt Macy 706eda14cbcSMatt Macy return (B_FALSE); 707eda14cbcSMatt Macy } 708eda14cbcSMatt Macy 709eda14cbcSMatt Macy boolean_t 710eda14cbcSMatt Macy spa_flush_all_logs_requested(spa_t *spa) 711eda14cbcSMatt Macy { 712eda14cbcSMatt Macy return (spa->spa_log_flushall_txg != 0); 713eda14cbcSMatt Macy } 714eda14cbcSMatt Macy 715eda14cbcSMatt Macy void 716eda14cbcSMatt Macy spa_flush_metaslabs(spa_t *spa, dmu_tx_t *tx) 717eda14cbcSMatt Macy { 718eda14cbcSMatt Macy uint64_t txg = dmu_tx_get_txg(tx); 719eda14cbcSMatt Macy 720eda14cbcSMatt Macy if (spa_sync_pass(spa) != 1) 721eda14cbcSMatt Macy return; 722eda14cbcSMatt Macy 723eda14cbcSMatt Macy if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)) 724eda14cbcSMatt Macy return; 725eda14cbcSMatt Macy 726eda14cbcSMatt Macy /* 727eda14cbcSMatt Macy * If we don't have any metaslabs with unflushed changes 728eda14cbcSMatt Macy * return immediately. 729eda14cbcSMatt Macy */ 730eda14cbcSMatt Macy if (avl_numnodes(&spa->spa_metaslabs_by_flushed) == 0) 731eda14cbcSMatt Macy return; 732eda14cbcSMatt Macy 733eda14cbcSMatt Macy /* 734eda14cbcSMatt Macy * During SPA export we leave a few empty TXGs to go by [see 735eda14cbcSMatt Macy * spa_final_dirty_txg() to understand why]. For this specific 736eda14cbcSMatt Macy * case, it is important to not flush any metaslabs as that 737eda14cbcSMatt Macy * would dirty this TXG. 738eda14cbcSMatt Macy * 739eda14cbcSMatt Macy * That said, during one of these dirty TXGs that is less or 740eda14cbcSMatt Macy * equal to spa_final_dirty(), spa_unload() will request that 741eda14cbcSMatt Macy * we try to flush all the metaslabs for that TXG before 742eda14cbcSMatt Macy * exporting the pool, thus we ensure that we didn't get a 743eda14cbcSMatt Macy * request of flushing everything before we attempt to return 744eda14cbcSMatt Macy * immediately. 745eda14cbcSMatt Macy */ 746eda14cbcSMatt Macy if (spa->spa_uberblock.ub_rootbp.blk_birth < txg && 747eda14cbcSMatt Macy !dmu_objset_is_dirty(spa_meta_objset(spa), txg) && 748eda14cbcSMatt Macy !spa_flush_all_logs_requested(spa)) 749eda14cbcSMatt Macy return; 750eda14cbcSMatt Macy 751eda14cbcSMatt Macy /* 752eda14cbcSMatt Macy * We need to generate a log space map before flushing because this 753eda14cbcSMatt Macy * will set up the in-memory data (i.e. node in spa_sm_logs_by_txg) 754eda14cbcSMatt Macy * for this TXG's flushed metaslab count (aka sls_mscount which is 755eda14cbcSMatt Macy * manipulated in many ways down the metaslab_flush() codepath). 756eda14cbcSMatt Macy * 757eda14cbcSMatt Macy * That is not to say that we may generate a log space map when we 758eda14cbcSMatt Macy * don't need it. If we are flushing metaslabs, that means that we 759eda14cbcSMatt Macy * were going to write changes to disk anyway, so even if we were 760eda14cbcSMatt Macy * not flushing, a log space map would have been created anyway in 761eda14cbcSMatt Macy * metaslab_sync(). 762eda14cbcSMatt Macy */ 763eda14cbcSMatt Macy spa_generate_syncing_log_sm(spa, tx); 764eda14cbcSMatt Macy 765eda14cbcSMatt Macy /* 766eda14cbcSMatt Macy * This variable tells us how many metaslabs we want to flush based 767eda14cbcSMatt Macy * on the block-heuristic of our flushing algorithm (see block comment 768eda14cbcSMatt Macy * of log space map feature). We also decrement this as we flush 769eda14cbcSMatt Macy * metaslabs and attempt to destroy old log space maps. 770eda14cbcSMatt Macy */ 771eda14cbcSMatt Macy uint64_t want_to_flush; 772eda14cbcSMatt Macy if (spa_flush_all_logs_requested(spa)) { 773eda14cbcSMatt Macy ASSERT3S(spa_state(spa), ==, POOL_STATE_EXPORTED); 774eda14cbcSMatt Macy want_to_flush = avl_numnodes(&spa->spa_metaslabs_by_flushed); 775eda14cbcSMatt Macy } else { 776eda14cbcSMatt Macy want_to_flush = spa_estimate_metaslabs_to_flush(spa); 777eda14cbcSMatt Macy } 778eda14cbcSMatt Macy 779eda14cbcSMatt Macy ASSERT3U(avl_numnodes(&spa->spa_metaslabs_by_flushed), >=, 780eda14cbcSMatt Macy want_to_flush); 781eda14cbcSMatt Macy 782eda14cbcSMatt Macy /* Used purely for verification purposes */ 783eda14cbcSMatt Macy uint64_t visited = 0; 784eda14cbcSMatt Macy 785eda14cbcSMatt Macy /* 786eda14cbcSMatt Macy * Ideally we would only iterate through spa_metaslabs_by_flushed 787eda14cbcSMatt Macy * using only one variable (curr). We can't do that because 788eda14cbcSMatt Macy * metaslab_flush() mutates position of curr in the AVL when 789eda14cbcSMatt Macy * it flushes that metaslab by moving it to the end of the tree. 790eda14cbcSMatt Macy * Thus we always keep track of the original next node of the 791eda14cbcSMatt Macy * current node (curr) in another variable (next). 792eda14cbcSMatt Macy */ 793eda14cbcSMatt Macy metaslab_t *next = NULL; 794eda14cbcSMatt Macy for (metaslab_t *curr = avl_first(&spa->spa_metaslabs_by_flushed); 795eda14cbcSMatt Macy curr != NULL; curr = next) { 796eda14cbcSMatt Macy next = AVL_NEXT(&spa->spa_metaslabs_by_flushed, curr); 797eda14cbcSMatt Macy 798eda14cbcSMatt Macy /* 799eda14cbcSMatt Macy * If this metaslab has been flushed this txg then we've done 800eda14cbcSMatt Macy * a full circle over the metaslabs. 801eda14cbcSMatt Macy */ 802eda14cbcSMatt Macy if (metaslab_unflushed_txg(curr) == txg) 803eda14cbcSMatt Macy break; 804eda14cbcSMatt Macy 805eda14cbcSMatt Macy /* 806eda14cbcSMatt Macy * If we are done flushing for the block heuristic and the 807eda14cbcSMatt Macy * unflushed changes don't exceed the memory limit just stop. 808eda14cbcSMatt Macy */ 809eda14cbcSMatt Macy if (want_to_flush == 0 && !spa_log_exceeds_memlimit(spa)) 810eda14cbcSMatt Macy break; 811eda14cbcSMatt Macy 812eda14cbcSMatt Macy mutex_enter(&curr->ms_sync_lock); 813eda14cbcSMatt Macy mutex_enter(&curr->ms_lock); 814eda14cbcSMatt Macy boolean_t flushed = metaslab_flush(curr, tx); 815eda14cbcSMatt Macy mutex_exit(&curr->ms_lock); 816eda14cbcSMatt Macy mutex_exit(&curr->ms_sync_lock); 817eda14cbcSMatt Macy 818eda14cbcSMatt Macy /* 819eda14cbcSMatt Macy * If we failed to flush a metaslab (because it was loading), 820eda14cbcSMatt Macy * then we are done with the block heuristic as it's not 821eda14cbcSMatt Macy * possible to destroy any log space maps once you've skipped 822eda14cbcSMatt Macy * a metaslab. In that case we just set our counter to 0 but 823eda14cbcSMatt Macy * we continue looping in case there is still memory pressure 824eda14cbcSMatt Macy * due to unflushed changes. Note that, flushing a metaslab 825eda14cbcSMatt Macy * that is not the oldest flushed in the pool, will never 826eda14cbcSMatt Macy * destroy any log space maps [see spa_cleanup_old_sm_logs()]. 827eda14cbcSMatt Macy */ 828eda14cbcSMatt Macy if (!flushed) { 829eda14cbcSMatt Macy want_to_flush = 0; 830eda14cbcSMatt Macy } else if (want_to_flush > 0) { 831eda14cbcSMatt Macy want_to_flush--; 832eda14cbcSMatt Macy } 833eda14cbcSMatt Macy 834eda14cbcSMatt Macy visited++; 835eda14cbcSMatt Macy } 836eda14cbcSMatt Macy ASSERT3U(avl_numnodes(&spa->spa_metaslabs_by_flushed), >=, visited); 837eda14cbcSMatt Macy } 838eda14cbcSMatt Macy 839eda14cbcSMatt Macy /* 840eda14cbcSMatt Macy * Close the log space map for this TXG and update the block counts 841eda14cbcSMatt Macy * for the log's in-memory structure and the summary. 842eda14cbcSMatt Macy */ 843eda14cbcSMatt Macy void 844eda14cbcSMatt Macy spa_sync_close_syncing_log_sm(spa_t *spa) 845eda14cbcSMatt Macy { 846eda14cbcSMatt Macy if (spa_syncing_log_sm(spa) == NULL) 847eda14cbcSMatt Macy return; 848eda14cbcSMatt Macy ASSERT(spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)); 849eda14cbcSMatt Macy 850eda14cbcSMatt Macy spa_log_sm_t *sls = avl_last(&spa->spa_sm_logs_by_txg); 851eda14cbcSMatt Macy ASSERT3U(sls->sls_txg, ==, spa_syncing_txg(spa)); 852eda14cbcSMatt Macy 853eda14cbcSMatt Macy sls->sls_nblocks = space_map_nblocks(spa_syncing_log_sm(spa)); 854eda14cbcSMatt Macy spa->spa_unflushed_stats.sus_nblocks += sls->sls_nblocks; 855eda14cbcSMatt Macy 856eda14cbcSMatt Macy /* 857eda14cbcSMatt Macy * Note that we can't assert that sls_mscount is not 0, 858eda14cbcSMatt Macy * because there is the case where the first metaslab 859eda14cbcSMatt Macy * in spa_metaslabs_by_flushed is loading and we were 860eda14cbcSMatt Macy * not able to flush any metaslabs the current TXG. 861eda14cbcSMatt Macy */ 862eda14cbcSMatt Macy ASSERT(sls->sls_nblocks != 0); 863eda14cbcSMatt Macy 864eda14cbcSMatt Macy spa_log_summary_add_incoming_blocks(spa, sls->sls_nblocks); 865eda14cbcSMatt Macy spa_log_summary_verify_counts(spa); 866eda14cbcSMatt Macy 867eda14cbcSMatt Macy space_map_close(spa->spa_syncing_log_sm); 868eda14cbcSMatt Macy spa->spa_syncing_log_sm = NULL; 869eda14cbcSMatt Macy 870eda14cbcSMatt Macy /* 871eda14cbcSMatt Macy * At this point we tried to flush as many metaslabs as we 872eda14cbcSMatt Macy * can as the pool is getting exported. Reset the "flush all" 873eda14cbcSMatt Macy * so the last few TXGs before closing the pool can be empty 874eda14cbcSMatt Macy * (e.g. not dirty). 875eda14cbcSMatt Macy */ 876eda14cbcSMatt Macy if (spa_flush_all_logs_requested(spa)) { 877eda14cbcSMatt Macy ASSERT3S(spa_state(spa), ==, POOL_STATE_EXPORTED); 878eda14cbcSMatt Macy spa->spa_log_flushall_txg = 0; 879eda14cbcSMatt Macy } 880eda14cbcSMatt Macy } 881eda14cbcSMatt Macy 882eda14cbcSMatt Macy void 883eda14cbcSMatt Macy spa_cleanup_old_sm_logs(spa_t *spa, dmu_tx_t *tx) 884eda14cbcSMatt Macy { 885eda14cbcSMatt Macy objset_t *mos = spa_meta_objset(spa); 886eda14cbcSMatt Macy 887eda14cbcSMatt Macy uint64_t spacemap_zap; 888eda14cbcSMatt Macy int error = zap_lookup(mos, DMU_POOL_DIRECTORY_OBJECT, 889eda14cbcSMatt Macy DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, &spacemap_zap); 890eda14cbcSMatt Macy if (error == ENOENT) { 891eda14cbcSMatt Macy ASSERT(avl_is_empty(&spa->spa_sm_logs_by_txg)); 892eda14cbcSMatt Macy return; 893eda14cbcSMatt Macy } 894eda14cbcSMatt Macy VERIFY0(error); 895eda14cbcSMatt Macy 896eda14cbcSMatt Macy metaslab_t *oldest = avl_first(&spa->spa_metaslabs_by_flushed); 897eda14cbcSMatt Macy uint64_t oldest_flushed_txg = metaslab_unflushed_txg(oldest); 898eda14cbcSMatt Macy 899eda14cbcSMatt Macy /* Free all log space maps older than the oldest_flushed_txg. */ 900eda14cbcSMatt Macy for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg); 901eda14cbcSMatt Macy sls && sls->sls_txg < oldest_flushed_txg; 902eda14cbcSMatt Macy sls = avl_first(&spa->spa_sm_logs_by_txg)) { 903eda14cbcSMatt Macy ASSERT0(sls->sls_mscount); 904eda14cbcSMatt Macy avl_remove(&spa->spa_sm_logs_by_txg, sls); 905eda14cbcSMatt Macy space_map_free_obj(mos, sls->sls_sm_obj, tx); 906eda14cbcSMatt Macy VERIFY0(zap_remove_int(mos, spacemap_zap, sls->sls_txg, tx)); 907eda14cbcSMatt Macy spa->spa_unflushed_stats.sus_nblocks -= sls->sls_nblocks; 908eda14cbcSMatt Macy kmem_free(sls, sizeof (spa_log_sm_t)); 909eda14cbcSMatt Macy } 910eda14cbcSMatt Macy } 911eda14cbcSMatt Macy 912eda14cbcSMatt Macy static spa_log_sm_t * 913eda14cbcSMatt Macy spa_log_sm_alloc(uint64_t sm_obj, uint64_t txg) 914eda14cbcSMatt Macy { 915eda14cbcSMatt Macy spa_log_sm_t *sls = kmem_zalloc(sizeof (*sls), KM_SLEEP); 916eda14cbcSMatt Macy sls->sls_sm_obj = sm_obj; 917eda14cbcSMatt Macy sls->sls_txg = txg; 918eda14cbcSMatt Macy return (sls); 919eda14cbcSMatt Macy } 920eda14cbcSMatt Macy 921eda14cbcSMatt Macy void 922eda14cbcSMatt Macy spa_generate_syncing_log_sm(spa_t *spa, dmu_tx_t *tx) 923eda14cbcSMatt Macy { 924eda14cbcSMatt Macy uint64_t txg = dmu_tx_get_txg(tx); 925eda14cbcSMatt Macy objset_t *mos = spa_meta_objset(spa); 926eda14cbcSMatt Macy 927eda14cbcSMatt Macy if (spa_syncing_log_sm(spa) != NULL) 928eda14cbcSMatt Macy return; 929eda14cbcSMatt Macy 930eda14cbcSMatt Macy if (!spa_feature_is_enabled(spa, SPA_FEATURE_LOG_SPACEMAP)) 931eda14cbcSMatt Macy return; 932eda14cbcSMatt Macy 933eda14cbcSMatt Macy uint64_t spacemap_zap; 934eda14cbcSMatt Macy int error = zap_lookup(mos, DMU_POOL_DIRECTORY_OBJECT, 935eda14cbcSMatt Macy DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, &spacemap_zap); 936eda14cbcSMatt Macy if (error == ENOENT) { 937eda14cbcSMatt Macy ASSERT(avl_is_empty(&spa->spa_sm_logs_by_txg)); 938eda14cbcSMatt Macy 939eda14cbcSMatt Macy error = 0; 940eda14cbcSMatt Macy spacemap_zap = zap_create(mos, 941eda14cbcSMatt Macy DMU_OTN_ZAP_METADATA, DMU_OT_NONE, 0, tx); 942eda14cbcSMatt Macy VERIFY0(zap_add(mos, DMU_POOL_DIRECTORY_OBJECT, 943eda14cbcSMatt Macy DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, 944eda14cbcSMatt Macy &spacemap_zap, tx)); 945eda14cbcSMatt Macy spa_feature_incr(spa, SPA_FEATURE_LOG_SPACEMAP, tx); 946eda14cbcSMatt Macy } 947eda14cbcSMatt Macy VERIFY0(error); 948eda14cbcSMatt Macy 949eda14cbcSMatt Macy uint64_t sm_obj; 950eda14cbcSMatt Macy ASSERT3U(zap_lookup_int_key(mos, spacemap_zap, txg, &sm_obj), 951eda14cbcSMatt Macy ==, ENOENT); 952eda14cbcSMatt Macy sm_obj = space_map_alloc(mos, zfs_log_sm_blksz, tx); 953eda14cbcSMatt Macy VERIFY0(zap_add_int_key(mos, spacemap_zap, txg, sm_obj, tx)); 954eda14cbcSMatt Macy avl_add(&spa->spa_sm_logs_by_txg, spa_log_sm_alloc(sm_obj, txg)); 955eda14cbcSMatt Macy 956eda14cbcSMatt Macy /* 957eda14cbcSMatt Macy * We pass UINT64_MAX as the space map's representation size 958eda14cbcSMatt Macy * and SPA_MINBLOCKSHIFT as the shift, to make the space map 959eda14cbcSMatt Macy * accept any sorts of segments since there's no real advantage 960eda14cbcSMatt Macy * to being more restrictive (given that we're already going 961eda14cbcSMatt Macy * to be using 2-word entries). 962eda14cbcSMatt Macy */ 963eda14cbcSMatt Macy VERIFY0(space_map_open(&spa->spa_syncing_log_sm, mos, sm_obj, 964eda14cbcSMatt Macy 0, UINT64_MAX, SPA_MINBLOCKSHIFT)); 965eda14cbcSMatt Macy 966eda14cbcSMatt Macy /* 967eda14cbcSMatt Macy * If the log space map feature was just enabled, the blocklimit 968eda14cbcSMatt Macy * has not yet been set. 969eda14cbcSMatt Macy */ 970eda14cbcSMatt Macy if (spa_log_sm_blocklimit(spa) == 0) 971eda14cbcSMatt Macy spa_log_sm_set_blocklimit(spa); 972eda14cbcSMatt Macy } 973eda14cbcSMatt Macy 974eda14cbcSMatt Macy /* 975eda14cbcSMatt Macy * Find all the log space maps stored in the space map ZAP and sort 976eda14cbcSMatt Macy * them by their TXG in spa_sm_logs_by_txg. 977eda14cbcSMatt Macy */ 978eda14cbcSMatt Macy static int 979eda14cbcSMatt Macy spa_ld_log_sm_metadata(spa_t *spa) 980eda14cbcSMatt Macy { 981eda14cbcSMatt Macy int error; 982eda14cbcSMatt Macy uint64_t spacemap_zap; 983eda14cbcSMatt Macy 984eda14cbcSMatt Macy ASSERT(avl_is_empty(&spa->spa_sm_logs_by_txg)); 985eda14cbcSMatt Macy 986eda14cbcSMatt Macy error = zap_lookup(spa_meta_objset(spa), DMU_POOL_DIRECTORY_OBJECT, 987eda14cbcSMatt Macy DMU_POOL_LOG_SPACEMAP_ZAP, sizeof (spacemap_zap), 1, &spacemap_zap); 988eda14cbcSMatt Macy if (error == ENOENT) { 989eda14cbcSMatt Macy /* the space map ZAP doesn't exist yet */ 990eda14cbcSMatt Macy return (0); 991eda14cbcSMatt Macy } else if (error != 0) { 992eda14cbcSMatt Macy spa_load_failed(spa, "spa_ld_log_sm_metadata(): failed at " 993eda14cbcSMatt Macy "zap_lookup(DMU_POOL_DIRECTORY_OBJECT) [error %d]", 994eda14cbcSMatt Macy error); 995eda14cbcSMatt Macy return (error); 996eda14cbcSMatt Macy } 997eda14cbcSMatt Macy 998eda14cbcSMatt Macy zap_cursor_t zc; 999eda14cbcSMatt Macy zap_attribute_t za; 1000eda14cbcSMatt Macy for (zap_cursor_init(&zc, spa_meta_objset(spa), spacemap_zap); 1001eda14cbcSMatt Macy (error = zap_cursor_retrieve(&zc, &za)) == 0; 1002eda14cbcSMatt Macy zap_cursor_advance(&zc)) { 1003eda14cbcSMatt Macy uint64_t log_txg = zfs_strtonum(za.za_name, NULL); 1004eda14cbcSMatt Macy spa_log_sm_t *sls = 1005eda14cbcSMatt Macy spa_log_sm_alloc(za.za_first_integer, log_txg); 1006eda14cbcSMatt Macy avl_add(&spa->spa_sm_logs_by_txg, sls); 1007eda14cbcSMatt Macy } 1008eda14cbcSMatt Macy zap_cursor_fini(&zc); 1009eda14cbcSMatt Macy if (error != ENOENT) { 1010eda14cbcSMatt Macy spa_load_failed(spa, "spa_ld_log_sm_metadata(): failed at " 1011eda14cbcSMatt Macy "zap_cursor_retrieve(spacemap_zap) [error %d]", 1012eda14cbcSMatt Macy error); 1013eda14cbcSMatt Macy return (error); 1014eda14cbcSMatt Macy } 1015eda14cbcSMatt Macy 1016eda14cbcSMatt Macy for (metaslab_t *m = avl_first(&spa->spa_metaslabs_by_flushed); 1017eda14cbcSMatt Macy m; m = AVL_NEXT(&spa->spa_metaslabs_by_flushed, m)) { 1018eda14cbcSMatt Macy spa_log_sm_t target = { .sls_txg = metaslab_unflushed_txg(m) }; 1019eda14cbcSMatt Macy spa_log_sm_t *sls = avl_find(&spa->spa_sm_logs_by_txg, 1020eda14cbcSMatt Macy &target, NULL); 1021eda14cbcSMatt Macy 1022eda14cbcSMatt Macy /* 1023eda14cbcSMatt Macy * At this point if sls is zero it means that a bug occurred 1024eda14cbcSMatt Macy * in ZFS the last time the pool was open or earlier in the 1025eda14cbcSMatt Macy * import code path. In general, we would have placed a 1026eda14cbcSMatt Macy * VERIFY() here or in this case just let the kernel panic 1027eda14cbcSMatt Macy * with NULL pointer dereference when incrementing sls_mscount, 1028eda14cbcSMatt Macy * but since this is the import code path we can be a bit more 1029eda14cbcSMatt Macy * lenient. Thus, for DEBUG bits we always cause a panic, while 1030eda14cbcSMatt Macy * in production we log the error and just fail the import. 1031eda14cbcSMatt Macy */ 1032eda14cbcSMatt Macy ASSERT(sls != NULL); 1033eda14cbcSMatt Macy if (sls == NULL) { 1034eda14cbcSMatt Macy spa_load_failed(spa, "spa_ld_log_sm_metadata(): bug " 1035eda14cbcSMatt Macy "encountered: could not find log spacemap for " 1036*1f88aa09SMartin Matuska "TXG %llu [error %d]", 1037*1f88aa09SMartin Matuska (u_longlong_t)metaslab_unflushed_txg(m), ENOENT); 1038eda14cbcSMatt Macy return (ENOENT); 1039eda14cbcSMatt Macy } 1040eda14cbcSMatt Macy sls->sls_mscount++; 1041eda14cbcSMatt Macy } 1042eda14cbcSMatt Macy 1043eda14cbcSMatt Macy return (0); 1044eda14cbcSMatt Macy } 1045eda14cbcSMatt Macy 1046eda14cbcSMatt Macy typedef struct spa_ld_log_sm_arg { 1047eda14cbcSMatt Macy spa_t *slls_spa; 1048eda14cbcSMatt Macy uint64_t slls_txg; 1049eda14cbcSMatt Macy } spa_ld_log_sm_arg_t; 1050eda14cbcSMatt Macy 1051eda14cbcSMatt Macy static int 1052eda14cbcSMatt Macy spa_ld_log_sm_cb(space_map_entry_t *sme, void *arg) 1053eda14cbcSMatt Macy { 1054eda14cbcSMatt Macy uint64_t offset = sme->sme_offset; 1055eda14cbcSMatt Macy uint64_t size = sme->sme_run; 1056eda14cbcSMatt Macy uint32_t vdev_id = sme->sme_vdev; 1057eda14cbcSMatt Macy 1058eda14cbcSMatt Macy spa_ld_log_sm_arg_t *slls = arg; 1059eda14cbcSMatt Macy spa_t *spa = slls->slls_spa; 1060eda14cbcSMatt Macy 1061eda14cbcSMatt Macy vdev_t *vd = vdev_lookup_top(spa, vdev_id); 1062eda14cbcSMatt Macy 1063eda14cbcSMatt Macy /* 1064eda14cbcSMatt Macy * If the vdev has been removed (i.e. it is indirect or a hole) 1065eda14cbcSMatt Macy * skip this entry. The contents of this vdev have already moved 1066eda14cbcSMatt Macy * elsewhere. 1067eda14cbcSMatt Macy */ 1068eda14cbcSMatt Macy if (!vdev_is_concrete(vd)) 1069eda14cbcSMatt Macy return (0); 1070eda14cbcSMatt Macy 1071eda14cbcSMatt Macy metaslab_t *ms = vd->vdev_ms[offset >> vd->vdev_ms_shift]; 1072eda14cbcSMatt Macy ASSERT(!ms->ms_loaded); 1073eda14cbcSMatt Macy 1074eda14cbcSMatt Macy /* 1075eda14cbcSMatt Macy * If we have already flushed entries for this TXG to this 1076eda14cbcSMatt Macy * metaslab's space map, then ignore it. Note that we flush 1077eda14cbcSMatt Macy * before processing any allocations/frees for that TXG, so 1078eda14cbcSMatt Macy * the metaslab's space map only has entries from *before* 1079eda14cbcSMatt Macy * the unflushed TXG. 1080eda14cbcSMatt Macy */ 1081eda14cbcSMatt Macy if (slls->slls_txg < metaslab_unflushed_txg(ms)) 1082eda14cbcSMatt Macy return (0); 1083eda14cbcSMatt Macy 1084eda14cbcSMatt Macy switch (sme->sme_type) { 1085eda14cbcSMatt Macy case SM_ALLOC: 1086eda14cbcSMatt Macy range_tree_remove_xor_add_segment(offset, offset + size, 1087eda14cbcSMatt Macy ms->ms_unflushed_frees, ms->ms_unflushed_allocs); 1088eda14cbcSMatt Macy break; 1089eda14cbcSMatt Macy case SM_FREE: 1090eda14cbcSMatt Macy range_tree_remove_xor_add_segment(offset, offset + size, 1091eda14cbcSMatt Macy ms->ms_unflushed_allocs, ms->ms_unflushed_frees); 1092eda14cbcSMatt Macy break; 1093eda14cbcSMatt Macy default: 1094eda14cbcSMatt Macy panic("invalid maptype_t"); 1095eda14cbcSMatt Macy break; 1096eda14cbcSMatt Macy } 1097eda14cbcSMatt Macy return (0); 1098eda14cbcSMatt Macy } 1099eda14cbcSMatt Macy 1100eda14cbcSMatt Macy static int 1101eda14cbcSMatt Macy spa_ld_log_sm_data(spa_t *spa) 1102eda14cbcSMatt Macy { 1103eda14cbcSMatt Macy int error = 0; 1104eda14cbcSMatt Macy 1105eda14cbcSMatt Macy /* 1106eda14cbcSMatt Macy * If we are not going to do any writes there is no need 1107eda14cbcSMatt Macy * to read the log space maps. 1108eda14cbcSMatt Macy */ 1109eda14cbcSMatt Macy if (!spa_writeable(spa)) 1110eda14cbcSMatt Macy return (0); 1111eda14cbcSMatt Macy 1112eda14cbcSMatt Macy ASSERT0(spa->spa_unflushed_stats.sus_nblocks); 1113eda14cbcSMatt Macy ASSERT0(spa->spa_unflushed_stats.sus_memused); 1114eda14cbcSMatt Macy 1115eda14cbcSMatt Macy hrtime_t read_logs_starttime = gethrtime(); 1116eda14cbcSMatt Macy /* this is a no-op when we don't have space map logs */ 1117eda14cbcSMatt Macy for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg); 1118eda14cbcSMatt Macy sls; sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls)) { 1119eda14cbcSMatt Macy space_map_t *sm = NULL; 1120eda14cbcSMatt Macy error = space_map_open(&sm, spa_meta_objset(spa), 1121eda14cbcSMatt Macy sls->sls_sm_obj, 0, UINT64_MAX, SPA_MINBLOCKSHIFT); 1122eda14cbcSMatt Macy if (error != 0) { 1123eda14cbcSMatt Macy spa_load_failed(spa, "spa_ld_log_sm_data(): failed at " 1124eda14cbcSMatt Macy "space_map_open(obj=%llu) [error %d]", 1125eda14cbcSMatt Macy (u_longlong_t)sls->sls_sm_obj, error); 1126eda14cbcSMatt Macy goto out; 1127eda14cbcSMatt Macy } 1128eda14cbcSMatt Macy 1129eda14cbcSMatt Macy struct spa_ld_log_sm_arg vla = { 1130eda14cbcSMatt Macy .slls_spa = spa, 1131eda14cbcSMatt Macy .slls_txg = sls->sls_txg 1132eda14cbcSMatt Macy }; 1133eda14cbcSMatt Macy error = space_map_iterate(sm, space_map_length(sm), 1134eda14cbcSMatt Macy spa_ld_log_sm_cb, &vla); 1135eda14cbcSMatt Macy if (error != 0) { 1136eda14cbcSMatt Macy space_map_close(sm); 1137eda14cbcSMatt Macy spa_load_failed(spa, "spa_ld_log_sm_data(): failed " 1138eda14cbcSMatt Macy "at space_map_iterate(obj=%llu) [error %d]", 1139eda14cbcSMatt Macy (u_longlong_t)sls->sls_sm_obj, error); 1140eda14cbcSMatt Macy goto out; 1141eda14cbcSMatt Macy } 1142eda14cbcSMatt Macy 1143eda14cbcSMatt Macy ASSERT0(sls->sls_nblocks); 1144eda14cbcSMatt Macy sls->sls_nblocks = space_map_nblocks(sm); 1145eda14cbcSMatt Macy spa->spa_unflushed_stats.sus_nblocks += sls->sls_nblocks; 1146eda14cbcSMatt Macy summary_add_data(spa, sls->sls_txg, 1147eda14cbcSMatt Macy sls->sls_mscount, sls->sls_nblocks); 1148eda14cbcSMatt Macy 1149eda14cbcSMatt Macy space_map_close(sm); 1150eda14cbcSMatt Macy } 1151eda14cbcSMatt Macy hrtime_t read_logs_endtime = gethrtime(); 1152eda14cbcSMatt Macy spa_load_note(spa, 1153eda14cbcSMatt Macy "read %llu log space maps (%llu total blocks - blksz = %llu bytes) " 1154eda14cbcSMatt Macy "in %lld ms", (u_longlong_t)avl_numnodes(&spa->spa_sm_logs_by_txg), 1155eda14cbcSMatt Macy (u_longlong_t)spa_log_sm_nblocks(spa), 1156eda14cbcSMatt Macy (u_longlong_t)zfs_log_sm_blksz, 1157eda14cbcSMatt Macy (longlong_t)((read_logs_endtime - read_logs_starttime) / 1000000)); 1158eda14cbcSMatt Macy 1159eda14cbcSMatt Macy out: 1160eda14cbcSMatt Macy /* 1161eda14cbcSMatt Macy * Now that the metaslabs contain their unflushed changes: 1162eda14cbcSMatt Macy * [1] recalculate their actual allocated space 1163eda14cbcSMatt Macy * [2] recalculate their weights 1164eda14cbcSMatt Macy * [3] sum up the memory usage of their unflushed range trees 1165eda14cbcSMatt Macy * [4] optionally load them, if debug_load is set 1166eda14cbcSMatt Macy * 1167eda14cbcSMatt Macy * Note that even in the case where we get here because of an 1168eda14cbcSMatt Macy * error (e.g. error != 0), we still want to update the fields 1169eda14cbcSMatt Macy * below in order to have a proper teardown in spa_unload(). 1170eda14cbcSMatt Macy */ 1171eda14cbcSMatt Macy for (metaslab_t *m = avl_first(&spa->spa_metaslabs_by_flushed); 1172eda14cbcSMatt Macy m != NULL; m = AVL_NEXT(&spa->spa_metaslabs_by_flushed, m)) { 1173eda14cbcSMatt Macy mutex_enter(&m->ms_lock); 1174eda14cbcSMatt Macy m->ms_allocated_space = space_map_allocated(m->ms_sm) + 1175eda14cbcSMatt Macy range_tree_space(m->ms_unflushed_allocs) - 1176eda14cbcSMatt Macy range_tree_space(m->ms_unflushed_frees); 1177eda14cbcSMatt Macy 1178eda14cbcSMatt Macy vdev_t *vd = m->ms_group->mg_vd; 1179eda14cbcSMatt Macy metaslab_space_update(vd, m->ms_group->mg_class, 1180eda14cbcSMatt Macy range_tree_space(m->ms_unflushed_allocs), 0, 0); 1181eda14cbcSMatt Macy metaslab_space_update(vd, m->ms_group->mg_class, 1182eda14cbcSMatt Macy -range_tree_space(m->ms_unflushed_frees), 0, 0); 1183eda14cbcSMatt Macy 1184eda14cbcSMatt Macy ASSERT0(m->ms_weight & METASLAB_ACTIVE_MASK); 1185eda14cbcSMatt Macy metaslab_recalculate_weight_and_sort(m); 1186eda14cbcSMatt Macy 1187eda14cbcSMatt Macy spa->spa_unflushed_stats.sus_memused += 1188eda14cbcSMatt Macy metaslab_unflushed_changes_memused(m); 1189eda14cbcSMatt Macy 1190eda14cbcSMatt Macy if (metaslab_debug_load && m->ms_sm != NULL) { 1191eda14cbcSMatt Macy VERIFY0(metaslab_load(m)); 1192eda14cbcSMatt Macy metaslab_set_selected_txg(m, 0); 1193eda14cbcSMatt Macy } 1194eda14cbcSMatt Macy mutex_exit(&m->ms_lock); 1195eda14cbcSMatt Macy } 1196eda14cbcSMatt Macy 1197eda14cbcSMatt Macy return (error); 1198eda14cbcSMatt Macy } 1199eda14cbcSMatt Macy 1200eda14cbcSMatt Macy static int 1201eda14cbcSMatt Macy spa_ld_unflushed_txgs(vdev_t *vd) 1202eda14cbcSMatt Macy { 1203eda14cbcSMatt Macy spa_t *spa = vd->vdev_spa; 1204eda14cbcSMatt Macy objset_t *mos = spa_meta_objset(spa); 1205eda14cbcSMatt Macy 1206eda14cbcSMatt Macy if (vd->vdev_top_zap == 0) 1207eda14cbcSMatt Macy return (0); 1208eda14cbcSMatt Macy 1209eda14cbcSMatt Macy uint64_t object = 0; 1210eda14cbcSMatt Macy int error = zap_lookup(mos, vd->vdev_top_zap, 1211eda14cbcSMatt Macy VDEV_TOP_ZAP_MS_UNFLUSHED_PHYS_TXGS, 1212eda14cbcSMatt Macy sizeof (uint64_t), 1, &object); 1213eda14cbcSMatt Macy if (error == ENOENT) 1214eda14cbcSMatt Macy return (0); 1215eda14cbcSMatt Macy else if (error != 0) { 1216eda14cbcSMatt Macy spa_load_failed(spa, "spa_ld_unflushed_txgs(): failed at " 1217eda14cbcSMatt Macy "zap_lookup(vdev_top_zap=%llu) [error %d]", 1218eda14cbcSMatt Macy (u_longlong_t)vd->vdev_top_zap, error); 1219eda14cbcSMatt Macy return (error); 1220eda14cbcSMatt Macy } 1221eda14cbcSMatt Macy 1222eda14cbcSMatt Macy for (uint64_t m = 0; m < vd->vdev_ms_count; m++) { 1223eda14cbcSMatt Macy metaslab_t *ms = vd->vdev_ms[m]; 1224eda14cbcSMatt Macy ASSERT(ms != NULL); 1225eda14cbcSMatt Macy 1226eda14cbcSMatt Macy metaslab_unflushed_phys_t entry; 1227eda14cbcSMatt Macy uint64_t entry_size = sizeof (entry); 1228eda14cbcSMatt Macy uint64_t entry_offset = ms->ms_id * entry_size; 1229eda14cbcSMatt Macy 1230eda14cbcSMatt Macy error = dmu_read(mos, object, 1231eda14cbcSMatt Macy entry_offset, entry_size, &entry, 0); 1232eda14cbcSMatt Macy if (error != 0) { 1233eda14cbcSMatt Macy spa_load_failed(spa, "spa_ld_unflushed_txgs(): " 1234eda14cbcSMatt Macy "failed at dmu_read(obj=%llu) [error %d]", 1235eda14cbcSMatt Macy (u_longlong_t)object, error); 1236eda14cbcSMatt Macy return (error); 1237eda14cbcSMatt Macy } 1238eda14cbcSMatt Macy 1239eda14cbcSMatt Macy ms->ms_unflushed_txg = entry.msp_unflushed_txg; 1240eda14cbcSMatt Macy if (ms->ms_unflushed_txg != 0) { 1241eda14cbcSMatt Macy mutex_enter(&spa->spa_flushed_ms_lock); 1242eda14cbcSMatt Macy avl_add(&spa->spa_metaslabs_by_flushed, ms); 1243eda14cbcSMatt Macy mutex_exit(&spa->spa_flushed_ms_lock); 1244eda14cbcSMatt Macy } 1245eda14cbcSMatt Macy } 1246eda14cbcSMatt Macy return (0); 1247eda14cbcSMatt Macy } 1248eda14cbcSMatt Macy 1249eda14cbcSMatt Macy /* 1250eda14cbcSMatt Macy * Read all the log space map entries into their respective 1251eda14cbcSMatt Macy * metaslab unflushed trees and keep them sorted by TXG in the 1252eda14cbcSMatt Macy * SPA's metadata. In addition, setup all the metadata for the 1253eda14cbcSMatt Macy * memory and the block heuristics. 1254eda14cbcSMatt Macy */ 1255eda14cbcSMatt Macy int 1256eda14cbcSMatt Macy spa_ld_log_spacemaps(spa_t *spa) 1257eda14cbcSMatt Macy { 1258eda14cbcSMatt Macy int error; 1259eda14cbcSMatt Macy 1260eda14cbcSMatt Macy spa_log_sm_set_blocklimit(spa); 1261eda14cbcSMatt Macy 1262eda14cbcSMatt Macy for (uint64_t c = 0; c < spa->spa_root_vdev->vdev_children; c++) { 1263eda14cbcSMatt Macy vdev_t *vd = spa->spa_root_vdev->vdev_child[c]; 1264eda14cbcSMatt Macy error = spa_ld_unflushed_txgs(vd); 1265eda14cbcSMatt Macy if (error != 0) 1266eda14cbcSMatt Macy return (error); 1267eda14cbcSMatt Macy } 1268eda14cbcSMatt Macy 1269eda14cbcSMatt Macy error = spa_ld_log_sm_metadata(spa); 1270eda14cbcSMatt Macy if (error != 0) 1271eda14cbcSMatt Macy return (error); 1272eda14cbcSMatt Macy 1273eda14cbcSMatt Macy /* 1274eda14cbcSMatt Macy * Note: we don't actually expect anything to change at this point 1275eda14cbcSMatt Macy * but we grab the config lock so we don't fail any assertions 1276eda14cbcSMatt Macy * when using vdev_lookup_top(). 1277eda14cbcSMatt Macy */ 1278eda14cbcSMatt Macy spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); 1279eda14cbcSMatt Macy error = spa_ld_log_sm_data(spa); 1280eda14cbcSMatt Macy spa_config_exit(spa, SCL_CONFIG, FTAG); 1281eda14cbcSMatt Macy 1282eda14cbcSMatt Macy return (error); 1283eda14cbcSMatt Macy } 1284eda14cbcSMatt Macy 1285eda14cbcSMatt Macy /* BEGIN CSTYLED */ 1286eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, unflushed_max_mem_amt, ULONG, ZMOD_RW, 1287eda14cbcSMatt Macy "Specific hard-limit in memory that ZFS allows to be used for " 1288eda14cbcSMatt Macy "unflushed changes"); 1289eda14cbcSMatt Macy 1290eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, unflushed_max_mem_ppm, ULONG, ZMOD_RW, 1291eda14cbcSMatt Macy "Percentage of the overall system memory that ZFS allows to be " 1292eda14cbcSMatt Macy "used for unflushed changes (value is calculated over 1000000 for " 129316038816SMartin Matuska "finer granularity)"); 1294eda14cbcSMatt Macy 1295eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, unflushed_log_block_max, ULONG, ZMOD_RW, 1296eda14cbcSMatt Macy "Hard limit (upper-bound) in the size of the space map log " 1297eda14cbcSMatt Macy "in terms of blocks."); 1298eda14cbcSMatt Macy 1299eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, unflushed_log_block_min, ULONG, ZMOD_RW, 1300eda14cbcSMatt Macy "Lower-bound limit for the maximum amount of blocks allowed in " 1301eda14cbcSMatt Macy "log spacemap (see zfs_unflushed_log_block_max)"); 1302eda14cbcSMatt Macy 1303eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, unflushed_log_block_pct, ULONG, ZMOD_RW, 1304eda14cbcSMatt Macy "Tunable used to determine the number of blocks that can be used for " 1305eda14cbcSMatt Macy "the spacemap log, expressed as a percentage of the total number of " 1306eda14cbcSMatt Macy "metaslabs in the pool (e.g. 400 means the number of log blocks is " 1307eda14cbcSMatt Macy "capped at 4 times the number of metaslabs)"); 1308eda14cbcSMatt Macy 1309eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, max_log_walking, ULONG, ZMOD_RW, 1310eda14cbcSMatt Macy "The number of past TXGs that the flushing algorithm of the log " 1311eda14cbcSMatt Macy "spacemap feature uses to estimate incoming log blocks"); 1312eda14cbcSMatt Macy 1313eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, max_logsm_summary_length, ULONG, ZMOD_RW, 1314eda14cbcSMatt Macy "Maximum number of rows allowed in the summary of the spacemap log"); 1315eda14cbcSMatt Macy 1316eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, min_metaslabs_to_flush, ULONG, ZMOD_RW, 1317eda14cbcSMatt Macy "Minimum number of metaslabs to flush per dirty TXG"); 1318eda14cbcSMatt Macy 1319eda14cbcSMatt Macy ZFS_MODULE_PARAM(zfs, zfs_, keep_log_spacemaps_at_export, INT, ZMOD_RW, 1320eda14cbcSMatt Macy "Prevent the log spacemaps from being flushed and destroyed " 1321eda14cbcSMatt Macy "during pool export/destroy"); 1322eda14cbcSMatt Macy /* END CSTYLED */ 1323