1////
2/**
3 *
4 * Licensed to the Apache Software Foundation (ASF) under one
5 * or more contributor license agreements.  See the NOTICE file
6 * distributed with this work for additional information
7 * regarding copyright ownership.  The ASF licenses this file
8 * to you under the Apache License, Version 2.0 (the
9 * "License"); you may not use this file except in compliance
10 * with the License.  You may obtain a copy of the License at
11 *
12 *     http://www.apache.org/licenses/LICENSE-2.0
13 *
14 * Unless required by applicable law or agreed to in writing, software
15 * distributed under the License is distributed on an "AS IS" BASIS,
16 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17 * See the License for the specific language governing permissions and
18 * limitations under the License.
19 */
20////
21
22
23[[performance]]
24= Apache HBase Performance Tuning
25:doctype: book
26:numbered:
27:toc: left
28:icons: font
29:experimental:
30
31[[perf.os]]
32== Operating System
33
34[[perf.os.ram]]
35=== Memory
36
37RAM, RAM, RAM.
38Don't starve HBase.
39
40[[perf.os.64]]
41=== 64-bit
42
43Use a 64-bit platform (and 64-bit JVM).
44
45[[perf.os.swap]]
46=== Swapping
47
48Watch out for swapping.
49Set `swappiness` to 0.
50
51[[perf.os.cpu]]
52=== CPU
53Make sure you have set up your Hadoop to use native, hardware checksumming.
54See link:[hadoop.native.lib].
55
56[[perf.network]]
57== Network
58
59Perhaps the most important factor in avoiding network issues degrading Hadoop and HBase performance is the switching hardware that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more).
60
61Important items to consider:
62
63* Switching capacity of the device
64* Number of systems connected
65* Uplink capacity
66
67[[perf.network.1switch]]
68=== Single Switch
69
70The single most important factor in this configuration is that the switching capacity of the hardware is capable of handling the traffic which can be generated by all systems connected to the switch.
71Some lower priced commodity hardware can have a slower switching capacity than could be utilized by a full switch.
72
73[[perf.network.2switch]]
74=== Multiple Switches
75
76Multiple switches are a potential pitfall in the architecture.
77The most common configuration of lower priced hardware is a simple 1Gbps uplink from one switch to another.
78This often overlooked pinch point can easily become a bottleneck for cluster communication.
79Especially with MapReduce jobs that are both reading and writing a lot of data the communication across this uplink could be saturated.
80
81Mitigation of this issue is fairly simple and can be accomplished in multiple ways:
82
83* Use appropriate hardware for the scale of the cluster which you're attempting to build.
84* Use larger single switch configurations i.e.
85  single 48 port as opposed to 2x 24 port
86* Configure port trunking for uplinks to utilize multiple interfaces to increase cross switch bandwidth.
87
88[[perf.network.multirack]]
89=== Multiple Racks
90
91Multiple rack configurations carry the same potential issues as multiple switches, and can suffer performance degradation from two main areas:
92
93* Poor switch capacity performance
94* Insufficient uplink to another rack
95
96If the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing more of your cluster across racks.
97The easiest way to avoid issues when spanning multiple racks is to use port trunking to create a bonded uplink to other racks.
98The downside of this method however, is in the overhead of ports that could potentially be used.
99An example of this is, creating an 8Gbps port channel from rack A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster.
100
101Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to save your ports for machines as opposed to uplinks.
102
103[[perf.network.ints]]
104=== Network Interfaces
105
106Are all the network interfaces functioning correctly? Are you sure? See the Troubleshooting Case Study in <<casestudies.slownode>>.
107
108[[perf.network.call_me_maybe]]
109=== Network Consistency and Partition Tolerance
110The link:http://en.wikipedia.org/wiki/CAP_theorem[CAP Theorem] states that a distributed system can maintain two out of the following three characteristics:
111- *C*onsistency -- all nodes see the same data.
112- *A*vailability -- every request receives a response about whether it succeeded or failed.
113- *P*artition tolerance -- the system continues to operate even if some of its components become unavailable to the others.
114
115HBase favors consistency and partition tolerance, where a decision has to be made. Coda Hale explains why partition tolerance is so important, in http://codahale.com/you-cant-sacrifice-partition-tolerance/.
116
117Robert Yokota used an automated testing framework called link:https://aphyr.com/tags/jepsen[Jepson] to test HBase's partition tolerance in the face of network partitions, using techniques modeled after Aphyr's link:https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions[Call Me Maybe] series. The results, available as a link:https://rayokota.wordpress.com/2015/09/30/call-me-maybe-hbase/[blog post] and an link:https://rayokota.wordpress.com/2015/09/30/call-me-maybe-hbase-addendum/[addendum], show that HBase performs correctly.
118
119[[jvm]]
120== Java
121
122[[gc]]
123=== The Garbage Collector and Apache HBase
124
125[[gcpause]]
126==== Long GC pauses
127
128In his presentation, link:http://www.slideshare.net/cloudera/hbase-hug-presentation[Avoiding Full GCs with MemStore-Local Allocation Buffers], Todd Lipcon describes two cases of stop-the-world garbage collections common in HBase, especially during loading; CMS failure modes and old generation heap fragmentation brought.
129
130To address the first, start the CMS earlier than default by adding `-XX:CMSInitiatingOccupancyFraction` and setting it down from defaults.
131Start at 60 or 70 percent (The lower you bring down the threshold, the more GCing is done, the more CPU used). To address the second fragmentation issue, Todd added an experimental facility,
132(MSLAB), that must be explicitly enabled in Apache HBase 0.90.x (It's defaulted to be _on_ in Apache 0.92.x HBase). Set `hbase.hregion.memstore.mslab.enabled` to true in your `Configuration`.
133See the cited slides for background and detail.
134The latest JVMs do better regards fragmentation so make sure you are running a recent release.
135Read down in the message, link:http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html[Identifying concurrent mode failures caused by fragmentation].
136Be aware that when enabled, each MemStore instance will occupy at least an MSLAB instance of memory.
137If you have thousands of regions or lots of regions each with many column families, this allocation of MSLAB may be responsible for a good portion of your heap allocation and in an extreme case cause you to OOME.
138Disable MSLAB in this case, or lower the amount of memory it uses or float less regions per server.
139
140If you have a write-heavy workload, check out link:https://issues.apache.org/jira/browse/HBASE-8163[HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB].
141It describes configurations to lower the amount of young GC during write-heavy loadings.
142If you do not have HBASE-8163 installed, and you are trying to improve your young GC times, one trick to consider -- courtesy of our Liang Xie -- is to set the GC config `-XX:PretenureSizeThreshold` in _hbase-env.sh_ to be just smaller than the size of `hbase.hregion.memstore.mslab.chunksize` so MSLAB allocations happen in the tenured space directly rather than first in the young gen.
143You'd do this because these MSLAB allocations are going to likely make it to the old gen anyways and rather than pay the price of a copies between s0 and s1 in eden space followed by the copy up from young to old gen after the MSLABs have achieved sufficient tenure, save a bit of YGC churn and allocate in the old gen directly.
144
145Other sources of long GCs can be the JVM itself logging.
146See link:https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic[Eliminating Large JVM GC Pauses Caused by Background IO Traffic]
147
148For more information about GC logs, see <<trouble.log.gc>>.
149
150Consider also enabling the off-heap Block Cache.
151This has been shown to mitigate GC pause times.
152See <<block.cache>>
153
154[[perf.configurations]]
155== HBase Configurations
156
157See <<recommended_configurations>>.
158
159[[perf.compactions.and.splits]]
160=== Managing Compactions
161
162For larger systems, managing link:[compactions and splits] may be something you want to consider.
163
164[[perf.handlers]]
165=== `hbase.regionserver.handler.count`
166
167See <<hbase.regionserver.handler.count>>.
168
169[[perf.hfile.block.cache.size]]
170=== `hfile.block.cache.size`
171
172See <<hfile.block.cache.size>>.
173A memory setting for the RegionServer process.
174
175[[blockcache.prefetch]]
176=== Prefetch Option for Blockcache
177
178link:https://issues.apache.org/jira/browse/HBASE-9857[HBASE-9857] adds a new option to prefetch HFile contents when opening the BlockCache, if a Column family or RegionServer property is set.
179This option is available for HBase 0.98.3 and later.
180The purpose is to warm the BlockCache as rapidly as possible after the cache is opened, using in-memory table data, and not counting the prefetching as cache misses.
181This is great for fast reads, but is not a good idea if the data to be preloaded will not fit into the BlockCache.
182It is useful for tuning the IO impact of prefetching versus the time before all data blocks are in cache.
183
184To enable prefetching on a given column family, you can use HBase Shell or use the API.
185
186.Enable Prefetch Using HBase Shell
187====
188----
189hbase> create 'MyTable', { NAME => 'myCF', PREFETCH_BLOCKS_ON_OPEN => 'true' }
190----
191====
192
193.Enable Prefetch Using the API
194====
195[source,java]
196----
197
198// ...
199HTableDescriptor tableDesc = new HTableDescriptor("myTable");
200HColumnDescriptor cfDesc = new HColumnDescriptor("myCF");
201cfDesc.setPrefetchBlocksOnOpen(true);
202tableDesc.addFamily(cfDesc);
203// ...
204----
205====
206
207See the API documentation for
208link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig].
209
210[[perf.rs.memstore.size]]
211=== `hbase.regionserver.global.memstore.size`
212
213See <<hbase.regionserver.global.memstore.size>>.
214This memory setting is often adjusted for the RegionServer process depending on needs.
215
216[[perf.rs.memstore.size.lower.limit]]
217=== `hbase.regionserver.global.memstore.size.lower.limit`
218
219See <<hbase.regionserver.global.memstore.size.lower.limit>>.
220This memory setting is often adjusted for the RegionServer process depending on needs.
221
222[[perf.hstore.blockingstorefiles]]
223=== `hbase.hstore.blockingStoreFiles`
224
225See <<hbase.hstore.blockingStoreFiles>>.
226If there is blocking in the RegionServer logs, increasing this can help.
227
228[[perf.hregion.memstore.block.multiplier]]
229=== `hbase.hregion.memstore.block.multiplier`
230
231See <<hbase.hregion.memstore.block.multiplier>>.
232If there is enough RAM, increasing this can help.
233
234[[hbase.regionserver.checksum.verify.performance]]
235=== `hbase.regionserver.checksum.verify`
236
237Have HBase write the checksum into the datablock and save having to do the checksum seek whenever you read.
238
239See <<hbase.regionserver.checksum.verify>>, <<hbase.hstore.bytes.per.checksum>> and <<hbase.hstore.checksum.algorithm>>. For more information see the release note on link:https://issues.apache.org/jira/browse/HBASE-5074[HBASE-5074 support checksums in HBase block cache].
240
241=== Tuning `callQueue` Options
242
243link:https://issues.apache.org/jira/browse/HBASE-11355[HBASE-11355] introduces several callQueue tuning mechanisms which can increase performance.
244See the JIRA for some benchmarking information.
245
246To increase the number of callqueues, set `hbase.ipc.server.num.callqueue` to a value greater than `1`.
247To split the callqueue into separate read and write queues, set `hbase.ipc.server.callqueue.read.ratio` to a value between `0` and `1`.
248This factor weights the queues toward writes (if below .5) or reads (if above .5). Another way to say this is that the factor determines what percentage of the split queues are used for reads.
249The following examples illustrate some of the possibilities.
250Note that you always have at least one write queue, no matter what setting you use.
251
252* The default value of `0` does not split the queue.
253* A value of `.3` uses 30% of the queues for reading and 60% for writing.
254  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 3 queues would be used for reads and 7 for writes.
255* A value of `.5` uses the same number of read queues and write queues.
256  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 5 queues would be used for reads and 5 for writes.
257* A value of `.6` uses 60% of the queues for reading and 30% for reading.
258  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 7 queues would be used for reads and 3 for writes.
259* A value of `1.0` uses one queue to process write requests, and all other queues process read requests.
260  A value higher than `1.0` has the same effect as a value of `1.0`.
261  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 9 queues would be used for reads and 1 for writes.
262
263You can also split the read queues so that separate queues are used for short reads (from Get operations) and long reads (from Scan operations), by setting the `hbase.ipc.server.callqueue.scan.ratio` option.
264This option is a factor between 0 and 1, which determine the ratio of read queues used for Gets and Scans.
265More queues are used for Gets if the value is below `.5` and more are used for scans if the value is above `.5`.
266No matter what setting you use, at least one read queue is used for Get operations.
267
268* A value of `0` does not split the read queue.
269* A value of `.3` uses 60% of the read queues for Gets and 30% for Scans.
270  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value of `.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 7 would be used for Gets and 3 for Scans.
271* A value of `.5` uses half the read queues for Gets and half for Scans.
272  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value of `.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 5 would be used for Gets and 5 for Scans.
273* A value of `.6` uses 30% of the read queues for Gets and 60% for Scans.
274  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value of `.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 3 would be used for Gets and 7 for Scans.
275* A value of `1.0` uses all but one of the read queues for Scans.
276  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value of`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 1 would be used for Gets and 9 for Scans.
277
278You can use the new option `hbase.ipc.server.callqueue.handler.factor` to programmatically tune the number of queues:
279
280* A value of `0` uses a single shared queue between all the handlers.
281* A value of `1` uses a separate queue for each handler.
282* A value between `0` and `1` tunes the number of queues against the number of handlers.
283  For instance, a value of `.5` shares one queue between each two handlers.
284+
285Having more queues, such as in a situation where you have one queue per handler, reduces contention when adding a task to a queue or selecting it from a queue.
286The trade-off is that if you have some queues with long-running tasks, a handler may end up waiting to execute from that queue rather than processing another queue which has waiting tasks.
287
288
289For these values to take effect on a given RegionServer, the RegionServer must be restarted.
290These parameters are intended for testing purposes and should be used carefully.
291
292[[perf.zookeeper]]
293== ZooKeeper
294
295See <<zookeeper>> for information on configuring ZooKeeper, and see the part about having a dedicated disk.
296
297[[perf.schema]]
298== Schema Design
299
300[[perf.number.of.cfs]]
301=== Number of Column Families
302
303See <<number.of.cfs>>.
304
305[[perf.schema.keys]]
306=== Key and Attribute Lengths
307
308See <<keysize>>.
309See also <<perf.compression.however>> for compression caveats.
310
311[[schema.regionsize]]
312=== Table RegionSize
313
314The regionsize can be set on a per-table basis via `setFileSize` on link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor] in the event where certain tables require different regionsizes than the configured default regionsize.
315
316See <<ops.capacity.regions>> for more information.
317
318[[schema.bloom]]
319=== Bloom Filters
320
321A Bloom filter, named for its creator, Burton Howard Bloom, is a data structure which is designed to predict whether a given element is a member of a set of data.
322A positive result from a Bloom filter is not always accurate, but a negative result is guaranteed to be accurate.
323Bloom filters are designed to be "accurate enough" for sets of data which are so large that conventional hashing mechanisms would be impractical.
324For more information about Bloom filters in general, refer to http://en.wikipedia.org/wiki/Bloom_filter.
325
326In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the number of disk reads for a given Get operation (Bloom filters do not work with Scans) to only the StoreFiles likely to contain the desired Row.
327The potential performance gain increases with the number of parallel reads.
328
329The Bloom filters themselves are stored in the metadata of each HFile and never need to be updated.
330When an HFile is opened because a region is deployed to a RegionServer, the Bloom filter is loaded into memory.
331
332HBase includes some tuning mechanisms for folding the Bloom filter to reduce the size and keep the false positive rate within a desired range.
333
334Bloom filters were introduced in link:https://issues.apache.org/jira/browse/HBASE-1200[HBASE-1200].
335Since HBase 0.96, row-based Bloom filters are enabled by default.
336(link:https://issues.apache.org/jira/browse/HBASE-8450[HBASE-])
337
338For more information on Bloom filters in relation to HBase, see <<blooms>> for more information, or the following Quora discussion: link:http://www.quora.com/How-are-bloom-filters-used-in-HBase[How are bloom filters used in HBase?].
339
340[[bloom.filters.when]]
341==== When To Use Bloom Filters
342
343Since HBase 0.96, row-based Bloom filters are enabled by default.
344You may choose to disable them or to change some tables to use row+column Bloom filters, depending on the characteristics of your data and how it is loaded into HBase.
345
346To determine whether Bloom filters could have a positive impact, check the value of `blockCacheHitRatio` in the RegionServer metrics.
347If Bloom filters are enabled, the value of `blockCacheHitRatio` should increase, because the Bloom filter is filtering out blocks that are definitely not needed.
348
349You can choose to enable Bloom filters for a row or for a row+column combination.
350If you generally scan entire rows, the row+column combination will not provide any benefit.
351A row-based Bloom filter can operate on a row+column Get, but not the other way around.
352However, if you have a large number of column-level Puts, such that a row may be present in every StoreFile, a row-based filter will always return a positive result and provide no benefit.
353Unless you have one column per row, row+column Bloom filters require more space, in order to store more keys.
354Bloom filters work best when the size of each data entry is at least a few kilobytes in size.
355
356Overhead will be reduced when your data is stored in a few larger StoreFiles, to avoid extra disk IO during low-level scans to find a specific row.
357
358Bloom filters need to be rebuilt upon deletion, so may not be appropriate in environments with a large number of deletions.
359
360==== Enabling Bloom Filters
361
362Bloom filters are enabled on a Column Family.
363You can do this by using the setBloomFilterType method of HColumnDescriptor or using the HBase API.
364Valid values are `NONE` (the default), `ROW`, or `ROWCOL`.
365See <<bloom.filters.when>> for more information on `ROW` versus `ROWCOL`.
366See also the API documentation for link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor].
367
368The following example creates a table and enables a ROWCOL Bloom filter on the `colfam1` column family.
369
370----
371
372hbase> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}
373----
374
375==== Configuring Server-Wide Behavior of Bloom Filters
376
377You can configure the following settings in the _hbase-site.xml_.
378
379[cols="1,1,1", options="header"]
380|===
381| Parameter
382| Default
383| Description
384
385| io.hfile.bloom.enabled
386| yes
387| Set to no to kill bloom filters server-wide if something goes wrong
388
389| io.hfile.bloom.error.rate
390| .01
391| The average false positive rate for bloom filters. Folding is used to
392                  maintain the false positive rate. Expressed as a decimal representation of a
393                  percentage.
394
395| io.hfile.bloom.max.fold
396| 7
397| The guaranteed maximum fold rate. Changing this setting should not be
398                  necessary and is not recommended.
399
400| io.storefile.bloom.max.keys
401| 128000000
402| For default (single-block) Bloom filters, this specifies the maximum number of keys.
403
404| io.storefile.delete.family.bloom.enabled
405| true
406| Master switch to enable Delete Family Bloom filters and store them in the StoreFile.
407
408| io.storefile.bloom.block.size
409| 65536
410| Target Bloom block size. Bloom filter blocks of approximately this size
411                  are interleaved with data blocks.
412
413| hfile.block.bloom.cacheonwrite
414| false
415| Enables cache-on-write for inline blocks of a compound Bloom filter.
416|===
417
418[[schema.cf.blocksize]]
419=== ColumnFamily BlockSize
420
421The blocksize can be configured for each ColumnFamily in a table, and defaults to 64k.
422Larger cell values require larger blocksizes.
423There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved).
424
425See link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor] and <<store>>for more information.
426
427[[cf.in.memory]]
428=== In-Memory ColumnFamilies
429
430ColumnFamilies can optionally be defined as in-memory.
431Data is still persisted to disk, just like any other ColumnFamily.
432In-memory blocks have the highest priority in the <<block.cache>>, but it is not a guarantee that the entire table will be in memory.
433
434See link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor] for more information.
435
436[[perf.compression]]
437=== Compression
438
439Production systems should use compression with their ColumnFamily definitions.
440See <<compression>> for more information.
441
442[[perf.compression.however]]
443==== However...
444
445Compression deflates data _on disk_.
446When it's in-memory (e.g., in the MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated.
447So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.
448
449See <<keysize>> on for schema design tips, and <<keyvalue>> for more information on HBase stores data internally.
450
451[[perf.general]]
452== HBase General Patterns
453
454[[perf.general.constants]]
455=== Constants
456
457When people get started with HBase they have a tendency to write code that looks like this:
458
459[source,java]
460----
461Get get = new Get(rowkey);
462Result r = table.get(get);
463byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value
464----
465
466But especially when inside loops (and MapReduce jobs), converting the columnFamily and column-names to byte-arrays repeatedly is surprisingly expensive.
467It's better to use constants for the byte-arrays, like this:
468
469[source,java]
470----
471public static final byte[] CF = "cf".getBytes();
472public static final byte[] ATTR = "attr".getBytes();
473...
474Get get = new Get(rowkey);
475Result r = table.get(get);
476byte[] b = r.getValue(CF, ATTR);  // returns current version of value
477----
478
479[[perf.writing]]
480== Writing to HBase
481
482[[perf.batch.loading]]
483=== Batch Loading
484
485Use the bulk load tool if you can.
486See <<arch.bulk.load>>.
487Otherwise, pay attention to the below.
488
489[[precreate.regions]]
490===  Table Creation: Pre-Creating Regions
491
492Tables in HBase are initially created with one region by default.
493For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster.
494A useful pattern to speed up the bulk import process is to pre-create empty regions.
495Be somewhat conservative in this, because too-many regions can actually degrade performance.
496
497There are two different approaches to pre-creating splits.
498The first approach is to rely on the default `Admin` strategy (which is implemented in `Bytes.split`)...
499
500[source,java]
501----
502
503byte[] startKey = ...;      // your lowest key
504byte[] endKey = ...;        // your highest key
505int numberOfRegions = ...;  // # of regions to create
506admin.createTable(table, startKey, endKey, numberOfRegions);
507----
508
509And the other approach is to define the splits yourself...
510
511[source,java]
512----
513byte[][] splits = ...;   // create your own splits
514admin.createTable(table, splits);
515----
516
517See <<rowkey.regionsplits>> for issues related to understanding your keyspace and pre-creating regions.
518See <<manual_region_splitting_decisions,manual region splitting decisions>>  for discussion on manually pre-splitting regions.
519
520[[def.log.flush]]
521===  Table Creation: Deferred Log Flush
522
523The default behavior for Puts using the Write Ahead Log (WAL) is that `WAL` edits will be written immediately.
524If deferred log flush is used, WAL edits are kept in memory until the flush period.
525The benefit is aggregated and asynchronous `WAL`- writes, but the potential downside is that if the RegionServer goes down the yet-to-be-flushed edits are lost.
526This is safer, however, than not using WAL at all with Puts.
527
528Deferred log flush can be configured on tables via link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor].
529The default value of `hbase.regionserver.optionallogflushinterval` is 1000ms.
530
531[[perf.hbase.client.autoflush]]
532=== HBase Client: AutoFlush
533
534When performing a lot of Puts, make sure that setAutoFlush is set to false on your link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table] instance.
535Otherwise, the Puts will be sent one at a time to the RegionServer.
536Puts added via `table.add(Put)` and `table.add( <List> Put)` wind up in the same write buffer.
537If `autoFlush = false`, these messages are not sent until the write-buffer is filled.
538To explicitly flush the messages, call `flushCommits`.
539Calling `close` on the `Table` instance will invoke `flushCommits`.
540
541[[perf.hbase.client.putwal]]
542=== HBase Client: Turn off WAL on Puts
543
544A frequent request is to disable the WAL to increase performance of Puts.
545This is only appropriate for bulk loads, as it puts your data at risk by removing the protection of the WAL in the event of a region server crash.
546Bulk loads can be re-run in the event of a crash, with little risk of data loss.
547
548WARNING: If you disable the WAL for anything other than bulk loads, your data is at risk.
549
550In general, it is best to use WAL for Puts, and where loading throughput is a concern to use bulk loading techniques instead.
551For normal Puts, you are not likely to see a performance improvement which would outweigh the risk.
552To disable the WAL, see <<wal.disable>>.
553
554[[perf.hbase.client.regiongroup]]
555=== HBase Client: Group Puts by RegionServer
556
557In addition to using the writeBuffer, grouping `Put`s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
558There is a utility `HTableUtil` currently on MASTER that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
559
560[[perf.hbase.write.mr.reducer]]
561=== MapReduce: Skip The Reducer
562
563When writing a lot of data to an HBase table from a MR job (e.g., with link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]), and specifically where Puts are being emitted from the Mapper, skip the Reducer step.
564When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node.
565It's far more efficient to just write directly to HBase.
566
567For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). This is a different processing problem than from the above case.
568
569[[perf.one.region]]
570=== Anti-Pattern: One Hot Region
571
572If all your data is being written to one region at a time, then re-read the section on processing timeseries data.
573
574Also, if you are pre-splitting regions and all your data is _still_ winding up in a single region even though your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy.
575There are a variety of reasons that regions may appear "well split" but won't work with your data.
576As the HBase client communicates directly with the RegionServers, this can be obtained via link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#getRegionLocation(byte%5B%5D)[Table.getRegionLocation].
577
578See <<precreate.regions>>, as well as <<perf.configurations>>
579
580[[perf.reading]]
581== Reading from HBase
582
583The mailing list can help if you are having performance issues.
584For example, here is a good general thread on what to look at addressing read-time issues: link:http://search-hadoop.com/m/qOo2yyHtCC1[HBase Random Read latency > 100ms]
585
586[[perf.hbase.client.caching]]
587=== Scan Caching
588
589If HBase is used as an input source for a MapReduce job, for example, make sure that the input link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instance to the MapReduce job has `setCaching` set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed.
590Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed.
591There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better.
592
593[[perf.hbase.client.caching.mr]]
594==== Scan Caching in MapReduce Jobs
595
596Scan settings in MapReduce jobs deserve special attention.
597Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data.
598This problem can occur because there is non-trivial processing occurring per row.
599If you process rows quickly, set caching higher.
600If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower.
601
602Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.
603
604[[perf.hbase.client.selection]]
605=== Scan Attribute Selection
606
607Whenever a Scan is used to process large numbers of rows (and especially when used as a MapReduce source), be aware of which attributes are selected.
608If `scan.addFamily` is called then _all_ of the attributes in the specified ColumnFamily will be returned to the client.
609If only a small number of the available attributes are to be processed, then only those attributes should be specified in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.
610
611[[perf.hbase.client.seek]]
612=== Avoid scan seeks
613
614When columns are selected explicitly with `scan.addColumn`, HBase will schedule seek operations to seek between the selected columns.
615When rows have few columns and each column has only a few versions this can be inefficient.
616A seek operation is generally slower if does not seek at least past 5-10 columns/versions or 512-1024 bytes.
617
618In order to opportunistically look ahead a few columns/versions to see if the next column/version can be found that way before a seek operation is scheduled, a new attribute `Scan.HINT_LOOKAHEAD` can be set on the Scan object.
619The following code instructs the RegionServer to attempt two iterations of next before a seek is scheduled:
620
621[source,java]
622----
623Scan scan = new Scan();
624scan.addColumn(...);
625scan.setAttribute(Scan.HINT_LOOKAHEAD, Bytes.toBytes(2));
626table.getScanner(scan);
627----
628
629[[perf.hbase.mr.input]]
630=== MapReduce - Input Splits
631
632For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to have the same Input Split (i.e., the RegionServer serving the data), see the Troubleshooting Case Study in <<casestudies.slownode>>.
633
634[[perf.hbase.client.scannerclose]]
635=== Close ResultScanners
636
637This isn't so much about improving performance but rather _avoiding_ performance problems.
638If you forget to close link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html[ResultScanners] you can cause problems on the RegionServers.
639Always have ResultScanner processing enclosed in try/catch blocks.
640
641[source,java]
642----
643Scan scan = new Scan();
644// set attrs...
645ResultScanner rs = table.getScanner(scan);
646try {
647  for (Result r = rs.next(); r != null; r = rs.next()) {
648  // process result...
649} finally {
650  rs.close();  // always close the ResultScanner!
651}
652table.close();
653----
654
655[[perf.hbase.client.blockcache]]
656=== Block Cache
657
658link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be set to use the block cache in the RegionServer via the `setCacheBlocks` method.
659For input Scans to MapReduce jobs, this should be `false`.
660For frequently accessed rows, it is advisable to use the block cache.
661
662Cache more data by moving your Block Cache off-heap.
663See <<offheap.blockcache>>
664
665[[perf.hbase.client.rowkeyonly]]
666=== Optimal Loading of Row Keys
667
668When performing a table link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[scan] where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a `MUST_PASS_ALL` operator to the scanner using `setFilter`.
669The filter list should include both a link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter] and a link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html[KeyOnlyFilter].
670Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk and minimal network traffic to the client for a single row.
671
672[[perf.hbase.read.dist]]
673=== Concurrency: Monitor Data Spread
674
675When performing a high number of concurrent reads, monitor the data spread of the target tables.
676If the target table(s) have too few regions then the reads could likely be served from too few nodes.
677
678See <<precreate.regions>>, as well as <<perf.configurations>>
679
680[[blooms]]
681=== Bloom Filters
682
683Enabling Bloom Filters can save your having to go to disk and can help improve read latencies.
684
685link:http://en.wikipedia.org/wiki/Bloom_filter[Bloom filters] were developed over in link:https://issues.apache.org/jira/browse/HBASE-1200[HBase-1200 Add bloomfilters].
686For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the _Development Process_ section of the document link:https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf[BloomFilters in HBase] attached to link:https://issues.apache.org/jira/browse/HBASE-1200[HBASE-1200].
687The bloom filters described here are actually version two of blooms in HBase.
688In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the link:http://www.one-lab.org/[European Commission One-Lab Project 034819].
689The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile.
690Version 1 of HBase blooms never worked that well.
691Version 2 is a rewrite from scratch though again it starts with the one-lab work.
692
693See also <<schema.bloom>>.
694
695[[bloom_footprint]]
696==== Bloom StoreFile footprint
697
698Bloom filters add an entry to the `StoreFile` general `FileInfo` data structure and then two extra entries to the `StoreFile` metadata section.
699
700===== BloomFilter in the `StoreFile``FileInfo` data structure
701
702`FileInfo` has a `BLOOM_FILTER_TYPE` entry which is set to `NONE`, `ROW` or `ROWCOL.`
703
704===== BloomFilter entries in `StoreFile` metadata
705
706`BLOOM_FILTER_META` holds Bloom Size, Hash Function used, etc.
707It's small in size and is cached on `StoreFile.Reader` load
708
709`BLOOM_FILTER_DATA` is the actual bloomfilter data.
710Obtained on-demand.
711Stored in the LRU cache, if it is enabled (It's enabled by default).
712
713[[config.bloom]]
714==== Bloom Filter Configuration
715
716===== `io.hfile.bloom.enabled` global kill switch
717
718`io.hfile.bloom.enabled` in `Configuration` serves as the kill switch in case something goes wrong.
719Default = `true`.
720
721===== `io.hfile.bloom.error.rate`
722
723`io.hfile.bloom.error.rate` = average false positive rate.
724Default = 1%. Decrease rate by ½ (e.g.
725to .5%) == +1 bit per bloom entry.
726
727===== `io.hfile.bloom.max.fold`
728
729`io.hfile.bloom.max.fold` = guaranteed minimum fold rate.
730Most people should leave this alone.
731Default = 7, or can collapse to at least 1/128th of original size.
732See the _Development Process_ section of the document link:https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf[BloomFilters in HBase] for more on what this option means.
733
734=== Hedged Reads
735
736Hedged reads are a feature of HDFS, introduced in link:https://issues.apache.org/jira/browse/HDFS-5776[HDFS-5776].
737Normally, a single thread is spawned for each read request.
738However, if hedged reads are enabled, the client waits some configurable amount of time, and if the read does not return, the client spawns a second read request, against a different block replica of the same data.
739Whichever read returns first is used, and the other read request is discarded.
740Hedged reads can be helpful for times where a rare slow read is caused by a transient error such as a failing disk or flaky network connection.
741
742Because an HBase RegionServer is a HDFS client, you can enable hedged reads in HBase, by adding the following properties to the RegionServer's hbase-site.xml and tuning the values to suit your environment.
743
744.Configuration for Hedged Reads
745* `dfs.client.hedged.read.threadpool.size` - the number of threads dedicated to servicing hedged reads.
746  If this is set to 0 (the default), hedged reads are disabled.
747* `dfs.client.hedged.read.threshold.millis` - the number of milliseconds to wait before spawning a second read thread.
748
749.Hedged Reads Configuration Example
750====
751[source,xml]
752----
753<property>
754  <name>dfs.client.hedged.read.threadpool.size</name>
755  <value>20</value>  <!-- 20 threads -->
756</property>
757<property>
758  <name>dfs.client.hedged.read.threshold.millis</name>
759  <value>10</value>  <!-- 10 milliseconds -->
760</property>
761----
762====
763
764Use the following metrics to tune the settings for hedged reads on your cluster.
765See <<hbase_metrics>>  for more information.
766
767.Metrics for Hedged Reads
768* hedgedReadOps - the number of times hedged read threads have been triggered.
769  This could indicate that read requests are often slow, or that hedged reads are triggered too quickly.
770* hedgeReadOpsWin - the number of times the hedged read thread was faster than the original thread.
771  This could indicate that a given RegionServer is having trouble servicing requests.
772
773[[perf.deleting]]
774== Deleting from HBase
775
776[[perf.deleting.queue]]
777=== Using HBase Tables as Queues
778
779HBase tables are sometimes used as queues.
780In this case, special care must be taken to regularly perform major compactions on tables used in this manner.
781As is documented in <<datamodel>>, marking rows as deleted creates additional StoreFiles which then need to be processed on reads.
782Tombstones only get cleaned up with major compactions.
783
784See also <<compaction>> and link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact%28java.lang.String%29[Admin.majorCompact].
785
786[[perf.deleting.rpc]]
787=== Delete RPC Behavior
788
789Be aware that `Table.delete(Delete)` doesn't use the writeBuffer.
790It will execute an RegionServer RPC with each invocation.
791For a large number of deletes, consider `Table.delete(List)`.
792
793See
794+++<a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete%28org.apache.hadoop.hbase.client.Delete%29">hbase.client.Delete</a>+++.
795
796[[perf.hdfs]]
797== HDFS
798
799Because HBase runs on <<arch.hdfs>> it is important to understand how it works and how it affects HBase.
800
801[[perf.hdfs.curr]]
802=== Current Issues With Low-Latency Reads
803
804The original use-case for HDFS was batch processing.
805As such, there low-latency reads were historically not a priority.
806With the increased adoption of Apache HBase this is changing, and several improvements are already in development.
807See the link:https://issues.apache.org/jira/browse/HDFS-1599[Umbrella Jira Ticket for HDFS Improvements for HBase].
808
809[[perf.hdfs.configs.localread]]
810=== Leveraging local data
811
812Since Hadoop 1.0.0 (also 0.22.1, 0.23.1, CDH3u3 and HDP 1.0) via link:https://issues.apache.org/jira/browse/HDFS-2246[HDFS-2246], it is possible for the DFSClient to take a "short circuit" and read directly from the disk instead of going through the DataNode when the data is local.
813What this means for HBase is that the RegionServers can read directly off their machine's disks instead of having to open a socket to talk to the DataNode, the former being generally much faster.
814See JD's link:http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf[Performance Talk].
815Also see link:http://search-hadoop.com/m/zV6dKrLCVh1[HBase, mail # dev - read short circuit] thread for more discussion around short circuit reads.
816
817To enable "short circuit" reads, it will depend on your version of Hadoop.
818The original shortcircuit read patch was much improved upon in Hadoop 2 in link:https://issues.apache.org/jira/browse/HDFS-347[HDFS-347].
819See http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/ for details on the difference between the old and new implementations.
820See link:http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html[Hadoop shortcircuit reads configuration page] for how to enable the latter, better version of shortcircuit.
821For example, here is a minimal config.
822enabling short-circuit reads added to _hbase-site.xml_:
823
824[source,xml]
825----
826<property>
827  <name>dfs.client.read.shortcircuit</name>
828  <value>true</value>
829  <description>
830    This configuration parameter turns on short-circuit local reads.
831  </description>
832</property>
833<property>
834  <name>dfs.domain.socket.path</name>
835  <value>/home/stack/sockets/short_circuit_read_socket_PORT</value>
836  <description>
837    Optional.  This is a path to a UNIX domain socket that will be used for
838    communication between the DataNode and local HDFS clients.
839    If the string "_PORT" is present in this path, it will be replaced by the
840    TCP port of the DataNode.
841  </description>
842</property>
843----
844
845Be careful about permissions for the directory that hosts the shared domain socket; dfsclient will complain if open to other than the hbase user.
846
847If you are running on an old Hadoop, one that is without link:https://issues.apache.org/jira/browse/HDFS-347[HDFS-347] but that has link:https://issues.apache.org/jira/browse/HDFS-2246[HDFS-2246], you must set two configurations.
848First, the hdfs-site.xml needs to be amended.
849Set the property `dfs.block.local-path-access.user` to be the _only_ user that can use the shortcut.
850This has to be the user that started HBase.
851Then in hbase-site.xml, set `dfs.client.read.shortcircuit` to be `true`
852
853Services -- at least the HBase RegionServers -- will need to be restarted in order to pick up the new configurations.
854
855.dfs.client.read.shortcircuit.buffer.size
856[NOTE]
857====
858The default for this value is too high when running on a highly trafficked HBase.
859In HBase, if this value has not been set, we set it down from the default of 1M to 128k (Since HBase 0.98.0 and 0.96.1). See link:https://issues.apache.org/jira/browse/HBASE-8143[HBASE-8143 HBase on Hadoop 2 with local short circuit reads (ssr) causes OOM]). The Hadoop DFSClient in HBase will allocate a direct byte buffer of this size for _each_ block it has open; given HBase keeps its HDFS files open all the time, this can add up quickly.
860====
861
862[[perf.hdfs.comp]]
863=== Performance Comparisons of HBase vs. HDFS
864
865A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this processing context.
866There is room for improvement and this gap will, over time, be reduced, but HDFS will always be faster in this use-case.
867
868[[perf.ec2]]
869== Amazon EC2
870
871Performance questions are common on Amazon EC2 environments because it is a shared environment.
872You will not see the same throughput as a dedicated server.
873In terms of running tests on EC2, run them several times for the same reason (i.e., it's a shared environment and you don't know what else is happening on the server).
874
875If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that because EC2 issues are practically a separate class of performance issues.
876
877[[perf.hbase.mr.cluster]]
878== Collocating HBase and MapReduce
879
880It is often recommended to have different clusters for HBase and MapReduce.
881A better qualification of this is: don't collocate an HBase that serves live requests with a heavy MR workload.
882OLTP and OLAP-optimized systems have conflicting requirements and one will lose to the other, usually the former.
883For example, short latency-sensitive disk reads will have to wait in line behind longer reads that are trying to squeeze out as much throughput as possible.
884MR jobs that write to HBase will also generate flushes and compactions, which will in turn invalidate blocks in the <<block.cache>>.
885
886If you need to process the data from your live HBase cluster in MR, you can ship the deltas with <<copy.table>> or use replication to get the new data in real time on the OLAP cluster.
887In the worst case, if you really need to collocate both, set MR to use less Map and Reduce slots than you'd normally configure, possibly just one.
888
889When HBase is used for OLAP operations, it's preferable to set it up in a hardened way like configuring the ZooKeeper session timeout higher and giving more memory to the MemStores (the argument being that the Block Cache won't be used much since the workloads are usually long scans).
890
891[[perf.casestudy]]
892== Case Studies
893
894For Performance and Troubleshooting Case Studies, see <<casestudies>>.
895
896ifdef::backend-docbook[]
897[index]
898== Index
899// Generated automatically by the DocBook toolchain.
900endif::backend-docbook[]
901