1////
2/**
3 *
4 * Licensed to the Apache Software Foundation (ASF) under one
5 * or more contributor license agreements.  See the NOTICE file
6 * distributed with this work for additional information
7 * regarding copyright ownership.  The ASF licenses this file
8 * to you under the Apache License, Version 2.0 (the
9 * "License"); you may not use this file except in compliance
10 * with the License.  You may obtain a copy of the License at
11 *
12 *     http://www.apache.org/licenses/LICENSE-2.0
13 *
14 * Unless required by applicable law or agreed to in writing, software
15 * distributed under the License is distributed on an "AS IS" BASIS,
16 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17 * See the License for the specific language governing permissions and
18 * limitations under the License.
19 */
20////
21
22[[ops_mgt]]
23= Apache HBase Operational Management
24:doctype: book
25:numbered:
26:toc: left
27:icons: font
28:experimental:
29
30This chapter will cover operational tools and practices required of a running Apache HBase cluster.
31The subject of operations is related to the topics of <<trouble>>, <<performance>>, and <<configuration>> but is a distinct topic in itself.
32
33[[tools]]
34== HBase Tools and Utilities
35
36HBase provides several tools for administration, analysis, and debugging of your cluster.
37The entry-point to most of these tools is the _bin/hbase_ command, though some tools are available in the _dev-support/_ directory.
38
39To see usage instructions for _bin/hbase_ command, run it with no arguments, or with the `-h` argument.
40These are the usage instructions for HBase 0.98.x.
41Some commands, such as `version`, `pe`, `ltt`, `clean`, are not available in previous versions.
42
43----
44$ bin/hbase
45Usage: hbase [<options>] <command> [<args>]
46Options:
47  --config DIR    Configuration direction to use. Default: ./conf
48  --hosts HOSTS   Override the list in 'regionservers' file
49
50Commands:
51Some commands take arguments. Pass no args or -h for usage.
52  shell           Run the HBase shell
53  hbck            Run the hbase 'fsck' tool
54  wal             Write-ahead-log analyzer
55  hfile           Store file analyzer
56  zkcli           Run the ZooKeeper shell
57  upgrade         Upgrade hbase
58  master          Run an HBase HMaster node
59  regionserver    Run an HBase HRegionServer node
60  zookeeper       Run a Zookeeper server
61  rest            Run an HBase REST server
62  thrift          Run the HBase Thrift server
63  thrift2         Run the HBase Thrift2 server
64  clean           Run the HBase clean up script
65  classpath       Dump hbase CLASSPATH
66  mapredcp        Dump CLASSPATH entries required by mapreduce
67  pe              Run PerformanceEvaluation
68  ltt             Run LoadTestTool
69  version         Print the version
70  CLASSNAME       Run the class named CLASSNAME
71----
72
73Some of the tools and utilities below are Java classes which are passed directly to the _bin/hbase_ command, as referred to in the last line of the usage instructions.
74Others, such as `hbase shell` (<<shell>>), `hbase upgrade` (<<upgrading>>), and `hbase thrift` (<<thrift>>), are documented elsewhere in this guide.
75
76=== Canary
77
78There is a Canary class can help users to canary-test the HBase cluster status, with every column-family for every regions or RegionServer's granularity.
79To see the usage, use the `--help` parameter.
80
81----
82$ ${HBASE_HOME}/bin/hbase canary -help
83
84Usage: bin/hbase org.apache.hadoop.hbase.tool.Canary [opts] [table1 [table2]...] | [regionserver1 [regionserver2]..]
85 where [opts] are:
86   -help          Show this help and exit.
87   -regionserver  replace the table argument to regionserver,
88      which means to enable regionserver mode
89   -daemon        Continuous check at defined intervals.
90   -interval <N>  Interval between checks (sec)
91   -e             Use region/regionserver as regular expression
92      which means the region/regionserver is regular expression pattern
93   -f <B>         stop whole program if first error occurs, default is true
94   -t <N>         timeout for a check, default is 600000 (milliseconds)
95   -writeSniffing enable the write sniffing in canary
96   -treatFailureAsError treats read / write failure as error
97   -writeTable    The table used for write sniffing. Default is hbase:canary
98   -D<configProperty>=<value> assigning or override the configuration params
99----
100
101This tool will return non zero error codes to user for collaborating with other monitoring tools, such as Nagios.
102The error code definitions are:
103
104[source,java]
105----
106private static final int USAGE_EXIT_CODE = 1;
107private static final int INIT_ERROR_EXIT_CODE = 2;
108private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
109private static final int ERROR_EXIT_CODE = 4;
110----
111
112Here are some examples based on the following given case.
113There are two Table objects called test-01 and test-02, they have two column family cf1 and cf2 respectively, and deployed on the 3 RegionServers.
114see following table.
115
116[cols="1,1,1", options="header"]
117|===
118| RegionServer
119| test-01
120| test-02
121| rs1 | r1 | r2
122| rs2 | r2 |
123| rs3 | r2 | r1
124|===
125
126Following are some examples based on the previous given case.
127
128==== Canary test for every column family (store) of every region of every table
129
130----
131$ ${HBASE_HOME}/bin/hbase canary
132
1333/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf1 in 2ms
13413/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf2 in 2ms
13513/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf1 in 4ms
13613/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf2 in 1ms
137...
13813/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf1 in 5ms
13913/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf2 in 3ms
14013/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf1 in 31ms
14113/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf2 in 8ms
142----
143
144So you can see, table test-01 has two regions and two column families, so the Canary tool will pick 4 small piece of data from 4 (2 region * 2 store) different stores.
145This is a default behavior of the this tool does.
146
147==== Canary test for every column family (store) of every region of specific table(s)
148
149You can also test one or more specific tables.
150
151----
152$ ${HBASE_HOME}/bin/hbase canary test-01 test-02
153----
154
155==== Canary test with RegionServer granularity
156
157This will pick one small piece of data from each RegionServer, and can also put your RegionServer name as input options for canary-test specific RegionServer.
158
159----
160$ ${HBASE_HOME}/bin/hbase canary -regionserver
161
16213/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs2 in 72ms
16313/12/09 06:05:17 INFO tool.Canary: Read from table:test-02 on region server:rs3 in 34ms
16413/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs1 in 56ms
165----
166
167==== Canary test with regular expression pattern
168
169This will test both table test-01 and test-02.
170
171----
172$ ${HBASE_HOME}/bin/hbase canary -e test-0[1-2]
173----
174
175==== Run canary test as daemon mode
176
177Run repeatedly with interval defined in option `-interval` whose default value is 6 seconds.
178This daemon will stop itself and return non-zero error code if any error occurs, due to the default value of option -f is true.
179
180----
181$ ${HBASE_HOME}/bin/hbase canary -daemon
182----
183
184Run repeatedly with internal 5 seconds and will not stop itself even if errors occur in the test.
185
186----
187$ ${HBASE_HOME}/bin/hbase canary -daemon -interval 50000 -f false
188----
189
190==== Force timeout if canary test stuck
191
192In some cases the request is stuck and no response is sent back to the client. This can happen with dead RegionServers which the master has not yet noticed.
193Because of this we provide a timeout option to kill the canary test and return a non-zero error code.
194This run sets the timeout value to 60 seconds, the default value is 600 seconds.
195
196----
197$ ${HBASE_HOME}/bin/hbase canary -t 600000
198----
199
200==== Enable write sniffing in canary
201
202By default, the canary tool only check the read operations, it's hard to find the problem in the
203write path. To enable the write sniffing, you can run canary with the `-writeSniffing` option.
204When the write sniffing is enabled, the canary tool will create an hbase table and make sure the
205regions of the table distributed on all region servers. In each sniffing period, the canary will
206try to put data to these regions to check the write availability of each region server.
207----
208$ ${HBASE_HOME}/bin/hbase canary -writeSniffing
209----
210
211The default write table is `hbase:canary` and can be specified by the option `-writeTable`.
212----
213$ ${HBASE_HOME}/bin/hbase canary -writeSniffing -writeTable ns:canary
214----
215
216The default value size of each put is 10 bytes and you can set it by the config key:
217`hbase.canary.write.value.size`.
218
219==== Treat read / write failure as error
220
221By default, the canary tool only logs read failure, due to e.g. RetriesExhaustedException,
222while returning normal exit code. To treat read / write failure as error, you can run canary
223with the `-treatFailureAsError` option. When enabled, read / write failure would result in error
224exit code.
225----
226$ ${HBASE_HOME}/bin/hbase canary --treatFailureAsError
227----
228
229==== Running Canary in a Kerberos-enabled Cluster
230
231To run Canary in a Kerberos-enabled cluster, configure the following two properties in _hbase-site.xml_:
232
233* `hbase.client.keytab.file`
234* `hbase.client.kerberos.principal`
235
236Kerberos credentials are refreshed every 30 seconds when Canary runs in daemon mode.
237
238To configure the DNS interface for the client, configure the following optional properties in _hbase-site.xml_.
239
240* `hbase.client.dns.interface`
241* `hbase.client.dns.nameserver`
242
243.Canary in a Kerberos-Enabled Cluster
244====
245This example shows each of the properties with valid values.
246
247[source,xml]
248----
249<property>
250  <name>hbase.client.kerberos.principal</name>
251  <value>hbase/_HOST@YOUR-REALM.COM</value>
252</property>
253<property>
254  <name>hbase.client.keytab.file</name>
255  <value>/etc/hbase/conf/keytab.krb5</value>
256</property>
257<!-- optional params -->
258property>
259  <name>hbase.client.dns.interface</name>
260  <value>default</value>
261</property>
262<property>
263  <name>hbase.client.dns.nameserver</name>
264  <value>default</value>
265</property>
266----
267====
268
269[[health.check]]
270=== Health Checker
271
272You can configure HBase to run a script periodically and if it fails N times (configurable), have the server exit.
273See _HBASE-7351 Periodic health check script_ for configurations and detail.
274
275=== Driver
276
277Several frequently-accessed utilities are provided as `Driver` classes, and executed by the _bin/hbase_ command.
278These utilities represent MapReduce jobs which run on your cluster.
279They are run in the following way, replacing _UtilityName_ with the utility you want to run.
280This command assumes you have set the environment variable `HBASE_HOME` to the directory where HBase is unpacked on your server.
281
282----
283
284${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.mapreduce.UtilityName
285----
286
287The following utilities are available:
288
289`LoadIncrementalHFiles`::
290  Complete a bulk data load.
291
292`CopyTable`::
293  Export a table from the local cluster to a peer cluster.
294
295`Export`::
296  Write table data to HDFS.
297
298`Import`::
299  Import data written by a previous `Export` operation.
300
301`ImportTsv`::
302  Import data in TSV format.
303
304`RowCounter`::
305  Count rows in an HBase table.
306
307`replication.VerifyReplication`::
308  Compare the data from tables in two different clusters.
309  WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed.
310  Note that this command is in a different package than the others.
311
312Each command except `RowCounter` accepts a single `--help` argument to print usage instructions.
313
314[[hbck]]
315=== HBase `hbck`
316
317To run `hbck` against your HBase cluster run `$./bin/hbase hbck`. At the end of the command's output it prints `OK` or `INCONSISTENCY`.
318If your cluster reports inconsistencies, pass `-details` to see more detail emitted.
319If inconsistencies, run `hbck` a few times because the inconsistency may be transient (e.g. cluster is starting up or a region is splitting).
320 Passing `-fix` may correct the inconsistency (This is an experimental feature).
321
322For more information, see <<hbck.in.depth>>.
323
324[[hfile_tool2]]
325=== HFile Tool
326
327See <<hfile_tool>>.
328
329=== WAL Tools
330
331[[hlog_tool]]
332==== `FSHLog` tool
333
334The main method on `FSHLog` offers manual split and dump facilities.
335Pass it WALs or the product of a split, the content of the _recovered.edits_.
336directory.
337
338You can get a textual dump of a WAL file content by doing the following:
339
340----
341 $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.FSHLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
342----
343
344The return code will be non-zero if there are any issues with the file so you can test wholesomeness of file by redirecting `STDOUT` to `/dev/null` and testing the program return.
345
346Similarly you can force a split of a log file directory by doing:
347
348----
349 $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.FSHLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/
350----
351
352[[hlog_tool.prettyprint]]
353===== WAL Pretty Printer
354
355The WAL Pretty Printer is a tool with configurable options to print the contents of a WAL.
356You can invoke it via the HBase cli with the 'wal' command.
357
358----
359 $ ./bin/hbase wal hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
360----
361
362.WAL Printing in older versions of HBase
363[NOTE]
364====
365Prior to version 2.0, the WAL Pretty Printer was called the `HLogPrettyPrinter`, after an internal name for HBase's write ahead log.
366In those versions, you can print the contents of a WAL using the same configuration as above, but with the 'hlog' command.
367
368----
369 $ ./bin/hbase hlog hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
370----
371====
372
373[[compression.tool]]
374=== Compression Tool
375
376See <<compression.test,compression.test>>.
377
378[[copy.table]]
379=== CopyTable
380
381CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster.
382The target table must first exist.
383The usage is as follows:
384
385----
386
387$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
388/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
389Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
390
391Options:
392 rs.class     hbase.regionserver.class of the peer cluster,
393              specify if different from current cluster
394 rs.impl      hbase.regionserver.impl of the peer cluster,
395 startrow     the start row
396 stoprow      the stop row
397 starttime    beginning of the time range (unixtime in millis)
398              without endtime means from starttime to forever
399 endtime      end of the time range.  Ignored if no starttime specified.
400 versions     number of cell versions to copy
401 new.name     new table's name
402 peer.adr     Address of the peer cluster given in the format
403              hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
404 families     comma-separated list of families to copy
405              To copy from cf1 to cf2, give sourceCfName:destCfName.
406              To keep the same name, just give "cfName"
407 all.cells    also copy delete markers and deleted cells
408
409Args:
410 tablename    Name of the table to copy
411
412Examples:
413 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
414 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
415
416For performance consider the following general options:
417  It is recommended that you set the following to >=100. A higher value uses more memory but
418  decreases the round trip time to the server and may increase performance.
419    -Dhbase.client.scanner.caching=100
420  The following should always be set to false, to prevent writing data twice, which may produce
421  inaccurate results.
422    -Dmapred.map.tasks.speculative.execution=false
423----
424
425.Scanner Caching
426[NOTE]
427====
428Caching for the input Scan is configured via `hbase.client.scanner.caching`          in the job configuration.
429====
430
431.Versions
432[NOTE]
433====
434By default, CopyTable utility only copies the latest version of row cells unless `--versions=n` is explicitly specified in the command.
435====
436
437See Jonathan Hsieh's link:http://www.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online
438          HBase Backups with CopyTable] blog post for more on `CopyTable`.
439
440[[export]]
441=== Export
442
443Export is a utility that will dump the contents of table to HDFS in a sequence file.
444Invoke via:
445
446----
447$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
448----
449
450NOTE: To see usage instructions, run the command with no options. Available options include
451specifying column families and applying filters during the export.
452
453By default, the `Export` tool only exports the newest version of a given cell, regardless of the number of versions stored. To export more than one version, replace *_<versions>_* with the desired number of versions.
454
455Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
456
457[[import]]
458=== Import
459
460Import is a utility that will load data that has been exported back into HBase.
461Invoke via:
462
463----
464$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
465----
466
467NOTE: To see usage instructions, run the command with no options.
468
469To import 0.94 exported files in a 0.96 cluster or onwards, you need to set system property "hbase.import.version" when running the import command as below:
470
471----
472$ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
473----
474
475[[importtsv]]
476=== ImportTsv
477
478ImportTsv is a utility that will load data in TSV format into HBase.
479It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the `completebulkload`.
480
481To load data via Puts (i.e., non-bulk loading):
482
483----
484$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
485----
486
487To generate StoreFiles for bulk-loading:
488
489[source,bourne]
490----
491$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
492----
493
494These generated StoreFiles can be loaded into HBase via <<completebulkload,completebulkload>>.
495
496[[importtsv.options]]
497==== ImportTsv Options
498
499Running `ImportTsv` with no arguments prints brief usage information:
500
501----
502
503Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
504
505Imports the given input directory of TSV data into the specified table.
506
507The column names of the TSV data must be specified using the -Dimporttsv.columns
508option. This option takes the form of comma-separated column names, where each
509column name is either a simple column family, or a columnfamily:qualifier. The special
510column name HBASE_ROW_KEY is used to designate that this column should be used
511as the row key for each imported record. You must specify exactly one column
512to be the row key, and you must specify a column name for every column that exists in the
513input data.
514
515By default importtsv will load data directly into HBase. To instead generate
516HFiles of data to prepare for a bulk data load, pass the option:
517  -Dimporttsv.bulk.output=/path/for/output
518  Note: the target table will be created with default column family descriptors if it does not already exist.
519
520Other options that may be specified with -D include:
521  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
522  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
523  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
524  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
525----
526
527[[importtsv.example]]
528==== ImportTsv Example
529
530For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
531
532Assume that an input file exists as follows:
533----
534
535row1	c1	c2
536row2	c1	c2
537row3	c1	c2
538row4	c1	c2
539row5	c1	c2
540row6	c1	c2
541row7	c1	c2
542row8	c1	c2
543row9	c1	c2
544row10	c1	c2
545----
546
547For ImportTsv to use this input file, the command line needs to look like this:
548
549----
550
551 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
552----
553
554\... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used.
555The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
556
557[[importtsv.warning]]
558==== ImportTsv Warning
559
560If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
561
562[[importtsv.also]]
563==== See Also
564
565For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>
566
567[[completebulkload]]
568=== CompleteBulkLoad
569
570The `completebulkload` utility will move generated StoreFiles into an HBase table.
571This utility is often used in conjunction with output from <<importtsv,importtsv>>.
572
573There are two ways to invoke this utility, with explicit classname and via the driver:
574
575.Explicit Classname
576----
577$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
578----
579
580.Driver
581----
582HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
583----
584
585[[completebulkload.warning]]
586==== CompleteBulkLoad Warning
587
588Data generated via MapReduce is often created with file permissions that are not compatible with the running HBase process.
589Assuming you're running HDFS with permissions enabled, those permissions will need to be updated before you run CompleteBulkLoad.
590
591For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>.
592
593=== WALPlayer
594
595WALPlayer is a utility to replay WAL files into HBase.
596
597The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables.
598The output can optionally be mapped to another set of tables.
599
600WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
601
602Invoke via:
603
604----
605$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <wal inputdir> <tables> [<tableMappings>]>
606----
607
608For example:
609
610----
611$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
612----
613
614WALPlayer, by default, runs as a mapreduce job.
615To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags `-Dmapreduce.jobtracker.address=local` on the command line.
616
617[[rowcounter]]
618=== RowCounter and CellCounter
619
620link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]        is a mapreduce job to count all the rows of a table.
621This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency.
622It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
623
624----
625$ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename> [<column1> <column2>...]
626----
627
628RowCounter only counts one version per cell.
629
630Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
631
632HBase ships another diagnostic mapreduce job called link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter].
633Like RowCounter, it gathers more fine-grained statistics about your table.
634The statistics gathered by RowCounter are more fine-grained and include:
635
636* Total number of rows in the table.
637* Total number of CFs across all rows.
638* Total qualifiers across all rows.
639* Total occurrence of each CF.
640* Total occurrence of each qualifier.
641* Total number of versions of each qualifier.
642
643The program allows you to limit the scope of the run.
644Provide a row regex or prefix to limit the rows to analyze.
645Use `hbase.mapreduce.scan.column.family` to specify scanning a single column family.
646
647----
648$ bin/hbase org.apache.hadoop.hbase.mapreduce.CellCounter <tablename> <outputDir> [regex or prefix]
649----
650
651Note: just like RowCounter, caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
652
653=== mlockall
654
655It is possible to optionally pin your servers in physical memory making them less likely to be swapped out in oversubscribed environments by having the servers call link:http://linux.die.net/man/2/mlockall[mlockall] on startup.
656See link:https://issues.apache.org/jira/browse/HBASE-4391[HBASE-4391 Add ability to
657          start RS as root and call mlockall] for how to build the optional library and have it run on startup.
658
659[[compaction.tool]]
660=== Offline Compaction Tool
661
662See the usage for the
663link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/regionserver/CompactionTool.html[CompactionTool].
664Run it like:
665
666[source, bash]
667----
668$ ./bin/hbase org.apache.hadoop.hbase.regionserver.CompactionTool
669----
670
671=== `hbase clean`
672
673The `hbase clean` command cleans HBase data from ZooKeeper, HDFS, or both.
674It is appropriate to use for testing.
675Run it with no options for usage instructions.
676The `hbase clean` command was introduced in HBase 0.98.
677
678----
679
680$ bin/hbase clean
681Usage: hbase clean (--cleanZk|--cleanHdfs|--cleanAll)
682Options:
683        --cleanZk   cleans hbase related data from zookeeper.
684        --cleanHdfs cleans hbase related data from hdfs.
685        --cleanAll  cleans hbase related data from both zookeeper and hdfs.
686----
687
688=== `hbase pe`
689
690The `hbase pe` command is a shortcut provided to run the `org.apache.hadoop.hbase.PerformanceEvaluation` tool, which is used for testing.
691The `hbase pe` command was introduced in HBase 0.98.4.
692
693The PerformanceEvaluation tool accepts many different options and commands.
694For usage instructions, run the command with no options.
695
696To run PerformanceEvaluation prior to HBase 0.98.4, issue the command `hbase org.apache.hadoop.hbase.PerformanceEvaluation`.
697
698The PerformanceEvaluation tool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLs and visibility labels, multiget support for RPC calls, increased sampling sizes, an option to randomly sleep during testing, and ability to "warm up" the cluster before testing starts.
699
700=== `hbase ltt`
701
702The `hbase ltt` command is a shortcut provided to run the `org.apache.hadoop.hbase.util.LoadTestTool` utility, which is used for testing.
703The `hbase ltt` command was introduced in HBase 0.98.4.
704
705You must specify either `-write` or `-update-read` as the first option.
706For general usage instructions, pass the `-h` option.
707
708To run LoadTestTool prior to HBase 0.98.4, issue the command +hbase
709          org.apache.hadoop.hbase.util.LoadTestTool+.
710
711The LoadTestTool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLS and visibility labels, testing security-related features, ability to specify the number of regions per server, tests for multi-get RPC calls, and tests relating to replication.
712
713[[ops.regionmgt]]
714== Region Management
715
716[[ops.regionmgt.majorcompact]]
717=== Major Compaction
718
719Major compactions can be requested via the HBase shell or link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact%28java.lang.String%29[Admin.majorCompact].
720
721Note: major compactions do NOT do region merges.
722See <<compaction,compaction>> for more information about compactions.
723
724[[ops.regionmgt.merge]]
725=== Merge
726
727Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).
728
729[source,bourne]
730----
731$ bin/hbase org.apache.hadoop.hbase.util.Merge <tablename> <region1> <region2>
732----
733
734If you feel you have too many regions and want to consolidate them, Merge is the utility you need.
735Merge must run be done when the cluster is down.
736See the link:http://ofps.oreilly.com/titles/9781449396107/performance.html[O'Reilly HBase
737          Book] for an example of usage.
738
739You will need to pass 3 parameters to this application.
740The first one is the table name.
741The second one is the fully qualified name of the first region to merge, like "table_name,\x0A,1342956111995.7cef47f192318ba7ccc75b1bbf27a82b.". The third one is the fully qualified name for the second region to merge.
742
743Additionally, there is a Ruby script attached to link:https://issues.apache.org/jira/browse/HBASE-1621[HBASE-1621] for region merging.
744
745[[node.management]]
746== Node Management
747
748[[decommission]]
749=== Node Decommission
750
751You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:
752
753----
754$ ./bin/hbase-daemon.sh stop regionserver
755----
756
757The RegionServer will first close all regions and then shut itself down.
758On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire.
759The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
760
761.Disable the Load Balancer before Decommissioning a node
762[NOTE]
763====
764If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer.
765Avoid any problems by disabling the balancer first.
766See <<lb,lb>> below.
767====
768
769.Kill Node Tool
770[NOTE]
771====
772In hbase-2.0, in the bin directory, we added a script named _considerAsDead.sh_ that can be used to kill a regionserver.
773Hardware issues could be detected by specialized monitoring tools before the  zookeeper timeout has expired. _considerAsDead.sh_ is a simple function to mark a RegionServer as dead.
774It deletes all the znodes of the server, starting the recovery process.
775Plug in the script into your monitoring/fault detection tools to initiate faster failover.
776Be careful how you use this disruptive tool.
777Copy the script if you need to make use of it in a version of hbase previous to hbase-2.0.
778====
779
780A downside to the above stop of a RegionServer is that regions could be offline for a good period of time.
781Regions are closed in order.
782If many regions on the server, the first region to close may not be back online until all regions close and after the master notices the RegionServer's znode gone.
783In Apache HBase 0.90.2, we added facility for having a node gradually shed its load and then shutdown itself down.
784Apache HBase 0.90.2 added the _graceful_stop.sh_ script.
785Here is its usage:
786
787----
788$ ./bin/graceful_stop.sh
789Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname>
790 thrift      If we should stop/start thrift before/after the hbase stop/start
791 rest        If we should stop/start rest before/after the hbase stop/start
792 restart     If we should restart after graceful stop
793 reload      Move offloaded regions back on to the stopped server
794 debug       Move offloaded regions back on to the stopped server
795 hostname    Hostname of server we are to stop
796----
797
798To decommission a loaded RegionServer, run the following: +$
799          ./bin/graceful_stop.sh HOSTNAME+ where `HOSTNAME` is the host carrying the RegionServer you would decommission.
800
801.On `HOSTNAME`
802[NOTE]
803====
804The `HOSTNAME` passed to _graceful_stop.sh_ must match the hostname that hbase is using to identify RegionServers.
805Check the list of RegionServers in the master UI for how HBase is referring to servers.
806It's usually hostname but can also be FQDN.
807Whatever HBase is using, this is what you should pass the _graceful_stop.sh_ decommission script.
808If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks if server is currently running; the graceful unloading of regions will not run.
809====
810
811The _graceful_stop.sh_ script will move the regions off the decommissioned RegionServer one at a time to minimize region churn.
812It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions.
813At this point, the _graceful_stop.sh_ tells the RegionServer `stop`.
814The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.
815
816[[lb]]
817.Load Balancer
818[NOTE]
819====
820It is assumed that the Region Load Balancer is disabled while the `graceful_stop` script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:
821
822[source]
823----
824hbase(main):001:0> balance_switch false
825true
8260 row(s) in 0.3590 seconds
827----
828
829This turns the balancer OFF.
830To reenable, do:
831
832[source]
833----
834hbase(main):001:0> balance_switch true
835false
8360 row(s) in 0.3590 seconds
837----
838
839The `graceful_stop` will check the balancer and if enabled, will turn it off before it goes to work.
840If it exits prematurely because of error, it will not have reset the balancer.
841Hence, it is better to manage the balancer apart from `graceful_stop` reenabling it after you are done w/ graceful_stop.
842====
843
844[[draining.servers]]
845==== Decommissioning several Regions Servers concurrently
846
847If you have a large cluster, you may want to decommission more than one machine at a time by gracefully stopping multiple RegionServers concurrently.
848To gracefully drain multiple regionservers at the same time, RegionServers can be put into a "draining" state.
849This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the _hbase_root/draining_ znode.
850This znode has format `name,port,startcode` just like the regionserver entries under _hbase_root/rs_ znode.
851
852Without this facility, decommissioning multiple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining.
853Marking RegionServers to be in the draining state prevents this from happening.
854See this link:http://inchoate-clatter.blogspot.com/2012/03/hbase-ops-automation.html[blog
855            post] for more details.
856
857[[bad.disk]]
858==== Bad or Failing Disk
859
860It is good having <<dfs.datanode.failed.volumes.tolerated,dfs.datanode.failed.volumes.tolerated>> set if you have a decent number of disks per machine for the case where a disk plain dies.
861But usually disks do the "John Wayne" -- i.e.
862take a while to go down spewing errors in _dmesg_ -- or for some reason, run much slower than their companions.
863In this case you want to decommission the disk.
864You have two options.
865You can link:http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F[decommission
866            the datanode] or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its logs as it recalibrates where to get its data from -- it will likely roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
867
868.Short Circuit Reads
869[NOTE]
870====
871If you are doing short-circuit reads, you will have to move the regions off the regionserver before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot have access, because it already has the files open, it will be able to keep reading the file blocks from the bad disk even though the datanode is down.
872Move the regions back after you restart the datanode.
873====
874
875[[rolling]]
876=== Rolling Restart
877
878Some cluster configuration changes require either the entire cluster, or the RegionServers, to be restarted in order to pick up the changes.
879In addition, rolling restarts are supported for upgrading to a minor or maintenance release, and to a major release if at all possible.
880See the release notes for release you want to upgrade to, to find out about limitations to the ability to perform a rolling upgrade.
881
882There are multiple ways to restart your cluster nodes, depending on your situation.
883These methods are detailed below.
884
885==== Using the `rolling-restart.sh` Script
886
887HBase ships with a script, _bin/rolling-restart.sh_, that allows you to perform rolling restarts on the entire cluster, the master only, or the RegionServers only.
888The script is provided as a template for your own script, and is not explicitly tested.
889It requires password-less SSH login to be configured and assumes that you have deployed using a tarball.
890The script requires you to set some environment variables before running it.
891Examine the script and modify it to suit your needs.
892
893._rolling-restart.sh_ General Usage
894====
895----
896
897$ ./bin/rolling-restart.sh --help
898Usage: rolling-restart.sh [--config <hbase-confdir>] [--rs-only] [--master-only] [--graceful] [--maxthreads xx]
899----
900====
901
902Rolling Restart on RegionServers Only::
903  To perform a rolling restart on the RegionServers only, use the `--rs-only` option.
904  This might be necessary if you need to reboot the individual RegionServer or if you make a configuration change that only affects RegionServers and not the other HBase processes.
905
906Rolling Restart on Masters Only::
907  To perform a rolling restart on the active and backup Masters, use the `--master-only` option.
908  You might use this if you know that your configuration change only affects the Master and not the RegionServers, or if you need to restart the server where the active Master is running.
909
910Graceful Restart::
911  If you specify the `--graceful` option, RegionServers are restarted using the _bin/graceful_stop.sh_ script, which moves regions off a RegionServer before restarting it.
912  This is safer, but can delay the restart.
913
914Limiting the Number of Threads::
915  To limit the rolling restart to using only a specific number of threads, use the `--maxthreads` option.
916
917[[rolling.restart.manual]]
918==== Manual Rolling Restart
919
920To retain more control over the process, you may wish to manually do a rolling restart across your cluster.
921This uses the `graceful-stop.sh` command <<decommission,decommission>>.
922In this method, you can restart each RegionServer individually and then move its old regions back into place, retaining locality.
923If you also need to restart the Master, you need to do it separately, and restart the Master before restarting the RegionServers using this method.
924The following is an example of such a command.
925You may need to tailor it to your environment.
926This script does a rolling restart of RegionServers only.
927It disables the load balancer before moving the regions.
928
929----
930
931$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &;
932----
933
934Monitor the output of the _/tmp/log.txt_ file to follow the progress of the script.
935
936==== Logic for Crafting Your Own Rolling Restart Script
937
938Use the following guidelines if you want to create your own rolling restart script.
939
940. Extract the new release, verify its configuration, and synchronize it to all nodes of your cluster using `rsync`, `scp`, or another secure synchronization mechanism.
941. Use the hbck utility to ensure that the cluster is consistent.
942+
943----
944
945$ ./bin/hbck
946----
947+
948Perform repairs if required.
949See <<hbck,hbck>> for details.
950
951. Restart the master first.
952  You may need to modify these commands if your new HBase directory is different from the old one, such as for an upgrade.
953+
954----
955
956$ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
957----
958
959. Gracefully restart each RegionServer, using a script such as the following, from the Master.
960+
961----
962
963$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
964----
965+
966If you are running Thrift or REST servers, pass the --thrift or --rest options.
967For other available options, run the `bin/graceful-stop.sh --help`              command.
968+
969It is important to drain HBase regions slowly when restarting multiple RegionServers.
970Otherwise, multiple regions go offline simultaneously and must be reassigned to other nodes, which may also go offline soon.
971This can negatively affect performance.
972You can inject delays into the script above, for instance, by adding a Shell command such as `sleep`.
973To wait for 5 minutes between each RegionServer restart, modify the above script to the following:
974+
975----
976
977$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i & sleep 5m; done &> /tmp/log.txt &
978----
979
980. Restart the Master again, to clear out the dead servers list and re-enable the load balancer.
981. Run the `hbck` utility again, to be sure the cluster is consistent.
982
983[[adding.new.node]]
984=== Adding a New Node
985
986Adding a new regionserver in HBase is essentially free, you simply start it like this: `$ ./bin/hbase-daemon.sh start regionserver` and it will register itself with the master.
987Ideally you also started a DataNode on the same machine so that the RS can eventually start to have local files.
988If you rely on ssh to start your daemons, don't forget to add the new hostname in _conf/regionservers_ on the master.
989
990At this point the region server isn't serving data because no regions have moved to it yet.
991If the balancer is enabled, it will start moving regions to the new RS.
992On a small/medium cluster this can have a very adverse effect on latency as a lot of regions will be offline at the same time.
993It is thus recommended to disable the balancer the same way it's done when decommissioning a node and move the regions manually (or even better, using a script that moves them one by one).
994
995The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have to use the network to serve requests.
996Apart from resulting in higher latency, it may also be able to use all of your network card's capacity.
997For practical purposes, consider that a standard 1GigE NIC won't be able to read much more than _100MB/s_.
998In this case, or if you are in a OLAP environment and require having locality, then it is recommended to major compact the moved regions.
999
1000[[hbase_metrics]]
1001== HBase Metrics
1002
1003HBase emits metrics which adhere to the link:http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/metrics/package-summary.html[Hadoop metrics] API.
1004Starting with HBase 0.95footnote:[The Metrics system was redone in
1005          HBase 0.96. See Migration
1006            to the New Metrics Hotness – Metrics2 by Elliot Clark for detail], HBase is configured to emit a default set of metrics with a default sampling period of every 10 seconds.
1007You can use HBase metrics in conjunction with Ganglia.
1008You can also filter which metrics are emitted and extend the metrics framework to capture custom metrics appropriate for your environment.
1009
1010=== Metric Setup
1011
1012For HBase 0.95 and newer, HBase ships with a default metrics configuration, or [firstterm]_sink_.
1013This includes a wide variety of individual metrics, and emits them every 10 seconds by default.
1014To configure metrics for a given region server, edit the _conf/hadoop-metrics2-hbase.properties_ file.
1015Restart the region server for the changes to take effect.
1016
1017To change the sampling rate for the default sink, edit the line beginning with `*.period`.
1018To filter which metrics are emitted or to extend the metrics framework, see http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html
1019
1020.HBase Metrics and Ganglia
1021[NOTE]
1022====
1023By default, HBase emits a large number of metrics per region server.
1024Ganglia may have difficulty processing all these metrics.
1025Consider increasing the capacity of the Ganglia server or reducing the number of metrics emitted by HBase.
1026See link:http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html#filtering[Metrics Filtering].
1027====
1028
1029=== Disabling Metrics
1030
1031To disable metrics for a region server, edit the _conf/hadoop-metrics2-hbase.properties_ file and comment out any uncommented lines.
1032Restart the region server for the changes to take effect.
1033
1034[[discovering.available.metrics]]
1035=== Discovering Available Metrics
1036
1037Rather than listing each metric which HBase emits by default, you can browse through the available metrics, either as a JSON output or via JMX.
1038Different metrics are exposed for the Master process and each region server process.
1039
1040.Procedure: Access a JSON Output of Available Metrics
1041. After starting HBase, access the region server's web UI, at pass:[http://REGIONSERVER_HOSTNAME:60030] by default (or port 16030 in HBase 1.0+).
1042. Click the [label]#Metrics Dump# link near the top.
1043  The metrics for the region server are presented as a dump of the JMX bean in JSON format.
1044  This will dump out all metrics names and their values.
1045  To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60030/jmx?description=true].
1046  Not all beans and attributes have descriptions.
1047. To view metrics for the Master, connect to the Master's web UI instead (defaults to pass:[http://localhost:60010] or port 16010 in HBase 1.0+) and click its [label]#Metrics
1048  Dump# link.
1049  To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60010/jmx?description=true].
1050  Not all beans and attributes have descriptions.
1051
1052
1053You can use many different tools to view JMX content by browsing MBeans.
1054This procedure uses `jvisualvm`, which is an application usually available in the JDK.
1055
1056.Procedure: Browse the JMX Output of Available Metrics
1057. Start HBase, if it is not already running.
1058. Run the command `jvisualvm` command on a host with a GUI display.
1059  You can launch it from the command line or another method appropriate for your operating system.
1060. Be sure the [label]#VisualVM-MBeans# plugin is installed. Browse to *Tools -> Plugins*. Click [label]#Installed# and check whether the plugin is listed.
1061  If not, click [label]#Available Plugins#, select it, and click btn:[Install].
1062  When finished, click btn:[Close].
1063. To view details for a given HBase process, double-click the process in the [label]#Local# sub-tree in the left-hand panel.
1064  A detailed view opens in the right-hand panel.
1065  Click the [label]#MBeans# tab which appears as a tab in the top of the right-hand panel.
1066. To access the HBase metrics, navigate to the appropriate sub-bean:
1067.* Master:
1068.* RegionServer:
1069
1070. The name of each metric and its current value is displayed in the [label]#Attributes# tab.
1071  For a view which includes more details, including the description of each attribute, click the [label]#Metadata# tab.
1072
1073=== Units of Measure for Metrics
1074
1075Different metrics are expressed in different units, as appropriate.
1076Often, the unit of measure is in the name (as in the metric `shippedKBs`). Otherwise, use the following guidelines.
1077When in doubt, you may need to examine the source for a given metric.
1078
1079* Metrics that refer to a point in time are usually expressed as a timestamp.
1080* Metrics that refer to an age (such as `ageOfLastShippedOp`) are usually expressed in milliseconds.
1081* Metrics that refer to memory sizes are in bytes.
1082* Sizes of queues (such as `sizeOfLogQueue`) are expressed as the number of items in the queue.
1083  Determine the size by multiplying by the block size (default is 64 MB in HDFS).
1084* Metrics that refer to things like the number of a given type of operations (such as `logEditsRead`) are expressed as an integer.
1085
1086[[master_metrics]]
1087=== Most Important Master Metrics
1088
1089Note: Counts are usually over the last metrics reporting interval.
1090
1091hbase.master.numRegionServers::
1092  Number of live regionservers
1093
1094hbase.master.numDeadRegionServers::
1095  Number of dead regionservers
1096
1097hbase.master.ritCount ::
1098  The number of regions in transition
1099
1100hbase.master.ritCountOverThreshold::
1101  The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
1102
1103hbase.master.ritOldestAge::
1104  The age of the longest region in transition, in milliseconds
1105
1106[[rs_metrics]]
1107=== Most Important RegionServer Metrics
1108
1109Note: Counts are usually over the last metrics reporting interval.
1110
1111hbase.regionserver.regionCount::
1112  The number of regions hosted by the regionserver
1113
1114hbase.regionserver.storeFileCount::
1115  The number of store files on disk currently managed by the regionserver
1116
1117hbase.regionserver.storeFileSize::
1118  Aggregate size of the store files on disk
1119
1120hbase.regionserver.hlogFileCount::
1121  The number of write ahead logs not yet archived
1122
1123hbase.regionserver.totalRequestCount::
1124  The total number of requests received
1125
1126hbase.regionserver.readRequestCount::
1127  The number of read requests received
1128
1129hbase.regionserver.writeRequestCount::
1130  The number of write requests received
1131
1132hbase.regionserver.numOpenConnections::
1133  The number of open connections at the RPC layer
1134
1135hbase.regionserver.numActiveHandler::
1136  The number of RPC handlers actively servicing requests
1137
1138hbase.regionserver.numCallsInGeneralQueue::
1139  The number of currently enqueued user requests
1140
1141hbase.regionserver.numCallsInReplicationQueue::
1142  The number of currently enqueued operations received from replication
1143
1144hbase.regionserver.numCallsInPriorityQueue::
1145  The number of currently enqueued priority (internal housekeeping) requests
1146
1147hbase.regionserver.flushQueueLength::
1148  Current depth of the memstore flush queue.
1149  If increasing, we are falling behind with clearing memstores out to HDFS.
1150
1151hbase.regionserver.updatesBlockedTime::
1152  Number of milliseconds updates have been blocked so the memstore can be flushed
1153
1154hbase.regionserver.compactionQueueLength::
1155  Current depth of the compaction request queue.
1156  If increasing, we are falling behind with storefile compaction.
1157
1158hbase.regionserver.blockCacheHitCount::
1159  The number of block cache hits
1160
1161hbase.regionserver.blockCacheMissCount::
1162  The number of block cache misses
1163
1164hbase.regionserver.blockCacheExpressHitPercent ::
1165  The percent of the time that requests with the cache turned on hit the cache
1166
1167hbase.regionserver.percentFilesLocal::
1168  Percent of store file data that can be read from the local DataNode, 0-100
1169
1170hbase.regionserver.<op>_<measure>::
1171  Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
1172
1173hbase.regionserver.slow<op>Count ::
1174  The number of operations we thought were slow, where <op> is one of the list above
1175
1176hbase.regionserver.GcTimeMillis::
1177  Time spent in garbage collection, in milliseconds
1178
1179hbase.regionserver.GcTimeMillisParNew::
1180  Time spent in garbage collection of the young generation, in milliseconds
1181
1182hbase.regionserver.GcTimeMillisConcurrentMarkSweep::
1183  Time spent in garbage collection of the old generation, in milliseconds
1184
1185hbase.regionserver.authenticationSuccesses::
1186  Number of client connections where authentication succeeded
1187
1188hbase.regionserver.authenticationFailures::
1189  Number of client connection authentication failures
1190
1191hbase.regionserver.mutationsWithoutWALCount ::
1192  Count of writes submitted with a flag indicating they should bypass the write ahead log
1193
1194[[ops.monitoring]]
1195== HBase Monitoring
1196
1197[[ops.monitoring.overview]]
1198=== Overview
1199
1200The following metrics are arguably the most important to monitor for each RegionServer for "macro monitoring", preferably with a system like link:http://opentsdb.net/[OpenTSDB].
1201If your cluster is having performance issues it's likely that you'll see something unusual with this group.
1202
1203HBase::
1204  * See <<rs_metrics,rs metrics>>
1205
1206OS::
1207  * IO Wait
1208  * User CPU
1209
1210Java::
1211  * GC
1212
1213For more information on HBase metrics, see <<hbase_metrics,hbase metrics>>.
1214
1215[[ops.slow.query]]
1216=== Slow Query Log
1217
1218The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output.
1219The thresholds for "too long to run" and "too much output" are configurable, as described below.
1220The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events.
1221It is also prepended with identifying tags `(responseTooSlow)`, `(responseTooLarge)`, `(operationTooSlow)`, and `(operationTooLarge)` in order to enable easy filtering with grep, in case the user desires to see only slow queries.
1222
1223==== Configuration
1224
1225There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
1226
1227* `hbase.ipc.warn.response.time` Maximum number of milliseconds that a query can be run without being logged.
1228  Defaults to 10000, or 10 seconds.
1229  Can be set to -1 to disable logging by time.
1230* `hbase.ipc.warn.response.size` Maximum byte size of response that a query can return without being logged.
1231  Defaults to 100 megabytes.
1232  Can be set to -1 to disable logging by size.
1233
1234==== Metrics
1235
1236The slow query log exposes to metrics to JMX.
1237
1238* `hadoop.regionserver_rpc_slowResponse` a global metric reflecting the durations of all responses that triggered logging.
1239* `hadoop.regionserver_rpc_methodName.aboveOneSec` A metric reflecting the durations of all responses that lasted for more than one second.
1240
1241==== Output
1242
1243The output is tagged with operation e.g. `(operationTooSlow)` if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for.
1244If not, it is tagged `(responseTooSlow)`          and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. `TooLarge` is substituted for `TooSlow` if the response size triggered the logging, with `TooLarge` appearing even in the case that both size and duration triggered logging.
1245
1246==== Example
1247
1248
1249[source]
1250----
12512011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}
1252----
1253
1254Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port.
1255Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations.
1256In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.
1257
1258This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.
1259
1260=== Block Cache Monitoring
1261
1262Starting with HBase 0.98, the HBase Web UI includes the ability to monitor and report on the performance of the block cache.
1263To view the block cache reports, click .
1264Following are a few examples of the reporting capabilities.
1265
1266.Basic Info
1267image::bc_basic.png[]
1268
1269.Config
1270image::bc_config.png[]
1271
1272.Stats
1273image::bc_stats.png[]
1274
1275.L1 and L2
1276image::bc_l1.png[]
1277
1278This is not an exhaustive list of all the screens and reports available.
1279Have a look in the Web UI.
1280
1281== Cluster Replication
1282
1283NOTE: This information was previously available at
1284link:http://hbase.apache.org#replication[Cluster Replication].
1285
1286HBase provides a cluster replication mechanism which allows you to keep one cluster's state synchronized with that of another cluster, using the write-ahead log (WAL) of the source cluster to propagate the changes.
1287Some use cases for cluster replication include:
1288
1289* Backup and disaster recovery
1290* Data aggregation
1291* Geographic data distribution
1292* Online data ingestion combined with offline data analytics
1293
1294NOTE: Replication is enabled at the granularity of the column family.
1295Before enabling replication for a column family, create the table and all column families to be replicated, on the destination cluster.
1296
1297=== Replication Overview
1298
1299Cluster replication uses a source-push methodology.
1300An HBase cluster can be a source (also called master or active, meaning that it is the originator of new data), a destination (also called slave or passive, meaning that it receives data via replication), or can fulfill both roles at once.
1301Replication is asynchronous, and the goal of replication is eventual consistency.
1302When the source receives an edit to a column family with replication enabled, that edit is propagated to all destination clusters using the WAL for that for that column family on the RegionServer managing the relevant region.
1303
1304When data is replicated from one cluster to another, the original source of the data is tracked via a cluster ID which is part of the metadata.
1305In HBase 0.96 and newer (link:https://issues.apache.org/jira/browse/HBASE-7709[HBASE-7709]), all clusters which have already consumed the data are also tracked.
1306This prevents replication loops.
1307
1308The WALs for each region server must be kept in HDFS as long as they are needed to replicate data to any slave cluster.
1309Each region server reads from the oldest log it needs to replicate and keeps track of its progress processing WALs inside ZooKeeper to simplify failure recovery.
1310The position marker which indicates a slave cluster's progress, as well as the queue of WALs to process, may be different for every slave cluster.
1311
1312The clusters participating in replication can be of different sizes.
1313The master cluster relies on randomization to attempt to balance the stream of replication on the slave clusters.
1314It is expected that the slave cluster has storage capacity to hold the replicated data, as well as any data it is responsible for ingesting.
1315If a slave cluster does run out of room, or is inaccessible for other reasons, it throws an error and the master retains the WAL and retries the replication at intervals.
1316
1317.Terminology Changes
1318[NOTE]
1319====
1320Previously, terms such as [firstterm]_master-master_, [firstterm]_master-slave_, and [firstterm]_cyclical_ were used to describe replication relationships in HBase.
1321These terms added confusion, and have been abandoned in favor of discussions about cluster topologies appropriate for different scenarios.
1322====
1323
1324.Cluster Topologies
1325* A central source cluster might propagate changes out to multiple destination clusters, for failover or due to geographic distribution.
1326* A source cluster might push changes to a destination cluster, which might also push its own changes back to the original cluster.
1327* Many different low-latency clusters might push changes to one centralized cluster for backup or resource-intensive data analytics jobs.
1328  The processed data might then be replicated back to the low-latency clusters.
1329
1330Multiple levels of replication may be chained together to suit your organization's needs.
1331The following diagram shows a hypothetical scenario.
1332Use the arrows to follow the data paths.
1333
1334.Example of a Complex Cluster Replication Configuration
1335image::hbase_replication_diagram.jpg[]
1336
1337HBase replication borrows many concepts from the [firstterm]_statement-based replication_ design used by MySQL.
1338Instead of SQL statements, entire WALEdits (consisting of multiple cell inserts coming from Put and Delete operations on the clients) are replicated in order to maintain atomicity.
1339
1340=== Managing and Configuring Cluster Replication
1341.Cluster Configuration Overview
1342
1343. Configure and start the source and destination clusters.
1344  Create tables with the same names and column families on both the source and destination clusters, so that the destination cluster knows where to store data it will receive.
1345. All hosts in the source and destination clusters should be reachable to each other.
1346. If both clusters use the same ZooKeeper cluster, you must use a different `zookeeper.znode.parent`, because they cannot write in the same folder.
1347. Check to be sure that replication has not been disabled. `hbase.replication` defaults to `true`.
1348. On the source cluster, in HBase Shell, add the destination cluster as a peer, using the `add_peer` command.
1349. On the source cluster, in HBase Shell, enable the table replication, using the `enable_table_replication` command.
1350. Check the logs to see if replication is taking place. If so, you will see messages like the following, coming from the ReplicationSource.
1351----
1352LOG.info("Replicating "+clusterId + " -> " + peerClusterId);
1353----
1354
1355.Cluster Management Commands
1356add_peer <ID> <CLUSTER_KEY>::
1357  Adds a replication relationship between two clusters. +
1358  * ID -- a unique string, which must not contain a hyphen.
1359  * CLUSTER_KEY: composed using the following template, with appropriate place-holders: `hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent`
1360list_peers:: list all replication relationships known by this cluster
1361enable_peer <ID>::
1362  Enable a previously-disabled replication relationship
1363disable_peer <ID>::
1364  Disable a replication relationship. HBase will no longer send edits to that peer cluster, but it still keeps track of all the new WALs that it will need to replicate if and when it is re-enabled.
1365remove_peer <ID>::
1366  Disable and remove a replication relationship. HBase will no longer send edits to that peer cluster or keep track of WALs.
1367enable_table_replication <TABLE_NAME>::
1368  Enable the table replication switch for all its column families. If the table is not found in the destination cluster then it will create one with the same name and column families.
1369disable_table_replication <TABLE_NAME>::
1370  Disable the table replication switch for all its column families.
1371
1372=== Verifying Replicated Data
1373
1374The `VerifyReplication` MapReduce job, which is included in HBase, performs a systematic comparison of replicated data between two different clusters. Run the VerifyReplication job on the master cluster, supplying it with the peer ID and table name to use for validation. You can limit the verification further by specifying a time range or specific families. The job's short name is `verifyrep`. To run the job, use a command like the following:
1375+
1376[source,bash]
1377----
1378$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` "${HADOOP_HOME}/bin/hadoop" jar "${HBASE_HOME}/hbase-server-VERSION.jar" verifyrep --starttime=<timestamp> --stoptime=<timestamp> --families=<myFam> <ID> <tableName>
1379----
1380+
1381The `VerifyReplication` command prints out `GOODROWS` and `BADROWS` counters to indicate rows that did and did not replicate correctly.
1382
1383=== Detailed Information About Cluster Replication
1384
1385.Replication Architecture Overview
1386image::replication_overview.png[]
1387
1388==== Life of a WAL Edit
1389
1390A single WAL edit goes through several steps in order to be replicated to a slave cluster.
1391
1392. An HBase client uses a Put or Delete operation to manipulate data in HBase.
1393. The region server writes the request to the WAL in a way allows it to be replayed if it is not written successfully.
1394. If the changed cell corresponds to a column family that is scoped for replication, the edit is added to the queue for replication.
1395. In a separate thread, the edit is read from the log, as part of a batch process.
1396  Only the KeyValues that are eligible for replication are kept.
1397  Replicable KeyValues are part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as `hbase:meta`, did not originate from the target slave cluster, and have not already been consumed by the target slave cluster.
1398. The edit is tagged with the master's UUID and added to a buffer.
1399  When the buffer is filled, or the reader reaches the end of the file, the buffer is sent to a random region server on the slave cluster.
1400. The region server reads the edits sequentially and separates them into buffers, one buffer per table.
1401  After all edits are read, each buffer is flushed using link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table], HBase's normal client.
1402  The master's UUID and the UUIDs of slaves which have already consumed the data are preserved in the edits they are applied, in order to prevent replication loops.
1403. In the master, the offset for the WAL that is currently being replicated is registered in ZooKeeper.
1404
1405. The first three steps, where the edit is inserted, are identical.
1406. Again in a separate thread, the region server reads, filters, and edits the log edits in the same way as above.
1407  The slave region server does not answer the RPC call.
1408. The master sleeps and tries again a configurable number of times.
1409. If the slave region server is still not available, the master selects a new subset of region server to replicate to, and tries again to send the buffer of edits.
1410. Meanwhile, the WALs are rolled and stored in a queue in ZooKeeper.
1411  Logs that are [firstterm]_archived_ by their region server, by moving them from the region server's log directory to a central log directory, will update their paths in the in-memory queue of the replicating thread.
1412. When the slave cluster is finally available, the buffer is applied in the same way as during normal processing.
1413  The master region server will then replicate the backlog of logs that accumulated during the outage.
1414
1415.Spreading Queue Failover Load
1416When replication is active, a subset of region servers in the source cluster is responsible for shipping edits to the sink.
1417This responsibility must be failed over like all other region server functions should a process or node crash.
1418The following configuration settings are recommended for maintaining an even distribution of replication activity over the remaining live servers in the source cluster:
1419
1420* Set `replication.source.maxretriesmultiplier` to `300`.
1421* Set `replication.source.sleepforretries` to `1` (1 second). This value, combined with the value of `replication.source.maxretriesmultiplier`, causes the retry cycle to last about 5 minutes.
1422* Set `replication.sleep.before.failover` to `30000` (30 seconds) in the source cluster site configuration.
1423
1424[[cluster.replication.preserving.tags]]
1425.Preserving Tags During Replication
1426By default, the codec used for replication between clusters strips tags, such as cell-level ACLs, from cells.
1427To prevent the tags from being stripped, you can use a different codec which does not strip them.
1428Configure `hbase.replication.rpc.codec` to use `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`, on both the source and sink RegionServers involved in the replication.
1429This option was introduced in link:https://issues.apache.org/jira/browse/HBASE-10322[HBASE-10322].
1430
1431==== Replication Internals
1432
1433Replication State in ZooKeeper::
1434  HBase replication maintains its state in ZooKeeper.
1435  By default, the state is contained in the base node _/hbase/replication_.
1436  This node contains two child nodes, the `Peers` znode and the `RS`                znode.
1437
1438The `Peers` Znode::
1439  The `peers` znode is stored in _/hbase/replication/peers_ by default.
1440  It consists of a list of all peer replication clusters, along with the status of each of them.
1441  The value of each peer is its cluster key, which is provided in the HBase Shell.
1442  The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.
1443
1444The `RS` Znode::
1445  The `rs` znode contains a list of WAL logs which need to be replicated.
1446  This list is divided into a set of queues organized by region server and the peer cluster the region server is shipping the logs to.
1447  The rs znode has one child znode for each region server in the cluster.
1448  The child znode name is the region server's hostname, client port, and start code.
1449  This list includes both live and dead region servers.
1450
1451==== Choosing Region Servers to Replicate To
1452
1453When a master cluster region server initiates a replication source to a slave cluster, it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It then scans the _rs/_ directory to discover all the available sinks (region servers that are accepting incoming streams of edits to replicate) and randomly chooses a subset of them using a configured ratio which has a default value of 10%. For example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for edits that this master cluster region server sends.
1454Because this selection is performed by each master region server, the probability that all slave region servers are used is very high, and this method works for clusters of any size.
1455For example, a master cluster of 10 machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the master cluster region servers to choose one machine each at random.
1456
1457A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the slave cluster by each of the master cluster's region servers.
1458This watch is used to monitor changes in the composition of the slave cluster.
1459When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
1460
1461==== Keeping Track of Logs
1462
1463Each master cluster region server has its own znode in the replication znodes hierarchy.
1464It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
1465Each of these queues will track the WALs created by that region server, but they can differ in size.
1466For example, if one slave cluster becomes unavailable for some time, the WALs should not be deleted, so they need to stay in the queue while the others are processed.
1467See <<rs.failover.details,rs.failover.details>> for an example.
1468
1469When a source is instantiated, it contains the current WAL that the region server is writing to.
1470During log rolling, the new file is added to the queue of each slave cluster's znode just before it is made available.
1471This ensures that all the sources are aware that a new log exists before the region server is able to append edits into it, but this operations is now more expensive.
1472The queue items are discarded when the replication thread cannot read more entries from a file (because it reached the end of the last block) and there are other files in the queue.
1473This means that if a source is up to date and replicates from the log that the region server writes to, reading up to the "end" of the current file will not delete the item in the queue.
1474
1475A log can be archived if it is no longer used or if the number of logs exceeds `hbase.regionserver.maxlogs` because the insertion rate is faster than regions are flushed.
1476When a log is archived, the source threads are notified that the path for that log changed.
1477If a particular source has already finished with an archived log, it will just ignore the message.
1478If the log is in the queue, the path will be updated in memory.
1479If the log is currently being replicated, the change will be done atomically so that the reader doesn't attempt to open the file when has already been moved.
1480Because moving a file is a NameNode operation , if the reader is currently reading the log, it won't generate any exception.
1481
1482==== Reading, Filtering and Sending Edits
1483
1484By default, a source attempts to read from a WAL and ship log entries to a sink as quickly as possible.
1485Speed is limited by the filtering of log entries Only KeyValues that are scoped GLOBAL and that do not belong to catalog tables will be retained.
1486Speed is also limited by total size of the list of edits to replicate per slave, which is limited to 64 MB by default.
1487With this configuration, a master cluster region server with three slaves would use at most 192 MB to store data to replicate.
1488This does not account for the data which was filtered but not garbage collected.
1489
1490Once the maximum size of edits has been buffered or the reader reaches the end of the WAL, the source thread stops reading and chooses at random a sink to replicate to (from the list that was generated by keeping only a subset of slave region servers). It directly issues a RPC to the chosen region server and waits for the method to return.
1491If the RPC was successful, the source determines whether the current file has been emptied or it contains more data which needs to be read.
1492If the file has been emptied, the source deletes the znode in the queue.
1493Otherwise, it registers the new offset in the log's znode.
1494If the RPC threw an exception, the source will retry 10 times before trying to find a different sink.
1495
1496==== Cleaning Logs
1497
1498If replication is not enabled, the master's log-cleaning thread deletes old logs using a configured TTL.
1499This TTL-based method does not work well with replication, because archived logs which have exceeded their TTL may still be in a queue.
1500The default behavior is augmented so that if a log is past its TTL, the cleaning thread looks up every queue until it finds the log, while caching queues it has found.
1501If the log is not found in any queues, the log will be deleted.
1502The next time the cleaning process needs to look for a log, it starts by using its cached list.
1503
1504[[rs.failover.details]]
1505==== Region Server Failover
1506
1507When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
1508Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
1509
1510Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
1511The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
1512After queues are all transferred, they are deleted from the old location.
1513The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
1514
1515Next, the master cluster region server creates one new source thread per copied queue, and each of the source threads follows the read/filter/ship pattern.
1516The main difference is that those queues will never receive new data, since they do not belong to their new region server.
1517When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
1518
1519Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
1520The region servers' znodes all contain a `peers`          znode which contains a single queue.
1521The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
1522
1523----
1524
1525/hbase/replication/rs/
1526  1.1.1.1,60020,123456780/
1527    2/
1528      1.1.1.1,60020.1234  (Contains a position)
1529      1.1.1.1,60020.1265
1530  1.1.1.2,60020,123456790/
1531    2/
1532      1.1.1.2,60020.1214  (Contains a position)
1533      1.1.1.2,60020.1248
1534      1.1.1.2,60020.1312
1535  1.1.1.3,60020,    123456630/
1536    2/
1537      1.1.1.3,60020.1280  (Contains a position)
1538----
1539
1540Assume that 1.1.1.2 loses its ZooKeeper session.
1541The survivors will race to create a lock, and, arbitrarily, 1.1.1.3 wins.
1542It will then start transferring all the queues to its local peers znode by appending the name of the dead server.
1543Right before 1.1.1.3 is able to clean up the old znodes, the layout will look like the following:
1544
1545----
1546
1547/hbase/replication/rs/
1548  1.1.1.1,60020,123456780/
1549    2/
1550      1.1.1.1,60020.1234  (Contains a position)
1551      1.1.1.1,60020.1265
1552  1.1.1.2,60020,123456790/
1553    lock
1554    2/
1555      1.1.1.2,60020.1214  (Contains a position)
1556      1.1.1.2,60020.1248
1557      1.1.1.2,60020.1312
1558  1.1.1.3,60020,123456630/
1559    2/
1560      1.1.1.3,60020.1280  (Contains a position)
1561
1562    2-1.1.1.2,60020,123456790/
1563      1.1.1.2,60020.1214  (Contains a position)
1564      1.1.1.2,60020.1248
1565      1.1.1.2,60020.1312
1566----
1567
1568Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from 1.1.1.2, it dies too.
1569Some new logs were also created in the normal queues.
1570The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues.
1571The new layout will be:
1572
1573----
1574
1575/hbase/replication/rs/
1576  1.1.1.1,60020,123456780/
1577    2/
1578      1.1.1.1,60020.1378  (Contains a position)
1579
1580    2-1.1.1.3,60020,123456630/
1581      1.1.1.3,60020.1325  (Contains a position)
1582      1.1.1.3,60020.1401
1583
1584    2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
1585      1.1.1.2,60020.1312  (Contains a position)
1586  1.1.1.3,60020,123456630/
1587    lock
1588    2/
1589      1.1.1.3,60020.1325  (Contains a position)
1590      1.1.1.3,60020.1401
1591
1592    2-1.1.1.2,60020,123456790/
1593      1.1.1.2,60020.1312  (Contains a position)
1594----
1595
1596=== Replication Metrics
1597
1598The following metrics are exposed at the global region server level and (since HBase 0.95) at the peer level:
1599
1600`source.sizeOfLogQueue`::
1601  number of WALs to process (excludes the one which is being processed) at the Replication source
1602
1603`source.shippedOps`::
1604  number of mutations shipped
1605
1606`source.logEditsRead`::
1607  number of mutations read from WALs at the replication source
1608
1609`source.ageOfLastShippedOp`::
1610  age of last batch that was shipped by the replication source
1611
1612=== Replication Configuration Options
1613
1614[cols="1,1,1", options="header"]
1615|===
1616| Option
1617| Description
1618| Default
1619
1620| zookeeper.znode.parent
1621| The name of the base ZooKeeper znode used for HBase
1622| /hbase
1623
1624| zookeeper.znode.replication
1625| The name of the base znode used for replication
1626| replication
1627
1628| zookeeper.znode.replication.peers
1629| The name of the peer znode
1630| peers
1631
1632| zookeeper.znode.replication.peers.state
1633| The name of peer-state znode
1634| peer-state
1635
1636| zookeeper.znode.replication.rs
1637| The name of the rs znode
1638| rs
1639
1640| hbase.replication
1641| Whether replication is enabled or disabled on a given
1642                cluster
1643| false
1644
1645| eplication.sleep.before.failover
1646| How many milliseconds a worker should sleep before attempting to replicate
1647                a dead region server's WAL queues.
1648|
1649
1650| replication.executor.workers
1651| The number of region servers a given region server should attempt to
1652                  failover simultaneously.
1653| 1
1654|===
1655
1656=== Monitoring Replication Status
1657
1658You can use the HBase Shell command `status 'replication'` to monitor the replication status on your cluster. The  command has three variations:
1659* `status 'replication'` -- prints the status of each source and its sinks, sorted by hostname.
1660* `status 'replication', 'source'` -- prints the status for each replication source, sorted by hostname.
1661* `status 'replication', 'sink'` -- prints the status for each replication sink, sorted by hostname.
1662
1663== Running Multiple Workloads On a Single Cluster
1664
1665HBase provides the following mechanisms for managing the performance of a cluster
1666handling multiple workloads:
1667. <<quota>>
1668. <<request_queues>>
1669. <<multiple-typed-queues>>
1670
1671[[quota]]
1672=== Quotas
1673HBASE-11598 introduces quotas, which allow you to throttle requests based on
1674the following limits:
1675
1676. <<request-quotas,The number or size of requests(read, write, or read+write) in a given timeframe>>
1677. <<namespace_quotas,The number of tables allowed in a namespace>>
1678
1679These limits can be enforced for a specified user, table, or namespace.
1680
1681.Enabling Quotas
1682
1683Quotas are disabled by default. To enable the feature, set the `hbase.quota.enabled`
1684property to `true` in _hbase-site.xml_ file for all cluster nodes.
1685
1686.General Quota Syntax
1687. THROTTLE_TYPE can be expressed as READ, WRITE, or the default type(read + write).
1688. Timeframes  can be expressed in the following units: `sec`, `min`, `hour`, `day`
1689. Request sizes can be expressed in the following units: `B` (bytes), `K` (kilobytes),
1690`M` (megabytes), `G` (gigabytes), `T` (terabytes), `P` (petabytes)
1691. Numbers of requests are expressed as an integer followed by the string `req`
1692. Limits relating to time are expressed as req/time or size/time. For instance `10req/day`
1693or `100P/hour`.
1694. Numbers of tables or regions are expressed as integers.
1695
1696[[request-quotas]]
1697.Setting Request Quotas
1698You can set quota rules ahead of time, or you can change the throttle at runtime. The change
1699will propagate after the quota refresh period has expired. This expiration period
1700defaults to 5 minutes. To change it, modify the `hbase.quota.refresh.period` property
1701in `hbase-site.xml`. This property is expressed in milliseconds and defaults to `300000`.
1702
1703----
1704# Limit user u1 to 10 requests per second
1705hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10req/sec'
1706
1707# Limit user u1 to 10 read requests per second
1708hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', LIMIT => '10req/sec'
1709
1710# Limit user u1 to 10 M per day everywhere
1711hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10M/day'
1712
1713# Limit user u1 to 10 M write size per sec
1714hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => WRITE, USER => 'u1', LIMIT => '10M/sec'
1715
1716# Limit user u1 to 5k per minute on table t2
1717hbase> set_quota TYPE => THROTTLE, USER => 'u1', TABLE => 't2', LIMIT => '5K/min'
1718
1719# Limit user u1 to 10 read requests per sec on table t2
1720hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', TABLE => 't2', LIMIT => '10req/sec'
1721
1722# Remove an existing limit from user u1 on namespace ns2
1723hbase> set_quota TYPE => THROTTLE, USER => 'u1', NAMESPACE => 'ns2', LIMIT => NONE
1724
1725# Limit all users to 10 requests per hour on namespace ns1
1726hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns1', LIMIT => '10req/hour'
1727
1728# Limit all users to 10 T per hour on table t1
1729hbase> set_quota TYPE => THROTTLE, TABLE => 't1', LIMIT => '10T/hour'
1730
1731# Remove all existing limits from user u1
1732hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => NONE
1733
1734# List all quotas for user u1 in namespace ns2
1735hbase> list_quotas USER => 'u1, NAMESPACE => 'ns2'
1736
1737# List all quotas for namespace ns2
1738hbase> list_quotas NAMESPACE => 'ns2'
1739
1740# List all quotas for table t1
1741hbase> list_quotas TABLE => 't1'
1742
1743# list all quotas
1744hbase> list_quotas
1745----
1746
1747You can also place a global limit and exclude a user or a table from the limit by applying the
1748`GLOBAL_BYPASS` property.
1749----
1750hbase> set_quota NAMESPACE => 'ns1', LIMIT => '100req/min'               # a per-namespace request limit
1751hbase> set_quota USER => 'u1', GLOBAL_BYPASS => true                     # user u1 is not affected by the limit
1752----
1753
1754[[namespace_quotas]]
1755.Setting Namespace Quotas
1756
1757You can specify the maximum number of tables or regions allowed in a given namespace, either
1758when you create the namespace or by altering an existing namespace, by setting the
1759`hbase.namespace.quota.maxtables property`  on the namespace.
1760
1761.Limiting Tables Per Namespace
1762----
1763# Create a namespace with a max of 5 tables
1764hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxtables'=>'5'}
1765
1766# Alter an existing namespace to have a max of 8 tables
1767hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxtables'=>'8'}
1768
1769# Show quota information for a namespace
1770hbase> describe_namespace 'ns2'
1771
1772# Alter an existing namespace to remove a quota
1773hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=>'hbase.namespace.quota.maxtables'}
1774----
1775
1776.Limiting Regions Per Namespace
1777----
1778# Create a namespace with a max of 10 regions
1779hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxregions'=>'10'
1780
1781# Show quota information for a namespace
1782hbase> describe_namespace 'ns1'
1783
1784# Alter an existing namespace to have a max of 20 tables
1785hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxregions'=>'20'}
1786
1787# Alter an existing namespace to remove a quota
1788hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=> 'hbase.namespace.quota.maxregions'}
1789----
1790
1791[[request_queues]]
1792=== Request Queues
1793If no throttling policy is configured, when the RegionServer receives multiple requests,
1794they are now placed into a queue waiting for a free execution slot (HBASE-6721).
1795The simplest queue is a FIFO queue, where each request waits for all previous requests in the queue
1796to finish before running. Fast or interactive queries can get stuck behind large requests.
1797
1798If you are able to guess how long a request will take, you can reorder requests by
1799pushing the long requests to the end of the queue and allowing short requests to preempt
1800them. Eventually, you must still execute the large requests and prioritize the new
1801requests behind them. The short requests will be newer, so the result is not terrible,
1802but still suboptimal compared to a mechanism which allows large requests to be split
1803into multiple smaller ones.
1804
1805HBASE-10993 introduces such a system for deprioritizing long-running scanners. There
1806are two types of queues, `fifo` and `deadline`. To configure the type of queue used,
1807configure the `hbase.ipc.server.callqueue.type` property in `hbase-site.xml`. There
1808is no way to estimate how long each request may take, so de-prioritization only affects
1809scans, and is based on the number of “next” calls a scan request has made. An assumption
1810is made that when you are doing a full table scan, your job is not likely to be interactive,
1811so if there are concurrent requests, you can delay long-running scans up to a limit tunable by
1812setting the `hbase.ipc.server.queue.max.call.delay` property. The slope of the delay is calculated
1813by a simple square root of `(numNextCall * weight)` where the weight is
1814configurable by setting the `hbase.ipc.server.scan.vtime.weight` property.
1815
1816[[multiple-typed-queues]]
1817=== Multiple-Typed Queues
1818
1819You can also prioritize or deprioritize different kinds of requests by configuring
1820a specified number of dedicated handlers and queues. You can segregate the scan requests
1821in a single queue with a single handler, and all the other available queues can service
1822short `Get` requests.
1823
1824You can adjust the IPC queues and handlers based on the type of workload, using static
1825tuning options. This approach is an interim first step that will eventually allow
1826you to change the settings at runtime, and to dynamically adjust values based on the load.
1827
1828.Multiple Queues
1829
1830To avoid contention and separate different kinds of requests, configure the
1831`hbase.ipc.server.callqueue.handler.factor` property, which allows you to increase the number of
1832queues and control how many handlers can share the same queue., allows admins to increase the number
1833of queues and decide how many handlers share the same queue.
1834
1835Using more queues reduces contention when adding a task to a queue or selecting it
1836from a queue. You can even configure one queue per handler. The trade-off is that
1837if some queues contain long-running tasks, a handler may need to wait to execute from that queue
1838rather than stealing from another queue which has waiting tasks.
1839
1840.Read and Write Queues
1841With multiple queues, you can now divide read and write requests, giving more priority
1842(more queues) to one or the other type. Use the `hbase.ipc.server.callqueue.read.ratio`
1843property to choose to serve more reads or more writes.
1844
1845.Get and Scan Queues
1846Similar to the read/write split, you can split gets and scans by tuning the `hbase.ipc.server.callqueue.scan.ratio`
1847property to give more priority to gets or to scans. A scan ratio of `0.1` will give
1848more queue/handlers to the incoming gets, which means that more gets can be processed
1849at the same time and that fewer scans can be executed at the same time. A value of
1850`0.9` will give more queue/handlers to scans, so the number of scans executed will
1851increase and the number of gets will decrease.
1852
1853
1854[[ops.backup]]
1855== HBase Backup
1856
1857There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster.
1858Each approach has pros and cons.
1859
1860For additional information, see link:http://blog.sematext.com/2011/03/11/hbase-backup-options/[HBase Backup
1861        Options] over on the Sematext Blog.
1862
1863[[ops.backup.fullshutdown]]
1864=== Full Shutdown Backup
1865
1866Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages.
1867The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata.
1868The obvious con is that the cluster is down.
1869The steps include:
1870
1871[[ops.backup.fullshutdown.stop]]
1872==== Stop HBase
1873
1874
1875
1876[[ops.backup.fullshutdown.distcp]]
1877==== Distcp
1878
1879Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
1880
1881Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files.
1882Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.
1883
1884[[ops.backup.fullshutdown.restore]]
1885==== Restore (if needed)
1886
1887The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp.
1888The act of copying these files creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
1889
1890[[ops.backup.live.replication]]
1891=== Live Cluster Backup - Replication
1892
1893This approach assumes that there is a second cluster.
1894See the HBase page on link:http://hbase.apache.org/book.html#replication[replication] for more information.
1895
1896[[ops.backup.live.copytable]]
1897=== Live Cluster Backup - CopyTable
1898
1899The <<copy.table,copytable>> utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
1900
1901Since the cluster is up, there is a risk that edits could be missed in the copy process.
1902
1903[[ops.backup.live.export]]
1904=== Live Cluster Backup - Export
1905
1906The <<export,export>> approach dumps the content of a table to HDFS on the same cluster.
1907To restore the data, the <<import,import>> utility would be used.
1908
1909Since the cluster is up, there is a risk that edits could be missed in the export process.
1910
1911[[ops.snapshots]]
1912== HBase Snapshots
1913
1914HBase Snapshots allow you to take a snapshot of a table without too much impact on Region Servers.
1915Snapshot, Clone and restore operations don't involve data copying.
1916Also, Exporting the snapshot to another cluster doesn't have impact on the Region Servers.
1917
1918Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table.
1919The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
1920
1921[[ops.snapshots.configuration]]
1922=== Configuration
1923
1924To turn on the snapshot support just set the `hbase.snapshot.enabled`        property to true.
1925(Snapshots are enabled by default in 0.95+ and off by default in 0.94.6+)
1926
1927[source,java]
1928----
1929
1930  <property>
1931    <name>hbase.snapshot.enabled</name>
1932    <value>true</value>
1933  </property>
1934----
1935
1936[[ops.snapshots.takeasnapshot]]
1937=== Take a Snapshot
1938
1939You can take a snapshot of a table regardless of whether it is enabled or disabled.
1940The snapshot operation doesn't involve any data copying.
1941
1942----
1943
1944$ ./bin/hbase shell
1945hbase> snapshot 'myTable', 'myTableSnapshot-122112'
1946----
1947
1948.Take a Snapshot Without Flushing
1949The default behavior is to perform a flush of data in memory before the snapshot is taken.
1950This means that data in memory is included in the snapshot.
1951In most cases, this is the desired behavior.
1952However, if your set-up can tolerate data in memory being excluded from the snapshot, you can use the `SKIP_FLUSH` option of the `snapshot` command to disable and flushing while taking the snapshot.
1953
1954----
1955hbase> snapshot 'mytable', 'snapshot123', {SKIP_FLUSH => true}
1956----
1957
1958WARNING: There is no way to determine or predict whether a very concurrent insert or update will be included in a given snapshot, whether flushing is enabled or disabled.
1959A snapshot is only a representation of a table during a window of time.
1960The amount of time the snapshot operation will take to reach each Region Server may vary from a few seconds to a minute, depending on the resource load and speed of the hardware or network, among other factors.
1961There is also no way to know whether a given insert or update is in memory or has been flushed.
1962
1963[[ops.snapshots.list]]
1964=== Listing Snapshots
1965
1966List all snapshots taken (by printing the names and relative information).
1967
1968----
1969
1970$ ./bin/hbase shell
1971hbase> list_snapshots
1972----
1973
1974[[ops.snapshots.delete]]
1975=== Deleting Snapshots
1976
1977You can remove a snapshot, and the files retained for that snapshot will be removed if no longer needed.
1978
1979----
1980
1981$ ./bin/hbase shell
1982hbase> delete_snapshot 'myTableSnapshot-122112'
1983----
1984
1985[[ops.snapshots.clone]]
1986=== Clone a table from snapshot
1987
1988From a snapshot you can create a new table (clone operation) with the same data that you had when the snapshot was taken.
1989The clone operation, doesn't involve data copies, and a change to the cloned table doesn't impact the snapshot or the original table.
1990
1991----
1992
1993$ ./bin/hbase shell
1994hbase> clone_snapshot 'myTableSnapshot-122112', 'myNewTestTable'
1995----
1996
1997[[ops.snapshots.restore]]
1998=== Restore a snapshot
1999
2000The restore operation requires the table to be disabled, and the table will be restored to the state at the time when the snapshot was taken, changing both data and schema if required.
2001
2002----
2003
2004$ ./bin/hbase shell
2005hbase> disable 'myTable'
2006hbase> restore_snapshot 'myTableSnapshot-122112'
2007----
2008
2009NOTE: Since Replication works at log level and snapshots at file-system level, after a restore, the replicas will be in a different state from the master.
2010If you want to use restore, you need to stop replication and redo the bootstrap.
2011
2012In case of partial data-loss due to misbehaving client, instead of a full restore that requires the table to be disabled, you can clone the table from the snapshot and use a Map-Reduce job to copy the data that you need, from the clone to the main one.
2013
2014[[ops.snapshots.acls]]
2015=== Snapshots operations and ACLs
2016
2017If you are using security with the AccessController Coprocessor (See <<hbase.accesscontrol.configuration,hbase.accesscontrol.configuration>>), only a global administrator can take, clone, or restore a snapshot, and these actions do not capture the ACL rights.
2018This means that restoring a table preserves the ACL rights of the existing table, while cloning a table creates a new table that has no ACL rights until the administrator adds them.
2019
2020[[ops.snapshots.export]]
2021=== Export to another cluster
2022
2023The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster.
2024The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
2025
2026To copy a snapshot called MySnapshot to an HBase cluster srv2 (hdfs:///srv2:8082/hbase) using 16 mappers:
2027
2028[source,bourne]
2029----
2030$ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16
2031----
2032
2033.Limiting Bandwidth Consumption
2034You can limit the bandwidth consumption when exporting a snapshot, by specifying the `-bandwidth` parameter, which expects an integer representing megabytes per second.
2035The following example limits the above example to 200 MB/sec.
2036
2037[source,bourne]
2038----
2039$ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200
2040----
2041
2042[[ops.capacity]]
2043== Capacity Planning and Region Sizing
2044
2045There are several considerations when planning the capacity for an HBase cluster and performing the initial configuration.
2046Start with a solid understanding of how HBase handles data internally.
2047
2048[[ops.capacity.nodes]]
2049=== Node count and hardware/VM configuration
2050
2051[[ops.capacity.nodes.datasize]]
2052==== Physical data size
2053
2054Physical data size on disk is distinct from logical size of your data and is affected by the following:
2055
2056* Increased by HBase overhead
2057+
2058* See <<keyvalue,keyvalue>> and <<keysize,keysize>>.
2059  At least 24 bytes per key-value (cell), can be more.
2060  Small keys/values means more relative overhead.
2061* KeyValue instances are aggregated into blocks, which are indexed.
2062  Indexes also have to be stored.
2063  Blocksize is configurable on a per-ColumnFamily basis.
2064  See <<regions.arch,regions.arch>>.
2065
2066* Decreased by <<compression,compression>> and data block encoding, depending on data.
2067  See also link:http://search-hadoop.com/m/lL12B1PFVhp1[this thread].
2068  You might want to test what compression and encoding (if any) make sense for your data.
2069* Increased by size of region server <<wal,wal>> (usually fixed and negligible - less than half of RS memory size, per RS).
2070* Increased by HDFS replication - usually x3.
2071
2072Aside from the disk space necessary to store the data, one RS may not be able to serve arbitrarily large amounts of data due to some practical limits on region count and size (see <<ops.capacity.regions,ops.capacity.regions>>).
2073
2074[[ops.capacity.nodes.throughput]]
2075==== Read/Write throughput
2076
2077Number of nodes can also be driven by required throughput for reads and/or writes.
2078The throughput one can get per node depends a lot on data (esp.
2079key/value sizes) and request patterns, as well as node and system configuration.
2080Planning should be done for peak load if it is likely that the load would be the main driver of the increase of the node count.
2081PerformanceEvaluation and <<ycsb,ycsb>> tools can be used to test single node or a test cluster.
2082
2083For write, usually 5-15Mb/s per RS can be expected, since every region server has only one active WAL.
2084There's no good estimate for reads, as it depends vastly on data, requests, and cache hit rate. <<perf.casestudy,perf.casestudy>> might be helpful.
2085
2086[[ops.capacity.nodes.gc]]
2087==== JVM GC limitations
2088
2089RS cannot currently utilize very large heap due to cost of GC.
2090There's also no good way of running multiple RS-es per server (other than running several VMs per machine). Thus, ~20-24Gb or less memory dedicated to one RS is recommended.
2091GC tuning is required for large heap sizes.
2092See <<gcpause,gcpause>>, <<trouble.log.gc,trouble.log.gc>> and elsewhere (TODO: where?)
2093
2094[[ops.capacity.regions]]
2095=== Determining region count and size
2096
2097Generally less regions makes for a smoother running cluster (you can always manually split the big regions later (if necessary) to spread the data, or request load, over the cluster); 20-200 regions per RS is a reasonable range.
2098The number of regions cannot be configured directly (unless you go for fully <<disable.splitting,disable.splitting>>); adjust the region size to achieve the target region size given table size.
2099
2100When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor], as well as shell commands.
2101These settings will override the ones in `hbase-site.xml`.
2102That is useful if your tables have different workloads/use cases.
2103
2104Also note that in the discussion of region sizes here, _HDFS replication factor is not (and should not be) taken into account, whereas
2105          other factors <<ops.capacity.nodes.datasize,ops.capacity.nodes.datasize>> should be._ So, if your data is compressed and replicated 3 ways by HDFS, "9 Gb region" means 9 Gb of compressed data.
2106HDFS replication factor only affects your disk usage and is invisible to most HBase code.
2107
2108==== Viewing the Current Number of Regions
2109
2110You can view the current number of regions for a given table using the HMaster UI.
2111In the [label]#Tables# section, the number of online regions for each table is listed in the [label]#Online Regions# column.
2112This total only includes the in-memory state and does not include disabled or offline regions.
2113If you do not want to use the HMaster UI, you can determine the number of regions by counting the number of subdirectories of the /hbase/<table>/ subdirectories in HDFS, or by running the `bin/hbase hbck` command.
2114Each of these methods may return a slightly different number, depending on the status of each region.
2115
2116[[ops.capacity.regions.count]]
2117==== Number of regions per RS - upper bound
2118
2119In production scenarios, where you have a lot of data, you are normally concerned with the maximum number of regions you can have per server. <<too_many_regions,too many regions>>          has technical discussion on the subject.
2120Basically, the maximum number of regions is mostly determined by memstore memory usage.
2121Each region has its own memstores; these grow up to a configurable size; usually in 128-256 MB range, see <<hbase.hregion.memstore.flush.size,hbase.hregion.memstore.flush.size>>.
2122One memstore exists per column family (so there's only one per region if there's one CF in the table). The RS dedicates some fraction of total memory to its memstores (see <<hbase.regionserver.global.memstore.size,hbase.regionserver.global.memstore.size>>). If this memory is exceeded (too much memstore usage), it can cause undesirable consequences such as unresponsive server or compaction storms.
2123A good starting point for the number of regions per RS (assuming one table) is:
2124
2125[source]
2126----
2127((RS memory) * (total memstore fraction)) / ((memstore size)*(# column families))
2128----
2129
2130This formula is pseudo-code.
2131Here are two formulas using the actual tunable parameters, first for HBase 0.98+ and second for HBase 0.94.x.
2132
2133HBase 0.98.x::
2134----
2135((RS Xmx) * hbase.regionserver.global.memstore.size) / (hbase.hregion.memstore.flush.size * (# column families))
2136----
2137HBase 0.94.x::
2138----
2139((RS Xmx) * hbase.regionserver.global.memstore.upperLimit) / (hbase.hregion.memstore.flush.size * (# column families))+
2140----
2141
2142If a given RegionServer has 16 GB of RAM, with default settings, the formula works out to 16384*0.4/128 ~ 51 regions per RS is a starting point.
2143The formula can be extended to multiple tables; if they all have the same configuration, just use the total number of families.
2144
2145This number can be adjusted; the formula above assumes all your regions are filled at approximately the same rate.
2146If only a fraction of your regions are going to be actively written to, you can divide the result by that fraction to get a larger region count.
2147Then, even if all regions are written to, all region memstores are not filled evenly, and eventually jitter appears even if they are (due to limited number of concurrent flushes). Thus, one can have as many as 2-3 times more regions than the starting point; however, increased numbers carry increased risk.
2148
2149For write-heavy workload, memstore fraction can be increased in configuration at the expense of block cache; this will also allow one to have more regions.
2150
2151[[ops.capacity.regions.mincount]]
2152==== Number of regions per RS - lower bound
2153
2154HBase scales by having regions across many servers.
2155Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle.
2156This really can't be stressed enough, since a common problem is loading 200MB data into HBase and then wondering why your awesome 10 node cluster isn't doing anything.
2157
2158On the other hand, if you have a very large amount of data, you may also want to go for a larger number of regions to avoid having regions that are too large.
2159
2160[[ops.capacity.regions.size]]
2161==== Maximum region size
2162
2163For large tables in production scenarios, maximum region size is mostly limited by compactions - very large compactions, esp.
2164major, can degrade cluster performance.
2165Currently, the recommended maximum region size is 10-20Gb, and 5-10Gb is optimal.
2166For older 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb.
2167
2168The size at which the region is split into two is generally configured via <<hbase.hregion.max.filesize,hbase.hregion.max.filesize>>; for details, see <<arch.region.splits,arch.region.splits>>.
2169
2170If you cannot estimate the size of your tables well, when starting off, it's probably best to stick to the default region size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
2171
2172In HBase 0.98, experimental stripe compactions feature was added that would allow for larger regions, especially for log data.
2173See <<ops.stripe,ops.stripe>>.
2174
2175[[ops.capacity.regions.total]]
2176==== Total data size per region server
2177
2178According to above numbers for region size and number of regions per region server, in an optimistic estimate 10 GB x 100 regions per RS will give up to 1TB served per region server, which is in line with some of the reported multi-PB use cases.
2179However, it is important to think about the data vs cache size ratio at the RS level.
2180With 1TB of data per server and 10 GB block cache, only 1% of the data will be cached, which may barely cover all block indices.
2181
2182[[ops.capacity.config]]
2183=== Initial configuration and tuning
2184
2185First, see <<important_configurations,important configurations>>.
2186Note that some configurations, more than others, depend on specific scenarios.
2187Pay special attention to:
2188
2189* <<hbase.regionserver.handler.count,hbase.regionserver.handler.count>> - request handler thread count, vital for high-throughput workloads.
2190* <<config.wals,config.wals>> - the blocking number of WAL files depends on your memstore configuration and should be set accordingly to prevent potential blocking when doing high volume of writes.
2191
2192Then, there are some considerations when setting up your cluster and tables.
2193
2194[[ops.capacity.config.compactions]]
2195==== Compactions
2196
2197Depending on read/write volume and latency requirements, optimal compaction settings may be different.
2198See <<compaction,compaction>> for some details.
2199
2200When provisioning for large data sizes, however, it's good to keep in mind that compactions can affect write throughput.
2201Thus, for write-intensive workloads, you may opt for less frequent compactions and more store files per regions.
2202Minimum number of files for compactions (`hbase.hstore.compaction.min`) can be set to higher value; <<hbase.hstore.blockingStoreFiles,hbase.hstore.blockingStoreFiles>> should also be increased, as more files might accumulate in such case.
2203You may also consider manually managing compactions: <<managed.compactions,managed.compactions>>
2204
2205[[ops.capacity.config.presplit]]
2206==== Pre-splitting the table
2207
2208Based on the target number of the regions per RS (see <<ops.capacity.regions.count,ops.capacity.regions.count>>) and number of RSes, one can pre-split the table at creation time.
2209This would both avoid some costly splitting as the table starts to fill up, and ensure that the table starts out already distributed across many servers.
2210
2211If the table is expected to grow large enough to justify that, at least one region per RS should be created.
2212It is not recommended to split immediately into the full target number of regions (e.g.
221350 * number of RSes), but a low intermediate value can be chosen.
2214For multiple tables, it is recommended to be conservative with presplitting (e.g.
2215pre-split 1 region per RS at most), especially if you don't know how much each table will grow.
2216If you split too much, you may end up with too many regions, with some tables having too many small regions.
2217
2218For pre-splitting howto, see <<manual_region_splitting_decisions,manual region splitting decisions>> and <<precreate.regions,precreate.regions>>.
2219
2220[[table.rename]]
2221== Table Rename
2222
2223In versions 0.90.x of hbase and earlier, we had a simple script that would rename the hdfs table directory and then do an edit of the hbase:meta table replacing all mentions of the old table name with the new.
2224The script was called `./bin/rename_table.rb`.
2225The script was deprecated and removed mostly because it was unmaintained and the operation performed by the script was brutal.
2226
2227As of hbase 0.94.x, you can use the snapshot facility renaming a table.
2228Here is how you would do it using the hbase shell:
2229
2230----
2231hbase shell> disable 'tableName'
2232hbase shell> snapshot 'tableName', 'tableSnapshot'
2233hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
2234hbase shell> delete_snapshot 'tableSnapshot'
2235hbase shell> drop 'tableName'
2236----
2237
2238or in code it would be as follows:
2239
2240[source,java]
2241----
2242void rename(Admin admin, String oldTableName, TableName newTableName) {
2243  String snapshotName = randomName();
2244  admin.disableTable(oldTableName);
2245  admin.snapshot(snapshotName, oldTableName);
2246  admin.cloneSnapshot(snapshotName, newTableName);
2247  admin.deleteSnapshot(snapshotName);
2248  admin.deleteTable(oldTableName);
2249}
2250----
2251