1.. _pxb.xtrabackup.table-statistics.analyzing:
2
3================================================================================
4 Analyzing Table Statistics
5================================================================================
6
7The |xtrabackup| binary can analyze InnoDB data files in read-only mode to give
8statistics about them. To do this, you should use the :option:`--stats`
9option. You can combine this with the :option:`--tables` option to limit the
10files to examine. It also uses the :option:`--use-memory` option.
11
12You can perform the analysis on a running server, with some chance of errors due
13to the data being changed during analysis. Or, you can analyze a backup copy of
14the database. Either way, to use the statistics feature, you need a clean copy
15of the database including correctly sized log files, so you need to execute with
16:option:`--prepare` twice to use this functionality on a backup.
17
18The result of running on a backup might look like the following: ::
19
20  <INDEX STATISTICS>
21    table: test/table1, index: PRIMARY, space id: 12, root page 3
22    estimated statistics in dictionary:
23      key vals: 25265338, leaf pages 497839, size pages 498304
24    real statistics:
25       level 2 pages: pages=1, data=5395 bytes, data/pages=32%
26       level 1 pages: pages=415, data=6471907 bytes, data/pages=95%
27          leaf pages: recs=25958413, pages=497839, data=7492026403 bytes, data/pages=91%
28
29This can be interpreted as follows:
30
31* The first line simply shows the table and index name and its internal
32  identifiers. If you see an index named ``GEN_CLUST_INDEX``, that is the
33  table's clustered index, automatically created because you did not explicitly
34  create a ``PRIMARY KEY``.
35* The estimated statistics in dictionary information is similar to the data
36  that's gathered through ``ANALYZE TABLE`` inside of |InnoDB| to be stored as
37  estimated cardinality statistics and passed to the query optimizer.
38* The real statistics information is the result of scanning the data pages and
39  computing exact information about the index.
40* ``The level <X> pages``: output means that the line shows information about
41  pages at that level in the index tree. The larger ``<X>`` is, the farther it
42  is from the leaf pages, which are level 0. The first line is the root page.
43* The ``leaf pages`` output shows the leaf pages, of course. This is where the
44  table's data is stored.
45* The ``external pages``: output (not shown) shows large external pages that
46  hold values too long to fit in the row itself, such as long ``BLOB`` and
47  ``TEXT`` values.
48* The ``recs`` is the real number of records (rows) in leaf pages.
49* The ``pages`` is the page count.
50* The ``data`` is the total size of the data in the pages, in bytes.
51* The ``data/pages`` is calculated as (``data`` / (``pages`` * ``PAGE_SIZE``)) *
52  100%. It will never reach 100% because of space reserved for page headers and
53  footers.
54
55A more detailed example is posted as a MySQL Performance Blog `post
56<http://www.mysqlperformanceblog.com/2009/09/14/statistics-of-innodb-tables-and-indexes-available-in-xtrabackup/>`_.
57
58Script to Format Output
59================================================================================
60
61The following script can be used to summarize and tabulate the output of the
62statistics information: ::
63
64    tabulate-xtrabackup-stats.pl
65
66    #!/usr/bin/env perl
67    use strict;
68    use warnings FATAL => 'all';
69    my $script_version = "0.1";
70
71    my $PG_SIZE = 16_384; # InnoDB defaults to 16k pages, change if needed.
72    my ($cur_idx, $cur_tbl);
73    my (%idx_stats, %tbl_stats);
74    my ($max_tbl_len, $max_idx_len) = (0, 0);
75    while ( my $line = <> ) {
76       if ( my ($t, $i) = $line =~ m/table: (.*), index: (.*), space id:/ ) {
77          $t =~ s!/!.!;
78          $cur_tbl = $t;
79          $cur_idx = $i;
80          if ( length($i) > $max_idx_len ) {
81             $max_idx_len = length($i);
82          }
83          if ( length($t) > $max_tbl_len ) {
84             $max_tbl_len = length($t);
85          }
86       }
87       elsif ( my ($kv, $lp, $sp) = $line =~ m/key vals: (\d+), \D*(\d+), \D*(\d+)/ ) {
88          @{$idx_stats{$cur_tbl}->{$cur_idx}}{qw(est_kv est_lp est_sp)} = ($kv, $lp, $sp);
89          $tbl_stats{$cur_tbl}->{est_kv} += $kv;
90          $tbl_stats{$cur_tbl}->{est_lp} += $lp;
91          $tbl_stats{$cur_tbl}->{est_sp} += $sp;
92       }
93       elsif ( my ($l, $pages, $bytes) = $line =~ m/(?:level (\d+)|leaf) pages:.*pages=(\d+), data=(\d+) bytes/ ) {
94          $l ||= 0;
95          $idx_stats{$cur_tbl}->{$cur_idx}->{real_pages} += $pages;
96          $idx_stats{$cur_tbl}->{$cur_idx}->{real_bytes} += $bytes;
97          $tbl_stats{$cur_tbl}->{real_pages} += $pages;
98          $tbl_stats{$cur_tbl}->{real_bytes} += $bytes;
99       }
100    }
101
102    my $hdr_fmt = "%${max_tbl_len}s %${max_idx_len}s %9s %10s %10s\n";
103    my @headers = qw(TABLE INDEX TOT_PAGES FREE_PAGES PCT_FULL);
104    printf $hdr_fmt, @headers;
105
106    my $row_fmt = "%${max_tbl_len}s %${max_idx_len}s %9d %10d %9.1f%%\n";
107    foreach my $t ( sort keys %tbl_stats ) {
108       my $tbl = $tbl_stats{$t};
109       printf $row_fmt, $t, "", $tbl->{est_sp}, $tbl->{est_sp} - $tbl->{real_pages},
110          $tbl->{real_bytes} / ($tbl->{real_pages} * $PG_SIZE) * 100;
111       foreach my $i ( sort keys %{$idx_stats{$t}} ) {
112          my $idx = $idx_stats{$t}->{$i};
113          printf $row_fmt, $t, $i, $idx->{est_sp}, $idx->{est_sp} - $idx->{real_pages},
114             $idx->{real_bytes} / ($idx->{real_pages} * $PG_SIZE) * 100;
115       }
116    }
117
118.. rubric:: Sample Script Output
119
120The output of the above Perl script, when run against the sample shown in the
121previously mentioned blog post, will appear as follows: ::
122
123            TABLE           INDEX TOT_PAGES FREE_PAGES   PCT_FULL
124  art.link_out104                    832383      38561      86.8%
125  art.link_out104         PRIMARY    498304         49      91.9%
126  art.link_out104       domain_id     49600       6230      76.9%
127  art.link_out104     domain_id_2     26495       3339      89.1%
128  art.link_out104 from_message_id     28160        142      96.3%
129  art.link_out104    from_site_id     38848       4874      79.4%
130  art.link_out104   revert_domain    153984      19276      71.4%
131  art.link_out104    site_message     36992       4651      83.4%
132
133The columns are the table and index, followed by the total number of pages in
134that index, the number of pages not actually occupied by data, and the number of
135bytes of real data as a percentage of the total size of the pages of real
136data. The first line in the above output, in which the ``INDEX`` column is
137empty, is a summary of the entire table.
138