README
1HTML::TableParser
2
3HTML::TableParser uses HTML::Parser to extract data from an HTML table.
4The data is returned via a series of user defined callback functions or
5methods. Specific tables may be selected either by a matching a unique
6table id or by matching against the column names. Multiple (even nested)
7tables may be parsed in a document in one pass.
8
9 Table Identification
10
11Each table is given a unique id, relative to its parent, based upon its
12order and nesting. The first top level table has id 1, the second 2,
13etc. The first table nested in table 1 has id 1.1, the second 1.2, etc.
14The first table nested in table 1.1 has id 1.1.1, etc. These, as well as
15the tables' column names, may be used to identify which tables to parse.
16
17 Data Extraction
18
19As the parser traverses a selected table, it will pass data to user
20provided callback functions or methods after it has digested particular
21structures in the table. All functions are passed the table id (as
22described above), the line number in the HTML source where the table was
23found, and a reference to any table specific user provided data.
24
25Table Start
26 The start callback is invoked when a matched table has been
27 found.
28
29Table End
30 The end callback is invoked after a matched table has been
31 parsed.
32
33Header The hdr callback is invoked after the table header has been read
34 in. Some tables do not use the <th> tag to indicate a header, so
35 this function may not be called. It is passed the column names.
36
37Row The row callback is invoked after a row in the table has been
38 read. It is passed the column data.
39
40Warn The warn callback is invoked when a non-fatal error occurs
41 during parsing. Fatal errors croak.
42
43New This is the class method to call to create a new object when
44 HTML::TableParser is supposed to create new objects upon table
45 start.
46
47 Callback API
48
49Callbacks may be functions or methods or a mixture of both. In the
50latter case, an object must be passed to the constructor. (More on that
51later.)
52
53The callbacks are invoked as follows:
54
55 start( $tbl_id, $line_no, $udata );
56
57 end( $tbl_id, $line_no, $udata );
58
59 hdr( $tbl_id, $line_no, \@col_names, $udata );
60
61 row( $tbl_id, $line_no, \@data, $udata );
62
63 warn( $tbl_id, $line_no, $message, $udata );
64
65 new( $tbl_id, $udata );
66
67 Data Cleanup
68
69There are several cleanup operations that may be performed
70automatically:
71
72Chomp chomp() the data
73
74Decode Run the data through HTML::Entities::decode.
75
76DecodeNBSP
77 Normally HTML::Entitites::decode changes a non-breaking space
78 into a character which doesn't seem to be matched by Perl's
79 whitespace regexp. Setting this attribute changes the HTML
80 "nbsp" character to a plain 'ol blank.
81
82Trim remove leading and trailing white space.
83
84 Data Organization
85
86Column names are derived from cells delimited by the <th> and </th>
87tags. Some tables have header cells which span one or more columns or
88rows to make things look nice. HTML::TableParser determines the actual
89number of columns used and provides column names for each column,
90repeating names for spanned columns and concatenating spanned rows and
91columns. For example, if the table header looks like this:
92
93 +----+--------+----------+-------------+-------------------+
94 | | | Eq J2000 | | Velocity/Redshift |
95 | No | Object |----------| Object Type |-------------------|
96 | | | RA | Dec | | km/s | z | Qual |
97 +----+--------+----------+-------------+-------------------+
98
99The columns will be:
100
101 No
102 Object
103 Eq J2000 RA
104 Eq J2000 Dec
105 Object Type
106 Velocity/Redshift km/s
107 Velocity/Redshift z
108 Velocity/Redshift Qual
109
110Row data are derived from cells delimited by the <td> and </td> tags.
111Cells which span more than one column or row are handled correctly, i.e.
112the values are duplicated in the appropriate places.
113
114INSTALLATION
115
116This is a Perl module distribution. It should be installed with whichever
117tool you use to manage your installation of Perl, e.g. any of
118
119 cpanm .
120 cpan .
121 cpanp -i .
122
123Consult http://www.cpan.org/modules/INSTALL.html for further instruction.
124Should you wish to install this module manually, the procedure is
125
126 perl Makefile.PL
127 make
128 make test
129 make install
130
131COPYRIGHT AND LICENSE
132
133This software is Copyright (c) 2018 by Smithsonian Astrophysical
134Observatory.
135
136This is free software, licensed under:
137
138 The GNU General Public License, Version 3, June 2007
139