• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

data/H03-May-2022-3,7633,677

lib/HTML/H22-Mar-2018-2,205629

t/H22-Mar-2018-1,085811

tdata/H03-May-2022-

xt/H22-Mar-2018-217145

ChangesH A D22-Mar-20183.9 KiB14686

LICENSEH A D22-Mar-201834.5 KiB681556

MANIFESTH A D22-Mar-20181.3 KiB7069

META.jsonH A D22-Mar-20185 KiB139137

META.ymlH A D22-Mar-20181 KiB4342

Makefile.PLH A D22-Mar-20181.3 KiB5745

READMEH A D22-Mar-20184.4 KiB13997

cpanfileH A D22-Mar-20183.4 KiB8075

dist.iniH A D22-Mar-20182.4 KiB141107

weaver.iniH A D22-Mar-2018379 3223

README

1HTML::TableParser
2
3HTML::TableParser uses HTML::Parser to extract data from an HTML table.
4The data is returned via a series of user defined callback functions or
5methods. Specific tables may be selected either by a matching a unique
6table id or by matching against the column names. Multiple (even nested)
7tables may be parsed in a document in one pass.
8
9  Table Identification
10
11Each table is given a unique id, relative to its parent, based upon its
12order and nesting. The first top level table has id 1, the second 2,
13etc. The first table nested in table 1 has id 1.1, the second 1.2, etc.
14The first table nested in table 1.1 has id 1.1.1, etc. These, as well as
15the tables' column names, may be used to identify which tables to parse.
16
17  Data Extraction
18
19As the parser traverses a selected table, it will pass data to user
20provided callback functions or methods after it has digested particular
21structures in the table. All functions are passed the table id (as
22described above), the line number in the HTML source where the table was
23found, and a reference to any table specific user provided data.
24
25Table Start
26        The start callback is invoked when a matched table has been
27        found.
28
29Table End
30        The end callback is invoked after a matched table has been
31        parsed.
32
33Header  The hdr callback is invoked after the table header has been read
34        in. Some tables do not use the <th> tag to indicate a header, so
35        this function may not be called. It is passed the column names.
36
37Row     The row callback is invoked after a row in the table has been
38        read. It is passed the column data.
39
40Warn    The warn callback is invoked when a non-fatal error occurs
41        during parsing. Fatal errors croak.
42
43New     This is the class method to call to create a new object when
44        HTML::TableParser is supposed to create new objects upon table
45        start.
46
47  Callback API
48
49Callbacks may be functions or methods or a mixture of both. In the
50latter case, an object must be passed to the constructor. (More on that
51later.)
52
53The callbacks are invoked as follows:
54
55  start( $tbl_id, $line_no, $udata );
56
57  end( $tbl_id, $line_no, $udata );
58
59  hdr( $tbl_id, $line_no, \@col_names, $udata );
60
61  row( $tbl_id, $line_no, \@data, $udata );
62
63  warn( $tbl_id, $line_no, $message, $udata );
64
65  new( $tbl_id, $udata );
66
67  Data Cleanup
68
69There are several cleanup operations that may be performed
70automatically:
71
72Chomp   chomp() the data
73
74Decode  Run the data through HTML::Entities::decode.
75
76DecodeNBSP
77        Normally HTML::Entitites::decode changes a non-breaking space
78        into a character which doesn't seem to be matched by Perl's
79        whitespace regexp. Setting this attribute changes the HTML
80        "nbsp" character to a plain 'ol blank.
81
82Trim    remove leading and trailing white space.
83
84  Data Organization
85
86Column names are derived from cells delimited by the <th> and </th>
87tags. Some tables have header cells which span one or more columns or
88rows to make things look nice. HTML::TableParser determines the actual
89number of columns used and provides column names for each column,
90repeating names for spanned columns and concatenating spanned rows and
91columns. For example, if the table header looks like this:
92
93 +----+--------+----------+-------------+-------------------+
94 |    |        | Eq J2000 |             | Velocity/Redshift |
95 | No | Object |----------| Object Type |-------------------|
96 |    |        | RA | Dec |             | km/s |  z  | Qual |
97 +----+--------+----------+-------------+-------------------+
98
99The columns will be:
100
101  No
102  Object
103  Eq J2000 RA
104  Eq J2000 Dec
105  Object Type
106  Velocity/Redshift km/s
107  Velocity/Redshift z
108  Velocity/Redshift Qual
109
110Row data are derived from cells delimited by the <td> and </td> tags.
111Cells which span more than one column or row are handled correctly, i.e.
112the values are duplicated in the appropriate places.
113
114INSTALLATION
115
116This is a Perl module distribution. It should be installed with whichever
117tool you use to manage your installation of Perl, e.g. any of
118
119  cpanm .
120  cpan  .
121  cpanp -i .
122
123Consult http://www.cpan.org/modules/INSTALL.html for further instruction.
124Should you wish to install this module manually, the procedure is
125
126  perl Makefile.PL
127  make
128  make test
129  make install
130
131COPYRIGHT AND LICENSE
132
133This software is Copyright (c) 2018 by Smithsonian Astrophysical
134Observatory.
135
136This is free software, licensed under:
137
138  The GNU General Public License, Version 3, June 2007
139