1NAME
2 LaTeX::TOM - A module for parsing, analyzing, and manipulating LaTeX
3 documents.
4
5SYNOPSIS
6 use LaTeX::TOM;
7
8 $parser = LaTeX::TOM->new;
9
10 $document = $parser->parseFile('mypaper.tex');
11
12 $latex = $document->toLaTeX;
13
14 $specialnodes = $document->getNodesByCondition(sub {
15 my $node = shift;
16 return (
17 $node->getNodeType eq 'TEXT'
18 && $node->getNodeText =~ /magic string/
19 );
20 });
21
22 $sections = $document->getNodesByCondition(sub {
23 my $node = shift;
24 return (
25 $node->getNodeType eq 'COMMAND'
26 && $node->getCommandName =~ /section$/
27 );
28 });
29
30 $indexme = $document->getIndexableText;
31
32 $document->print;
33
34DESCRIPTION
35 This module provides a parser which parses and interprets (though not
36 fully) LaTeX documents and returns a tree-based representation of what
37 it finds. This tree is a `LaTeX::TOM::Tree'. The tree contains
38 `LaTeX::TOM::Node' nodes.
39
40 This module should be especially useful to anyone who wants to do
41 processing of LaTeX documents that requires extraction of plain-text
42 information, or altering of the plain-text components (or alternatively,
43 the math-text components).
44
45COMPONENTS
46 LaTeX::TOM::Parser
47 The parser recognizes 3 parameters upon creation. The parameters, in
48 order, are
49
50 parse error handling (= 0 || 1 || 2)
51 Determines what happens when a parse error is encountered. `0'
52 results in a warning. `1' results in a die. `2' results in silence.
53 Note that particular groupings in LaTeX (i.e. newcommands and the
54 like) contain invalid TeX or LaTeX, so you nearly always need this
55 parameter to be `0' or `2' to completely parse the document.
56
57 read inputs flag (= 0 || 1)
58 This flag determines whether a scan for `\input' and `\input-like'
59 commands is performed, and the resulting called files parsed and
60 added to the parent parse tree. `0' means no, `1' means do it. Note
61 that this will happen recursively if it is turned on. Also,
62 bibliographies (.bbl files) are detected and included.
63
64 apply mappings flag (= 0 || 1)
65 This flag determines whether (most) user-defined mappings are
66 applied. This means `\defs', `\newcommands', and `\newenvironments'.
67 This is critical for properly analyzing the content of the document,
68 as this must be phrased in terms of the semantics of the original
69 TeX and LaTeX commands, not ad hoc user macros. So, for instance, do
70 not expect plain-text extraction to work properly with this option
71 off.
72
73 The parser returns a `LaTeX::TOM::Tree' ($document in the SYNOPSIS).
74
75 LaTeX::TOM::Node
76 Nodes may be of the following types:
77
78 TEXT
79 `TEXT' nodes can be thought of as representing the plain-text
80 portions of the LaTeX document. This includes math and anything else
81 that is not a recognized TeX or LaTeX command, or user-defined
82 command. In reality, `TEXT' nodes contain commands that this parser
83 does not yet recognize the semantics of.
84
85 COMMAND
86 A `COMMAND' node represents a TeX command. It always has child nodes
87 in a tree, though the tree might be empty if the command operates on
88 zero parameters. An example of a command is
89
90 \textbf{blah}
91
92 This would parse into a `COMMAND' node for `textbf', which would
93 have a subtree containing the `TEXT' node with text ``blah.''
94
95 ENVIRONMENT
96 Similarly, TeX environments parse into `ENVIRONMENT' nodes, which
97 have metadata about the environment, along with a subtree
98 representing what is contained in the environment. For example,
99
100 \begin{equation}
101 r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
102 \end{equation}
103
104 Would parse into an `ENVIRONMENT' node of the class ``equation''
105 with a child tree containing the result of parsing ```r = \frac{-b
106 \pm \sqrt{b^2 - 4ac}}{2a}.'''
107
108 GROUP
109 A `GROUP' is like an anonymous `COMMAND'. Since you can put whatever
110 you want in curly-braces (`{}') in TeX in order to make semantically
111 isolated regions, this separation is preserved by the parser. A
112 `GROUP' is just the subtree of the parsed contents of plain
113 curly-braces.
114
115 It is important to note that currently only the first `GROUP' in a
116 series of `GROUP's following a LaTeX command will actually be parsed
117 into a `COMMAND' node. The reason is that, for the initial purposes
118 of this module, it was not necessary to recognize additional
119 `GROUP's as additional parameters to the `COMMAND'. However, this is
120 something that this module really should do eventually. Currently if
121 you want all the parameters to a multi-parametered command, you'll
122 need to pick out all the following `GROUP' nodes yourself.
123
124 Eventually this will become something like a list which is stored in
125 the `COMMAND' node, much like XML::DOM's treatment of attributes.
126 These are, in a sense, apart from the rest of the document tree.
127 Then `GROUP' nodes will become much more rare.
128
129 COMMENT
130 A `COMMENT' node is very similar to a `TEXT' node, except it is
131 specifically for lines beginning with ```%''' (the TeX comment
132 delimeter) or the right-hand portion of a line that has ```%''' at
133 some internal point.
134
135 LaTeX::TOM::Trees
136 As mentioned before, the Tree is the return result of a parse.
137
138 The tree is nothing more than an arrayref of Nodes, some of which may
139 contain their own trees. This is useful knowledge at this point, since
140 the user isn't provided with a full suite of convenient
141 tree-modification methods. However, Trees do already have some very
142 convenient methods, described in the next section.
143
144METHODS
145 LaTeX::TOM
146 new
147 Instantiate a new parser object.
148
149 In this section all of the methods for each of the components are listed
150 and described.
151
152 LaTeX::TOM::Parser
153 The methods for the parser (aside from the constructor, discussed above)
154 are :
155
156 parseFile (filename)
157 Read in the contents of *filename* and parse them, returning a
158 `LaTeX::TOM::Tree'.
159
160 parse (string)
161 Parse the string *string* and return a `LaTeX::TOM::Tree'.
162
163 LaTeX::TOM::Tree
164 This section contains methods for the Trees returned by the parser.
165
166 copy
167 Duplicate a tree into new memory.
168
169 print
170 A debug print of the structure of the tree.
171
172 plainText
173 Returns an arrayref which is a list of strings representing the text
174 of all `getNodePlainTextFlag = 1' `TEXT' nodes, in an inorder
175 traversal.
176
177 indexableText
178 A method like the above but which goes one step further; it cleans
179 all of the returned text and concatenates it into a single string
180 which one could consider having all of the standard information
181 retrieval value for the document, making it useful for indexing.
182
183 toLaTeX
184 Return a string representing the LaTeX encoded by the tree. This is
185 especially useful to get a normal document again, after modifying
186 nodes of the tree.
187
188 getTopLevelNodes
189 Return a list of `LaTeX::TOM::Nodes' at the top level of the Tree.
190
191 getAllNodes
192 Return an arrayref with all nodes of the tree. This "flattens" the
193 tree.
194
195 getCommandNodesByName (name)
196 Return an arrayref with all `COMMAND' nodes in the tree which have a
197 name matching *name*.
198
199 getEnvironmentsByName (name)
200 Return an arrayref with all `ENVIRONMENT' nodes in the tree which
201 have a class matching *name*.
202
203 getNodesByCondition (code reference)
204 This is a catch-all search method which can be used to pull out
205 nodes that match pretty much any perl expression, without manually
206 having to traverse the tree. *code reference* is a perl code
207 reference which receives as its first argument the node of the tree
208 that is currently scrutinized and is expected to return a boolean
209 value. See the SYNOPSIS for examples.
210
211 getFirstNode
212 Returns the first node of the tree. This is useful if you want to
213 walk the tree yourself, starting with the first node.
214
215 LaTeX::TOM::Node
216 This section contains the methods for nodes of the parsed Trees.
217
218 getNodeType
219 Returns the type, one of `TEXT', `COMMAND', `ENVIRONMENT', `GROUP',
220 or `COMMENT', as described above.
221
222 getNodeText
223 Applicable for `TEXT' or `COMMENT' nodes; this returns the document
224 text they contain. This is undef for other node types.
225
226 setNodeText
227 Set the node text, also for `TEXT' and `COMMENT' nodes.
228
229 getNodeStartingPosition
230 Get the starting character position in the document of this node.
231 For `TEXT' and `COMMENT' nodes, this will be where the text begins.
232 For `ENVIRONMENT', `COMMAND', or `GROUP' nodes, this will be the
233 position of the *last* character of the opening identifier.
234
235 getNodeEndingPosition
236 Same as above, but for last character. For `GROUP', `ENVIRONMENT',
237 or `COMMAND' nodes, this will be the *first* character of the
238 closing identifier.
239
240 getNodeOuterStartingPosition
241 Same as getNodeStartingPosition, but for `GROUP', `ENVIRONMENT', or
242 `COMMAND' nodes, this returns the *first* character of the opening
243 identifier.
244
245 getNodeOuterEndingPosition
246 Same as getNodeEndingPosition, but for `GROUP', `ENVIRONMENT', or
247 `COMMAND' nodes, this returns the *last* character of the closing
248 identifier.
249
250 getNodeMathFlag
251 This applies to any node type. It is `1' if the node sets, or is
252 contained within, a math mode region. `0' otherwise. `TEXT' nodes
253 which have this flag as `1' can be assumed to be the actual
254 mathematics contained in the document.
255
256 getNodePlainTextFlag
257 This applies only to `TEXT' nodes. It is `1' if the node is non-math
258 and is visible (in other words, will end up being a part of the
259 output document). One would only want to index `TEXT' nodes with
260 this property, for information retrieval purposes.
261
262 getEnvironmentClass
263 This applies only to `ENVIRONMENT' nodes. Returns what class of
264 environment the node represents (the `X' in `\begin{X}' and
265 `\end{X}').
266
267 getCommandName
268 This applies only to `COMMAND' nodes. Returns the name of the
269 command (the `X' in `\X{...}').
270
271 getChildTree
272 This applies only to `COMMAND', `ENVIRONMENT', and `GROUP' nodes: it
273 returns the `LaTeX::TOM::Tree' which is ``under'' the calling node.
274
275 getFirstChild
276 This applies only to `COMMAND', `ENVIRONMENT', and `GROUP' nodes: it
277 returns the first node from the first level of the child subtree.
278
279 getLastChild
280 Same as above, but for the last node of the first level.
281
282 getPreviousSibling
283 Return the prior node on the same level of the tree.
284
285 getNextSibling
286 Same as above, but for following node.
287
288 getParent
289 Get the parent node of this node in the tree.
290
291 getNextGroupNode
292 This is an interesting function, and kind of a hack because of the
293 way the parser makes the current tree. Basically it will give you
294 the next sibling that is a `GROUP' node, until it either hits the
295 end of the tree level, a `TEXT' node which doesn't match `/^\s*$/',
296 or a `COMMAND' node.
297
298 This is useful for finding all `GROUP'ed parameters after a
299 `COMMAND' node (see comments for `GROUP' in the `COMPONENTS' /
300 `LaTeX::TOM::Node' section). You can just have a while loop that
301 calls this method until it gets `undef', and you'll know you've
302 found all the parameters to a command.
303
304 Note: this may be bad, but `TEXT' Nodes matching `/^\s*\[[0-9]+\]$/'
305 (optional parameter groups) are treated as if they were 'blank'.
306
307CAVEATS
308 Due to the lack of tree-modification methods, currently this module is
309 mostly useful for minor modifications to the parsed document, for
310 instance, altering the text of `TEXT' nodes but not deleting the nodes.
311 Of course, the user can still do this by breaking abstraction and
312 directly modifying the Tree.
313
314 Also note that the parsing is not complete. This module was not written
315 with the intention of being able to produce output documents the way
316 ``latex'' does. The intent was instead to be able to analyze and modify
317 the document on a logical level with regards to the content; it doesn't
318 care about the document formatting and outputting side of TeX/LaTeX.
319
320 There is much work still to be done. See the TODO list in the TOM.pm
321 source.
322
323BUGS
324 Probably plenty. However, this module has performed fairly well on a set
325 of ~1000 research publications from the Computing Research Repository,
326 so I deemed it ``good enough'' to use for purposes similar to mine.
327
328 Please let the maintainer know of parser errors if you discover any.
329
330CREDITS
331 Thanks to (in order of appearance) who have contributed valuable
332 suggestions and patches:
333
334 Otakar Smrz
335 Moritz Lenz
336 James Bowlin
337 Jesse S. Bangs
338
339AUTHORS
340 Written by Aaron Krowne <akrowne@vt.edu>
341
342 Maintained by Steven Schubiger <schubiger@cpan.org>
343
344LICENSE
345 This program is free software; you may redistribute it and/or modify it
346 under the same terms as Perl itself.
347
348 See http://dev.perl.org/licenses/
349
350