• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

lib/LaTeX/H23-Dec-2011-3,1261,463

t/H23-Dec-2011-536396

Build.PLH A D23-Dec-2011517 2418

ChangesH A D23-Dec-20117.2 KiB268150

INSTALLH A D23-Dec-2011285 1511

MANIFESTH A D23-Dec-2011700 3938

META.jsonH A D23-Dec-20111.4 KiB6059

META.ymlH A D23-Dec-2011818 3534

Makefile.PLH A D23-Dec-2011493 1715

READMEH A D23-Dec-201113.4 KiB350270

TODOH A D23-Dec-20112.5 KiB5639

README

1NAME
2    LaTeX::TOM - A module for parsing, analyzing, and manipulating LaTeX
3    documents.
4
5SYNOPSIS
6     use LaTeX::TOM;
7
8     $parser = LaTeX::TOM->new;
9
10     $document = $parser->parseFile('mypaper.tex');
11
12     $latex = $document->toLaTeX;
13
14     $specialnodes = $document->getNodesByCondition(sub {
15         my $node = shift;
16         return (
17           $node->getNodeType eq 'TEXT'
18             && $node->getNodeText =~ /magic string/
19         );
20     });
21
22     $sections = $document->getNodesByCondition(sub {
23         my $node = shift;
24         return (
25           $node->getNodeType eq 'COMMAND'
26             && $node->getCommandName =~ /section$/
27         );
28     });
29
30     $indexme = $document->getIndexableText;
31
32     $document->print;
33
34DESCRIPTION
35    This module provides a parser which parses and interprets (though not
36    fully) LaTeX documents and returns a tree-based representation of what
37    it finds. This tree is a `LaTeX::TOM::Tree'. The tree contains
38    `LaTeX::TOM::Node' nodes.
39
40    This module should be especially useful to anyone who wants to do
41    processing of LaTeX documents that requires extraction of plain-text
42    information, or altering of the plain-text components (or alternatively,
43    the math-text components).
44
45COMPONENTS
46  LaTeX::TOM::Parser
47    The parser recognizes 3 parameters upon creation. The parameters, in
48    order, are
49
50    parse error handling (= 0 || 1 || 2)
51        Determines what happens when a parse error is encountered. `0'
52        results in a warning. `1' results in a die. `2' results in silence.
53        Note that particular groupings in LaTeX (i.e. newcommands and the
54        like) contain invalid TeX or LaTeX, so you nearly always need this
55        parameter to be `0' or `2' to completely parse the document.
56
57    read inputs flag (= 0 || 1)
58        This flag determines whether a scan for `\input' and `\input-like'
59        commands is performed, and the resulting called files parsed and
60        added to the parent parse tree. `0' means no, `1' means do it. Note
61        that this will happen recursively if it is turned on. Also,
62        bibliographies (.bbl files) are detected and included.
63
64    apply mappings flag (= 0 || 1)
65        This flag determines whether (most) user-defined mappings are
66        applied. This means `\defs', `\newcommands', and `\newenvironments'.
67        This is critical for properly analyzing the content of the document,
68        as this must be phrased in terms of the semantics of the original
69        TeX and LaTeX commands, not ad hoc user macros. So, for instance, do
70        not expect plain-text extraction to work properly with this option
71        off.
72
73    The parser returns a `LaTeX::TOM::Tree' ($document in the SYNOPSIS).
74
75  LaTeX::TOM::Node
76    Nodes may be of the following types:
77
78    TEXT
79        `TEXT' nodes can be thought of as representing the plain-text
80        portions of the LaTeX document. This includes math and anything else
81        that is not a recognized TeX or LaTeX command, or user-defined
82        command. In reality, `TEXT' nodes contain commands that this parser
83        does not yet recognize the semantics of.
84
85    COMMAND
86        A `COMMAND' node represents a TeX command. It always has child nodes
87        in a tree, though the tree might be empty if the command operates on
88        zero parameters. An example of a command is
89
90         \textbf{blah}
91
92        This would parse into a `COMMAND' node for `textbf', which would
93        have a subtree containing the `TEXT' node with text ``blah.''
94
95    ENVIRONMENT
96        Similarly, TeX environments parse into `ENVIRONMENT' nodes, which
97        have metadata about the environment, along with a subtree
98        representing what is contained in the environment. For example,
99
100         \begin{equation}
101           r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
102         \end{equation}
103
104        Would parse into an `ENVIRONMENT' node of the class ``equation''
105        with a child tree containing the result of parsing ```r = \frac{-b
106        \pm \sqrt{b^2 - 4ac}}{2a}.'''
107
108    GROUP
109        A `GROUP' is like an anonymous `COMMAND'. Since you can put whatever
110        you want in curly-braces (`{}') in TeX in order to make semantically
111        isolated regions, this separation is preserved by the parser. A
112        `GROUP' is just the subtree of the parsed contents of plain
113        curly-braces.
114
115        It is important to note that currently only the first `GROUP' in a
116        series of `GROUP's following a LaTeX command will actually be parsed
117        into a `COMMAND' node. The reason is that, for the initial purposes
118        of this module, it was not necessary to recognize additional
119        `GROUP's as additional parameters to the `COMMAND'. However, this is
120        something that this module really should do eventually. Currently if
121        you want all the parameters to a multi-parametered command, you'll
122        need to pick out all the following `GROUP' nodes yourself.
123
124        Eventually this will become something like a list which is stored in
125        the `COMMAND' node, much like XML::DOM's treatment of attributes.
126        These are, in a sense, apart from the rest of the document tree.
127        Then `GROUP' nodes will become much more rare.
128
129    COMMENT
130        A `COMMENT' node is very similar to a `TEXT' node, except it is
131        specifically for lines beginning with ```%''' (the TeX comment
132        delimeter) or the right-hand portion of a line that has ```%''' at
133        some internal point.
134
135  LaTeX::TOM::Trees
136    As mentioned before, the Tree is the return result of a parse.
137
138    The tree is nothing more than an arrayref of Nodes, some of which may
139    contain their own trees. This is useful knowledge at this point, since
140    the user isn't provided with a full suite of convenient
141    tree-modification methods. However, Trees do already have some very
142    convenient methods, described in the next section.
143
144METHODS
145  LaTeX::TOM
146    new
147        Instantiate a new parser object.
148
149    In this section all of the methods for each of the components are listed
150    and described.
151
152  LaTeX::TOM::Parser
153    The methods for the parser (aside from the constructor, discussed above)
154    are :
155
156    parseFile (filename)
157        Read in the contents of *filename* and parse them, returning a
158        `LaTeX::TOM::Tree'.
159
160    parse (string)
161        Parse the string *string* and return a `LaTeX::TOM::Tree'.
162
163  LaTeX::TOM::Tree
164    This section contains methods for the Trees returned by the parser.
165
166    copy
167        Duplicate a tree into new memory.
168
169    print
170        A debug print of the structure of the tree.
171
172    plainText
173        Returns an arrayref which is a list of strings representing the text
174        of all `getNodePlainTextFlag = 1' `TEXT' nodes, in an inorder
175        traversal.
176
177    indexableText
178        A method like the above but which goes one step further; it cleans
179        all of the returned text and concatenates it into a single string
180        which one could consider having all of the standard information
181        retrieval value for the document, making it useful for indexing.
182
183    toLaTeX
184        Return a string representing the LaTeX encoded by the tree. This is
185        especially useful to get a normal document again, after modifying
186        nodes of the tree.
187
188    getTopLevelNodes
189        Return a list of `LaTeX::TOM::Nodes' at the top level of the Tree.
190
191    getAllNodes
192        Return an arrayref with all nodes of the tree. This "flattens" the
193        tree.
194
195    getCommandNodesByName (name)
196        Return an arrayref with all `COMMAND' nodes in the tree which have a
197        name matching *name*.
198
199    getEnvironmentsByName (name)
200        Return an arrayref with all `ENVIRONMENT' nodes in the tree which
201        have a class matching *name*.
202
203    getNodesByCondition (code reference)
204        This is a catch-all search method which can be used to pull out
205        nodes that match pretty much any perl expression, without manually
206        having to traverse the tree. *code reference* is a perl code
207        reference which receives as its first argument the node of the tree
208        that is currently scrutinized and is expected to return a boolean
209        value. See the SYNOPSIS for examples.
210
211    getFirstNode
212        Returns the first node of the tree. This is useful if you want to
213        walk the tree yourself, starting with the first node.
214
215  LaTeX::TOM::Node
216    This section contains the methods for nodes of the parsed Trees.
217
218    getNodeType
219        Returns the type, one of `TEXT', `COMMAND', `ENVIRONMENT', `GROUP',
220        or `COMMENT', as described above.
221
222    getNodeText
223        Applicable for `TEXT' or `COMMENT' nodes; this returns the document
224        text they contain. This is undef for other node types.
225
226    setNodeText
227        Set the node text, also for `TEXT' and `COMMENT' nodes.
228
229    getNodeStartingPosition
230        Get the starting character position in the document of this node.
231        For `TEXT' and `COMMENT' nodes, this will be where the text begins.
232        For `ENVIRONMENT', `COMMAND', or `GROUP' nodes, this will be the
233        position of the *last* character of the opening identifier.
234
235    getNodeEndingPosition
236        Same as above, but for last character. For `GROUP', `ENVIRONMENT',
237        or `COMMAND' nodes, this will be the *first* character of the
238        closing identifier.
239
240    getNodeOuterStartingPosition
241        Same as getNodeStartingPosition, but for `GROUP', `ENVIRONMENT', or
242        `COMMAND' nodes, this returns the *first* character of the opening
243        identifier.
244
245    getNodeOuterEndingPosition
246        Same as getNodeEndingPosition, but for `GROUP', `ENVIRONMENT', or
247        `COMMAND' nodes, this returns the *last* character of the closing
248        identifier.
249
250    getNodeMathFlag
251        This applies to any node type. It is `1' if the node sets, or is
252        contained within, a math mode region. `0' otherwise. `TEXT' nodes
253        which have this flag as `1' can be assumed to be the actual
254        mathematics contained in the document.
255
256    getNodePlainTextFlag
257        This applies only to `TEXT' nodes. It is `1' if the node is non-math
258        and is visible (in other words, will end up being a part of the
259        output document). One would only want to index `TEXT' nodes with
260        this property, for information retrieval purposes.
261
262    getEnvironmentClass
263        This applies only to `ENVIRONMENT' nodes. Returns what class of
264        environment the node represents (the `X' in `\begin{X}' and
265        `\end{X}').
266
267    getCommandName
268        This applies only to `COMMAND' nodes. Returns the name of the
269        command (the `X' in `\X{...}').
270
271    getChildTree
272        This applies only to `COMMAND', `ENVIRONMENT', and `GROUP' nodes: it
273        returns the `LaTeX::TOM::Tree' which is ``under'' the calling node.
274
275    getFirstChild
276        This applies only to `COMMAND', `ENVIRONMENT', and `GROUP' nodes: it
277        returns the first node from the first level of the child subtree.
278
279    getLastChild
280        Same as above, but for the last node of the first level.
281
282    getPreviousSibling
283        Return the prior node on the same level of the tree.
284
285    getNextSibling
286        Same as above, but for following node.
287
288    getParent
289        Get the parent node of this node in the tree.
290
291    getNextGroupNode
292        This is an interesting function, and kind of a hack because of the
293        way the parser makes the current tree. Basically it will give you
294        the next sibling that is a `GROUP' node, until it either hits the
295        end of the tree level, a `TEXT' node which doesn't match `/^\s*$/',
296        or a `COMMAND' node.
297
298        This is useful for finding all `GROUP'ed parameters after a
299        `COMMAND' node (see comments for `GROUP' in the `COMPONENTS' /
300        `LaTeX::TOM::Node' section). You can just have a while loop that
301        calls this method until it gets `undef', and you'll know you've
302        found all the parameters to a command.
303
304        Note: this may be bad, but `TEXT' Nodes matching `/^\s*\[[0-9]+\]$/'
305        (optional parameter groups) are treated as if they were 'blank'.
306
307CAVEATS
308    Due to the lack of tree-modification methods, currently this module is
309    mostly useful for minor modifications to the parsed document, for
310    instance, altering the text of `TEXT' nodes but not deleting the nodes.
311    Of course, the user can still do this by breaking abstraction and
312    directly modifying the Tree.
313
314    Also note that the parsing is not complete. This module was not written
315    with the intention of being able to produce output documents the way
316    ``latex'' does. The intent was instead to be able to analyze and modify
317    the document on a logical level with regards to the content; it doesn't
318    care about the document formatting and outputting side of TeX/LaTeX.
319
320    There is much work still to be done. See the TODO list in the TOM.pm
321    source.
322
323BUGS
324    Probably plenty. However, this module has performed fairly well on a set
325    of ~1000 research publications from the Computing Research Repository,
326    so I deemed it ``good enough'' to use for purposes similar to mine.
327
328    Please let the maintainer know of parser errors if you discover any.
329
330CREDITS
331    Thanks to (in order of appearance) who have contributed valuable
332    suggestions and patches:
333
334     Otakar Smrz
335     Moritz Lenz
336     James Bowlin
337     Jesse S. Bangs
338
339AUTHORS
340    Written by Aaron Krowne <akrowne@vt.edu>
341
342    Maintained by Steven Schubiger <schubiger@cpan.org>
343
344LICENSE
345    This program is free software; you may redistribute it and/or modify it
346    under the same terms as Perl itself.
347
348    See http://dev.perl.org/licenses/
349
350