1===============================================================
2Tutorial for building tools using LibTooling and LibASTMatchers
3===============================================================
4
5This document is intended to show how to build a useful source-to-source
6translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
7explicitly aimed at people who are new to Clang, so all you should need
8is a working knowledge of C++ and the command line.
9
10In order to work on the compiler, you need some basic knowledge of the
11abstract syntax tree (AST). To this end, the reader is incouraged to
12skim the :doc:`Introduction to the Clang
13AST <IntroductionToTheClangAST>`
14
15Step 0: Obtaining Clang
16=======================
17
18As Clang is part of the LLVM project, you'll need to download LLVM's
19source code first. Both Clang and LLVM are maintained as Subversion
20repositories, but we'll be accessing them through the git mirror. For
21further information, see the `getting started
22guide <http://llvm.org/docs/GettingStarted.html>`_.
23
24.. code-block:: console
25
26      mkdir ~/clang-llvm && cd ~/clang-llvm
27      git clone http://llvm.org/git/llvm.git
28      cd llvm/tools
29      git clone http://llvm.org/git/clang.git
30      cd clang/tools
31      git clone http://llvm.org/git/clang-tools-extra.git extra
32
33Next you need to obtain the CMake build system and Ninja build tool. You
34may already have CMake installed, but current binary versions of CMake
35aren't built with Ninja support.
36
37.. code-block:: console
38
39      cd ~/clang-llvm
40      git clone https://github.com/martine/ninja.git
41      cd ninja
42      git checkout release
43      ./bootstrap.py
44      sudo cp ninja /usr/bin/
45
46      cd ~/clang-llvm
47      git clone git://cmake.org/stage/cmake.git
48      cd cmake
49      git checkout next
50      ./bootstrap
51      make
52      sudo make install
53
54Okay. Now we'll build Clang!
55
56.. code-block:: console
57
58      cd ~/clang-llvm
59      mkdir build && cd build
60      cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
61      ninja
62      ninja check       # Test LLVM only.
63      ninja clang-test  # Test Clang only.
64      ninja install
65
66And we're live.
67
68All of the tests should pass, though there is a (very) small chance that
69you can catch LLVM and Clang out of sync. Running ``'git svn rebase'``
70in both the llvm and clang directories should fix any problems.
71
72Finally, we want to set Clang as its own compiler.
73
74.. code-block:: console
75
76      cd ~/clang-llvm/build
77      ccmake ../llvm
78
79The second command will bring up a GUI for configuring Clang. You need
80to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
81advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
82``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
83configure, then ``'g'`` to generate CMake's files.
84
85Finally, run ninja one last time, and you're done.
86
87Step 1: Create a ClangTool
88==========================
89
90Now that we have enough background knowledge, it's time to create the
91simplest productive ClangTool in existence: a syntax checker. While this
92already exists as ``clang-check``, it's important to understand what's
93going on.
94
95First, we'll need to create a new directory for our tool and tell CMake
96that it exists. As this is not going to be a core clang tool, it will
97live in the ``tools/extra`` repository.
98
99.. code-block:: console
100
101      cd ~/clang-llvm/llvm/tools/clang
102      mkdir tools/extra/loop-convert
103      echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt
104      vim tools/extra/loop-convert/CMakeLists.txt
105
106CMakeLists.txt should have the following contents:
107
108::
109
110      set(LLVM_LINK_COMPONENTS support)
111      set(LLVM_USED_LIBS clangTooling clangBasic clangAST)
112
113      add_clang_executable(loop-convert
114        LoopConvert.cpp
115        )
116      target_link_libraries(loop-convert
117        clangTooling
118        clangBasic
119        clangASTMatchers
120        )
121
122With that done, Ninja will be able to compile our tool. Let's give it
123something to compile! Put the following into
124``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
125why the different parts are needed can be found in the `LibTooling
126documentation <LibTooling.html>`_.
127
128.. code-block:: c++
129
130      // Declares clang::SyntaxOnlyAction.
131      #include "clang/Frontend/FrontendActions.h"
132      #include "clang/Tooling/CommonOptionsParser.h"
133      #include "clang/Tooling/Tooling.h"
134      // Declares llvm::cl::extrahelp.
135      #include "llvm/Support/CommandLine.h"
136
137      using namespace clang::tooling;
138      using namespace llvm;
139
140      // CommonOptionsParser declares HelpMessage with a description of the common
141      // command-line options related to the compilation database and input files.
142      // It's nice to have this help message in all tools.
143      static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
144
145      // A help message for this specific tool can be added afterwards.
146      static cl::extrahelp MoreHelp("\nMore help text...");
147
148      int main(int argc, const char **argv) {
149        CommonOptionsParser OptionsParser(argc, argv);
150        ClangTool Tool(OptionsParser.getCompilations(),
151                       OptionsParser.getSourcePathList());
152        return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>());
153      }
154
155And that's it! You can compile our new tool by running ninja from the
156``build`` directory.
157
158.. code-block:: console
159
160      cd ~/clang-llvm/build
161      ninja
162
163You should now be able to run the syntax checker, which is located in
164``~/clang-llvm/build/bin``, on any source file. Try it!
165
166.. code-block:: console
167
168      cat "int main() { return 0; }" > test.cpp
169      bin/loop-convert test.cpp --
170
171Note the two dashes after we specify the source file. The additional
172options for the compiler are passed after the dashes rather than loading
173them from a compilation database - there just aren't any options needed
174right now.
175
176Intermezzo: Learn AST matcher basics
177====================================
178
179Clang recently introduced the :doc:`ASTMatcher
180library <LibASTMatchers>` to provide a simple, powerful, and
181concise way to describe specific patterns in the AST. Implemented as a
182DSL powered by macros and templates (see
183`ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
184curious), matchers offer the feel of algebraic data types common to
185functional programming languages.
186
187For example, suppose you wanted to examine only binary operators. There
188is a matcher to do exactly that, conveniently named ``binaryOperator``.
189I'll give you one guess what this matcher does:
190
191.. code-block:: c++
192
193      binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
194
195Shockingly, it will match against addition expressions whose left hand
196side is exactly the literal 0. It will not match against other forms of
1970, such as ``'\0'`` or ``NULL``, but it will match against macros that
198expand to 0. The matcher will also not match against calls to the
199overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
200matcher to handle overloaded operators.
201
202There are AST matchers to match all the different nodes of the AST,
203narrowing matchers to only match AST nodes fulfilling specific criteria,
204and traversal matchers to get from one kind of AST node to another. For
205a complete list of AST matchers, take a look at the `AST Matcher
206References <LibASTMatchersReference.html>`_
207
208All matcher that are nouns describe entities in the AST and can be
209bound, so that they can be referred to whenever a match is found. To do
210so, simply call the method ``bind`` on these matchers, e.g.:
211
212.. code-block:: c++
213
214      variable(hasType(isInteger())).bind("intvar")
215
216Step 2: Using AST matchers
217==========================
218
219Okay, on to using matchers for real. Let's start by defining a matcher
220which will capture all ``for`` statements that define a new variable
221initialized to zero. Let's start with matching all ``for`` loops:
222
223.. code-block:: c++
224
225      forStmt()
226
227Next, we want to specify that a single variable is declared in the first
228portion of the loop, so we can extend the matcher to
229
230.. code-block:: c++
231
232      forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
233
234Finally, we can add the condition that the variable is initialized to
235zero.
236
237.. code-block:: c++
238
239      forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
240        hasInitializer(integerLiteral(equals(0))))))))
241
242It is fairly easy to read and understand the matcher definition ("match
243loops whose init portion declares a single variable which is initialized
244to the integer literal 0"), but deciding that every piece is necessary
245is more difficult. Note that this matcher will not match loops whose
246variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
247zero besides the integer 0.
248
249The last step is giving the matcher a name and binding the ``ForStmt``
250as we will want to do something with it:
251
252.. code-block:: c++
253
254      StatementMatcher LoopMatcher =
255        forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
256          hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
257
258Once you have defined your matchers, you will need to add a little more
259scaffolding in order to run them. Matchers are paired with a
260``MatchCallback`` and registered with a ``MatchFinder`` object, then run
261from a ``ClangTool``. More code!
262
263Add the following to ``LoopConvert.cpp``:
264
265.. code-block:: c++
266
267      #include "clang/ASTMatchers/ASTMatchers.h"
268      #include "clang/ASTMatchers/ASTMatchFinder.h"
269
270      using namespace clang;
271      using namespace clang::ast_matchers;
272
273      StatementMatcher LoopMatcher =
274        forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
275          hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
276
277      class LoopPrinter : public MatchFinder::MatchCallback {
278      public :
279        virtual void run(const MatchFinder::MatchResult &Result) {
280          if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
281            FS->dump();
282        }
283      };
284
285And change ``main()`` to:
286
287.. code-block:: c++
288
289      int main(int argc, const char **argv) {
290        CommonOptionsParser OptionsParser(argc, argv);
291        ClangTool Tool(OptionsParser.getCompilations(),
292                       OptionsParser.getSourcePathList());
293
294        LoopPrinter Printer;
295        MatchFinder Finder;
296        Finder.addMatcher(LoopMatcher, &Printer);
297
298        return Tool.run(newFrontendActionFactory(&Finder));
299      }
300
301Now, you should be able to recompile and run the code to discover for
302loops. Create a new file with a few examples, and test out our new
303handiwork:
304
305.. code-block:: console
306
307      cd ~/clang-llvm/llvm/llvm_build/
308      ninja loop-convert
309      vim ~/test-files/simple-loops.cc
310      bin/loop-convert ~/test-files/simple-loops.cc
311
312Step 3.5: More Complicated Matchers
313===================================
314
315Our simple matcher is capable of discovering for loops, but we would
316still need to filter out many more ourselves. We can do a good portion
317of the remaining work with some cleverly chosen matchers, but first we
318need to decide exactly which properties we want to allow.
319
320How can we characterize for loops over arrays which would be eligible
321for translation to range-based syntax? Range based loops over arrays of
322size ``N`` that:
323
324-  start at index ``0``
325-  iterate consecutively
326-  end at index ``N-1``
327
328We already check for (1), so all we need to add is a check to the loop's
329condition to ensure that the loop's index variable is compared against
330``N`` and another check to ensure that the increment step just
331increments this same variable. The matcher for (2) is straightforward:
332require a pre- or post-increment of the same variable declared in the
333init portion.
334
335Unfortunately, such a matcher is impossible to write. Matchers contain
336no logic for comparing two arbitrary AST nodes and determining whether
337or not they are equal, so the best we can do is matching more than we
338would like to allow, and punting extra comparisons to the callback.
339
340In any case, we can start building this sub-matcher. We can require that
341the increment step be a unary increment like this:
342
343.. code-block:: c++
344
345      hasIncrement(unaryOperator(hasOperatorName("++")))
346
347Specifying what is incremented introduces another quirk of Clang's AST:
348Usages of variables are represented as ``DeclRefExpr``'s ("declaration
349reference expressions") because they are expressions which refer to
350variable declarations. To find a ``unaryOperator`` that refers to a
351specific declaration, we can simply add a second condition to it:
352
353.. code-block:: c++
354
355      hasIncrement(unaryOperator(
356        hasOperatorName("++"),
357        hasUnaryOperand(declRefExpr())))
358
359Furthermore, we can restrict our matcher to only match if the
360incremented variable is an integer:
361
362.. code-block:: c++
363
364      hasIncrement(unaryOperator(
365        hasOperatorName("++"),
366        hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
367
368And the last step will be to attach an identifier to this variable, so
369that we can retrieve it in the callback:
370
371.. code-block:: c++
372
373      hasIncrement(unaryOperator(
374        hasOperatorName("++"),
375        hasUnaryOperand(declRefExpr(to(
376          varDecl(hasType(isInteger())).bind("incrementVariable"))))))
377
378We can add this code to the definition of ``LoopMatcher`` and make sure
379that our program, outfitted with the new matcher, only prints out loops
380that declare a single variable initialized to zero and have an increment
381step consisting of a unary increment of some variable.
382
383Now, we just need to add a matcher to check if the condition part of the
384``for`` loop compares a variable against the size of the array. There is
385only one problem - we don't know which array we're iterating over
386without looking at the body of the loop! We are again restricted to
387approximating the result we want with matchers, filling in the details
388in the callback. So we start with:
389
390.. code-block:: c++
391
392      hasCondition(binaryOperator(hasOperatorName("<"))
393
394It makes sense to ensure that the left-hand side is a reference to a
395variable, and that the right-hand side has integer type.
396
397.. code-block:: c++
398
399      hasCondition(binaryOperator(
400        hasOperatorName("<"),
401        hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
402        hasRHS(expr(hasType(isInteger())))))
403
404Why? Because it doesn't work. Of the three loops provided in
405``test-files/simple.cpp``, zero of them have a matching condition. A
406quick look at the AST dump of the first for loop, produced by the
407previous iteration of loop-convert, shows us the answer:
408
409::
410
411      (ForStmt 0x173b240
412        (DeclStmt 0x173afc8
413          0x173af50 "int i =
414            (IntegerLiteral 0x173afa8 'int' 0)")
415        <<>>
416        (BinaryOperator 0x173b060 '_Bool' '<'
417          (ImplicitCastExpr 0x173b030 'int'
418            (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
419          (ImplicitCastExpr 0x173b048 'int'
420            (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
421        (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
422          (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
423        (CompoundStatement ...
424
425We already know that the declaration and increments both match, or this
426loop wouldn't have been dumped. The culprit lies in the implicit cast
427applied to the first operand (i.e. the LHS) of the less-than operator,
428an L-value to R-value conversion applied to the expression referencing
429``i``. Thankfully, the matcher library offers a solution to this problem
430in the form of ``ignoringParenImpCasts``, which instructs the matcher to
431ignore implicit casts and parentheses before continuing to match.
432Adjusting the condition operator will restore the desired match.
433
434.. code-block:: c++
435
436      hasCondition(binaryOperator(
437        hasOperatorName("<"),
438        hasLHS(ignoringParenImpCasts(declRefExpr(
439          to(varDecl(hasType(isInteger())))))),
440        hasRHS(expr(hasType(isInteger())))))
441
442After adding binds to the expressions we wished to capture and
443extracting the identifier strings into variables, we have array-step-2
444completed.
445
446Step 4: Retrieving Matched Nodes
447================================
448
449So far, the matcher callback isn't very interesting: it just dumps the
450loop's AST. At some point, we will need to make changes to the input
451source code. Next, we'll work on using the nodes we bound in the
452previous step.
453
454The ``MatchFinder::run()`` callback takes a
455``MatchFinder::MatchResult&`` as its parameter. We're most interested in
456its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
457class to represent contextual information about the AST, as the name
458implies, though the most functionally important detail is that several
459operations require an ``ASTContext*`` parameter. More immediately useful
460is the set of matched nodes, and how we retrieve them.
461
462Since we bind three variables (identified by ConditionVarName,
463InitVarName, and IncrementVarName), we can obtain the matched nodes by
464using the ``getNodeAs()`` member function.
465
466In ``LoopConvert.cpp`` add
467
468.. code-block:: c++
469
470      #include "clang/AST/ASTContext.h"
471
472Change ``LoopMatcher`` to
473
474.. code-block:: c++
475
476      StatementMatcher LoopMatcher =
477          forStmt(hasLoopInit(declStmt(
478                      hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
479                                        .bind("initVarName")))),
480                  hasIncrement(unaryOperator(
481                      hasOperatorName("++"),
482                      hasUnaryOperand(declRefExpr(
483                          to(varDecl(hasType(isInteger())).bind("incVarName")))))),
484                  hasCondition(binaryOperator(
485                      hasOperatorName("<"),
486                      hasLHS(ignoringParenImpCasts(declRefExpr(
487                          to(varDecl(hasType(isInteger())).bind("condVarName"))))),
488                      hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
489
490And change ``LoopPrinter::run`` to
491
492.. code-block:: c++
493
494      void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
495        ASTContext *Context = Result.Context;
496        const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>("forLoop");
497        // We do not want to convert header files!
498        if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc()))
499          return;
500        const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
501        const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
502        const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
503
504        if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
505          return;
506        llvm::outs() << "Potential array-based loop discovered.\n";
507      }
508
509Clang associates a ``VarDecl`` with each variable to represent the variable's
510declaration. Since the "canonical" form of each declaration is unique by
511address, all we need to do is make sure neither ``ValueDecl`` (base class of
512``VarDecl``) is ``NULL`` and compare the canonical Decls.
513
514.. code-block:: c++
515
516      static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
517        return First && Second &&
518               First->getCanonicalDecl() == Second->getCanonicalDecl();
519      }
520
521If execution reaches the end of ``LoopPrinter::run()``, we know that the
522loop shell that looks like
523
524.. code-block:: c++
525
526      for (int i= 0; i < expr(); ++i) { ... }
527
528For now, we will just print a message explaining that we found a loop.
529The next section will deal with recursively traversing the AST to
530discover all changes needed.
531
532As a side note, it's not as trivial to test if two expressions are the same,
533though Clang has already done the hard work for us by providing a way to
534canonicalize expressions:
535
536.. code-block:: c++
537
538      static bool areSameExpr(ASTContext *Context, const Expr *First,
539                              const Expr *Second) {
540        if (!First || !Second)
541          return false;
542        llvm::FoldingSetNodeID FirstID, SecondID;
543        First->Profile(FirstID, *Context, true);
544        Second->Profile(SecondID, *Context, true);
545        return FirstID == SecondID;
546      }
547
548This code relies on the comparison between two
549``llvm::FoldingSetNodeIDs``. As the documentation for
550``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
551a description of a node in the AST, based on its properties, along with
552those of its children. ``FoldingSetNodeID`` then serves as a hash we can
553use to compare expressions. We will need ``areSameExpr`` later. Before
554you run the new code on the additional loops added to
555test-files/simple.cpp, try to figure out which ones will be considered
556potentially convertible.
557