1===============================================================
2Tutorial for building tools using LibTooling and LibASTMatchers
3===============================================================
4
5This document is intended to show how to build a useful source-to-source
6translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
7explicitly aimed at people who are new to Clang, so all you should need
8is a working knowledge of C++ and the command line.
9
10In order to work on the compiler, you need some basic knowledge of the
11abstract syntax tree (AST). To this end, the reader is encouraged to
12skim the :doc:`Introduction to the Clang
13AST <IntroductionToTheClangAST>`
14
15Step 0: Obtaining Clang
16=======================
17
18As Clang is part of the LLVM project, you'll need to download LLVM's
19source code first. Both Clang and LLVM are in the same git repository,
20under different directories. For further information, see the `getting
21started guide <https://llvm.org/docs/GettingStarted.html>`_.
22
23.. code-block:: console
24
25      cd ~/clang-llvm
26      git clone https://github.com/llvm/llvm-project.git
27
28Next you need to obtain the CMake build system and Ninja build tool.
29
30.. code-block:: console
31
32      cd ~/clang-llvm
33      git clone https://github.com/martine/ninja.git
34      cd ninja
35      git checkout release
36      ./bootstrap.py
37      sudo cp ninja /usr/bin/
38
39      cd ~/clang-llvm
40      git clone git://cmake.org/stage/cmake.git
41      cd cmake
42      git checkout next
43      ./bootstrap
44      make
45      sudo make install
46
47Okay. Now we'll build Clang!
48
49.. code-block:: console
50
51      cd ~/clang-llvm
52      mkdir build && cd build
53      cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
54      ninja
55      ninja check       # Test LLVM only.
56      ninja clang-test  # Test Clang only.
57      ninja install
58
59And we're live.
60
61All of the tests should pass.
62
63Finally, we want to set Clang as its own compiler.
64
65.. code-block:: console
66
67      cd ~/clang-llvm/build
68      ccmake ../llvm
69
70The second command will bring up a GUI for configuring Clang. You need
71to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
72advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
73``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
74configure, then ``'g'`` to generate CMake's files.
75
76Finally, run ninja one last time, and you're done.
77
78Step 1: Create a ClangTool
79==========================
80
81Now that we have enough background knowledge, it's time to create the
82simplest productive ClangTool in existence: a syntax checker. While this
83already exists as ``clang-check``, it's important to understand what's
84going on.
85
86First, we'll need to create a new directory for our tool and tell CMake
87that it exists. As this is not going to be a core clang tool, it will
88live in the ``clang-tools-extra`` repository.
89
90.. code-block:: console
91
92      cd ~/clang-llvm
93      mkdir clang-tools-extra/loop-convert
94      echo 'add_subdirectory(loop-convert)' >> clang-tools-extra/CMakeLists.txt
95      vim clang-tools-extra/loop-convert/CMakeLists.txt
96
97CMakeLists.txt should have the following contents:
98
99::
100
101      set(LLVM_LINK_COMPONENTS support)
102
103      add_clang_executable(loop-convert
104        LoopConvert.cpp
105        )
106      target_link_libraries(loop-convert
107        PRIVATE
108        clangAST
109        clangASTMatchers
110        clangBasic
111        clangFrontend
112        clangSerialization
113        clangTooling
114        )
115
116With that done, Ninja will be able to compile our tool. Let's give it
117something to compile! Put the following into
118``clang-tools-extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
119why the different parts are needed can be found in the `LibTooling
120documentation <LibTooling.html>`_.
121
122.. code-block:: c++
123
124      // Declares clang::SyntaxOnlyAction.
125      #include "clang/Frontend/FrontendActions.h"
126      #include "clang/Tooling/CommonOptionsParser.h"
127      #include "clang/Tooling/Tooling.h"
128      // Declares llvm::cl::extrahelp.
129      #include "llvm/Support/CommandLine.h"
130
131      using namespace clang::tooling;
132      using namespace llvm;
133
134      // Apply a custom category to all command-line options so that they are the
135      // only ones displayed.
136      static llvm::cl::OptionCategory MyToolCategory("my-tool options");
137
138      // CommonOptionsParser declares HelpMessage with a description of the common
139      // command-line options related to the compilation database and input files.
140      // It's nice to have this help message in all tools.
141      static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
142
143      // A help message for this specific tool can be added afterwards.
144      static cl::extrahelp MoreHelp("\nMore help text...\n");
145
146      int main(int argc, const char **argv) {
147        auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory);
148        if (!ExpectedParser) {
149          // Fail gracefully for unsupported options.
150          llvm::errs() << ExpectedParser.takeError();
151          return 1;
152        }
153        CommonOptionsParser& OptionsParser = ExpectedParser.get();
154        ClangTool Tool(OptionsParser.getCompilations(),
155                       OptionsParser.getSourcePathList());
156        return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get());
157      }
158
159And that's it! You can compile our new tool by running ninja from the
160``build`` directory.
161
162.. code-block:: console
163
164      cd ~/clang-llvm/build
165      ninja
166
167You should now be able to run the syntax checker, which is located in
168``~/clang-llvm/build/bin``, on any source file. Try it!
169
170.. code-block:: console
171
172      echo "int main() { return 0; }" > test.cpp
173      bin/loop-convert test.cpp --
174
175Note the two dashes after we specify the source file. The additional
176options for the compiler are passed after the dashes rather than loading
177them from a compilation database - there just aren't any options needed
178right now.
179
180Intermezzo: Learn AST matcher basics
181====================================
182
183Clang recently introduced the :doc:`ASTMatcher
184library <LibASTMatchers>` to provide a simple, powerful, and
185concise way to describe specific patterns in the AST. Implemented as a
186DSL powered by macros and templates (see
187`ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
188curious), matchers offer the feel of algebraic data types common to
189functional programming languages.
190
191For example, suppose you wanted to examine only binary operators. There
192is a matcher to do exactly that, conveniently named ``binaryOperator``.
193I'll give you one guess what this matcher does:
194
195.. code-block:: c++
196
197      binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
198
199Shockingly, it will match against addition expressions whose left hand
200side is exactly the literal 0. It will not match against other forms of
2010, such as ``'\0'`` or ``NULL``, but it will match against macros that
202expand to 0. The matcher will also not match against calls to the
203overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
204matcher to handle overloaded operators.
205
206There are AST matchers to match all the different nodes of the AST,
207narrowing matchers to only match AST nodes fulfilling specific criteria,
208and traversal matchers to get from one kind of AST node to another. For
209a complete list of AST matchers, take a look at the `AST Matcher
210References <LibASTMatchersReference.html>`_
211
212All matcher that are nouns describe entities in the AST and can be
213bound, so that they can be referred to whenever a match is found. To do
214so, simply call the method ``bind`` on these matchers, e.g.:
215
216.. code-block:: c++
217
218      variable(hasType(isInteger())).bind("intvar")
219
220Step 2: Using AST matchers
221==========================
222
223Okay, on to using matchers for real. Let's start by defining a matcher
224which will capture all ``for`` statements that define a new variable
225initialized to zero. Let's start with matching all ``for`` loops:
226
227.. code-block:: c++
228
229      forStmt()
230
231Next, we want to specify that a single variable is declared in the first
232portion of the loop, so we can extend the matcher to
233
234.. code-block:: c++
235
236      forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
237
238Finally, we can add the condition that the variable is initialized to
239zero.
240
241.. code-block:: c++
242
243      forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
244        hasInitializer(integerLiteral(equals(0))))))))
245
246It is fairly easy to read and understand the matcher definition ("match
247loops whose init portion declares a single variable which is initialized
248to the integer literal 0"), but deciding that every piece is necessary
249is more difficult. Note that this matcher will not match loops whose
250variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
251zero besides the integer 0.
252
253The last step is giving the matcher a name and binding the ``ForStmt``
254as we will want to do something with it:
255
256.. code-block:: c++
257
258      StatementMatcher LoopMatcher =
259        forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
260          hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
261
262Once you have defined your matchers, you will need to add a little more
263scaffolding in order to run them. Matchers are paired with a
264``MatchCallback`` and registered with a ``MatchFinder`` object, then run
265from a ``ClangTool``. More code!
266
267Add the following to ``LoopConvert.cpp``:
268
269.. code-block:: c++
270
271      #include "clang/ASTMatchers/ASTMatchers.h"
272      #include "clang/ASTMatchers/ASTMatchFinder.h"
273
274      using namespace clang;
275      using namespace clang::ast_matchers;
276
277      StatementMatcher LoopMatcher =
278        forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
279          hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
280
281      class LoopPrinter : public MatchFinder::MatchCallback {
282      public :
283        virtual void run(const MatchFinder::MatchResult &Result) {
284          if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
285            FS->dump();
286        }
287      };
288
289And change ``main()`` to:
290
291.. code-block:: c++
292
293      int main(int argc, const char **argv) {
294        auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory);
295        if (!ExpectedParser) {
296          // Fail gracefully for unsupported options.
297          llvm::errs() << ExpectedParser.takeError();
298          return 1;
299        }
300        CommonOptionsParser& OptionsParser = ExpectedParser.get();
301        ClangTool Tool(OptionsParser.getCompilations(),
302                       OptionsParser.getSourcePathList());
303
304        LoopPrinter Printer;
305        MatchFinder Finder;
306        Finder.addMatcher(LoopMatcher, &Printer);
307
308        return Tool.run(newFrontendActionFactory(&Finder).get());
309      }
310
311Now, you should be able to recompile and run the code to discover for
312loops. Create a new file with a few examples, and test out our new
313handiwork:
314
315.. code-block:: console
316
317      cd ~/clang-llvm/llvm/llvm_build/
318      ninja loop-convert
319      vim ~/test-files/simple-loops.cc
320      bin/loop-convert ~/test-files/simple-loops.cc
321
322Step 3.5: More Complicated Matchers
323===================================
324
325Our simple matcher is capable of discovering for loops, but we would
326still need to filter out many more ourselves. We can do a good portion
327of the remaining work with some cleverly chosen matchers, but first we
328need to decide exactly which properties we want to allow.
329
330How can we characterize for loops over arrays which would be eligible
331for translation to range-based syntax? Range based loops over arrays of
332size ``N`` that:
333
334-  start at index ``0``
335-  iterate consecutively
336-  end at index ``N-1``
337
338We already check for (1), so all we need to add is a check to the loop's
339condition to ensure that the loop's index variable is compared against
340``N`` and another check to ensure that the increment step just
341increments this same variable. The matcher for (2) is straightforward:
342require a pre- or post-increment of the same variable declared in the
343init portion.
344
345Unfortunately, such a matcher is impossible to write. Matchers contain
346no logic for comparing two arbitrary AST nodes and determining whether
347or not they are equal, so the best we can do is matching more than we
348would like to allow, and punting extra comparisons to the callback.
349
350In any case, we can start building this sub-matcher. We can require that
351the increment step be a unary increment like this:
352
353.. code-block:: c++
354
355      hasIncrement(unaryOperator(hasOperatorName("++")))
356
357Specifying what is incremented introduces another quirk of Clang's AST:
358Usages of variables are represented as ``DeclRefExpr``'s ("declaration
359reference expressions") because they are expressions which refer to
360variable declarations. To find a ``unaryOperator`` that refers to a
361specific declaration, we can simply add a second condition to it:
362
363.. code-block:: c++
364
365      hasIncrement(unaryOperator(
366        hasOperatorName("++"),
367        hasUnaryOperand(declRefExpr())))
368
369Furthermore, we can restrict our matcher to only match if the
370incremented variable is an integer:
371
372.. code-block:: c++
373
374      hasIncrement(unaryOperator(
375        hasOperatorName("++"),
376        hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
377
378And the last step will be to attach an identifier to this variable, so
379that we can retrieve it in the callback:
380
381.. code-block:: c++
382
383      hasIncrement(unaryOperator(
384        hasOperatorName("++"),
385        hasUnaryOperand(declRefExpr(to(
386          varDecl(hasType(isInteger())).bind("incrementVariable"))))))
387
388We can add this code to the definition of ``LoopMatcher`` and make sure
389that our program, outfitted with the new matcher, only prints out loops
390that declare a single variable initialized to zero and have an increment
391step consisting of a unary increment of some variable.
392
393Now, we just need to add a matcher to check if the condition part of the
394``for`` loop compares a variable against the size of the array. There is
395only one problem - we don't know which array we're iterating over
396without looking at the body of the loop! We are again restricted to
397approximating the result we want with matchers, filling in the details
398in the callback. So we start with:
399
400.. code-block:: c++
401
402      hasCondition(binaryOperator(hasOperatorName("<"))
403
404It makes sense to ensure that the left-hand side is a reference to a
405variable, and that the right-hand side has integer type.
406
407.. code-block:: c++
408
409      hasCondition(binaryOperator(
410        hasOperatorName("<"),
411        hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
412        hasRHS(expr(hasType(isInteger())))))
413
414Why? Because it doesn't work. Of the three loops provided in
415``test-files/simple.cpp``, zero of them have a matching condition. A
416quick look at the AST dump of the first for loop, produced by the
417previous iteration of loop-convert, shows us the answer:
418
419::
420
421      (ForStmt 0x173b240
422        (DeclStmt 0x173afc8
423          0x173af50 "int i =
424            (IntegerLiteral 0x173afa8 'int' 0)")
425        <<>>
426        (BinaryOperator 0x173b060 '_Bool' '<'
427          (ImplicitCastExpr 0x173b030 'int'
428            (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
429          (ImplicitCastExpr 0x173b048 'int'
430            (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
431        (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
432          (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
433        (CompoundStatement ...
434
435We already know that the declaration and increments both match, or this
436loop wouldn't have been dumped. The culprit lies in the implicit cast
437applied to the first operand (i.e. the LHS) of the less-than operator,
438an L-value to R-value conversion applied to the expression referencing
439``i``. Thankfully, the matcher library offers a solution to this problem
440in the form of ``ignoringParenImpCasts``, which instructs the matcher to
441ignore implicit casts and parentheses before continuing to match.
442Adjusting the condition operator will restore the desired match.
443
444.. code-block:: c++
445
446      hasCondition(binaryOperator(
447        hasOperatorName("<"),
448        hasLHS(ignoringParenImpCasts(declRefExpr(
449          to(varDecl(hasType(isInteger())))))),
450        hasRHS(expr(hasType(isInteger())))))
451
452After adding binds to the expressions we wished to capture and
453extracting the identifier strings into variables, we have array-step-2
454completed.
455
456Step 4: Retrieving Matched Nodes
457================================
458
459So far, the matcher callback isn't very interesting: it just dumps the
460loop's AST. At some point, we will need to make changes to the input
461source code. Next, we'll work on using the nodes we bound in the
462previous step.
463
464The ``MatchFinder::run()`` callback takes a
465``MatchFinder::MatchResult&`` as its parameter. We're most interested in
466its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
467class to represent contextual information about the AST, as the name
468implies, though the most functionally important detail is that several
469operations require an ``ASTContext*`` parameter. More immediately useful
470is the set of matched nodes, and how we retrieve them.
471
472Since we bind three variables (identified by ConditionVarName,
473InitVarName, and IncrementVarName), we can obtain the matched nodes by
474using the ``getNodeAs()`` member function.
475
476In ``LoopConvert.cpp`` add
477
478.. code-block:: c++
479
480      #include "clang/AST/ASTContext.h"
481
482Change ``LoopMatcher`` to
483
484.. code-block:: c++
485
486      StatementMatcher LoopMatcher =
487          forStmt(hasLoopInit(declStmt(
488                      hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
489                                        .bind("initVarName")))),
490                  hasIncrement(unaryOperator(
491                      hasOperatorName("++"),
492                      hasUnaryOperand(declRefExpr(
493                          to(varDecl(hasType(isInteger())).bind("incVarName")))))),
494                  hasCondition(binaryOperator(
495                      hasOperatorName("<"),
496                      hasLHS(ignoringParenImpCasts(declRefExpr(
497                          to(varDecl(hasType(isInteger())).bind("condVarName"))))),
498                      hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
499
500And change ``LoopPrinter::run`` to
501
502.. code-block:: c++
503
504      void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
505        ASTContext *Context = Result.Context;
506        const ForStmt *FS = Result.Nodes.getNodeAs<ForStmt>("forLoop");
507        // We do not want to convert header files!
508        if (!FS || !Context->getSourceManager().isWrittenInMainFile(FS->getForLoc()))
509          return;
510        const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
511        const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
512        const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
513
514        if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
515          return;
516        llvm::outs() << "Potential array-based loop discovered.\n";
517      }
518
519Clang associates a ``VarDecl`` with each variable to represent the variable's
520declaration. Since the "canonical" form of each declaration is unique by
521address, all we need to do is make sure neither ``ValueDecl`` (base class of
522``VarDecl``) is ``NULL`` and compare the canonical Decls.
523
524.. code-block:: c++
525
526      static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
527        return First && Second &&
528               First->getCanonicalDecl() == Second->getCanonicalDecl();
529      }
530
531If execution reaches the end of ``LoopPrinter::run()``, we know that the
532loop shell that looks like
533
534.. code-block:: c++
535
536      for (int i= 0; i < expr(); ++i) { ... }
537
538For now, we will just print a message explaining that we found a loop.
539The next section will deal with recursively traversing the AST to
540discover all changes needed.
541
542As a side note, it's not as trivial to test if two expressions are the same,
543though Clang has already done the hard work for us by providing a way to
544canonicalize expressions:
545
546.. code-block:: c++
547
548      static bool areSameExpr(ASTContext *Context, const Expr *First,
549                              const Expr *Second) {
550        if (!First || !Second)
551          return false;
552        llvm::FoldingSetNodeID FirstID, SecondID;
553        First->Profile(FirstID, *Context, true);
554        Second->Profile(SecondID, *Context, true);
555        return FirstID == SecondID;
556      }
557
558This code relies on the comparison between two
559``llvm::FoldingSetNodeIDs``. As the documentation for
560``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
561a description of a node in the AST, based on its properties, along with
562those of its children. ``FoldingSetNodeID`` then serves as a hash we can
563use to compare expressions. We will need ``areSameExpr`` later. Before
564you run the new code on the additional loops added to
565test-files/simple.cpp, try to figure out which ones will be considered
566potentially convertible.
567