1===============================================================
2Tutorial for building tools using LibTooling and LibASTMatchers
3===============================================================
4
5This document is intended to show how to build a useful source-to-source
6translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
7explicitly aimed at people who are new to Clang, so all you should need
8is a working knowledge of C++ and the command line.
9
10In order to work on the compiler, you need some basic knowledge of the
11abstract syntax tree (AST). To this end, the reader is encouraged to
12skim the :doc:`Introduction to the Clang
13AST <IntroductionToTheClangAST>`
14
15Step 0: Obtaining Clang
16=======================
17
18As Clang is part of the LLVM project, you'll need to download LLVM's
19source code first. Both Clang and LLVM are in the same git repository,
20under different directories. For further information, see the `getting
21started guide <https://llvm.org/docs/GettingStarted.html>`_.
22
23.. code-block:: console
24
25      cd ~/clang-llvm
26      git clone https://github.com/llvm/llvm-project.git
27
28Next you need to obtain the CMake build system and Ninja build tool.
29
30.. code-block:: console
31
32      cd ~/clang-llvm
33      git clone https://github.com/martine/ninja.git
34      cd ninja
35      git checkout release
36      ./bootstrap.py
37      sudo cp ninja /usr/bin/
38
39      cd ~/clang-llvm
40      git clone git://cmake.org/stage/cmake.git
41      cd cmake
42      git checkout next
43      ./bootstrap
44      make
45      sudo make install
46
47Okay. Now we'll build Clang!
48
49.. code-block:: console
50
51      cd ~/clang-llvm
52      mkdir build && cd build
53      cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
54      ninja
55      ninja check       # Test LLVM only.
56      ninja clang-test  # Test Clang only.
57      ninja install
58
59And we're live.
60
61All of the tests should pass.
62
63Finally, we want to set Clang as its own compiler.
64
65.. code-block:: console
66
67      cd ~/clang-llvm/build
68      ccmake ../llvm
69
70The second command will bring up a GUI for configuring Clang. You need
71to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
72advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
73``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
74configure, then ``'g'`` to generate CMake's files.
75
76Finally, run ninja one last time, and you're done.
77
78Step 1: Create a ClangTool
79==========================
80
81Now that we have enough background knowledge, it's time to create the
82simplest productive ClangTool in existence: a syntax checker. While this
83already exists as ``clang-check``, it's important to understand what's
84going on.
85
86First, we'll need to create a new directory for our tool and tell CMake
87that it exists. As this is not going to be a core clang tool, it will
88live in the ``clang-tools-extra`` repository.
89
90.. code-block:: console
91
92      cd ~/clang-llvm
93      mkdir clang-tools-extra/loop-convert
94      echo 'add_subdirectory(loop-convert)' >> clang-tools-extra/CMakeLists.txt
95      vim clang-tools-extra/loop-convert/CMakeLists.txt
96
97CMakeLists.txt should have the following contents:
98
99::
100
101      set(LLVM_LINK_COMPONENTS support)
102
103      add_clang_executable(loop-convert
104        LoopConvert.cpp
105        )
106      target_link_libraries(loop-convert
107        PRIVATE
108        clangTooling
109        clangBasic
110        clangASTMatchers
111        )
112
113With that done, Ninja will be able to compile our tool. Let's give it
114something to compile! Put the following into
115``clang-tools-extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
116why the different parts are needed can be found in the `LibTooling
117documentation <LibTooling.html>`_.
118
119.. code-block:: c++
120
121      // Declares clang::SyntaxOnlyAction.
122      #include "clang/Frontend/FrontendActions.h"
123      #include "clang/Tooling/CommonOptionsParser.h"
124      #include "clang/Tooling/Tooling.h"
125      // Declares llvm::cl::extrahelp.
126      #include "llvm/Support/CommandLine.h"
127
128      using namespace clang::tooling;
129      using namespace llvm;
130
131      // Apply a custom category to all command-line options so that they are the
132      // only ones displayed.
133      static llvm::cl::OptionCategory MyToolCategory("my-tool options");
134
135      // CommonOptionsParser declares HelpMessage with a description of the common
136      // command-line options related to the compilation database and input files.
137      // It's nice to have this help message in all tools.
138      static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
139
140      // A help message for this specific tool can be added afterwards.
141      static cl::extrahelp MoreHelp("\nMore help text...\n");
142
143      int main(int argc, const char **argv) {
144        CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
145        ClangTool Tool(OptionsParser.getCompilations(),
146                       OptionsParser.getSourcePathList());
147        return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get());
148      }
149
150And that's it! You can compile our new tool by running ninja from the
151``build`` directory.
152
153.. code-block:: console
154
155      cd ~/clang-llvm/build
156      ninja
157
158You should now be able to run the syntax checker, which is located in
159``~/clang-llvm/build/bin``, on any source file. Try it!
160
161.. code-block:: console
162
163      echo "int main() { return 0; }" > test.cpp
164      bin/loop-convert test.cpp --
165
166Note the two dashes after we specify the source file. The additional
167options for the compiler are passed after the dashes rather than loading
168them from a compilation database - there just aren't any options needed
169right now.
170
171Intermezzo: Learn AST matcher basics
172====================================
173
174Clang recently introduced the :doc:`ASTMatcher
175library <LibASTMatchers>` to provide a simple, powerful, and
176concise way to describe specific patterns in the AST. Implemented as a
177DSL powered by macros and templates (see
178`ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
179curious), matchers offer the feel of algebraic data types common to
180functional programming languages.
181
182For example, suppose you wanted to examine only binary operators. There
183is a matcher to do exactly that, conveniently named ``binaryOperator``.
184I'll give you one guess what this matcher does:
185
186.. code-block:: c++
187
188      binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
189
190Shockingly, it will match against addition expressions whose left hand
191side is exactly the literal 0. It will not match against other forms of
1920, such as ``'\0'`` or ``NULL``, but it will match against macros that
193expand to 0. The matcher will also not match against calls to the
194overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
195matcher to handle overloaded operators.
196
197There are AST matchers to match all the different nodes of the AST,
198narrowing matchers to only match AST nodes fulfilling specific criteria,
199and traversal matchers to get from one kind of AST node to another. For
200a complete list of AST matchers, take a look at the `AST Matcher
201References <LibASTMatchersReference.html>`_
202
203All matcher that are nouns describe entities in the AST and can be
204bound, so that they can be referred to whenever a match is found. To do
205so, simply call the method ``bind`` on these matchers, e.g.:
206
207.. code-block:: c++
208
209      variable(hasType(isInteger())).bind("intvar")
210
211Step 2: Using AST matchers
212==========================
213
214Okay, on to using matchers for real. Let's start by defining a matcher
215which will capture all ``for`` statements that define a new variable
216initialized to zero. Let's start with matching all ``for`` loops:
217
218.. code-block:: c++
219
220      forStmt()
221
222Next, we want to specify that a single variable is declared in the first
223portion of the loop, so we can extend the matcher to
224
225.. code-block:: c++
226
227      forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
228
229Finally, we can add the condition that the variable is initialized to
230zero.
231
232.. code-block:: c++
233
234      forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
235        hasInitializer(integerLiteral(equals(0))))))))
236
237It is fairly easy to read and understand the matcher definition ("match
238loops whose init portion declares a single variable which is initialized
239to the integer literal 0"), but deciding that every piece is necessary
240is more difficult. Note that this matcher will not match loops whose
241variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
242zero besides the integer 0.
243
244The last step is giving the matcher a name and binding the ``ForStmt``
245as we will want to do something with it:
246
247.. code-block:: c++
248
249      StatementMatcher LoopMatcher =
250        forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
251          hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
252
253Once you have defined your matchers, you will need to add a little more
254scaffolding in order to run them. Matchers are paired with a
255``MatchCallback`` and registered with a ``MatchFinder`` object, then run
256from a ``ClangTool``. More code!
257
258Add the following to ``LoopConvert.cpp``:
259
260.. code-block:: c++
261
262      #include "clang/ASTMatchers/ASTMatchers.h"
263      #include "clang/ASTMatchers/ASTMatchFinder.h"
264
265      using namespace clang;
266      using namespace clang::ast_matchers;
267
268      StatementMatcher LoopMatcher =
269        forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
270          hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
271
272      class LoopPrinter : public MatchFinder::MatchCallback {
273      public :
274        virtual void run(const MatchFinder::MatchResult &Result) {
275          if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
276            FS->dump();
277        }
278      };
279
280And change ``main()`` to:
281
282.. code-block:: c++
283
284      int main(int argc, const char **argv) {
285        CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
286        ClangTool Tool(OptionsParser.getCompilations(),
287                       OptionsParser.getSourcePathList());
288
289        LoopPrinter Printer;
290        MatchFinder Finder;
291        Finder.addMatcher(LoopMatcher, &Printer);
292
293        return Tool.run(newFrontendActionFactory(&Finder).get());
294      }
295
296Now, you should be able to recompile and run the code to discover for
297loops. Create a new file with a few examples, and test out our new
298handiwork:
299
300.. code-block:: console
301
302      cd ~/clang-llvm/llvm/llvm_build/
303      ninja loop-convert
304      vim ~/test-files/simple-loops.cc
305      bin/loop-convert ~/test-files/simple-loops.cc
306
307Step 3.5: More Complicated Matchers
308===================================
309
310Our simple matcher is capable of discovering for loops, but we would
311still need to filter out many more ourselves. We can do a good portion
312of the remaining work with some cleverly chosen matchers, but first we
313need to decide exactly which properties we want to allow.
314
315How can we characterize for loops over arrays which would be eligible
316for translation to range-based syntax? Range based loops over arrays of
317size ``N`` that:
318
319-  start at index ``0``
320-  iterate consecutively
321-  end at index ``N-1``
322
323We already check for (1), so all we need to add is a check to the loop's
324condition to ensure that the loop's index variable is compared against
325``N`` and another check to ensure that the increment step just
326increments this same variable. The matcher for (2) is straightforward:
327require a pre- or post-increment of the same variable declared in the
328init portion.
329
330Unfortunately, such a matcher is impossible to write. Matchers contain
331no logic for comparing two arbitrary AST nodes and determining whether
332or not they are equal, so the best we can do is matching more than we
333would like to allow, and punting extra comparisons to the callback.
334
335In any case, we can start building this sub-matcher. We can require that
336the increment step be a unary increment like this:
337
338.. code-block:: c++
339
340      hasIncrement(unaryOperator(hasOperatorName("++")))
341
342Specifying what is incremented introduces another quirk of Clang's AST:
343Usages of variables are represented as ``DeclRefExpr``'s ("declaration
344reference expressions") because they are expressions which refer to
345variable declarations. To find a ``unaryOperator`` that refers to a
346specific declaration, we can simply add a second condition to it:
347
348.. code-block:: c++
349
350      hasIncrement(unaryOperator(
351        hasOperatorName("++"),
352        hasUnaryOperand(declRefExpr())))
353
354Furthermore, we can restrict our matcher to only match if the
355incremented variable is an integer:
356
357.. code-block:: c++
358
359      hasIncrement(unaryOperator(
360        hasOperatorName("++"),
361        hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
362
363And the last step will be to attach an identifier to this variable, so
364that we can retrieve it in the callback:
365
366.. code-block:: c++
367
368      hasIncrement(unaryOperator(
369        hasOperatorName("++"),
370        hasUnaryOperand(declRefExpr(to(
371          varDecl(hasType(isInteger())).bind("incrementVariable"))))))
372
373We can add this code to the definition of ``LoopMatcher`` and make sure
374that our program, outfitted with the new matcher, only prints out loops
375that declare a single variable initialized to zero and have an increment
376step consisting of a unary increment of some variable.
377
378Now, we just need to add a matcher to check if the condition part of the
379``for`` loop compares a variable against the size of the array. There is
380only one problem - we don't know which array we're iterating over
381without looking at the body of the loop! We are again restricted to
382approximating the result we want with matchers, filling in the details
383in the callback. So we start with:
384
385.. code-block:: c++
386
387      hasCondition(binaryOperator(hasOperatorName("<"))
388
389It makes sense to ensure that the left-hand side is a reference to a
390variable, and that the right-hand side has integer type.
391
392.. code-block:: c++
393
394      hasCondition(binaryOperator(
395        hasOperatorName("<"),
396        hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
397        hasRHS(expr(hasType(isInteger())))))
398
399Why? Because it doesn't work. Of the three loops provided in
400``test-files/simple.cpp``, zero of them have a matching condition. A
401quick look at the AST dump of the first for loop, produced by the
402previous iteration of loop-convert, shows us the answer:
403
404::
405
406      (ForStmt 0x173b240
407        (DeclStmt 0x173afc8
408          0x173af50 "int i =
409            (IntegerLiteral 0x173afa8 'int' 0)")
410        <<>>
411        (BinaryOperator 0x173b060 '_Bool' '<'
412          (ImplicitCastExpr 0x173b030 'int'
413            (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
414          (ImplicitCastExpr 0x173b048 'int'
415            (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
416        (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
417          (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
418        (CompoundStatement ...
419
420We already know that the declaration and increments both match, or this
421loop wouldn't have been dumped. The culprit lies in the implicit cast
422applied to the first operand (i.e. the LHS) of the less-than operator,
423an L-value to R-value conversion applied to the expression referencing
424``i``. Thankfully, the matcher library offers a solution to this problem
425in the form of ``ignoringParenImpCasts``, which instructs the matcher to
426ignore implicit casts and parentheses before continuing to match.
427Adjusting the condition operator will restore the desired match.
428
429.. code-block:: c++
430
431      hasCondition(binaryOperator(
432        hasOperatorName("<"),
433        hasLHS(ignoringParenImpCasts(declRefExpr(
434          to(varDecl(hasType(isInteger())))))),
435        hasRHS(expr(hasType(isInteger())))))
436
437After adding binds to the expressions we wished to capture and
438extracting the identifier strings into variables, we have array-step-2
439completed.
440
441Step 4: Retrieving Matched Nodes
442================================
443
444So far, the matcher callback isn't very interesting: it just dumps the
445loop's AST. At some point, we will need to make changes to the input
446source code. Next, we'll work on using the nodes we bound in the
447previous step.
448
449The ``MatchFinder::run()`` callback takes a
450``MatchFinder::MatchResult&`` as its parameter. We're most interested in
451its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
452class to represent contextual information about the AST, as the name
453implies, though the most functionally important detail is that several
454operations require an ``ASTContext*`` parameter. More immediately useful
455is the set of matched nodes, and how we retrieve them.
456
457Since we bind three variables (identified by ConditionVarName,
458InitVarName, and IncrementVarName), we can obtain the matched nodes by
459using the ``getNodeAs()`` member function.
460
461In ``LoopConvert.cpp`` add
462
463.. code-block:: c++
464
465      #include "clang/AST/ASTContext.h"
466
467Change ``LoopMatcher`` to
468
469.. code-block:: c++
470
471      StatementMatcher LoopMatcher =
472          forStmt(hasLoopInit(declStmt(
473                      hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
474                                        .bind("initVarName")))),
475                  hasIncrement(unaryOperator(
476                      hasOperatorName("++"),
477                      hasUnaryOperand(declRefExpr(
478                          to(varDecl(hasType(isInteger())).bind("incVarName")))))),
479                  hasCondition(binaryOperator(
480                      hasOperatorName("<"),
481                      hasLHS(ignoringParenImpCasts(declRefExpr(
482                          to(varDecl(hasType(isInteger())).bind("condVarName"))))),
483                      hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
484
485And change ``LoopPrinter::run`` to
486
487.. code-block:: c++
488
489      void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
490        ASTContext *Context = Result.Context;
491        const ForStmt *FS = Result.Nodes.getNodeAs<ForStmt>("forLoop");
492        // We do not want to convert header files!
493        if (!FS || !Context->getSourceManager().isWrittenInMainFile(FS->getForLoc()))
494          return;
495        const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
496        const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
497        const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
498
499        if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
500          return;
501        llvm::outs() << "Potential array-based loop discovered.\n";
502      }
503
504Clang associates a ``VarDecl`` with each variable to represent the variable's
505declaration. Since the "canonical" form of each declaration is unique by
506address, all we need to do is make sure neither ``ValueDecl`` (base class of
507``VarDecl``) is ``NULL`` and compare the canonical Decls.
508
509.. code-block:: c++
510
511      static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
512        return First && Second &&
513               First->getCanonicalDecl() == Second->getCanonicalDecl();
514      }
515
516If execution reaches the end of ``LoopPrinter::run()``, we know that the
517loop shell that looks like
518
519.. code-block:: c++
520
521      for (int i= 0; i < expr(); ++i) { ... }
522
523For now, we will just print a message explaining that we found a loop.
524The next section will deal with recursively traversing the AST to
525discover all changes needed.
526
527As a side note, it's not as trivial to test if two expressions are the same,
528though Clang has already done the hard work for us by providing a way to
529canonicalize expressions:
530
531.. code-block:: c++
532
533      static bool areSameExpr(ASTContext *Context, const Expr *First,
534                              const Expr *Second) {
535        if (!First || !Second)
536          return false;
537        llvm::FoldingSetNodeID FirstID, SecondID;
538        First->Profile(FirstID, *Context, true);
539        Second->Profile(SecondID, *Context, true);
540        return FirstID == SecondID;
541      }
542
543This code relies on the comparison between two
544``llvm::FoldingSetNodeIDs``. As the documentation for
545``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
546a description of a node in the AST, based on its properties, along with
547those of its children. ``FoldingSetNodeID`` then serves as a hash we can
548use to compare expressions. We will need ``areSameExpr`` later. Before
549you run the new code on the additional loops added to
550test-files/simple.cpp, try to figure out which ones will be considered
551potentially convertible.
552