1=============================================================== 2Tutorial for building tools using LibTooling and LibASTMatchers 3=============================================================== 4 5This document is intended to show how to build a useful source-to-source 6translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is 7explicitly aimed at people who are new to Clang, so all you should need 8is a working knowledge of C++ and the command line. 9 10In order to work on the compiler, you need some basic knowledge of the 11abstract syntax tree (AST). To this end, the reader is encouraged to 12skim the :doc:`Introduction to the Clang 13AST <IntroductionToTheClangAST>` 14 15Step 0: Obtaining Clang 16======================= 17 18As Clang is part of the LLVM project, you'll need to download LLVM's 19source code first. Both Clang and LLVM are in the same git repository, 20under different directories. For further information, see the `getting 21started guide <https://llvm.org/docs/GettingStarted.html>`_. 22 23.. code-block:: console 24 25 cd ~/clang-llvm 26 git clone https://github.com/llvm/llvm-project.git 27 28Next you need to obtain the CMake build system and Ninja build tool. 29 30.. code-block:: console 31 32 cd ~/clang-llvm 33 git clone https://github.com/martine/ninja.git 34 cd ninja 35 git checkout release 36 ./bootstrap.py 37 sudo cp ninja /usr/bin/ 38 39 cd ~/clang-llvm 40 git clone git://cmake.org/stage/cmake.git 41 cd cmake 42 git checkout next 43 ./bootstrap 44 make 45 sudo make install 46 47Okay. Now we'll build Clang! 48 49.. code-block:: console 50 51 cd ~/clang-llvm 52 mkdir build && cd build 53 cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. 54 ninja 55 ninja check # Test LLVM only. 56 ninja clang-test # Test Clang only. 57 ninja install 58 59And we're live. 60 61All of the tests should pass. 62 63Finally, we want to set Clang as its own compiler. 64 65.. code-block:: console 66 67 cd ~/clang-llvm/build 68 ccmake ../llvm 69 70The second command will bring up a GUI for configuring Clang. You need 71to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on 72advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to 73``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to 74configure, then ``'g'`` to generate CMake's files. 75 76Finally, run ninja one last time, and you're done. 77 78Step 1: Create a ClangTool 79========================== 80 81Now that we have enough background knowledge, it's time to create the 82simplest productive ClangTool in existence: a syntax checker. While this 83already exists as ``clang-check``, it's important to understand what's 84going on. 85 86First, we'll need to create a new directory for our tool and tell CMake 87that it exists. As this is not going to be a core clang tool, it will 88live in the ``clang-tools-extra`` repository. 89 90.. code-block:: console 91 92 cd ~/clang-llvm 93 mkdir clang-tools-extra/loop-convert 94 echo 'add_subdirectory(loop-convert)' >> clang-tools-extra/CMakeLists.txt 95 vim clang-tools-extra/loop-convert/CMakeLists.txt 96 97CMakeLists.txt should have the following contents: 98 99:: 100 101 set(LLVM_LINK_COMPONENTS support) 102 103 add_clang_executable(loop-convert 104 LoopConvert.cpp 105 ) 106 target_link_libraries(loop-convert 107 PRIVATE 108 clangAST 109 clangASTMatchers 110 clangBasic 111 clangFrontend 112 clangSerialization 113 clangTooling 114 ) 115 116With that done, Ninja will be able to compile our tool. Let's give it 117something to compile! Put the following into 118``clang-tools-extra/loop-convert/LoopConvert.cpp``. A detailed explanation of 119why the different parts are needed can be found in the `LibTooling 120documentation <LibTooling.html>`_. 121 122.. code-block:: c++ 123 124 // Declares clang::SyntaxOnlyAction. 125 #include "clang/Frontend/FrontendActions.h" 126 #include "clang/Tooling/CommonOptionsParser.h" 127 #include "clang/Tooling/Tooling.h" 128 // Declares llvm::cl::extrahelp. 129 #include "llvm/Support/CommandLine.h" 130 131 using namespace clang::tooling; 132 using namespace llvm; 133 134 // Apply a custom category to all command-line options so that they are the 135 // only ones displayed. 136 static llvm::cl::OptionCategory MyToolCategory("my-tool options"); 137 138 // CommonOptionsParser declares HelpMessage with a description of the common 139 // command-line options related to the compilation database and input files. 140 // It's nice to have this help message in all tools. 141 static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); 142 143 // A help message for this specific tool can be added afterwards. 144 static cl::extrahelp MoreHelp("\nMore help text...\n"); 145 146 int main(int argc, const char **argv) { 147 auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory); 148 if (!ExpectedParser) { 149 // Fail gracefully for unsupported options. 150 llvm::errs() << ExpectedParser.takeError(); 151 return 1; 152 } 153 CommonOptionsParser& OptionsParser = ExpectedParser.get(); 154 ClangTool Tool(OptionsParser.getCompilations(), 155 OptionsParser.getSourcePathList()); 156 return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get()); 157 } 158 159And that's it! You can compile our new tool by running ninja from the 160``build`` directory. 161 162.. code-block:: console 163 164 cd ~/clang-llvm/build 165 ninja 166 167You should now be able to run the syntax checker, which is located in 168``~/clang-llvm/build/bin``, on any source file. Try it! 169 170.. code-block:: console 171 172 echo "int main() { return 0; }" > test.cpp 173 bin/loop-convert test.cpp -- 174 175Note the two dashes after we specify the source file. The additional 176options for the compiler are passed after the dashes rather than loading 177them from a compilation database - there just aren't any options needed 178right now. 179 180Intermezzo: Learn AST matcher basics 181==================================== 182 183Clang recently introduced the :doc:`ASTMatcher 184library <LibASTMatchers>` to provide a simple, powerful, and 185concise way to describe specific patterns in the AST. Implemented as a 186DSL powered by macros and templates (see 187`ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're 188curious), matchers offer the feel of algebraic data types common to 189functional programming languages. 190 191For example, suppose you wanted to examine only binary operators. There 192is a matcher to do exactly that, conveniently named ``binaryOperator``. 193I'll give you one guess what this matcher does: 194 195.. code-block:: c++ 196 197 binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) 198 199Shockingly, it will match against addition expressions whose left hand 200side is exactly the literal 0. It will not match against other forms of 2010, such as ``'\0'`` or ``NULL``, but it will match against macros that 202expand to 0. The matcher will also not match against calls to the 203overloaded operator ``'+'``, as there is a separate ``operatorCallExpr`` 204matcher to handle overloaded operators. 205 206There are AST matchers to match all the different nodes of the AST, 207narrowing matchers to only match AST nodes fulfilling specific criteria, 208and traversal matchers to get from one kind of AST node to another. For 209a complete list of AST matchers, take a look at the `AST Matcher 210References <LibASTMatchersReference.html>`_ 211 212All matcher that are nouns describe entities in the AST and can be 213bound, so that they can be referred to whenever a match is found. To do 214so, simply call the method ``bind`` on these matchers, e.g.: 215 216.. code-block:: c++ 217 218 variable(hasType(isInteger())).bind("intvar") 219 220Step 2: Using AST matchers 221========================== 222 223Okay, on to using matchers for real. Let's start by defining a matcher 224which will capture all ``for`` statements that define a new variable 225initialized to zero. Let's start with matching all ``for`` loops: 226 227.. code-block:: c++ 228 229 forStmt() 230 231Next, we want to specify that a single variable is declared in the first 232portion of the loop, so we can extend the matcher to 233 234.. code-block:: c++ 235 236 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) 237 238Finally, we can add the condition that the variable is initialized to 239zero. 240 241.. code-block:: c++ 242 243 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 244 hasInitializer(integerLiteral(equals(0)))))))) 245 246It is fairly easy to read and understand the matcher definition ("match 247loops whose init portion declares a single variable which is initialized 248to the integer literal 0"), but deciding that every piece is necessary 249is more difficult. Note that this matcher will not match loops whose 250variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of 251zero besides the integer 0. 252 253The last step is giving the matcher a name and binding the ``ForStmt`` 254as we will want to do something with it: 255 256.. code-block:: c++ 257 258 StatementMatcher LoopMatcher = 259 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 260 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 261 262Once you have defined your matchers, you will need to add a little more 263scaffolding in order to run them. Matchers are paired with a 264``MatchCallback`` and registered with a ``MatchFinder`` object, then run 265from a ``ClangTool``. More code! 266 267Add the following to ``LoopConvert.cpp``: 268 269.. code-block:: c++ 270 271 #include "clang/ASTMatchers/ASTMatchers.h" 272 #include "clang/ASTMatchers/ASTMatchFinder.h" 273 274 using namespace clang; 275 using namespace clang::ast_matchers; 276 277 StatementMatcher LoopMatcher = 278 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 279 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 280 281 class LoopPrinter : public MatchFinder::MatchCallback { 282 public : 283 virtual void run(const MatchFinder::MatchResult &Result) { 284 if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) 285 FS->dump(); 286 } 287 }; 288 289And change ``main()`` to: 290 291.. code-block:: c++ 292 293 int main(int argc, const char **argv) { 294 auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory); 295 if (!ExpectedParser) { 296 // Fail gracefully for unsupported options. 297 llvm::errs() << ExpectedParser.takeError(); 298 return 1; 299 } 300 CommonOptionsParser& OptionsParser = ExpectedParser.get(); 301 ClangTool Tool(OptionsParser.getCompilations(), 302 OptionsParser.getSourcePathList()); 303 304 LoopPrinter Printer; 305 MatchFinder Finder; 306 Finder.addMatcher(LoopMatcher, &Printer); 307 308 return Tool.run(newFrontendActionFactory(&Finder).get()); 309 } 310 311Now, you should be able to recompile and run the code to discover for 312loops. Create a new file with a few examples, and test out our new 313handiwork: 314 315.. code-block:: console 316 317 cd ~/clang-llvm/llvm/llvm_build/ 318 ninja loop-convert 319 vim ~/test-files/simple-loops.cc 320 bin/loop-convert ~/test-files/simple-loops.cc 321 322Step 3.5: More Complicated Matchers 323=================================== 324 325Our simple matcher is capable of discovering for loops, but we would 326still need to filter out many more ourselves. We can do a good portion 327of the remaining work with some cleverly chosen matchers, but first we 328need to decide exactly which properties we want to allow. 329 330How can we characterize for loops over arrays which would be eligible 331for translation to range-based syntax? Range based loops over arrays of 332size ``N`` that: 333 334- start at index ``0`` 335- iterate consecutively 336- end at index ``N-1`` 337 338We already check for (1), so all we need to add is a check to the loop's 339condition to ensure that the loop's index variable is compared against 340``N`` and another check to ensure that the increment step just 341increments this same variable. The matcher for (2) is straightforward: 342require a pre- or post-increment of the same variable declared in the 343init portion. 344 345Unfortunately, such a matcher is impossible to write. Matchers contain 346no logic for comparing two arbitrary AST nodes and determining whether 347or not they are equal, so the best we can do is matching more than we 348would like to allow, and punting extra comparisons to the callback. 349 350In any case, we can start building this sub-matcher. We can require that 351the increment step be a unary increment like this: 352 353.. code-block:: c++ 354 355 hasIncrement(unaryOperator(hasOperatorName("++"))) 356 357Specifying what is incremented introduces another quirk of Clang's AST: 358Usages of variables are represented as ``DeclRefExpr``'s ("declaration 359reference expressions") because they are expressions which refer to 360variable declarations. To find a ``unaryOperator`` that refers to a 361specific declaration, we can simply add a second condition to it: 362 363.. code-block:: c++ 364 365 hasIncrement(unaryOperator( 366 hasOperatorName("++"), 367 hasUnaryOperand(declRefExpr()))) 368 369Furthermore, we can restrict our matcher to only match if the 370incremented variable is an integer: 371 372.. code-block:: c++ 373 374 hasIncrement(unaryOperator( 375 hasOperatorName("++"), 376 hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) 377 378And the last step will be to attach an identifier to this variable, so 379that we can retrieve it in the callback: 380 381.. code-block:: c++ 382 383 hasIncrement(unaryOperator( 384 hasOperatorName("++"), 385 hasUnaryOperand(declRefExpr(to( 386 varDecl(hasType(isInteger())).bind("incrementVariable")))))) 387 388We can add this code to the definition of ``LoopMatcher`` and make sure 389that our program, outfitted with the new matcher, only prints out loops 390that declare a single variable initialized to zero and have an increment 391step consisting of a unary increment of some variable. 392 393Now, we just need to add a matcher to check if the condition part of the 394``for`` loop compares a variable against the size of the array. There is 395only one problem - we don't know which array we're iterating over 396without looking at the body of the loop! We are again restricted to 397approximating the result we want with matchers, filling in the details 398in the callback. So we start with: 399 400.. code-block:: c++ 401 402 hasCondition(binaryOperator(hasOperatorName("<")) 403 404It makes sense to ensure that the left-hand side is a reference to a 405variable, and that the right-hand side has integer type. 406 407.. code-block:: c++ 408 409 hasCondition(binaryOperator( 410 hasOperatorName("<"), 411 hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))), 412 hasRHS(expr(hasType(isInteger()))))) 413 414Why? Because it doesn't work. Of the three loops provided in 415``test-files/simple.cpp``, zero of them have a matching condition. A 416quick look at the AST dump of the first for loop, produced by the 417previous iteration of loop-convert, shows us the answer: 418 419:: 420 421 (ForStmt 0x173b240 422 (DeclStmt 0x173afc8 423 0x173af50 "int i = 424 (IntegerLiteral 0x173afa8 'int' 0)") 425 <<>> 426 (BinaryOperator 0x173b060 '_Bool' '<' 427 (ImplicitCastExpr 0x173b030 'int' 428 (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) 429 (ImplicitCastExpr 0x173b048 'int' 430 (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) 431 (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' 432 (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) 433 (CompoundStatement ... 434 435We already know that the declaration and increments both match, or this 436loop wouldn't have been dumped. The culprit lies in the implicit cast 437applied to the first operand (i.e. the LHS) of the less-than operator, 438an L-value to R-value conversion applied to the expression referencing 439``i``. Thankfully, the matcher library offers a solution to this problem 440in the form of ``ignoringParenImpCasts``, which instructs the matcher to 441ignore implicit casts and parentheses before continuing to match. 442Adjusting the condition operator will restore the desired match. 443 444.. code-block:: c++ 445 446 hasCondition(binaryOperator( 447 hasOperatorName("<"), 448 hasLHS(ignoringParenImpCasts(declRefExpr( 449 to(varDecl(hasType(isInteger())))))), 450 hasRHS(expr(hasType(isInteger()))))) 451 452After adding binds to the expressions we wished to capture and 453extracting the identifier strings into variables, we have array-step-2 454completed. 455 456Step 4: Retrieving Matched Nodes 457================================ 458 459So far, the matcher callback isn't very interesting: it just dumps the 460loop's AST. At some point, we will need to make changes to the input 461source code. Next, we'll work on using the nodes we bound in the 462previous step. 463 464The ``MatchFinder::run()`` callback takes a 465``MatchFinder::MatchResult&`` as its parameter. We're most interested in 466its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext`` 467class to represent contextual information about the AST, as the name 468implies, though the most functionally important detail is that several 469operations require an ``ASTContext*`` parameter. More immediately useful 470is the set of matched nodes, and how we retrieve them. 471 472Since we bind three variables (identified by ConditionVarName, 473InitVarName, and IncrementVarName), we can obtain the matched nodes by 474using the ``getNodeAs()`` member function. 475 476In ``LoopConvert.cpp`` add 477 478.. code-block:: c++ 479 480 #include "clang/AST/ASTContext.h" 481 482Change ``LoopMatcher`` to 483 484.. code-block:: c++ 485 486 StatementMatcher LoopMatcher = 487 forStmt(hasLoopInit(declStmt( 488 hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0)))) 489 .bind("initVarName")))), 490 hasIncrement(unaryOperator( 491 hasOperatorName("++"), 492 hasUnaryOperand(declRefExpr( 493 to(varDecl(hasType(isInteger())).bind("incVarName")))))), 494 hasCondition(binaryOperator( 495 hasOperatorName("<"), 496 hasLHS(ignoringParenImpCasts(declRefExpr( 497 to(varDecl(hasType(isInteger())).bind("condVarName"))))), 498 hasRHS(expr(hasType(isInteger())))))).bind("forLoop"); 499 500And change ``LoopPrinter::run`` to 501 502.. code-block:: c++ 503 504 void LoopPrinter::run(const MatchFinder::MatchResult &Result) { 505 ASTContext *Context = Result.Context; 506 const ForStmt *FS = Result.Nodes.getNodeAs<ForStmt>("forLoop"); 507 // We do not want to convert header files! 508 if (!FS || !Context->getSourceManager().isWrittenInMainFile(FS->getForLoc())) 509 return; 510 const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName"); 511 const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName"); 512 const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName"); 513 514 if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) 515 return; 516 llvm::outs() << "Potential array-based loop discovered.\n"; 517 } 518 519Clang associates a ``VarDecl`` with each variable to represent the variable's 520declaration. Since the "canonical" form of each declaration is unique by 521address, all we need to do is make sure neither ``ValueDecl`` (base class of 522``VarDecl``) is ``NULL`` and compare the canonical Decls. 523 524.. code-block:: c++ 525 526 static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { 527 return First && Second && 528 First->getCanonicalDecl() == Second->getCanonicalDecl(); 529 } 530 531If execution reaches the end of ``LoopPrinter::run()``, we know that the 532loop shell that looks like 533 534.. code-block:: c++ 535 536 for (int i= 0; i < expr(); ++i) { ... } 537 538For now, we will just print a message explaining that we found a loop. 539The next section will deal with recursively traversing the AST to 540discover all changes needed. 541 542As a side note, it's not as trivial to test if two expressions are the same, 543though Clang has already done the hard work for us by providing a way to 544canonicalize expressions: 545 546.. code-block:: c++ 547 548 static bool areSameExpr(ASTContext *Context, const Expr *First, 549 const Expr *Second) { 550 if (!First || !Second) 551 return false; 552 llvm::FoldingSetNodeID FirstID, SecondID; 553 First->Profile(FirstID, *Context, true); 554 Second->Profile(SecondID, *Context, true); 555 return FirstID == SecondID; 556 } 557 558This code relies on the comparison between two 559``llvm::FoldingSetNodeIDs``. As the documentation for 560``Stmt::Profile()`` indicates, the ``Profile()`` member function builds 561a description of a node in the AST, based on its properties, along with 562those of its children. ``FoldingSetNodeID`` then serves as a hash we can 563use to compare expressions. We will need ``areSameExpr`` later. Before 564you run the new code on the additional loops added to 565test-files/simple.cpp, try to figure out which ones will be considered 566potentially convertible. 567