1=encoding utf8 2 3=for comment 4Consistent formatting of this file is achieved with: 5 perl ./Porting/podtidy pod/perlinterp.pod 6 7=head1 NAME 8 9perlinterp - An overview of the Perl interpreter 10 11=head1 DESCRIPTION 12 13This document provides an overview of how the Perl interpreter works at 14the level of C code, along with pointers to the relevant C source code 15files. 16 17=head1 ELEMENTS OF THE INTERPRETER 18 19The work of the interpreter has two main stages: compiling the code 20into the internal representation, or bytecode, and then executing it. 21L<perlguts/Compiled code> explains exactly how the compilation stage 22happens. 23 24Here is a short breakdown of perl's operation: 25 26=head2 Startup 27 28The action begins in F<perlmain.c>. (or F<miniperlmain.c> for miniperl) 29This is very high-level code, enough to fit on a single screen, and it 30resembles the code found in L<perlembed>; most of the real action takes 31place in F<perl.c> 32 33F<perlmain.c> is generated by C<ExtUtils::Miniperl> from 34F<miniperlmain.c> at make time, so you should make perl to follow this 35along. 36 37First, F<perlmain.c> allocates some memory and constructs a Perl 38interpreter, along these lines: 39 40 1 PERL_SYS_INIT3(&argc,&argv,&env); 41 2 42 3 if (!PL_do_undump) { 43 4 my_perl = perl_alloc(); 44 5 if (!my_perl) 45 6 exit(1); 46 7 perl_construct(my_perl); 47 8 PL_perl_destruct_level = 0; 48 9 } 49 50Line 1 is a macro, and its definition is dependent on your operating 51system. Line 3 references C<PL_do_undump>, a global variable - all 52global variables in Perl start with C<PL_>. This tells you whether the 53current running program was created with the C<-u> flag to perl and 54then F<undump>, which means it's going to be false in any sane context. 55 56Line 4 calls a function in F<perl.c> to allocate memory for a Perl 57interpreter. It's quite a simple function, and the guts of it looks 58like this: 59 60 my_perl = (PerlInterpreter*)PerlMem_malloc(sizeof(PerlInterpreter)); 61 62Here you see an example of Perl's system abstraction, which we'll see 63later: C<PerlMem_malloc> is either your system's C<malloc>, or Perl's 64own C<malloc> as defined in F<malloc.c> if you selected that option at 65configure time. 66 67Next, in line 7, we construct the interpreter using perl_construct, 68also in F<perl.c>; this sets up all the special variables that Perl 69needs, the stacks, and so on. 70 71Now we pass Perl the command line options, and tell it to go: 72 73 if (!perl_parse(my_perl, xs_init, argc, argv, (char **)NULL)) 74 perl_run(my_perl); 75 76 exitstatus = perl_destruct(my_perl); 77 78 perl_free(my_perl); 79 80C<perl_parse> is actually a wrapper around C<S_parse_body>, as defined 81in F<perl.c>, which processes the command line options, sets up any 82statically linked XS modules, opens the program and calls C<yyparse> to 83parse it. 84 85=head2 Parsing 86 87The aim of this stage is to take the Perl source, and turn it into an 88op tree. We'll see what one of those looks like later. Strictly 89speaking, there's three things going on here. 90 91C<yyparse>, the parser, lives in F<perly.c>, although you're better off 92reading the original YACC input in F<perly.y>. (Yes, Virginia, there 93B<is> a YACC grammar for Perl!) The job of the parser is to take your 94code and "understand" it, splitting it into sentences, deciding which 95operands go with which operators and so on. 96 97The parser is nobly assisted by the lexer, which chunks up your input 98into tokens, and decides what type of thing each token is: a variable 99name, an operator, a bareword, a subroutine, a core function, and so 100on. The main point of entry to the lexer is C<yylex>, and that and its 101associated routines can be found in F<toke.c>. Perl isn't much like 102other computer languages; it's highly context sensitive at times, it 103can be tricky to work out what sort of token something is, or where a 104token ends. As such, there's a lot of interplay between the tokeniser 105and the parser, which can get pretty frightening if you're not used to 106it. 107 108As the parser understands a Perl program, it builds up a tree of 109operations for the interpreter to perform during execution. The 110routines which construct and link together the various operations are 111to be found in F<op.c>, and will be examined later. 112 113=head2 Optimization 114 115Now the parsing stage is complete, and the finished tree represents the 116operations that the Perl interpreter needs to perform to execute our 117program. Next, Perl does a dry run over the tree looking for 118optimisations: constant expressions such as C<3 + 4> will be computed 119now, and the optimizer will also see if any multiple operations can be 120replaced with a single one. For instance, to fetch the variable 121C<$foo>, instead of grabbing the glob C<*foo> and looking at the scalar 122component, the optimizer fiddles the op tree to use a function which 123directly looks up the scalar in question. The main optimizer is C<peep> 124in F<op.c>, and many ops have their own optimizing functions. 125 126=head2 Running 127 128Now we're finally ready to go: we have compiled Perl byte code, and all 129that's left to do is run it. The actual execution is done by the 130C<runops_standard> function in F<run.c>; more specifically, it's done 131by these three innocent looking lines: 132 133 while ((PL_op = PL_op->op_ppaddr(aTHX))) { 134 PERL_ASYNC_CHECK(); 135 } 136 137You may be more comfortable with the Perl version of that: 138 139 PERL_ASYNC_CHECK() while $Perl::op = &{$Perl::op->{function}}; 140 141Well, maybe not. Anyway, each op contains a function pointer, which 142stipulates the function which will actually carry out the operation. 143This function will return the next op in the sequence - this allows for 144things like C<if> which choose the next op dynamically at run time. The 145C<PERL_ASYNC_CHECK> makes sure that things like signals interrupt 146execution if required. 147 148=for apidoc_section $embedding 149=for apidoc Amh|void|PERL_ASYNC_CHECK 150 151The actual functions called are known as PP code, and they're spread 152between four files: F<pp_hot.c> contains the "hot" code, which is most 153often used and highly optimized, F<pp_sys.c> contains all the 154system-specific functions, F<pp_ctl.c> contains the functions which 155implement control structures (C<if>, C<while> and the like) and F<pp.c> 156contains everything else. These are, if you like, the C code for Perl's 157built-in functions and operators. 158 159Note that each C<pp_> function is expected to return a pointer to the 160next op. Calls to perl subs (and eval blocks) are handled within the 161same runops loop, and do not consume extra space on the C stack. For 162example, C<pp_entersub> and C<pp_entertry> just push a C<CXt_SUB> or 163C<CXt_EVAL> block struct onto the context stack, which contain the address 164of the op following the sub call or eval. They then return the first op 165of that sub or eval block, and so execution continues of that sub or 166block. Later, a C<pp_leavesub> or C<pp_leavetry> op pops the C<CXt_SUB> 167or C<CXt_EVAL>, retrieves the return op from it, and returns it. 168 169=head2 Exception handing 170 171Perl's exception handing (i.e. C<die> etc.) is built on top of the 172low-level C<setjmp()>/C<longjmp()> C-library functions. These basically 173provide a way to capture the current PC and SP registers of the CPU and 174later restore them: i.e. a C<longjmp()> continues at the point in code 175where a previous C<setjmp()> was done, with anything further up on the C 176stack being lost. (This is why code should always save values using 177C<SAVE_I<FOO>> rather than in auto variables.) 178 179=for apidoc_section $exceptions 180=for apidoc Amh|void|JMPENV_PUSH|int v 181=for apidoc Amh|void|JMPENV_JUMP|int v 182=for apidoc Amnh|OP *|PL_restartop 183 184The perl core wraps C<setjmp()> and C<longjmp()> in the macros 185C<JMPENV_PUSH> and C<JMPENV_JUMP>. The push operation, as well as setting 186a C<setjump()>, stores some temporary state in a struct local to the 187current function (allocated by C<dJMPENV>). In particular, it stores a 188pointer to the previous C<JMPENV> struct, and updates C<PL_top_env> to 189point to the newest one, forming a chain of C<JMPENV> states. Both the 190push and jump can output debugging information under C<perl -Dl>. 191 192A basic rule of the perl internals is that all interpreter exits are 193achieved via a C<JMPENV_JUMP()>. In particular: 194 195=over 196 197=item * level 2: perl-level exit() and internals my_exit() 198 199These unwind all stacks, then perform a JMPENV_JUMP(2). 200 201=item * level 3: perl-level die() and internals croak() 202 203If currently within an eval, these pop the context stack back to the 204nearest C<CXt_EVAL> frame, set C<$@> as appropriate, set C<PL_restartop> 205to the op which follows the eval associated with that frame, then perform 206a JMPENV_JUMP(3). 207 208Otherwise, the error message is printed to C<STDERR>, then it is treated 209as an exit: unwind all stacks and perform a JMPENV_JUMP(2). 210 211=item * level 1: unused 212 213JMPENV_JUMP(1) is currently unused except in perl_run(). 214 215=item * level 0: normal return. 216 217The zero value is for a normal return from JMPENV_PUSH() 218 219=back 220 221So the perl interpreter expects that, at all times, there is a suitable 222C<JMPENV_PUSH> set up (and at a suitable location within the CPU call 223stack) that can catch and process a 2- or 3-valued jump; and in the case 224of a 3, start a new runops loop to execute C<PL_restartop> and all 225remaining ops (as will be explained shortly). 226 227The entry points to the perl interpreter all provide such a facility. For 228example, perl_parse(), perl_run() and C<call_sv(cv, G_EVAL)> all contain 229something similar in outline to: 230 231 { 232 dJMPENV; 233 JMPENV_PUSH(ret); 234 switch (ret) { 235 case 0: /* normal return from JMPENV_PUSH() */ 236 redo_body: 237 CALLRUNOPS(aTHX); 238 break; 239 case 2: /* caught longjmp(2) - exit / die */ 240 break; 241 case 3: /* caught longjmp(3) - eval { die } */ 242 PL_op = PL_restartop; 243 goto redo_body; 244 } 245 246 JMPENV_POP; 247 } 248 249A runops loop such as Perl_runops_standard() (as set up by CALLRUNOPS()) 250is, at its heart, just a simple: 251 252 while ((PL_op = PL_op->op_ppaddr(aTHX))) { 1; } 253 254which calls the pp() function associated with each op, relying on that to 255return a pointer to the next op to be executed. 256 257As well as setting catches at the entry points to the perl interpreter, 258you might expect perl to also do a JMPENV_PUSH() in places like 259pp_entertry(), just before some trappable ops are executed. In fact perl 260doesn't normally do this. The drawback with doing it is that with nested 261or recursive code such as: 262 263 sub foo { my ($i) = @_; return if $i < 0; eval { foo(--$i) } } 264 265Then the C stack would quickly overflow with pairs of entries like 266 267 ... 268 #N+3 Perl_runops() 269 #N+2 Perl_pp_entertry() 270 #N+1 Perl_runops() 271 #N Perl_pp_entertry() 272 ... 273 274Instead, perl puts its guards at the I<callers> of runops loops. Then as 275many nested subroutine calls and evals may be called as you like, all 276within the one runops loop. If an exception occurs, control passes back to 277the caller of the loop, which just immediately restarts a new loop with 278C<PL_restartop> being the next op to call. 279 280So in normal operation where there are several nested evals, there 281will be multiple C<CXt_EVAL> context stack entries, but only a single 282runops loop, guarded by a single C<JMPENV_PUSH>. Each caught eval will pop 283the next C<CXt_EVAL> off the stack, set C<PL_restartop>, then longjmp() 284back to perl_run() and continue. 285 286However, ops are sometimes executed within an inner runops loop, such as 287in a tie, sort, or overload code. In this case, something like 288 289 sub FETCH { eval { die }; .... } 290 291would, unless handled specially, cause a longjmp() right back to the guard 292in perl_run(), popping I<both> the runops loops - which is clearly 293incorrect. One way to avoid this is for the tie code to do a 294C<JMPENV_PUSH> before executing C<FETCH> in the inner runops loop, but for 295efficiency reasons, perl in fact just temporarily sets a flag using 296C<CATCH_SET(TRUE)>. This flag warns any subsequent C<require>, 297C<entereval> or C<entertry> ops that the caller is no longer promising to 298catch any raised exceptions on their behalf. 299 300These ops check this flag, and if true, they (via docatch()) do a 301C<JMPENV_PUSH> and start a new runops loop to execute the code, rather 302than doing it with the current loop. 303 304As a consequence, on exit from the eval block in the C<FETCH> above, 305execution of the code following the block is still carried on in the 306inner loop (i.e. the one established by the pp_entertry()). To avoid 307confusion, if a further exception is then raised, docatch() compares the 308C<JMPENV> level of the C<CXt_EVAL> with C<PL_top_env> and if they differ, 309just re-throws the exception. In this way any inner loops get popped, 310and the exception will be dealt with properly by the level which is 311expecting it. 312 313Here's an example. 314 315 1: eval { tie @a, 'A' }; 316 2: sub A::TIEARRAY { 317 3: eval { die }; 318 4: die; 319 5: } 320 321To run this code, perl_run() is called, which does a JMPENV_PUSH(), 322then enters a runops loop. This loop executes the C<entereval> and C<tie> 323ops on line 1, with the C<entereval> pushing a C<CXt_EVAL> onto the context 324stack. 325 326The pp_tie() does a C<CATCH_SET(TRUE)>, then starts a second runops 327loop to execute the body of TIEARRAY(). When the loop executes the 328C<entertry> op on line 3, CATCH_GET() is true, so pp_entertry() calls 329docatch() which does a C<JMPENV_PUSH> and starts a third runops loop, 330which restarts the pp_entertry(), then executes the C<die> op. At this 331point the C call stack looks like this: 332 333 #10 Perl_pp_die() 334 #9 Perl_runops() # runops loop 3 335 #8 S_docatch() # JMPENV level 2 336 #7 Perl_pp_entertry() 337 #6 Perl_runops() # runops loop 2 338 #5 Perl_call_sv() 339 #4 Perl_pp_tie() 340 #3 Perl_runops() # runops loop 1 341 #2 S_run_body() 342 #1 perl_run() # JMPENV level 1 343 #0 main() 344 345and the context and data stacks, as shown by C<perl -Dstv>, look like: 346 347 STACK 0: MAIN 348 CX 0: BLOCK => 349 CX 1: EVAL => AV() PV("A"\0) 350 retop=leave 351 STACK 1: MAGIC 352 CX 0: SUB => 353 retop=(null) 354 CX 1: EVAL => * 355 retop=nextstate 356 357The die() pops the first C<CXt_EVAL> off the context stack, sets 358C<PL_restartop> from it, does a C<JMPENV_JUMP(3)>, and control returns 359to the C<JMPENV> level set in docatch(). This then starts another 360third-level runops level, which executes the C<nextstate>, C<pushmark> and 361C<die> ops from line 4. At the point that the second pp_die() is called, 362the C call stack looks exactly like that above, even though we are no 363longer within an inner eval. However, the context stack now looks like 364this, i.e. with the top CXt_EVAL popped: 365 366 STACK 0: MAIN 367 CX 0: BLOCK => 368 CX 1: EVAL => AV() PV("A"\0) 369 retop=leave 370 STACK 1: MAGIC 371 CX 0: SUB => 372 retop=(null) 373 374The die() on line 4 pops the context stack back down to the C<CXt_EVAL>, 375leaving it as: 376 377 STACK 0: MAIN 378 CX 0: BLOCK => 379 380As usual, C<PL_restartop> is extracted from the C<CXt_EVAL>, and a 381JMPENV_JUMP(3) done, which pops the C stack back to the docatch(): 382 383 #8 S_docatch() # JMPENV level 2 384 #7 Perl_pp_entertry() 385 #6 Perl_runops() # runops loop 2 386 #5 Perl_call_sv() 387 #4 Perl_pp_tie() 388 #3 Perl_runops() # runops loop 1 389 #2 S_run_body() 390 #1 perl_run() # JMPENV level 1 391 #0 main() 392 393In this case, because the C<JMPENV> level recorded in the C<CXt_EVAL> 394differs from the current one, docatch() just does a JMPENV_JUMP(3) 395to re-throw the exception, and the C stack unwinds to: 396 397 #1 perl_run() # JMPENV level 1 398 #0 main() 399 400Because C<PL_restartop> is non-null, run_body() starts a new runops 401loop, and execution continues. 402 403 404=head2 INTERNAL VARIABLE TYPES 405 406You should by now have had a look at L<perlguts>, which tells you about 407Perl's internal variable types: SVs, HVs, AVs and the rest. If not, do 408that now. 409 410These variables are used not only to represent Perl-space variables, 411but also any constants in the code, as well as some structures 412completely internal to Perl. The symbol table, for instance, is an 413ordinary Perl hash. Your code is represented by an SV as it's read into 414the parser; any program files you call are opened via ordinary Perl 415filehandles, and so on. 416 417The core L<Devel::Peek|Devel::Peek> module lets us examine SVs from a 418Perl program. Let's see, for instance, how Perl treats the constant 419C<"hello">. 420 421 % perl -MDevel::Peek -e 'Dump("hello")' 422 1 SV = PV(0xa041450) at 0xa04ecbc 423 2 REFCNT = 1 424 3 FLAGS = (POK,READONLY,pPOK) 425 4 PV = 0xa0484e0 "hello"\0 426 5 CUR = 5 427 6 LEN = 6 428 429Reading C<Devel::Peek> output takes a bit of practise, so let's go 430through it line by line. 431 432Line 1 tells us we're looking at an SV which lives at C<0xa04ecbc> in 433memory. SVs themselves are very simple structures, but they contain a 434pointer to a more complex structure. In this case, it's a PV, a 435structure which holds a string value, at location C<0xa041450>. Line 2 436is the reference count; there are no other references to this data, so 437it's 1. 438 439Line 3 are the flags for this SV - it's OK to use it as a PV, it's a 440read-only SV (because it's a constant) and the data is a PV internally. 441Next we've got the contents of the string, starting at location 442C<0xa0484e0>. 443 444Line 5 gives us the current length of the string - note that this does 445B<not> include the null terminator. Line 6 is not the length of the 446string, but the length of the currently allocated buffer; as the string 447grows, Perl automatically extends the available storage via a routine 448called C<SvGROW>. 449 450You can get at any of these quantities from C very easily; just add 451C<Sv> to the name of the field shown in the snippet, and you've got a 452macro which will return the value: C<SvCUR(sv)> returns the current 453length of the string, C<SvREFCOUNT(sv)> returns the reference count, 454C<SvPV(sv, len)> returns the string itself with its length, and so on. 455More macros to manipulate these properties can be found in L<perlguts>. 456 457Let's take an example of manipulating a PV, from C<sv_catpvn>, in 458F<sv.c> 459 460 1 void 461 2 Perl_sv_catpvn(pTHX_ SV *sv, const char *ptr, STRLEN len) 462 3 { 463 4 STRLEN tlen; 464 5 char *junk; 465 466 6 junk = SvPV_force(sv, tlen); 467 7 SvGROW(sv, tlen + len + 1); 468 8 if (ptr == junk) 469 9 ptr = SvPVX(sv); 470 10 Move(ptr,SvPVX(sv)+tlen,len,char); 471 11 SvCUR(sv) += len; 472 12 *SvEND(sv) = '\0'; 473 13 (void)SvPOK_only_UTF8(sv); /* validate pointer */ 474 14 SvTAINT(sv); 475 15 } 476 477This is a function which adds a string, C<ptr>, of length C<len> onto 478the end of the PV stored in C<sv>. The first thing we do in line 6 is 479make sure that the SV B<has> a valid PV, by calling the C<SvPV_force> 480macro to force a PV. As a side effect, C<tlen> gets set to the current 481value of the PV, and the PV itself is returned to C<junk>. 482 483In line 7, we make sure that the SV will have enough room to 484accommodate the old string, the new string and the null terminator. If 485C<LEN> isn't big enough, C<SvGROW> will reallocate space for us. 486 487Now, if C<junk> is the same as the string we're trying to add, we can 488grab the string directly from the SV; C<SvPVX> is the address of the PV 489in the SV. 490 491Line 10 does the actual catenation: the C<Move> macro moves a chunk of 492memory around: we move the string C<ptr> to the end of the PV - that's 493the start of the PV plus its current length. We're moving C<len> bytes 494of type C<char>. After doing so, we need to tell Perl we've extended 495the string, by altering C<CUR> to reflect the new length. C<SvEND> is a 496macro which gives us the end of the string, so that needs to be a 497C<"\0">. 498 499Line 13 manipulates the flags; since we've changed the PV, any IV or NV 500values will no longer be valid: if we have C<$x=10; $x.="6";> we don't 501want to use the old IV of 10. C<SvPOK_only_utf8> is a special 502UTF-8-aware version of C<SvPOK_only>, a macro which turns off the IOK 503and NOK flags and turns on POK. The final C<SvTAINT> is a macro which 504launders tainted data if taint mode is turned on. 505 506AVs and HVs are more complicated, but SVs are by far the most common 507variable type being thrown around. Having seen something of how we 508manipulate these, let's go on and look at how the op tree is 509constructed. 510 511=head1 OP TREES 512 513First, what is the op tree, anyway? The op tree is the parsed 514representation of your program, as we saw in our section on parsing, 515and it's the sequence of operations that Perl goes through to execute 516your program, as we saw in L</Running>. 517 518An op is a fundamental operation that Perl can perform: all the 519built-in functions and operators are ops, and there are a series of ops 520which deal with concepts the interpreter needs internally - entering 521and leaving a block, ending a statement, fetching a variable, and so 522on. 523 524The op tree is connected in two ways: you can imagine that there are 525two "routes" through it, two orders in which you can traverse the tree. 526First, parse order reflects how the parser understood the code, and 527secondly, execution order tells perl what order to perform the 528operations in. 529 530The easiest way to examine the op tree is to stop Perl after it has 531finished parsing, and get it to dump out the tree. This is exactly what 532the compiler backends L<B::Terse|B::Terse>, L<B::Concise|B::Concise> 533and CPAN module <B::Debug do. 534 535Let's have a look at how Perl sees C<$x = $y + $z>: 536 537 % perl -MO=Terse -e '$x=$y+$z' 538 1 LISTOP (0x8179888) leave 539 2 OP (0x81798b0) enter 540 3 COP (0x8179850) nextstate 541 4 BINOP (0x8179828) sassign 542 5 BINOP (0x8179800) add [1] 543 6 UNOP (0x81796e0) null [15] 544 7 SVOP (0x80fafe0) gvsv GV (0x80fa4cc) *y 545 8 UNOP (0x81797e0) null [15] 546 9 SVOP (0x8179700) gvsv GV (0x80efeb0) *z 547 10 UNOP (0x816b4f0) null [15] 548 11 SVOP (0x816dcf0) gvsv GV (0x80fa460) *x 549 550Let's start in the middle, at line 4. This is a BINOP, a binary 551operator, which is at location C<0x8179828>. The specific operator in 552question is C<sassign> - scalar assignment - and you can find the code 553which implements it in the function C<pp_sassign> in F<pp_hot.c>. As a 554binary operator, it has two children: the add operator, providing the 555result of C<$y+$z>, is uppermost on line 5, and the left hand side is 556on line 10. 557 558Line 10 is the null op: this does exactly nothing. What is that doing 559there? If you see the null op, it's a sign that something has been 560optimized away after parsing. As we mentioned in L</Optimization>, the 561optimization stage sometimes converts two operations into one, for 562example when fetching a scalar variable. When this happens, instead of 563rewriting the op tree and cleaning up the dangling pointers, it's 564easier just to replace the redundant operation with the null op. 565Originally, the tree would have looked like this: 566 567 10 SVOP (0x816b4f0) rv2sv [15] 568 11 SVOP (0x816dcf0) gv GV (0x80fa460) *x 569 570That is, fetch the C<a> entry from the main symbol table, and then look 571at the scalar component of it: C<gvsv> (C<pp_gvsv> in F<pp_hot.c>) 572happens to do both these things. 573 574The right hand side, starting at line 5 is similar to what we've just 575seen: we have the C<add> op (C<pp_add>, also in F<pp_hot.c>) add 576together two C<gvsv>s. 577 578Now, what's this about? 579 580 1 LISTOP (0x8179888) leave 581 2 OP (0x81798b0) enter 582 3 COP (0x8179850) nextstate 583 584C<enter> and C<leave> are scoping ops, and their job is to perform any 585housekeeping every time you enter and leave a block: lexical variables 586are tidied up, unreferenced variables are destroyed, and so on. Every 587program will have those first three lines: C<leave> is a list, and its 588children are all the statements in the block. Statements are delimited 589by C<nextstate>, so a block is a collection of C<nextstate> ops, with 590the ops to be performed for each statement being the children of 591C<nextstate>. C<enter> is a single op which functions as a marker. 592 593That's how Perl parsed the program, from top to bottom: 594 595 Program 596 | 597 Statement 598 | 599 = 600 / \ 601 / \ 602 $x + 603 / \ 604 $y $z 605 606However, it's impossible to B<perform> the operations in this order: 607you have to find the values of C<$y> and C<$z> before you add them 608together, for instance. So, the other thread that runs through the op 609tree is the execution order: each op has a field C<op_next> which 610points to the next op to be run, so following these pointers tells us 611how perl executes the code. We can traverse the tree in this order 612using the C<exec> option to C<B::Terse>: 613 614 % perl -MO=Terse,exec -e '$x=$y+$z' 615 1 OP (0x8179928) enter 616 2 COP (0x81798c8) nextstate 617 3 SVOP (0x81796c8) gvsv GV (0x80fa4d4) *y 618 4 SVOP (0x8179798) gvsv GV (0x80efeb0) *z 619 5 BINOP (0x8179878) add [1] 620 6 SVOP (0x816dd38) gvsv GV (0x80fa468) *x 621 7 BINOP (0x81798a0) sassign 622 8 LISTOP (0x8179900) leave 623 624This probably makes more sense for a human: enter a block, start a 625statement. Get the values of C<$y> and C<$z>, and add them together. 626Find C<$x>, and assign one to the other. Then leave. 627 628The way Perl builds up these op trees in the parsing process can be 629unravelled by examining F<toke.c>, the lexer, and F<perly.y>, the YACC 630grammar. Let's look at the code that constructs the tree for C<$x = $y + 631$z>. 632 633First, we'll look at the C<Perl_yylex> function in the lexer. We want to 634look for C<case 'x'>, where x is the first character of the operator. 635(Incidentally, when looking for the code that handles a keyword, you'll 636want to search for C<KEY_foo> where "foo" is the keyword.) Here is the code 637that handles assignment (there are quite a few operators beginning with 638C<=>, so most of it is omitted for brevity): 639 640 1 case '=': 641 2 s++; 642 ... code that handles == => etc. and pod ... 643 3 pl_yylval.ival = 0; 644 4 OPERATOR(ASSIGNOP); 645 646We can see on line 4 that our token type is C<ASSIGNOP> (C<OPERATOR> is a 647macro, defined in F<toke.c>, that returns the token type, among other 648things). And C<+>: 649 650 1 case '+': 651 2 { 652 3 const char tmp = *s++; 653 ... code for ++ ... 654 4 if (PL_expect == XOPERATOR) { 655 ... 656 5 Aop(OP_ADD); 657 6 } 658 ... 659 7 } 660 661Line 4 checks what type of token we are expecting. C<Aop> returns a token. 662If you search for C<Aop> elsewhere in F<toke.c>, you will see that it 663returns an C<ADDOP> token. 664 665Now that we know the two token types we want to look for in the parser, 666let's take the piece of F<perly.y> we need to construct the tree for 667C<$x = $y + $z> 668 669 1 term : term ASSIGNOP term 670 2 { $$ = newASSIGNOP(OPf_STACKED, $1, $2, $3); } 671 3 | term ADDOP term 672 4 { $$ = newBINOP($2, 0, scalar($1), scalar($3)); } 673 674If you're not used to reading BNF grammars, this is how it works: 675You're fed certain things by the tokeniser, which generally end up in 676upper case. C<ADDOP> and C<ASSIGNOP> are examples of "terminal symbols", 677because you can't get any simpler than 678them. 679 680The grammar, lines one and three of the snippet above, tells you how to 681build up more complex forms. These complex forms, "non-terminal 682symbols" are generally placed in lower case. C<term> here is a 683non-terminal symbol, representing a single expression. 684 685The grammar gives you the following rule: you can make the thing on the 686left of the colon if you see all the things on the right in sequence. 687This is called a "reduction", and the aim of parsing is to completely 688reduce the input. There are several different ways you can perform a 689reduction, separated by vertical bars: so, C<term> followed by C<=> 690followed by C<term> makes a C<term>, and C<term> followed by C<+> 691followed by C<term> can also make a C<term>. 692 693So, if you see two terms with an C<=> or C<+>, between them, you can 694turn them into a single expression. When you do this, you execute the 695code in the block on the next line: if you see C<=>, you'll do the code 696in line 2. If you see C<+>, you'll do the code in line 4. It's this 697code which contributes to the op tree. 698 699 | term ADDOP term 700 { $$ = newBINOP($2, 0, scalar($1), scalar($3)); } 701 702What this does is creates a new binary op, and feeds it a number of 703variables. The variables refer to the tokens: C<$1> is the first token 704in the input, C<$2> the second, and so on - think regular expression 705backreferences. C<$$> is the op returned from this reduction. So, we 706call C<newBINOP> to create a new binary operator. The first parameter 707to C<newBINOP>, a function in F<op.c>, is the op type. It's an addition 708operator, so we want the type to be C<ADDOP>. We could specify this 709directly, but it's right there as the second token in the input, so we 710use C<$2>. The second parameter is the op's flags: 0 means "nothing 711special". Then the things to add: the left and right hand side of our 712expression, in scalar context. 713 714The functions that create ops, which have names like C<newUNOP> and 715C<newBINOP>, call a "check" function associated with each op type, before 716returning the op. The check functions can mangle the op as they see fit, 717and even replace it with an entirely new one. These functions are defined 718in F<op.c>, and have a C<Perl_ck_> prefix. You can find out which 719check function is used for a particular op type by looking in 720F<regen/opcodes>. Take C<OP_ADD>, for example. (C<OP_ADD> is the token 721value from the C<Aop(OP_ADD)> in F<toke.c> which the parser passes to 722C<newBINOP> as its first argument.) Here is the relevant line: 723 724 add addition (+) ck_null IfsT2 S S 725 726The check function in this case is C<Perl_ck_null>, which does nothing. 727Let's look at a more interesting case: 728 729 readline <HANDLE> ck_readline t% F? 730 731And here is the function from F<op.c>: 732 733 1 OP * 734 2 Perl_ck_readline(pTHX_ OP *o) 735 3 { 736 4 PERL_ARGS_ASSERT_CK_READLINE; 737 5 738 6 if (o->op_flags & OPf_KIDS) { 739 7 OP *kid = cLISTOPo->op_first; 740 8 if (kid->op_type == OP_RV2GV) 741 9 kid->op_private |= OPpALLOW_FAKE; 742 10 } 743 11 else { 744 12 OP * const newop 745 13 = newUNOP(OP_READLINE, 0, newGVOP(OP_GV, 0, 746 14 PL_argvgv)); 747 15 op_free(o); 748 16 return newop; 749 17 } 750 18 return o; 751 19 } 752 753One particularly interesting aspect is that if the op has no kids (i.e., 754C<readline()> or C<< <> >>) the op is freed and replaced with an entirely 755new one that references C<*ARGV> (lines 12-16). 756 757=head1 STACKS 758 759When perl executes something like C<addop>, how does it pass on its 760results to the next op? The answer is, through the use of stacks. Perl 761has a number of stacks to store things it's currently working on, and 762we'll look at the three most important ones here. 763 764=head2 Argument stack 765 766Arguments are passed to PP code and returned from PP code using the 767argument stack, C<ST>. The typical way to handle arguments is to pop 768them off the stack, deal with them how you wish, and then push the 769result back onto the stack. This is how, for instance, the cosine 770operator works: 771 772 NV value; 773 value = POPn; 774 value = Perl_cos(value); 775 XPUSHn(value); 776 777We'll see a more tricky example of this when we consider Perl's macros 778below. C<POPn> gives you the NV (floating point value) of the top SV on 779the stack: the C<$x> in C<cos($x)>. Then we compute the cosine, and 780push the result back as an NV. The C<X> in C<XPUSHn> means that the 781stack should be extended if necessary - it can't be necessary here, 782because we know there's room for one more item on the stack, since 783we've just removed one! The C<XPUSH*> macros at least guarantee safety. 784 785Alternatively, you can fiddle with the stack directly: C<SP> gives you 786the first element in your portion of the stack, and C<TOP*> gives you 787the top SV/IV/NV/etc. on the stack. So, for instance, to do unary 788negation of an integer: 789 790 SETi(-TOPi); 791 792Just set the integer value of the top stack entry to its negation. 793 794Argument stack manipulation in the core is exactly the same as it is in 795XSUBs - see L<perlxstut>, L<perlxs> and L<perlguts> for a longer 796description of the macros used in stack manipulation. 797 798=head2 Mark stack 799 800I say "your portion of the stack" above because PP code doesn't 801necessarily get the whole stack to itself: if your function calls 802another function, you'll only want to expose the arguments aimed for 803the called function, and not (necessarily) let it get at your own data. 804The way we do this is to have a "virtual" bottom-of-stack, exposed to 805each function. The mark stack keeps bookmarks to locations in the 806argument stack usable by each function. For instance, when dealing with 807a tied variable, (internally, something with "P" magic) Perl has to 808call methods for accesses to the tied variables. However, we need to 809separate the arguments exposed to the method to the argument exposed to 810the original function - the store or fetch or whatever it may be. 811Here's roughly how the tied C<push> is implemented; see C<av_push> in 812F<av.c>: 813 814 1 PUSHMARK(SP); 815 2 EXTEND(SP,2); 816 3 PUSHs(SvTIED_obj((SV*)av, mg)); 817 4 PUSHs(val); 818 5 PUTBACK; 819 6 ENTER; 820 7 call_method("PUSH", G_SCALAR|G_DISCARD); 821 8 LEAVE; 822 823Let's examine the whole implementation, for practice: 824 825 1 PUSHMARK(SP); 826 827Push the current state of the stack pointer onto the mark stack. This 828is so that when we've finished adding items to the argument stack, Perl 829knows how many things we've added recently. 830 831 2 EXTEND(SP,2); 832 3 PUSHs(SvTIED_obj((SV*)av, mg)); 833 4 PUSHs(val); 834 835We're going to add two more items onto the argument stack: when you 836have a tied array, the C<PUSH> subroutine receives the object and the 837value to be pushed, and that's exactly what we have here - the tied 838object, retrieved with C<SvTIED_obj>, and the value, the SV C<val>. 839 840=for apidoc_section $magic 841=for apidoc Amh||SvTIED_obj|SV *sv|MAGIC *mg 842 843 5 PUTBACK; 844 845Next we tell Perl to update the global stack pointer from our internal 846variable: C<dSP> only gave us a local copy, not a reference to the 847global. 848 849 6 ENTER; 850 7 call_method("PUSH", G_SCALAR|G_DISCARD); 851 8 LEAVE; 852 853C<ENTER> and C<LEAVE> localise a block of code - they make sure that 854all variables are tidied up, everything that has been localised gets 855its previous value returned, and so on. Think of them as the C<{> and 856C<}> of a Perl block. 857 858To actually do the magic method call, we have to call a subroutine in 859Perl space: C<call_method> takes care of that, and it's described in 860L<perlcall>. We call the C<PUSH> method in scalar context, and we're 861going to discard its return value. The call_method() function removes 862the top element of the mark stack, so there is nothing for the caller 863to clean up. 864 865=head2 Save stack 866 867C doesn't have a concept of local scope, so perl provides one. We've 868seen that C<ENTER> and C<LEAVE> are used as scoping braces; the save 869stack implements the C equivalent of, for example: 870 871 { 872 local $foo = 42; 873 ... 874 } 875 876See L<perlguts/"Localizing changes"> for how to use the save stack. 877 878=head1 MILLIONS OF MACROS 879 880One thing you'll notice about the Perl source is that it's full of 881macros. Some have called the pervasive use of macros the hardest thing 882to understand, others find it adds to clarity. Let's take an example, 883a stripped-down version the code which implements the addition operator: 884 885 1 PP(pp_add) 886 2 { 887 3 dSP; dATARGET; 888 4 tryAMAGICbin_MG(add_amg, AMGf_assign|AMGf_numeric); 889 5 { 890 6 dPOPTOPnnrl_ul; 891 7 SETn( left + right ); 892 8 RETURN; 893 9 } 894 10 } 895 896Every line here (apart from the braces, of course) contains a macro. 897The first line sets up the function declaration as Perl expects for PP 898code; line 3 sets up variable declarations for the argument stack and 899the target, the return value of the operation. Line 4 tries to see 900if the addition operation is overloaded; if so, the appropriate 901subroutine is called. 902 903Line 6 is another variable declaration - all variable declarations 904start with C<d> - which pops from the top of the argument stack two NVs 905(hence C<nn>) and puts them into the variables C<right> and C<left>, 906hence the C<rl>. These are the two operands to the addition operator. 907Next, we call C<SETn> to set the NV of the return value to the result 908of adding the two values. This done, we return - the C<RETURN> macro 909makes sure that our return value is properly handled, and we pass the 910next operator to run back to the main run loop. 911 912Most of these macros are explained in L<perlapi>, and some of the more 913important ones are explained in L<perlxs> as well. Pay special 914attention to L<perlguts/Background and MULTIPLICITY> for 915information on the C<[pad]THX_?> macros. 916 917=head1 FURTHER READING 918 919For more information on the Perl internals, please see the documents 920listed at L<perl/Internals and C Language Interface>. 921