xref: /openbsd/gnu/usr.bin/perl/pod/perlinterp.pod (revision 5486feef)
1=encoding utf8
2
3=for comment
4Consistent formatting of this file is achieved with:
5  perl ./Porting/podtidy pod/perlinterp.pod
6
7=head1 NAME
8
9perlinterp - An overview of the Perl interpreter
10
11=head1 DESCRIPTION
12
13This document provides an overview of how the Perl interpreter works at
14the level of C code, along with pointers to the relevant C source code
15files.
16
17=head1 ELEMENTS OF THE INTERPRETER
18
19The work of the interpreter has two main stages: compiling the code
20into the internal representation, or bytecode, and then executing it.
21L<perlguts/Compiled code> explains exactly how the compilation stage
22happens.
23
24Here is a short breakdown of perl's operation:
25
26=head2 Startup
27
28The action begins in F<perlmain.c>. (or F<miniperlmain.c> for miniperl)
29This is very high-level code, enough to fit on a single screen, and it
30resembles the code found in L<perlembed>; most of the real action takes
31place in F<perl.c>
32
33F<perlmain.c> is generated by C<ExtUtils::Miniperl> from
34F<miniperlmain.c> at make time, so you should make perl to follow this
35along.
36
37First, F<perlmain.c> allocates some memory and constructs a Perl
38interpreter, along these lines:
39
40    1 PERL_SYS_INIT3(&argc,&argv,&env);
41    2
42    3 if (!PL_do_undump) {
43    4     my_perl = perl_alloc();
44    5     if (!my_perl)
45    6         exit(1);
46    7     perl_construct(my_perl);
47    8     PL_perl_destruct_level = 0;
48    9 }
49
50Line 1 is a macro, and its definition is dependent on your operating
51system. Line 3 references C<PL_do_undump>, a global variable - all
52global variables in Perl start with C<PL_>. This tells you whether the
53current running program was created with the C<-u> flag to perl and
54then F<undump>, which means it's going to be false in any sane context.
55
56Line 4 calls a function in F<perl.c> to allocate memory for a Perl
57interpreter. It's quite a simple function, and the guts of it looks
58like this:
59
60 my_perl = (PerlInterpreter*)PerlMem_malloc(sizeof(PerlInterpreter));
61
62Here you see an example of Perl's system abstraction, which we'll see
63later: C<PerlMem_malloc> is either your system's C<malloc>, or Perl's
64own C<malloc> as defined in F<malloc.c> if you selected that option at
65configure time.
66
67Next, in line 7, we construct the interpreter using perl_construct,
68also in F<perl.c>; this sets up all the special variables that Perl
69needs, the stacks, and so on.
70
71Now we pass Perl the command line options, and tell it to go:
72
73 if (!perl_parse(my_perl, xs_init, argc, argv, (char **)NULL))
74     perl_run(my_perl);
75
76 exitstatus = perl_destruct(my_perl);
77
78 perl_free(my_perl);
79
80C<perl_parse> is actually a wrapper around C<S_parse_body>, as defined
81in F<perl.c>, which processes the command line options, sets up any
82statically linked XS modules, opens the program and calls C<yyparse> to
83parse it.
84
85=head2 Parsing
86
87The aim of this stage is to take the Perl source, and turn it into an
88op tree. We'll see what one of those looks like later. Strictly
89speaking, there's three things going on here.
90
91C<yyparse>, the parser, lives in F<perly.c>, although you're better off
92reading the original YACC input in F<perly.y>. (Yes, Virginia, there
93B<is> a YACC grammar for Perl!) The job of the parser is to take your
94code and "understand" it, splitting it into sentences, deciding which
95operands go with which operators and so on.
96
97The parser is nobly assisted by the lexer, which chunks up your input
98into tokens, and decides what type of thing each token is: a variable
99name, an operator, a bareword, a subroutine, a core function, and so
100on. The main point of entry to the lexer is C<yylex>, and that and its
101associated routines can be found in F<toke.c>. Perl isn't much like
102other computer languages; it's highly context sensitive at times, it
103can be tricky to work out what sort of token something is, or where a
104token ends. As such, there's a lot of interplay between the tokeniser
105and the parser, which can get pretty frightening if you're not used to
106it.
107
108As the parser understands a Perl program, it builds up a tree of
109operations for the interpreter to perform during execution. The
110routines which construct and link together the various operations are
111to be found in F<op.c>, and will be examined later.
112
113=head2 Optimization
114
115Now the parsing stage is complete, and the finished tree represents the
116operations that the Perl interpreter needs to perform to execute our
117program. Next, Perl does a dry run over the tree looking for
118optimisations: constant expressions such as C<3 + 4> will be computed
119now, and the optimizer will also see if any multiple operations can be
120replaced with a single one. For instance, to fetch the variable
121C<$foo>, instead of grabbing the glob C<*foo> and looking at the scalar
122component, the optimizer fiddles the op tree to use a function which
123directly looks up the scalar in question. The main optimizer is C<peep>
124in F<op.c>, and many ops have their own optimizing functions.
125
126=head2 Running
127
128Now we're finally ready to go: we have compiled Perl byte code, and all
129that's left to do is run it. The actual execution is done by the
130C<runops_standard> function in F<run.c>; more specifically, it's done
131by these three innocent looking lines:
132
133    while ((PL_op = PL_op->op_ppaddr(aTHX))) {
134        PERL_ASYNC_CHECK();
135    }
136
137You may be more comfortable with the Perl version of that:
138
139    PERL_ASYNC_CHECK() while $Perl::op = &{$Perl::op->{function}};
140
141Well, maybe not. Anyway, each op contains a function pointer, which
142stipulates the function which will actually carry out the operation.
143This function will return the next op in the sequence - this allows for
144things like C<if> which choose the next op dynamically at run time. The
145C<PERL_ASYNC_CHECK> makes sure that things like signals interrupt
146execution if required.
147
148=for apidoc_section $embedding
149=for apidoc Amh|void|PERL_ASYNC_CHECK
150
151The actual functions called are known as PP code, and they're spread
152between four files: F<pp_hot.c> contains the "hot" code, which is most
153often used and highly optimized, F<pp_sys.c> contains all the
154system-specific functions, F<pp_ctl.c> contains the functions which
155implement control structures (C<if>, C<while> and the like) and F<pp.c>
156contains everything else. These are, if you like, the C code for Perl's
157built-in functions and operators.
158
159Note that each C<pp_> function is expected to return a pointer to the
160next op. Calls to perl subs (and eval blocks) are handled within the
161same runops loop, and do not consume extra space on the C stack. For
162example, C<pp_entersub> and C<pp_entertry> just push a C<CXt_SUB> or
163C<CXt_EVAL> block struct onto the context stack, which contain the address
164of the op following the sub call or eval. They then return the first op
165of that sub or eval block, and so execution continues of that sub or
166block. Later, a C<pp_leavesub> or C<pp_leavetry> op pops the C<CXt_SUB>
167or C<CXt_EVAL>, retrieves the return op from it, and returns it.
168
169=head2 Exception handing
170
171Perl's exception handing (i.e. C<die> etc.) is built on top of the
172low-level C<setjmp()>/C<longjmp()> C-library functions. These basically
173provide a way to capture the current PC and SP registers of the CPU and
174later restore them: i.e. a C<longjmp()> continues at the point in code
175where a previous C<setjmp()> was done, with anything further up on the C
176stack being lost. (This is why code should always save values using
177C<SAVE_I<FOO>> rather than in auto variables.)
178
179=for apidoc_section $exceptions
180=for apidoc Amh|void|JMPENV_PUSH|int v
181=for apidoc Amh|void|JMPENV_JUMP|int v
182=for apidoc Amnh|OP *|PL_restartop
183
184The perl core wraps C<setjmp()> and C<longjmp()> in the macros
185C<JMPENV_PUSH> and C<JMPENV_JUMP>. The push operation, as well as setting
186a C<setjump()>, stores some temporary state in a struct local to the
187current function (allocated by C<dJMPENV>). In particular, it stores a
188pointer to the previous C<JMPENV> struct, and updates C<PL_top_env> to
189point to the newest one, forming a chain of C<JMPENV> states. Both the
190push and jump can output debugging information under C<perl -Dl>.
191
192A basic rule of the perl internals is that all interpreter exits are
193achieved via a C<JMPENV_JUMP()>. In particular:
194
195=over
196
197=item * level 2: perl-level exit() and internals my_exit()
198
199These unwind all stacks, then perform a JMPENV_JUMP(2).
200
201=item * level 3: perl-level die() and internals croak()
202
203If currently within an eval, these pop the context stack back to the
204nearest C<CXt_EVAL> frame, set C<$@> as appropriate, set C<PL_restartop>
205to the op which follows the eval associated with that frame, then perform
206a JMPENV_JUMP(3).
207
208Otherwise, the error message is printed to C<STDERR>, then it is treated
209as an exit: unwind all stacks and perform a JMPENV_JUMP(2).
210
211=item * level 1: unused
212
213JMPENV_JUMP(1) is currently unused except in perl_run().
214
215=item * level 0: normal return.
216
217The zero value is for a normal return from JMPENV_PUSH()
218
219=back
220
221So the perl interpreter expects that, at all times, there is a suitable
222C<JMPENV_PUSH> set up (and at a suitable location within the CPU call
223stack) that can catch and process a 2- or 3-valued jump; and in the case
224of a 3, start a new runops loop to execute C<PL_restartop> and all
225remaining ops (as will be explained shortly).
226
227The entry points to the perl interpreter all provide such a facility. For
228example, perl_parse(),  perl_run() and  C<call_sv(cv, G_EVAL)> all contain
229something similar in outline to:
230
231    {
232        dJMPENV;
233        JMPENV_PUSH(ret);
234        switch (ret) {
235        case 0:                     /* normal return from JMPENV_PUSH() */
236          redo_body:
237            CALLRUNOPS(aTHX);
238            break;
239        case 2:                     /* caught longjmp(2) - exit / die */
240            break;
241        case 3:                     /* caught longjmp(3) - eval { die } */
242            PL_op = PL_restartop;
243            goto redo_body;
244        }
245
246        JMPENV_POP;
247    }
248
249A runops loop such as Perl_runops_standard() (as set up by CALLRUNOPS())
250is, at its heart, just a simple:
251
252    while ((PL_op = PL_op->op_ppaddr(aTHX))) { 1; }
253
254which calls the pp() function associated with each op, relying on that to
255return a pointer to the next op to be executed.
256
257As well as setting catches at the entry points to the perl interpreter,
258you might expect perl to also do a JMPENV_PUSH() in places like
259pp_entertry(), just before some trappable ops are executed. In fact perl
260doesn't normally do this. The drawback with doing it is that with nested
261or recursive code such as:
262
263    sub foo { my ($i) = @_; return if $i < 0; eval { foo(--$i) } }
264
265Then the C stack would quickly overflow with pairs of entries like
266
267    ...
268    #N+3 Perl_runops()
269    #N+2 Perl_pp_entertry()
270    #N+1 Perl_runops()
271    #N   Perl_pp_entertry()
272    ...
273
274Instead, perl puts its guards at the I<callers> of runops loops. Then as
275many nested subroutine calls and evals may be called as you like, all
276within the one runops loop. If an exception occurs, control passes back to
277the caller of the loop, which just immediately restarts a new loop with
278C<PL_restartop> being the next op to call.
279
280So in normal operation where there are several nested evals, there
281will be multiple C<CXt_EVAL> context stack entries, but only a single
282runops loop, guarded by a single C<JMPENV_PUSH>. Each caught eval will pop
283the next C<CXt_EVAL> off the stack, set C<PL_restartop>, then longjmp()
284back to perl_run() and continue.
285
286However, ops are sometimes executed within an inner runops loop, such as
287in a tie, sort, or overload code. In this case, something like
288
289    sub FETCH { eval { die }; .... }
290
291would, unless handled specially, cause a longjmp() right back to the guard
292in perl_run(), popping I<both> the runops loops - which is clearly
293incorrect.  One way to avoid this is for the tie code to do a
294C<JMPENV_PUSH> before executing C<FETCH> in the inner runops loop, but for
295efficiency reasons, perl in fact just temporarily sets a flag using
296C<CATCH_SET(TRUE)>. This flag warns any subsequent C<require>,
297C<entereval> or C<entertry> ops that the caller is no longer promising to
298catch any raised exceptions on their behalf.
299
300These ops check this flag, and if true, they (via docatch()) do a
301C<JMPENV_PUSH> and start a new runops loop to execute the code, rather
302than doing it with the current loop.
303
304As a consequence, on exit from the eval block in the C<FETCH> above,
305execution of the code following the block is still carried on in the
306inner loop (i.e. the one established by the pp_entertry()). To avoid
307confusion, if a further exception is then raised, docatch() compares the
308C<JMPENV> level of the C<CXt_EVAL> with C<PL_top_env> and if they differ,
309just re-throws the exception. In this way any inner loops get popped,
310and the exception will be dealt with properly by the level which is
311expecting it.
312
313Here's an example.
314
315    1: eval { tie @a, 'A' };
316    2: sub A::TIEARRAY {
317    3:     eval { die };
318    4:     die;
319    5: }
320
321To run this code, perl_run() is called, which does a JMPENV_PUSH(),
322then enters a runops loop. This loop executes the C<entereval> and C<tie>
323ops on line 1, with the C<entereval> pushing a C<CXt_EVAL> onto the context
324stack.
325
326The pp_tie() does a C<CATCH_SET(TRUE)>, then starts a second runops
327loop to execute the body of TIEARRAY(). When the loop executes the
328C<entertry> op on line 3, CATCH_GET() is true, so pp_entertry() calls
329docatch() which does a C<JMPENV_PUSH> and starts a third runops loop,
330which restarts the pp_entertry(), then executes the C<die> op. At this
331point the C call stack looks like this:
332
333    #10 Perl_pp_die()
334    #9  Perl_runops()      # runops loop 3
335    #8  S_docatch()        # JMPENV level 2
336    #7  Perl_pp_entertry()
337    #6  Perl_runops()      # runops loop 2
338    #5  Perl_call_sv()
339    #4  Perl_pp_tie()
340    #3  Perl_runops()      # runops loop 1
341    #2  S_run_body()
342    #1  perl_run()         # JMPENV level 1
343    #0  main()
344
345and the context and data stacks, as shown by C<perl -Dstv>, look like:
346
347    STACK 0: MAIN
348      CX 0: BLOCK  =>
349      CX 1: EVAL   => AV()  PV("A"\0)
350      retop=leave
351    STACK 1: MAGIC
352      CX 0: SUB    =>
353      retop=(null)
354      CX 1: EVAL   => *
355    retop=nextstate
356
357The die() pops the first C<CXt_EVAL> off the context stack, sets
358C<PL_restartop> from it, does a C<JMPENV_JUMP(3)>, and control returns
359to the C<JMPENV> level set in docatch(). This then starts another
360third-level runops level, which executes the C<nextstate>, C<pushmark> and
361C<die> ops from line 4. At the point that the second pp_die() is called,
362the C call stack looks exactly like that above, even though we are no
363longer within an inner eval. However, the context stack now looks like
364this, i.e. with the top CXt_EVAL popped:
365
366    STACK 0: MAIN
367      CX 0: BLOCK  =>
368      CX 1: EVAL   => AV()  PV("A"\0)
369      retop=leave
370    STACK 1: MAGIC
371      CX 0: SUB    =>
372      retop=(null)
373
374The die() on line 4 pops the context stack back down to the C<CXt_EVAL>,
375leaving it as:
376
377    STACK 0: MAIN
378      CX 0: BLOCK  =>
379
380As usual, C<PL_restartop> is extracted from the C<CXt_EVAL>, and a
381JMPENV_JUMP(3) done, which pops the C stack back to the docatch():
382
383    #8  S_docatch()        # JMPENV level 2
384    #7  Perl_pp_entertry()
385    #6  Perl_runops()      # runops loop 2
386    #5  Perl_call_sv()
387    #4  Perl_pp_tie()
388    #3  Perl_runops()      # runops loop 1
389    #2  S_run_body()
390    #1  perl_run()         # JMPENV level 1
391    #0  main()
392
393In  this case, because the C<JMPENV> level recorded in the C<CXt_EVAL>
394differs from the current one, docatch() just does a JMPENV_JUMP(3)
395to re-throw the exception, and the C stack unwinds to:
396
397    #1  perl_run()         # JMPENV level 1
398    #0  main()
399
400Because C<PL_restartop> is non-null, run_body() starts a new runops
401loop, and execution continues.
402
403
404=head2 INTERNAL VARIABLE TYPES
405
406You should by now have had a look at L<perlguts>, which tells you about
407Perl's internal variable types: SVs, HVs, AVs and the rest. If not, do
408that now.
409
410These variables are used not only to represent Perl-space variables,
411but also any constants in the code, as well as some structures
412completely internal to Perl. The symbol table, for instance, is an
413ordinary Perl hash. Your code is represented by an SV as it's read into
414the parser; any program files you call are opened via ordinary Perl
415filehandles, and so on.
416
417The core L<Devel::Peek|Devel::Peek> module lets us examine SVs from a
418Perl program. Let's see, for instance, how Perl treats the constant
419C<"hello">.
420
421      % perl -MDevel::Peek -e 'Dump("hello")'
422    1 SV = PV(0xa041450) at 0xa04ecbc
423    2   REFCNT = 1
424    3   FLAGS = (POK,READONLY,pPOK)
425    4   PV = 0xa0484e0 "hello"\0
426    5   CUR = 5
427    6   LEN = 6
428
429Reading C<Devel::Peek> output takes a bit of practise, so let's go
430through it line by line.
431
432Line 1 tells us we're looking at an SV which lives at C<0xa04ecbc> in
433memory. SVs themselves are very simple structures, but they contain a
434pointer to a more complex structure. In this case, it's a PV, a
435structure which holds a string value, at location C<0xa041450>. Line 2
436is the reference count; there are no other references to this data, so
437it's 1.
438
439Line 3 are the flags for this SV - it's OK to use it as a PV, it's a
440read-only SV (because it's a constant) and the data is a PV internally.
441Next we've got the contents of the string, starting at location
442C<0xa0484e0>.
443
444Line 5 gives us the current length of the string - note that this does
445B<not> include the null terminator. Line 6 is not the length of the
446string, but the length of the currently allocated buffer; as the string
447grows, Perl automatically extends the available storage via a routine
448called C<SvGROW>.
449
450You can get at any of these quantities from C very easily; just add
451C<Sv> to the name of the field shown in the snippet, and you've got a
452macro which will return the value: C<SvCUR(sv)> returns the current
453length of the string, C<SvREFCOUNT(sv)> returns the reference count,
454C<SvPV(sv, len)> returns the string itself with its length, and so on.
455More macros to manipulate these properties can be found in L<perlguts>.
456
457Let's take an example of manipulating a PV, from C<sv_catpvn>, in
458F<sv.c>
459
460     1  void
461     2  Perl_sv_catpvn(pTHX_ SV *sv, const char *ptr, STRLEN len)
462     3  {
463     4      STRLEN tlen;
464     5      char *junk;
465
466     6      junk = SvPV_force(sv, tlen);
467     7      SvGROW(sv, tlen + len + 1);
468     8      if (ptr == junk)
469     9          ptr = SvPVX(sv);
470    10      Move(ptr,SvPVX(sv)+tlen,len,char);
471    11      SvCUR(sv) += len;
472    12      *SvEND(sv) = '\0';
473    13      (void)SvPOK_only_UTF8(sv);          /* validate pointer */
474    14      SvTAINT(sv);
475    15  }
476
477This is a function which adds a string, C<ptr>, of length C<len> onto
478the end of the PV stored in C<sv>. The first thing we do in line 6 is
479make sure that the SV B<has> a valid PV, by calling the C<SvPV_force>
480macro to force a PV. As a side effect, C<tlen> gets set to the current
481value of the PV, and the PV itself is returned to C<junk>.
482
483In line 7, we make sure that the SV will have enough room to
484accommodate the old string, the new string and the null terminator. If
485C<LEN> isn't big enough, C<SvGROW> will reallocate space for us.
486
487Now, if C<junk> is the same as the string we're trying to add, we can
488grab the string directly from the SV; C<SvPVX> is the address of the PV
489in the SV.
490
491Line 10 does the actual catenation: the C<Move> macro moves a chunk of
492memory around: we move the string C<ptr> to the end of the PV - that's
493the start of the PV plus its current length. We're moving C<len> bytes
494of type C<char>. After doing so, we need to tell Perl we've extended
495the string, by altering C<CUR> to reflect the new length. C<SvEND> is a
496macro which gives us the end of the string, so that needs to be a
497C<"\0">.
498
499Line 13 manipulates the flags; since we've changed the PV, any IV or NV
500values will no longer be valid: if we have C<$x=10; $x.="6";> we don't
501want to use the old IV of 10. C<SvPOK_only_utf8> is a special
502UTF-8-aware version of C<SvPOK_only>, a macro which turns off the IOK
503and NOK flags and turns on POK. The final C<SvTAINT> is a macro which
504launders tainted data if taint mode is turned on.
505
506AVs and HVs are more complicated, but SVs are by far the most common
507variable type being thrown around. Having seen something of how we
508manipulate these, let's go on and look at how the op tree is
509constructed.
510
511=head1 OP TREES
512
513First, what is the op tree, anyway? The op tree is the parsed
514representation of your program, as we saw in our section on parsing,
515and it's the sequence of operations that Perl goes through to execute
516your program, as we saw in L</Running>.
517
518An op is a fundamental operation that Perl can perform: all the
519built-in functions and operators are ops, and there are a series of ops
520which deal with concepts the interpreter needs internally - entering
521and leaving a block, ending a statement, fetching a variable, and so
522on.
523
524The op tree is connected in two ways: you can imagine that there are
525two "routes" through it, two orders in which you can traverse the tree.
526First, parse order reflects how the parser understood the code, and
527secondly, execution order tells perl what order to perform the
528operations in.
529
530The easiest way to examine the op tree is to stop Perl after it has
531finished parsing, and get it to dump out the tree. This is exactly what
532the compiler backends L<B::Terse|B::Terse>, L<B::Concise|B::Concise>
533and CPAN module <B::Debug do.
534
535Let's have a look at how Perl sees C<$x = $y + $z>:
536
537     % perl -MO=Terse -e '$x=$y+$z'
538     1  LISTOP (0x8179888) leave
539     2      OP (0x81798b0) enter
540     3      COP (0x8179850) nextstate
541     4      BINOP (0x8179828) sassign
542     5          BINOP (0x8179800) add [1]
543     6              UNOP (0x81796e0) null [15]
544     7                  SVOP (0x80fafe0) gvsv  GV (0x80fa4cc) *y
545     8              UNOP (0x81797e0) null [15]
546     9                  SVOP (0x8179700) gvsv  GV (0x80efeb0) *z
547    10          UNOP (0x816b4f0) null [15]
548    11              SVOP (0x816dcf0) gvsv  GV (0x80fa460) *x
549
550Let's start in the middle, at line 4. This is a BINOP, a binary
551operator, which is at location C<0x8179828>. The specific operator in
552question is C<sassign> - scalar assignment - and you can find the code
553which implements it in the function C<pp_sassign> in F<pp_hot.c>. As a
554binary operator, it has two children: the add operator, providing the
555result of C<$y+$z>, is uppermost on line 5, and the left hand side is
556on line 10.
557
558Line 10 is the null op: this does exactly nothing. What is that doing
559there? If you see the null op, it's a sign that something has been
560optimized away after parsing. As we mentioned in L</Optimization>, the
561optimization stage sometimes converts two operations into one, for
562example when fetching a scalar variable. When this happens, instead of
563rewriting the op tree and cleaning up the dangling pointers, it's
564easier just to replace the redundant operation with the null op.
565Originally, the tree would have looked like this:
566
567    10          SVOP (0x816b4f0) rv2sv [15]
568    11              SVOP (0x816dcf0) gv  GV (0x80fa460) *x
569
570That is, fetch the C<a> entry from the main symbol table, and then look
571at the scalar component of it: C<gvsv> (C<pp_gvsv> in F<pp_hot.c>)
572happens to do both these things.
573
574The right hand side, starting at line 5 is similar to what we've just
575seen: we have the C<add> op (C<pp_add>, also in F<pp_hot.c>) add
576together two C<gvsv>s.
577
578Now, what's this about?
579
580     1  LISTOP (0x8179888) leave
581     2      OP (0x81798b0) enter
582     3      COP (0x8179850) nextstate
583
584C<enter> and C<leave> are scoping ops, and their job is to perform any
585housekeeping every time you enter and leave a block: lexical variables
586are tidied up, unreferenced variables are destroyed, and so on. Every
587program will have those first three lines: C<leave> is a list, and its
588children are all the statements in the block. Statements are delimited
589by C<nextstate>, so a block is a collection of C<nextstate> ops, with
590the ops to be performed for each statement being the children of
591C<nextstate>. C<enter> is a single op which functions as a marker.
592
593That's how Perl parsed the program, from top to bottom:
594
595                        Program
596                           |
597                       Statement
598                           |
599                           =
600                          / \
601                         /   \
602                        $x   +
603                            / \
604                          $y   $z
605
606However, it's impossible to B<perform> the operations in this order:
607you have to find the values of C<$y> and C<$z> before you add them
608together, for instance. So, the other thread that runs through the op
609tree is the execution order: each op has a field C<op_next> which
610points to the next op to be run, so following these pointers tells us
611how perl executes the code. We can traverse the tree in this order
612using the C<exec> option to C<B::Terse>:
613
614     % perl -MO=Terse,exec -e '$x=$y+$z'
615     1  OP (0x8179928) enter
616     2  COP (0x81798c8) nextstate
617     3  SVOP (0x81796c8) gvsv  GV (0x80fa4d4) *y
618     4  SVOP (0x8179798) gvsv  GV (0x80efeb0) *z
619     5  BINOP (0x8179878) add [1]
620     6  SVOP (0x816dd38) gvsv  GV (0x80fa468) *x
621     7  BINOP (0x81798a0) sassign
622     8  LISTOP (0x8179900) leave
623
624This probably makes more sense for a human: enter a block, start a
625statement. Get the values of C<$y> and C<$z>, and add them together.
626Find C<$x>, and assign one to the other. Then leave.
627
628The way Perl builds up these op trees in the parsing process can be
629unravelled by examining F<toke.c>, the lexer, and F<perly.y>, the YACC
630grammar. Let's look at the code that constructs the tree for C<$x = $y +
631$z>.
632
633First, we'll look at the C<Perl_yylex> function in the lexer. We want to
634look for C<case 'x'>, where x is the first character of the operator.
635(Incidentally, when looking for the code that handles a keyword, you'll
636want to search for C<KEY_foo> where "foo" is the keyword.) Here is the code
637that handles assignment (there are quite a few operators beginning with
638C<=>, so most of it is omitted for brevity):
639
640     1    case '=':
641     2        s++;
642              ... code that handles == => etc. and pod ...
643     3        pl_yylval.ival = 0;
644     4        OPERATOR(ASSIGNOP);
645
646We can see on line 4 that our token type is C<ASSIGNOP> (C<OPERATOR> is a
647macro, defined in F<toke.c>, that returns the token type, among other
648things). And C<+>:
649
650     1     case '+':
651     2         {
652     3             const char tmp = *s++;
653                   ... code for ++ ...
654     4             if (PL_expect == XOPERATOR) {
655                       ...
656     5                 Aop(OP_ADD);
657     6             }
658                   ...
659     7         }
660
661Line 4 checks what type of token we are expecting. C<Aop> returns a token.
662If you search for C<Aop> elsewhere in F<toke.c>, you will see that it
663returns an C<ADDOP> token.
664
665Now that we know the two token types we want to look for in the parser,
666let's take the piece of F<perly.y> we need to construct the tree for
667C<$x = $y + $z>
668
669    1 term    :   term ASSIGNOP term
670    2                { $$ = newASSIGNOP(OPf_STACKED, $1, $2, $3); }
671    3         |   term ADDOP term
672    4                { $$ = newBINOP($2, 0, scalar($1), scalar($3)); }
673
674If you're not used to reading BNF grammars, this is how it works:
675You're fed certain things by the tokeniser, which generally end up in
676upper case. C<ADDOP> and C<ASSIGNOP> are examples of "terminal symbols",
677because you can't get any simpler than
678them.
679
680The grammar, lines one and three of the snippet above, tells you how to
681build up more complex forms. These complex forms, "non-terminal
682symbols" are generally placed in lower case. C<term> here is a
683non-terminal symbol, representing a single expression.
684
685The grammar gives you the following rule: you can make the thing on the
686left of the colon if you see all the things on the right in sequence.
687This is called a "reduction", and the aim of parsing is to completely
688reduce the input. There are several different ways you can perform a
689reduction, separated by vertical bars: so, C<term> followed by C<=>
690followed by C<term> makes a C<term>, and C<term> followed by C<+>
691followed by C<term> can also make a C<term>.
692
693So, if you see two terms with an C<=> or C<+>, between them, you can
694turn them into a single expression. When you do this, you execute the
695code in the block on the next line: if you see C<=>, you'll do the code
696in line 2. If you see C<+>, you'll do the code in line 4. It's this
697code which contributes to the op tree.
698
699            |   term ADDOP term
700            { $$ = newBINOP($2, 0, scalar($1), scalar($3)); }
701
702What this does is creates a new binary op, and feeds it a number of
703variables. The variables refer to the tokens: C<$1> is the first token
704in the input, C<$2> the second, and so on - think regular expression
705backreferences. C<$$> is the op returned from this reduction. So, we
706call C<newBINOP> to create a new binary operator. The first parameter
707to C<newBINOP>, a function in F<op.c>, is the op type. It's an addition
708operator, so we want the type to be C<ADDOP>. We could specify this
709directly, but it's right there as the second token in the input, so we
710use C<$2>. The second parameter is the op's flags: 0 means "nothing
711special". Then the things to add: the left and right hand side of our
712expression, in scalar context.
713
714The functions that create ops, which have names like C<newUNOP> and
715C<newBINOP>, call a "check" function associated with each op type, before
716returning the op. The check functions can mangle the op as they see fit,
717and even replace it with an entirely new one. These functions are defined
718in F<op.c>, and have a C<Perl_ck_> prefix. You can find out which
719check function is used for a particular op type by looking in
720F<regen/opcodes>.  Take C<OP_ADD>, for example. (C<OP_ADD> is the token
721value from the C<Aop(OP_ADD)> in F<toke.c> which the parser passes to
722C<newBINOP> as its first argument.) Here is the relevant line:
723
724    add             addition (+)            ck_null         IfsT2   S S
725
726The check function in this case is C<Perl_ck_null>, which does nothing.
727Let's look at a more interesting case:
728
729    readline        <HANDLE>                ck_readline     t%      F?
730
731And here is the function from F<op.c>:
732
733     1 OP *
734     2 Perl_ck_readline(pTHX_ OP *o)
735     3 {
736     4     PERL_ARGS_ASSERT_CK_READLINE;
737     5
738     6     if (o->op_flags & OPf_KIDS) {
739     7          OP *kid = cLISTOPo->op_first;
740     8          if (kid->op_type == OP_RV2GV)
741     9              kid->op_private |= OPpALLOW_FAKE;
742    10     }
743    11     else {
744    12         OP * const newop
745    13             = newUNOP(OP_READLINE, 0, newGVOP(OP_GV, 0,
746    14                                               PL_argvgv));
747    15         op_free(o);
748    16         return newop;
749    17     }
750    18     return o;
751    19 }
752
753One particularly interesting aspect is that if the op has no kids (i.e.,
754C<readline()> or C<< <> >>) the op is freed and replaced with an entirely
755new one that references C<*ARGV> (lines 12-16).
756
757=head1 STACKS
758
759When perl executes something like C<addop>, how does it pass on its
760results to the next op? The answer is, through the use of stacks. Perl
761has a number of stacks to store things it's currently working on, and
762we'll look at the three most important ones here.
763
764=head2 Argument stack
765
766Arguments are passed to PP code and returned from PP code using the
767argument stack, C<ST>. The typical way to handle arguments is to pop
768them off the stack, deal with them how you wish, and then push the
769result back onto the stack. This is how, for instance, the cosine
770operator works:
771
772      NV value;
773      value = POPn;
774      value = Perl_cos(value);
775      XPUSHn(value);
776
777We'll see a more tricky example of this when we consider Perl's macros
778below. C<POPn> gives you the NV (floating point value) of the top SV on
779the stack: the C<$x> in C<cos($x)>. Then we compute the cosine, and
780push the result back as an NV. The C<X> in C<XPUSHn> means that the
781stack should be extended if necessary - it can't be necessary here,
782because we know there's room for one more item on the stack, since
783we've just removed one! The C<XPUSH*> macros at least guarantee safety.
784
785Alternatively, you can fiddle with the stack directly: C<SP> gives you
786the first element in your portion of the stack, and C<TOP*> gives you
787the top SV/IV/NV/etc. on the stack. So, for instance, to do unary
788negation of an integer:
789
790     SETi(-TOPi);
791
792Just set the integer value of the top stack entry to its negation.
793
794Argument stack manipulation in the core is exactly the same as it is in
795XSUBs - see L<perlxstut>, L<perlxs> and L<perlguts> for a longer
796description of the macros used in stack manipulation.
797
798=head2 Mark stack
799
800I say "your portion of the stack" above because PP code doesn't
801necessarily get the whole stack to itself: if your function calls
802another function, you'll only want to expose the arguments aimed for
803the called function, and not (necessarily) let it get at your own data.
804The way we do this is to have a "virtual" bottom-of-stack, exposed to
805each function. The mark stack keeps bookmarks to locations in the
806argument stack usable by each function. For instance, when dealing with
807a tied variable, (internally, something with "P" magic) Perl has to
808call methods for accesses to the tied variables. However, we need to
809separate the arguments exposed to the method to the argument exposed to
810the original function - the store or fetch or whatever it may be.
811Here's roughly how the tied C<push> is implemented; see C<av_push> in
812F<av.c>:
813
814     1	PUSHMARK(SP);
815     2	EXTEND(SP,2);
816     3	PUSHs(SvTIED_obj((SV*)av, mg));
817     4	PUSHs(val);
818     5	PUTBACK;
819     6	ENTER;
820     7	call_method("PUSH", G_SCALAR|G_DISCARD);
821     8	LEAVE;
822
823Let's examine the whole implementation, for practice:
824
825     1	PUSHMARK(SP);
826
827Push the current state of the stack pointer onto the mark stack. This
828is so that when we've finished adding items to the argument stack, Perl
829knows how many things we've added recently.
830
831     2	EXTEND(SP,2);
832     3	PUSHs(SvTIED_obj((SV*)av, mg));
833     4	PUSHs(val);
834
835We're going to add two more items onto the argument stack: when you
836have a tied array, the C<PUSH> subroutine receives the object and the
837value to be pushed, and that's exactly what we have here - the tied
838object, retrieved with C<SvTIED_obj>, and the value, the SV C<val>.
839
840=for apidoc_section $magic
841=for apidoc Amh||SvTIED_obj|SV *sv|MAGIC *mg
842
843     5	PUTBACK;
844
845Next we tell Perl to update the global stack pointer from our internal
846variable: C<dSP> only gave us a local copy, not a reference to the
847global.
848
849     6	ENTER;
850     7	call_method("PUSH", G_SCALAR|G_DISCARD);
851     8	LEAVE;
852
853C<ENTER> and C<LEAVE> localise a block of code - they make sure that
854all variables are tidied up, everything that has been localised gets
855its previous value returned, and so on. Think of them as the C<{> and
856C<}> of a Perl block.
857
858To actually do the magic method call, we have to call a subroutine in
859Perl space: C<call_method> takes care of that, and it's described in
860L<perlcall>. We call the C<PUSH> method in scalar context, and we're
861going to discard its return value. The call_method() function removes
862the top element of the mark stack, so there is nothing for the caller
863to clean up.
864
865=head2 Save stack
866
867C doesn't have a concept of local scope, so perl provides one. We've
868seen that C<ENTER> and C<LEAVE> are used as scoping braces; the save
869stack implements the C equivalent of, for example:
870
871    {
872        local $foo = 42;
873        ...
874    }
875
876See L<perlguts/"Localizing changes"> for how to use the save stack.
877
878=head1 MILLIONS OF MACROS
879
880One thing you'll notice about the Perl source is that it's full of
881macros. Some have called the pervasive use of macros the hardest thing
882to understand, others find it adds to clarity. Let's take an example,
883a stripped-down version the code which implements the addition operator:
884
885   1  PP(pp_add)
886   2  {
887   3      dSP; dATARGET;
888   4      tryAMAGICbin_MG(add_amg, AMGf_assign|AMGf_numeric);
889   5      {
890   6        dPOPTOPnnrl_ul;
891   7        SETn( left + right );
892   8        RETURN;
893   9      }
894  10  }
895
896Every line here (apart from the braces, of course) contains a macro.
897The first line sets up the function declaration as Perl expects for PP
898code; line 3 sets up variable declarations for the argument stack and
899the target, the return value of the operation. Line 4 tries to see
900if the addition operation is overloaded; if so, the appropriate
901subroutine is called.
902
903Line 6 is another variable declaration - all variable declarations
904start with C<d> - which pops from the top of the argument stack two NVs
905(hence C<nn>) and puts them into the variables C<right> and C<left>,
906hence the C<rl>. These are the two operands to the addition operator.
907Next, we call C<SETn> to set the NV of the return value to the result
908of adding the two values. This done, we return - the C<RETURN> macro
909makes sure that our return value is properly handled, and we pass the
910next operator to run back to the main run loop.
911
912Most of these macros are explained in L<perlapi>, and some of the more
913important ones are explained in L<perlxs> as well. Pay special
914attention to L<perlguts/Background and MULTIPLICITY> for
915information on the C<[pad]THX_?> macros.
916
917=head1 FURTHER READING
918
919For more information on the Perl internals, please see the documents
920listed at L<perl/Internals and C Language Interface>.
921