1@c Copyright (C) 2012-2019 Jordi Gutiérrez Hermoso
2@c
3@c This file is part of Octave.
4@c
5@c Octave is free software: you can redistribute it and/or modify it
6@c under the terms of the GNU General Public License as published by
7@c the Free Software Foundation, either version 3 of the License, or
8@c (at your option) any later version.
9@c
10@c Octave is distributed in the hope that it will be useful, but
11@c WITHOUT ANY WARRANTY; without even the implied warranty of
12@c MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13@c GNU General Public License for more details.
14@c
15@c You should have received a copy of the GNU General Public License
16@c along with Octave; see the file COPYING.  If not, see
17@c <https://www.gnu.org/licenses/>.
18
19@node Vectorization and Faster Code Execution
20@chapter Vectorization and Faster Code Execution
21@cindex vectorization
22@cindex vectorize
23
24Vectorization is a programming technique that uses vector operations
25instead of element-by-element loop-based operations.  Besides frequently
26producing more succinct Octave code, vectorization also allows for better
27optimization in the subsequent implementation.  The optimizations may occur
28either in Octave's own Fortran, C, or C++ internal implementation, or even at a
29lower level depending on the compiler and external numerical libraries used to
30build Octave.  The ultimate goal is to make use of your hardware's vector
31instructions if possible or to perform other optimizations in software.
32
33Vectorization is not a concept unique to Octave, but it is particularly
34important because Octave is a matrix-oriented language.  Vectorized
35Octave code will see a dramatic speed up (10X--100X) in most cases.
36
37This chapter discusses vectorization and other techniques for writing faster
38code.
39
40@menu
41* Basic Vectorization::        Basic techniques for code optimization
42* Broadcasting::               Broadcasting operations
43* Function Application::       Applying functions to arrays, cells, and structs
44* Accumulation::               Accumulation functions
45* JIT Compiler::               Just-In-Time Compiler for loops
46* Miscellaneous Techniques::   Other techniques for speeding up code
47* Examples::
48@end menu
49
50@node Basic Vectorization
51@section Basic Vectorization
52
53To a very good first approximation, the goal in vectorization is to
54write code that avoids loops and uses whole-array operations.  As a
55trivial example, consider
56
57@example
58@group
59for i = 1:n
60  for j = 1:m
61    c(i,j) = a(i,j) + b(i,j);
62  endfor
63endfor
64@end group
65@end example
66
67@noindent
68compared to the much simpler
69
70@example
71c = a + b;
72@end example
73
74@noindent
75This isn't merely easier to write; it is also internally much easier to
76optimize.  Octave delegates this operation to an underlying
77implementation which, among other optimizations, may use special vector
78hardware instructions or could conceivably even perform the additions in
79parallel.  In general, if the code is vectorized, the underlying
80implementation has more freedom about the assumptions it can make
81in order to achieve faster execution.
82
83This is especially important for loops with "cheap" bodies.  Often it
84suffices to vectorize just the innermost loop to get acceptable
85performance.  A general rule of thumb is that the "order" of the
86vectorized body should be greater or equal to the "order" of the
87enclosing loop.
88
89As a less trivial example, instead of
90
91@example
92@group
93for i = 1:n-1
94  a(i) = b(i+1) - b(i);
95endfor
96@end group
97@end example
98
99@noindent
100write
101
102@example
103a = b(2:n) - b(1:n-1);
104@end example
105
106This shows an important general concept about using arrays for indexing
107instead of looping over an index variable.  @xref{Index Expressions}.
108Also use boolean indexing generously.  If a condition needs to be tested,
109this condition can also be written as a boolean index.  For instance,
110instead of
111
112@example
113@group
114for i = 1:n
115  if (a(i) > 5)
116    a(i) -= 20
117  endif
118endfor
119@end group
120@end example
121
122@noindent
123write
124
125@example
126a(a>5) -= 20;
127@end example
128
129@noindent
130which exploits the fact that @code{a > 5} produces a boolean index.
131
132Use elementwise vector operators whenever possible to avoid looping
133(operators like @code{.*} and @code{.^}).  @xref{Arithmetic Ops}.
134
135Also exploit broadcasting in these elementwise operators both to avoid
136looping and unnecessary intermediate memory allocations.
137@xref{Broadcasting}.
138
139Use built-in and library functions if possible.  Built-in and compiled
140functions are very fast.  Even with an m-file library function, chances
141are good that it is already optimized, or will be optimized more in a
142future release.
143
144For instance, even better than
145
146@example
147a = b(2:n) - b(1:n-1);
148@end example
149
150@noindent
151is
152
153@example
154a = diff (b);
155@end example
156
157Most Octave functions are written with vector and array arguments in
158mind.  If you find yourself writing a loop with a very simple operation,
159chances are that such a function already exists.  The following functions
160occur frequently in vectorized code:
161
162@itemize @bullet
163@item
164Index manipulation
165
166@itemize
167@item
168find
169
170@item
171sub2ind
172
173@item
174ind2sub
175
176@item
177sort
178
179@item
180unique
181
182@item
183lookup
184
185@item
186ifelse / merge
187@end itemize
188
189@item
190Repetition
191
192@itemize
193@item
194repmat
195
196@item
197repelems
198@end itemize
199
200@item
201Vectorized arithmetic
202
203@itemize
204@item
205sum
206
207@item
208prod
209
210@item
211cumsum
212
213@item
214cumprod
215
216@item
217sumsq
218
219@item
220diff
221
222@item
223dot
224
225@item
226cummax
227
228@item
229cummin
230@end itemize
231
232@item
233Shape of higher dimensional arrays
234
235@itemize
236@item
237reshape
238
239@item
240resize
241
242@item
243permute
244
245@item
246squeeze
247
248@item
249deal
250@end itemize
251
252@end itemize
253
254@node Broadcasting
255@section Broadcasting
256@cindex broadcast
257@cindex broadcasting
258@cindex @nospell{BSX}
259@cindex recycling
260@cindex SIMD
261
262Broadcasting refers to how Octave binary operators and functions behave
263when their matrix or array operands or arguments differ in size.  Since
264version 3.6.0, Octave now automatically broadcasts vectors, matrices,
265and arrays when using elementwise binary operators and functions.
266Broadly speaking, smaller arrays are ``broadcast'' across the larger
267one, until they have a compatible shape.  The rule is that corresponding
268array dimensions must either
269
270@enumerate
271@item
272be equal, or
273
274@item
275one of them must be 1.
276@end enumerate
277
278@noindent
279In case all dimensions are equal, no broadcasting occurs and ordinary
280element-by-element arithmetic takes place.  For arrays of higher
281dimensions, if the number of dimensions isn't the same, then missing
282trailing dimensions are treated as 1.  When one of the dimensions is 1,
283the array with that singleton dimension gets copied along that dimension
284until it matches the dimension of the other array.  For example, consider
285
286@example
287@group
288x = [1 2 3;
289     4 5 6;
290     7 8 9];
291
292y = [10 20 30];
293
294x + y
295@end group
296@end example
297
298@noindent
299Without broadcasting, @code{x + y} would be an error because the dimensions
300do not agree.  However, with broadcasting it is as if the following
301operation were performed:
302
303@example
304@group
305x = [1 2 3
306     4 5 6
307     7 8 9];
308
309y = [10 20 30
310     10 20 30
311     10 20 30];
312
313x + y
314@result{}    11   22   33
315      14   25   36
316      17   28   39
317@end group
318@end example
319
320@noindent
321That is, the smaller array of size @code{[1 3]} gets copied along the
322singleton dimension (the number of rows) until it is @code{[3 3]}.  No
323actual copying takes place, however.  The internal implementation reuses
324elements along the necessary dimension in order to achieve the desired
325effect without copying in memory.
326
327Both arrays can be broadcast across each other, for example, all
328pairwise differences of the elements of a vector with itself:
329
330@example
331@group
332y - y'
333@result{}    0   10   20
334    -10    0   10
335    -20  -10    0
336@end group
337@end example
338
339@noindent
340Here the vectors of size @code{[1 3]} and @code{[3 1]} both get
341broadcast into matrices of size @code{[3 3]} before ordinary matrix
342subtraction takes place.
343
344A special case of broadcasting that may be familiar is when all
345dimensions of the array being broadcast are 1, i.e., the array is a
346scalar.  Thus for example, operations like @code{x - 42} and @code{max
347(x, 2)} are basic examples of broadcasting.
348
349For a higher-dimensional example, suppose @code{img} is an RGB image of
350size @code{[m n 3]} and we wish to multiply each color by a different
351scalar.  The following code accomplishes this with broadcasting,
352
353@example
354img .*= permute ([0.8, 0.9, 1.2], [1, 3, 2]);
355@end example
356
357@noindent
358Note the usage of permute to match the dimensions of the
359@code{[0.8, 0.9, 1.2]} vector with @code{img}.
360
361For functions that are not written with broadcasting semantics,
362@code{bsxfun} can be useful for coercing them to broadcast.
363
364@DOCSTRING(bsxfun)
365
366Broadcasting is only applied if either of the two broadcasting
367conditions hold.  As usual, however, broadcasting does not apply when two
368dimensions differ and neither is 1:
369
370@example
371@group
372x = [1 2 3
373     4 5 6];
374y = [10 20
375     30 40];
376x + y
377@end group
378@end example
379
380@noindent
381This will produce an error about nonconformant arguments.
382
383Besides common arithmetic operations, several functions of two arguments
384also broadcast.  The full list of functions and operators that broadcast
385is
386
387@example
388      plus      +  .+
389      minus     -  .-
390      times     .*
391      rdivide   ./
392      ldivide   .\
393      power     .^  .**
394      lt        <
395      le        <=
396      eq        ==
397      gt        >
398      ge        >=
399      ne        !=  ~=
400      and       &
401      or        |
402      atan2
403      hypot
404      max
405      min
406      mod
407      rem
408      xor
409
410      +=  -=  .+=  .-=  .*=  ./=  .\=  .^=  .**=  &=  |=
411@end example
412
413Beware of resorting to broadcasting if a simpler operation will suffice.
414For matrices @var{a} and @var{b}, consider the following:
415
416@example
417@var{c} = sum (permute (@var{a}, [1, 3, 2]) .* permute (@var{b}, [3, 2, 1]), 3);
418@end example
419
420@noindent
421This operation broadcasts the two matrices with permuted dimensions
422across each other during elementwise multiplication in order to obtain a
423larger 3-D array, and this array is then summed along the third dimension.
424A moment of thought will prove that this operation is simply the much
425faster ordinary matrix multiplication, @code{@var{c} = @var{a}*@var{b};}.
426
427A note on terminology: ``broadcasting'' is the term popularized by the
428Numpy numerical environment in the Python programming language.  In other
429programming languages and environments, broadcasting may also be known
430as @emph{binary singleton expansion} (@nospell{BSX}, in @sc{matlab}, and the
431origin of the name of the @code{bsxfun} function), @emph{recycling} (R
432programming language), @emph{single-instruction multiple data} (SIMD),
433or @emph{replication}.
434
435@subsection Broadcasting and Legacy Code
436
437The new broadcasting semantics almost never affect code that worked
438in previous versions of Octave.  Consequently, all code inherited from
439@sc{matlab} that worked in previous versions of Octave should still work
440without change in Octave.  The only exception is code such as
441
442@example
443@group
444try
445  c = a.*b;
446catch
447  c = a.*a;
448end_try_catch
449@end group
450@end example
451
452@noindent
453that may have relied on matrices of different size producing an error.
454Because such operation is now valid Octave syntax, this will no longer
455produce an error.  Instead, the following code should be used:
456
457@example
458@group
459if (isequal (size (a), size (b)))
460  c = a .* b;
461else
462  c = a .* a;
463endif
464@end group
465@end example
466
467
468@node Function Application
469@section Function Application
470@cindex map
471@cindex apply
472@cindex function application
473
474As a general rule, functions should already be written with matrix
475arguments in mind and should consider whole matrix operations in a
476vectorized manner.  Sometimes, writing functions in this way appears
477difficult or impossible for various reasons.  For those situations,
478Octave provides facilities for applying a function to each element of an
479array, cell, or struct.
480
481@DOCSTRING(arrayfun)
482
483@DOCSTRING(spfun)
484
485@DOCSTRING(cellfun)
486
487@DOCSTRING(structfun)
488
489Consistent with earlier advice, seek to use Octave built-in functions whenever
490possible for the best performance.  This advice applies especially to the four
491functions above.  For example, when adding two arrays together
492element-by-element one could use a handle to the built-in addition function
493@code{@@plus} or define an anonymous function @code{@@(x,y) x + y}.  But, the
494anonymous function is 60% slower than the first method.
495@xref{Operator Overloading}, for a list of basic functions which might be used
496in place of anonymous ones.
497
498@node Accumulation
499@section Accumulation
500
501Whenever it's possible to categorize according to indices the elements
502of an array when performing a computation, accumulation functions can be
503useful.
504
505@DOCSTRING(accumarray)
506
507@DOCSTRING(accumdim)
508
509@node JIT Compiler
510@section JIT Compiler
511
512Vectorization is the preferred technique for eliminating loops and speeding up
513code.  Nevertheless, it is not always possible to replace every loop.  In such
514situations it may be worth trying Octave's @strong{experimental} Just-In-Time
515(JIT) compiler.
516
517A JIT compiler works by analyzing the body of a loop, translating the Octave
518statements into another language, compiling the new code segment into an
519executable, and then running the executable and collecting any results.  The
520process is not simple and there is a significant amount of work to perform for
521each step.  It can still make sense, however, if the number of loop iterations
522is large.  Because Octave is an interpreted language every time through a
523loop Octave must parse the statements in the loop body before executing them.
524With a JIT compiler this is done just once when the body is translated to
525another language.
526
527The JIT compiler is a very new feature in Octave and not all valid Octave
528statements can currently be accelerated.  However, if no other technique
529is available it may be worth benchmarking the code with JIT enabled.  The
530function @code{jit_enable} is used to turn compilation on or off.  The
531function @code{jit_startcnt} sets the threshold for acceleration.  Loops
532with iteration counts above @code{jit_startcnt} will be accelerated.  The
533functions @code{jit_failcnt} and @code{debug_jit} are not likely to be of use
534to anyone not working directly on the implementation of the JIT compiler.
535
536@DOCSTRING(jit_enable)
537
538@DOCSTRING(jit_startcnt)
539
540@DOCSTRING(jit_failcnt)
541
542@DOCSTRING(debug_jit)
543
544@node Miscellaneous Techniques
545@section Miscellaneous Techniques
546@cindex execution speed
547@cindex speedups
548@cindex optimization
549
550Here are some other ways of improving the execution speed of Octave
551programs.
552
553@itemize @bullet
554
555@item Avoid computing costly intermediate results multiple times.
556Octave currently does not eliminate common subexpressions.  Also, certain
557internal computation results are cached for variables.  For instance, if
558a matrix variable is used multiple times as an index, checking the
559indices (and internal conversion to integers) is only done once.
560
561@item Be aware of lazy copies (copy-on-write).
562@cindex copy-on-write
563@cindex COW
564@cindex memory management
565When a copy of an object is created, the data is not immediately copied, but
566rather shared.  The actual copying is postponed until the copied data needs to
567be modified.  For example:
568
569@example
570@group
571a = zeros (1000); # create a 1000x1000 matrix
572b = a; # no copying done here
573b(1) = 1; # copying done here
574@end group
575@end example
576
577Lazy copying applies to whole Octave objects such as matrices, cells,
578struct, and also individual cell or struct elements (not array
579elements).
580
581Additionally, index expressions also use lazy copying when Octave can
582determine that the indexed portion is contiguous in memory.  For example:
583
584@example
585@group
586a = zeros (1000); # create a 1000x1000 matrix
587b = a(:,10:100);  # no copying done here
588b = a(10:100,:);  # copying done here
589@end group
590@end example
591
592This applies to arrays (matrices), cell arrays, and structs indexed
593using @samp{()}.  Index expressions generating comma-separated lists can also
594benefit from shallow copying in some cases.  In particular, when @var{a} is a
595struct array, expressions like @code{@{a.x@}, @{a(:,2).x@}} will use lazy
596copying, so that data can be shared between a struct array and a cell array.
597
598Most indexing expressions do not live longer than their parent
599objects.  In rare cases, however, a lazily copied slice outlasts its
600parent, in which case it becomes orphaned, still occupying unnecessarily
601more memory than needed.  To provide a remedy working in most real cases,
602Octave checks for orphaned lazy slices at certain situations, when a
603value is stored into a "permanent" location, such as a named variable or
604cell or struct element, and possibly economizes them.  For example:
605
606@example
607@group
608a = zeros (1000); # create a 1000x1000 matrix
609b = a(:,10:100);  # lazy slice
610a = []; # the original "a" array is still allocated
611c@{1@} = b; # b is reallocated at this point
612@end group
613@end example
614
615@item Avoid deep recursion.
616Function calls to m-file functions carry a relatively significant overhead, so
617rewriting a recursion as a loop often helps.  Also, note that the maximum level
618of recursion is limited.
619
620@item Avoid resizing matrices unnecessarily.
621When building a single result matrix from a series of calculations, set the
622size of the result matrix first, then insert values into it.  Write
623
624@example
625@group
626result = zeros (big_n, big_m)
627for i = over:and_over
628  ridx = @dots{}
629  cidx = @dots{}
630  result(ridx, cidx) = new_value ();
631endfor
632@end group
633@end example
634
635@noindent
636instead of
637
638@example
639@group
640result = [];
641for i = ever:and_ever
642  result = [ result, new_value() ];
643endfor
644@end group
645@end example
646
647Sometimes the number of items can not be computed in advance, and
648stack-like operations are needed.  When elements are being repeatedly
649inserted or removed from the end of an array, Octave detects it as stack
650usage and attempts to use a smarter memory management strategy by
651pre-allocating the array in bigger chunks.  This strategy is also applied
652to cell and struct arrays.
653
654@example
655@group
656a = [];
657while (condition)
658  @dots{}
659  a(end+1) = value; # "push" operation
660  @dots{}
661  a(end) = []; # "pop" operation
662  @dots{}
663endwhile
664@end group
665@end example
666
667@item Avoid calling @code{eval} or @code{feval} excessively.
668Parsing input or looking up the name of a function in the symbol table are
669relatively expensive operations.
670
671If you are using @code{eval} merely as an exception handling mechanism, and not
672because you need to execute some arbitrary text, use the @code{try}
673statement instead.  @xref{The try Statement}.
674
675@item Use @code{ignore_function_time_stamp} when appropriate.
676If you are calling lots of functions, and none of them will need to change
677during your run, set the variable @code{ignore_function_time_stamp} to
678@qcode{"all"}.  This will stop Octave from checking the time stamp of a
679function file to see if it has been updated while the program is being run.
680@end itemize
681
682@node Examples
683@section Examples
684
685The following are examples of vectorization questions asked by actual
686users of Octave and their solutions.
687
688@c FIXME: We need a lot more examples here.
689
690@itemize @bullet
691@item
692For a vector @code{A}, the following loop
693
694@example
695@group
696n = length (A) - 1;
697B = zeros (n, 2);
698for i = 1:n
699  ## this will be two columns, the first is the difference and
700  ## the second the mean of the two elements used for the diff.
701  B(i,:) = [A(i+1)-A(i), (A(i+1) + A(i))/2];
702endfor
703@end group
704@end example
705
706@noindent
707can be turned into the following one-liner:
708
709@example
710B = [diff(A)(:), 0.5*(A(1:end-1)+A(2:end))(:)]
711@end example
712
713Note the usage of colon indexing to flatten an intermediate result into
714a column vector.  This is a common vectorization trick.
715
716@end itemize
717