1@c Copyright (C) 2012-2019 Jordi Gutiérrez Hermoso 2@c 3@c This file is part of Octave. 4@c 5@c Octave is free software: you can redistribute it and/or modify it 6@c under the terms of the GNU General Public License as published by 7@c the Free Software Foundation, either version 3 of the License, or 8@c (at your option) any later version. 9@c 10@c Octave is distributed in the hope that it will be useful, but 11@c WITHOUT ANY WARRANTY; without even the implied warranty of 12@c MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13@c GNU General Public License for more details. 14@c 15@c You should have received a copy of the GNU General Public License 16@c along with Octave; see the file COPYING. If not, see 17@c <https://www.gnu.org/licenses/>. 18 19@node Vectorization and Faster Code Execution 20@chapter Vectorization and Faster Code Execution 21@cindex vectorization 22@cindex vectorize 23 24Vectorization is a programming technique that uses vector operations 25instead of element-by-element loop-based operations. Besides frequently 26producing more succinct Octave code, vectorization also allows for better 27optimization in the subsequent implementation. The optimizations may occur 28either in Octave's own Fortran, C, or C++ internal implementation, or even at a 29lower level depending on the compiler and external numerical libraries used to 30build Octave. The ultimate goal is to make use of your hardware's vector 31instructions if possible or to perform other optimizations in software. 32 33Vectorization is not a concept unique to Octave, but it is particularly 34important because Octave is a matrix-oriented language. Vectorized 35Octave code will see a dramatic speed up (10X--100X) in most cases. 36 37This chapter discusses vectorization and other techniques for writing faster 38code. 39 40@menu 41* Basic Vectorization:: Basic techniques for code optimization 42* Broadcasting:: Broadcasting operations 43* Function Application:: Applying functions to arrays, cells, and structs 44* Accumulation:: Accumulation functions 45* JIT Compiler:: Just-In-Time Compiler for loops 46* Miscellaneous Techniques:: Other techniques for speeding up code 47* Examples:: 48@end menu 49 50@node Basic Vectorization 51@section Basic Vectorization 52 53To a very good first approximation, the goal in vectorization is to 54write code that avoids loops and uses whole-array operations. As a 55trivial example, consider 56 57@example 58@group 59for i = 1:n 60 for j = 1:m 61 c(i,j) = a(i,j) + b(i,j); 62 endfor 63endfor 64@end group 65@end example 66 67@noindent 68compared to the much simpler 69 70@example 71c = a + b; 72@end example 73 74@noindent 75This isn't merely easier to write; it is also internally much easier to 76optimize. Octave delegates this operation to an underlying 77implementation which, among other optimizations, may use special vector 78hardware instructions or could conceivably even perform the additions in 79parallel. In general, if the code is vectorized, the underlying 80implementation has more freedom about the assumptions it can make 81in order to achieve faster execution. 82 83This is especially important for loops with "cheap" bodies. Often it 84suffices to vectorize just the innermost loop to get acceptable 85performance. A general rule of thumb is that the "order" of the 86vectorized body should be greater or equal to the "order" of the 87enclosing loop. 88 89As a less trivial example, instead of 90 91@example 92@group 93for i = 1:n-1 94 a(i) = b(i+1) - b(i); 95endfor 96@end group 97@end example 98 99@noindent 100write 101 102@example 103a = b(2:n) - b(1:n-1); 104@end example 105 106This shows an important general concept about using arrays for indexing 107instead of looping over an index variable. @xref{Index Expressions}. 108Also use boolean indexing generously. If a condition needs to be tested, 109this condition can also be written as a boolean index. For instance, 110instead of 111 112@example 113@group 114for i = 1:n 115 if (a(i) > 5) 116 a(i) -= 20 117 endif 118endfor 119@end group 120@end example 121 122@noindent 123write 124 125@example 126a(a>5) -= 20; 127@end example 128 129@noindent 130which exploits the fact that @code{a > 5} produces a boolean index. 131 132Use elementwise vector operators whenever possible to avoid looping 133(operators like @code{.*} and @code{.^}). @xref{Arithmetic Ops}. 134 135Also exploit broadcasting in these elementwise operators both to avoid 136looping and unnecessary intermediate memory allocations. 137@xref{Broadcasting}. 138 139Use built-in and library functions if possible. Built-in and compiled 140functions are very fast. Even with an m-file library function, chances 141are good that it is already optimized, or will be optimized more in a 142future release. 143 144For instance, even better than 145 146@example 147a = b(2:n) - b(1:n-1); 148@end example 149 150@noindent 151is 152 153@example 154a = diff (b); 155@end example 156 157Most Octave functions are written with vector and array arguments in 158mind. If you find yourself writing a loop with a very simple operation, 159chances are that such a function already exists. The following functions 160occur frequently in vectorized code: 161 162@itemize @bullet 163@item 164Index manipulation 165 166@itemize 167@item 168find 169 170@item 171sub2ind 172 173@item 174ind2sub 175 176@item 177sort 178 179@item 180unique 181 182@item 183lookup 184 185@item 186ifelse / merge 187@end itemize 188 189@item 190Repetition 191 192@itemize 193@item 194repmat 195 196@item 197repelems 198@end itemize 199 200@item 201Vectorized arithmetic 202 203@itemize 204@item 205sum 206 207@item 208prod 209 210@item 211cumsum 212 213@item 214cumprod 215 216@item 217sumsq 218 219@item 220diff 221 222@item 223dot 224 225@item 226cummax 227 228@item 229cummin 230@end itemize 231 232@item 233Shape of higher dimensional arrays 234 235@itemize 236@item 237reshape 238 239@item 240resize 241 242@item 243permute 244 245@item 246squeeze 247 248@item 249deal 250@end itemize 251 252@end itemize 253 254@node Broadcasting 255@section Broadcasting 256@cindex broadcast 257@cindex broadcasting 258@cindex @nospell{BSX} 259@cindex recycling 260@cindex SIMD 261 262Broadcasting refers to how Octave binary operators and functions behave 263when their matrix or array operands or arguments differ in size. Since 264version 3.6.0, Octave now automatically broadcasts vectors, matrices, 265and arrays when using elementwise binary operators and functions. 266Broadly speaking, smaller arrays are ``broadcast'' across the larger 267one, until they have a compatible shape. The rule is that corresponding 268array dimensions must either 269 270@enumerate 271@item 272be equal, or 273 274@item 275one of them must be 1. 276@end enumerate 277 278@noindent 279In case all dimensions are equal, no broadcasting occurs and ordinary 280element-by-element arithmetic takes place. For arrays of higher 281dimensions, if the number of dimensions isn't the same, then missing 282trailing dimensions are treated as 1. When one of the dimensions is 1, 283the array with that singleton dimension gets copied along that dimension 284until it matches the dimension of the other array. For example, consider 285 286@example 287@group 288x = [1 2 3; 289 4 5 6; 290 7 8 9]; 291 292y = [10 20 30]; 293 294x + y 295@end group 296@end example 297 298@noindent 299Without broadcasting, @code{x + y} would be an error because the dimensions 300do not agree. However, with broadcasting it is as if the following 301operation were performed: 302 303@example 304@group 305x = [1 2 3 306 4 5 6 307 7 8 9]; 308 309y = [10 20 30 310 10 20 30 311 10 20 30]; 312 313x + y 314@result{} 11 22 33 315 14 25 36 316 17 28 39 317@end group 318@end example 319 320@noindent 321That is, the smaller array of size @code{[1 3]} gets copied along the 322singleton dimension (the number of rows) until it is @code{[3 3]}. No 323actual copying takes place, however. The internal implementation reuses 324elements along the necessary dimension in order to achieve the desired 325effect without copying in memory. 326 327Both arrays can be broadcast across each other, for example, all 328pairwise differences of the elements of a vector with itself: 329 330@example 331@group 332y - y' 333@result{} 0 10 20 334 -10 0 10 335 -20 -10 0 336@end group 337@end example 338 339@noindent 340Here the vectors of size @code{[1 3]} and @code{[3 1]} both get 341broadcast into matrices of size @code{[3 3]} before ordinary matrix 342subtraction takes place. 343 344A special case of broadcasting that may be familiar is when all 345dimensions of the array being broadcast are 1, i.e., the array is a 346scalar. Thus for example, operations like @code{x - 42} and @code{max 347(x, 2)} are basic examples of broadcasting. 348 349For a higher-dimensional example, suppose @code{img} is an RGB image of 350size @code{[m n 3]} and we wish to multiply each color by a different 351scalar. The following code accomplishes this with broadcasting, 352 353@example 354img .*= permute ([0.8, 0.9, 1.2], [1, 3, 2]); 355@end example 356 357@noindent 358Note the usage of permute to match the dimensions of the 359@code{[0.8, 0.9, 1.2]} vector with @code{img}. 360 361For functions that are not written with broadcasting semantics, 362@code{bsxfun} can be useful for coercing them to broadcast. 363 364@DOCSTRING(bsxfun) 365 366Broadcasting is only applied if either of the two broadcasting 367conditions hold. As usual, however, broadcasting does not apply when two 368dimensions differ and neither is 1: 369 370@example 371@group 372x = [1 2 3 373 4 5 6]; 374y = [10 20 375 30 40]; 376x + y 377@end group 378@end example 379 380@noindent 381This will produce an error about nonconformant arguments. 382 383Besides common arithmetic operations, several functions of two arguments 384also broadcast. The full list of functions and operators that broadcast 385is 386 387@example 388 plus + .+ 389 minus - .- 390 times .* 391 rdivide ./ 392 ldivide .\ 393 power .^ .** 394 lt < 395 le <= 396 eq == 397 gt > 398 ge >= 399 ne != ~= 400 and & 401 or | 402 atan2 403 hypot 404 max 405 min 406 mod 407 rem 408 xor 409 410 += -= .+= .-= .*= ./= .\= .^= .**= &= |= 411@end example 412 413Beware of resorting to broadcasting if a simpler operation will suffice. 414For matrices @var{a} and @var{b}, consider the following: 415 416@example 417@var{c} = sum (permute (@var{a}, [1, 3, 2]) .* permute (@var{b}, [3, 2, 1]), 3); 418@end example 419 420@noindent 421This operation broadcasts the two matrices with permuted dimensions 422across each other during elementwise multiplication in order to obtain a 423larger 3-D array, and this array is then summed along the third dimension. 424A moment of thought will prove that this operation is simply the much 425faster ordinary matrix multiplication, @code{@var{c} = @var{a}*@var{b};}. 426 427A note on terminology: ``broadcasting'' is the term popularized by the 428Numpy numerical environment in the Python programming language. In other 429programming languages and environments, broadcasting may also be known 430as @emph{binary singleton expansion} (@nospell{BSX}, in @sc{matlab}, and the 431origin of the name of the @code{bsxfun} function), @emph{recycling} (R 432programming language), @emph{single-instruction multiple data} (SIMD), 433or @emph{replication}. 434 435@subsection Broadcasting and Legacy Code 436 437The new broadcasting semantics almost never affect code that worked 438in previous versions of Octave. Consequently, all code inherited from 439@sc{matlab} that worked in previous versions of Octave should still work 440without change in Octave. The only exception is code such as 441 442@example 443@group 444try 445 c = a.*b; 446catch 447 c = a.*a; 448end_try_catch 449@end group 450@end example 451 452@noindent 453that may have relied on matrices of different size producing an error. 454Because such operation is now valid Octave syntax, this will no longer 455produce an error. Instead, the following code should be used: 456 457@example 458@group 459if (isequal (size (a), size (b))) 460 c = a .* b; 461else 462 c = a .* a; 463endif 464@end group 465@end example 466 467 468@node Function Application 469@section Function Application 470@cindex map 471@cindex apply 472@cindex function application 473 474As a general rule, functions should already be written with matrix 475arguments in mind and should consider whole matrix operations in a 476vectorized manner. Sometimes, writing functions in this way appears 477difficult or impossible for various reasons. For those situations, 478Octave provides facilities for applying a function to each element of an 479array, cell, or struct. 480 481@DOCSTRING(arrayfun) 482 483@DOCSTRING(spfun) 484 485@DOCSTRING(cellfun) 486 487@DOCSTRING(structfun) 488 489Consistent with earlier advice, seek to use Octave built-in functions whenever 490possible for the best performance. This advice applies especially to the four 491functions above. For example, when adding two arrays together 492element-by-element one could use a handle to the built-in addition function 493@code{@@plus} or define an anonymous function @code{@@(x,y) x + y}. But, the 494anonymous function is 60% slower than the first method. 495@xref{Operator Overloading}, for a list of basic functions which might be used 496in place of anonymous ones. 497 498@node Accumulation 499@section Accumulation 500 501Whenever it's possible to categorize according to indices the elements 502of an array when performing a computation, accumulation functions can be 503useful. 504 505@DOCSTRING(accumarray) 506 507@DOCSTRING(accumdim) 508 509@node JIT Compiler 510@section JIT Compiler 511 512Vectorization is the preferred technique for eliminating loops and speeding up 513code. Nevertheless, it is not always possible to replace every loop. In such 514situations it may be worth trying Octave's @strong{experimental} Just-In-Time 515(JIT) compiler. 516 517A JIT compiler works by analyzing the body of a loop, translating the Octave 518statements into another language, compiling the new code segment into an 519executable, and then running the executable and collecting any results. The 520process is not simple and there is a significant amount of work to perform for 521each step. It can still make sense, however, if the number of loop iterations 522is large. Because Octave is an interpreted language every time through a 523loop Octave must parse the statements in the loop body before executing them. 524With a JIT compiler this is done just once when the body is translated to 525another language. 526 527The JIT compiler is a very new feature in Octave and not all valid Octave 528statements can currently be accelerated. However, if no other technique 529is available it may be worth benchmarking the code with JIT enabled. The 530function @code{jit_enable} is used to turn compilation on or off. The 531function @code{jit_startcnt} sets the threshold for acceleration. Loops 532with iteration counts above @code{jit_startcnt} will be accelerated. The 533functions @code{jit_failcnt} and @code{debug_jit} are not likely to be of use 534to anyone not working directly on the implementation of the JIT compiler. 535 536@DOCSTRING(jit_enable) 537 538@DOCSTRING(jit_startcnt) 539 540@DOCSTRING(jit_failcnt) 541 542@DOCSTRING(debug_jit) 543 544@node Miscellaneous Techniques 545@section Miscellaneous Techniques 546@cindex execution speed 547@cindex speedups 548@cindex optimization 549 550Here are some other ways of improving the execution speed of Octave 551programs. 552 553@itemize @bullet 554 555@item Avoid computing costly intermediate results multiple times. 556Octave currently does not eliminate common subexpressions. Also, certain 557internal computation results are cached for variables. For instance, if 558a matrix variable is used multiple times as an index, checking the 559indices (and internal conversion to integers) is only done once. 560 561@item Be aware of lazy copies (copy-on-write). 562@cindex copy-on-write 563@cindex COW 564@cindex memory management 565When a copy of an object is created, the data is not immediately copied, but 566rather shared. The actual copying is postponed until the copied data needs to 567be modified. For example: 568 569@example 570@group 571a = zeros (1000); # create a 1000x1000 matrix 572b = a; # no copying done here 573b(1) = 1; # copying done here 574@end group 575@end example 576 577Lazy copying applies to whole Octave objects such as matrices, cells, 578struct, and also individual cell or struct elements (not array 579elements). 580 581Additionally, index expressions also use lazy copying when Octave can 582determine that the indexed portion is contiguous in memory. For example: 583 584@example 585@group 586a = zeros (1000); # create a 1000x1000 matrix 587b = a(:,10:100); # no copying done here 588b = a(10:100,:); # copying done here 589@end group 590@end example 591 592This applies to arrays (matrices), cell arrays, and structs indexed 593using @samp{()}. Index expressions generating comma-separated lists can also 594benefit from shallow copying in some cases. In particular, when @var{a} is a 595struct array, expressions like @code{@{a.x@}, @{a(:,2).x@}} will use lazy 596copying, so that data can be shared between a struct array and a cell array. 597 598Most indexing expressions do not live longer than their parent 599objects. In rare cases, however, a lazily copied slice outlasts its 600parent, in which case it becomes orphaned, still occupying unnecessarily 601more memory than needed. To provide a remedy working in most real cases, 602Octave checks for orphaned lazy slices at certain situations, when a 603value is stored into a "permanent" location, such as a named variable or 604cell or struct element, and possibly economizes them. For example: 605 606@example 607@group 608a = zeros (1000); # create a 1000x1000 matrix 609b = a(:,10:100); # lazy slice 610a = []; # the original "a" array is still allocated 611c@{1@} = b; # b is reallocated at this point 612@end group 613@end example 614 615@item Avoid deep recursion. 616Function calls to m-file functions carry a relatively significant overhead, so 617rewriting a recursion as a loop often helps. Also, note that the maximum level 618of recursion is limited. 619 620@item Avoid resizing matrices unnecessarily. 621When building a single result matrix from a series of calculations, set the 622size of the result matrix first, then insert values into it. Write 623 624@example 625@group 626result = zeros (big_n, big_m) 627for i = over:and_over 628 ridx = @dots{} 629 cidx = @dots{} 630 result(ridx, cidx) = new_value (); 631endfor 632@end group 633@end example 634 635@noindent 636instead of 637 638@example 639@group 640result = []; 641for i = ever:and_ever 642 result = [ result, new_value() ]; 643endfor 644@end group 645@end example 646 647Sometimes the number of items can not be computed in advance, and 648stack-like operations are needed. When elements are being repeatedly 649inserted or removed from the end of an array, Octave detects it as stack 650usage and attempts to use a smarter memory management strategy by 651pre-allocating the array in bigger chunks. This strategy is also applied 652to cell and struct arrays. 653 654@example 655@group 656a = []; 657while (condition) 658 @dots{} 659 a(end+1) = value; # "push" operation 660 @dots{} 661 a(end) = []; # "pop" operation 662 @dots{} 663endwhile 664@end group 665@end example 666 667@item Avoid calling @code{eval} or @code{feval} excessively. 668Parsing input or looking up the name of a function in the symbol table are 669relatively expensive operations. 670 671If you are using @code{eval} merely as an exception handling mechanism, and not 672because you need to execute some arbitrary text, use the @code{try} 673statement instead. @xref{The try Statement}. 674 675@item Use @code{ignore_function_time_stamp} when appropriate. 676If you are calling lots of functions, and none of them will need to change 677during your run, set the variable @code{ignore_function_time_stamp} to 678@qcode{"all"}. This will stop Octave from checking the time stamp of a 679function file to see if it has been updated while the program is being run. 680@end itemize 681 682@node Examples 683@section Examples 684 685The following are examples of vectorization questions asked by actual 686users of Octave and their solutions. 687 688@c FIXME: We need a lot more examples here. 689 690@itemize @bullet 691@item 692For a vector @code{A}, the following loop 693 694@example 695@group 696n = length (A) - 1; 697B = zeros (n, 2); 698for i = 1:n 699 ## this will be two columns, the first is the difference and 700 ## the second the mean of the two elements used for the diff. 701 B(i,:) = [A(i+1)-A(i), (A(i+1) + A(i))/2]; 702endfor 703@end group 704@end example 705 706@noindent 707can be turned into the following one-liner: 708 709@example 710B = [diff(A)(:), 0.5*(A(1:end-1)+A(2:end))(:)] 711@end example 712 713Note the usage of colon indexing to flatten an intermediate result into 714a column vector. This is a common vectorization trick. 715 716@end itemize 717