1# How to update the PCRE version used by Erlang 2 3## The basic changes to the PCRE library 4 5To work with the Erlang VM, PCRE has been changed in three important ways: 6 71. The main execution machine in pcre\_exec has been modified so that 8matching can be interrupted and restarted. This functionality utilizes 9the code that implements recursion by allocating explicit 10"stack-frames" in heap space, which basically means that all local 11variables in the loop are part of a struct which is kept in "malloced" 12memory on the heap and there are no real stack variables that need to 13be pushed on the C stack in the case of recursive calls. This is a 14technique we also use inside the VM to avoid building large C 15stacks. In PCRE this is enabled by the NO\_RECURSE define, so that is a 16prerequisite for the ERLANG\_INTEGRATION define which also adds labels 17at restart points and counts "reductions". 18 192. All visible symbols in PCRE gets the erts_ prefix, so that NIF's 20and such using a "real" pcre library does not get confused (or 're' 21gets confused when a "real" pcre library get's loaded into the VM 22process). 23 243. All irrelevant functionality has been stripped from the library, 25which means for example UTF16 support, jit, DFA execution 26etc. Basically the source files handling this are removed, together 27with any build support from the PCRE project. We have our own 28makefiles etc. 29 30## Setting up an environment for the work 31 32I work with four temporary directories when doing this (the examples 33are from the updating of pcre-7.6 to pcre-8.33); 34 35 ~/tmp/pcre> ls 36 epcre-7.6 epcre-8.33 pcre-7.6 pcre-8.33 37 38I've unpacked the plain pcre sources in pcre-* and will work with our 39patched sources in the epcre-* directories. 40 41Make sure your ERL_TOP contains a *built* version of Erlang (and you have made a branch) 42 43First unpack the pcre libraries (which will create the pcre-* 44directories) and then copy our code to the old epcre directory: 45 46 ~/tmp/pcre> tar jxf $ERL_TOP/erts/emulator/pcre/pcre-7.6.tar.bz2 47 ~/tmp/pcre> tar jxf ~/Downloads/pcre-8.33.tar.bz2 48 ~/tmp/pcre> mkdir epcre-7.6 epcre-8.33 49 ~/tmp/pcre> cd epcre-7.6/ 50 ~/tmp/pcre/epcre-7.6> cp -r $ERL_TOP/erts/emulator/pcre/* . 51 ~/tmp/pcre/epcre-7.6> rm pcre-7.6.tar.bz2 52 53Leave the obj directory, you may need the libepcre.a file... 54 55If you find it easier, you can revert the commit in GIT that adds the 56erts_ prefix to the previous version before continuing work, but as 57that is a quite small diff in newer versions of PCRE, it is probably 58not worth it. Still, you will find the erts_ prefix being a separate 59commit when integrating 8.33, so if you're nice, you will do the same 60for the person coming after you... 61 62## Generating a diff for our changes to PCRE 63 64Before you generate a diff (that, in an ideal world, would be used to 65automatically patch the newer version of pcre, which will probably 66only work for minor PCRE updates), we need to configure the old pcre. 67 68 ~/tmp/pcre/epcre-7.6> cd ../pcre-7.6 69 ~/tmp/pcre/pcre-7.6> ./configure --enable-utf8 --enable-unicode-properties --disable-shared --disable-stack-for-recursion 70 71Note that for newer versions, the configure flag '--enable-utf8' 72should be replaced with '--enable-utf' 73 74So we now generate a diff: 75 76 ~/tmp/pcre/pcre-7.6> cd ../epcre-7.6 77 ~/tmp/pcre/epcre-7.6> (for x in *.[ch]; do if [ -f ../pcre-7.6/$x ]; then diff -c ../pcre-7.6/$x $x; fi; done ) > ../epcre-7.6_clean.diff 78 79### What the diff means 80 81Let's now walk through the relevant parts of the diff. Some of the 82differences might come from patches that probably are already in the 83new version, For example in out 7.6, we had a security patch which 84added the define WORK_SIZE_CHECK and used it in some places. Those can 85probably safely be ignored, but to be on the safe side, check what's 86already integrated in the new version. 87 88The interesting part is in pcre_exec.c. You will see things like 89 90 #ifdef ERLANG_INTEGRATION 91 ... 92 #endif 93 94or 95 96 #if defined(ERLANG_INTEGRATION) 97 ... 98 #endif 99 100and a lot of 101 102 COST_CHK(1); 103 104or 105 106 COST(min); 107 108and 109 110 /* LOOP_COUNT: Ok */ 111 /* LOOP_COUNT: CHK */ 112 /* LOOP_COUNT: COST */ 113 114 115scattered over the main loop. Those mean the following: 116 117* COST(int x) - consume reductions proportional to the integer 118 parameter, but no need for interruption here (it's like 119 bump_reductions without trapping). The loop they apply to also has a 120 'LOOP_COUNT: COST' comment at it's head. 121 122* COST\_CHK(int x) - like COST(x), but also check that the reduction 123counter does not reach zero. If it does, leave the execution loop to 124be restarted at a later point. No real stack variables can be live 125here. Note that variables like 'max' and 'min' are *not* real stack 126variables, the NO\_RECURSION setting has taken care of that. 'i' is a 127stack variable that's explicitly saved when trapping, so that will 128also be correct when returning from a trap. So will 'c', 'rrc' and 129flags like 'utf8', 'minimize' and 'posessive'. Those can also be 130regarded as "non C-stack variables". The loop where they reside also 131has a 'LOOP\_COUNT: CHK' comment. 132 133* /* LOOP_COUNT: Ok */ - means that I have checked the loop and it 134 only runs a deterministic set of iterations regardless of input, or 135 it has a call to RRECURSE in it's body, why we need not add more 136 cost than the normal reduction counting that will occur for each 137 instruction demands. 138 139The thing is that each loop in the function 'match' should be marked 140with one of these comments. If no comment is present after you patched 141the new release (if you successfully manage to do it automatically), 142it may be a new regexp instruction that is added since the last 143release. 144 145You will need to manually go through the main 'match' loop after 146upgrading to verify that there are no unhandled loops in the regexp 147machine loop (!). 148 149The COST\_CHK macro works like this: 150 1511. Add to the loop count. 1522. If loop count > limit: 153 1. Store the line (+100) in the Xwhere member of the frame structure 154 2. Goto LOOP\_COUNT\_BREAK, which ultimately returns from the function 1553. Insert a label, which is named L\_LOOP\_COUNT\_<line number> 156 157LOOP\_COUNT\_BREAK code will create an extra "stack frame" on the heap 158allocated stack used if NO\_RECURSION is set, and will store the few 159locals that are not already in the ordinary stack frame there (like 160'c' and 'i'). 161 162When we continue execution (after a trap up to the main Erlang 163scheduler), we will jump to LOOP\_COUNT\_RETURN, which will restore 164the local variables and will jump to the labels. The jump code looks 165like this in the C source: 166 167 switch (frame->Xwhere) 168 { 169 #include "pcre_exec_loop_break_cases.inc" 170 default: 171 DPRINTF(("jump error in pcre match: label %d non-existent\n", frame->Xwhere)); 172 return PCRE_ERROR_INTERNAL; 173 } 174 175When building, pcre\_exec\_loop\_break\_cases.inc will be generated 176during build by pcre.mk, it will look like: 177 178 case 791: goto L_LOOP_COUNT_691; 179 case 1892: goto L_LOOP_COUNT_1792; 180 case 1999: goto L_LOOP_COUNT_1899; 181 182etc 183 184So, simply put, all C-stack variables are saved when we have consumed 185our reductions, we return from the function and, as there is no real 186recursion we immediately fall out into the re:run BIF, which with the 187help of a magic binary keeps track of the heap allocated stack for the 188regexp machine. When we return from trapping out to the scheduler, all 189vital data is restored and we continue from exactly the same state as 190we left. What's needed is to patch this into the new pcre_exec and 191check all new instructions to determine what might need updating in 192terms of COST, COST\_CHK etc. 193 194Well, that's *almost* everything, because there is of course more... 195 196The actual interface function, 'pcre\_exec', needs the same treatment 197as the actual regexp machine loop, that is we need to store all local 198variables between restarts. Unfortunately the NO\_RECURSE setting does 199not do this, we need to do it ourselves. So there's quite a diff in 200that function too, where a big struct is declared, containing every 201local variable in that function, together with either local copies 202that are swapped in and out, or macros that directly access the heap 203allocated struct. The struct is called `PcreExecContext`. 204 205If a context is present, we are restarting and therefore restore 206everything. If we are restarting we can also skip all initialization 207code in the function and jump more or less directly to the 208RESTART_INTERRUPTED label and the call to 'match', which is the actual 209regexp machine loop. 210 211There are a few places in the pcre_exec we need to do some housekeeping, you will see code like: 212 213 if ((extra_data->flags & PCRE_EXTRA_LOOP_LIMIT) != 0) 214 { 215 *extra_data->loop_counter_return = 216 (extra_data->loop_limit - md->loop_limit); 217 } 218 219Make sure, after updating, that this housekeeping is done whenever we 220do not reach the call to 'match'. 221 222So, now we in theory know what to do, so let's do it: 223 224But... 225 226## File changes in the new version of PCRE 227 228First we need to go through what's changed in the new library 229version. Files may have new names, functions may have moved and so on. 230 231Start by building the new library: 232 233 ~/tmp/pcre> cd pcre-8.33/ 234 ~/tmp/pcre/pcre-8.33> ./configure --enable-utf --enable-unicode-properties --disable-shared --disable-stack-for-recursion 235 ~/tmp/pcre/pcre-8.33> make 236 237In the make process, you will probably notice most files that are 238used, but you can bet that's not all not all... 239 240To begin with you will need a default table for Latin-1 characters, so: 241 242 ~/tmp/pcre/pcre-8.33> cc -DHAVE_CONFIG_H -o dftables dftables.c 243 ~/tmp/pcre/pcre-8.33> LANG=sv_SE ./dftables -L ../epcre-8.33/pcre_latin_1_table.c 244 245Compare it to the pcre\_latin\_1\_table.c in the old version, they 246should not differ in any significant way. If they do, it might be 247that you do not have the `sv_SE` locale installed on your machine. 248 249You can test whether it's installed with `locale -a | grep sv_SE$`, and 250install with `sudo locale-gen sv_SE && sudo update-locale` if needed. 251 252A good starting point is then to try to find all files in the new 253version of the library that have (probably) the same names as the 254one's in our distribution: 255 256 ~/tmp/pcre/pcre-8.33> cd ../epcre-7.6/ 257 ~/tmp/pcre/epcre-7.6> for x in *.[ch]; do if [ '!' -f ../pcre-8.33/$x ]; then echo $x; else cp ../pcre-8.33/$x ../epcre-8.33/; fi; done 258 259This will output a list of files not found in the new distro. Let's 260look at the list from the example upgrade: 261 262 local_config.h 263 make_latin1_table.c 264 pcre_info.c 265 pcre_latin_1_table.c 266 pcre_make_latin1_default.c 267 pcre_try_flipped.c 268 pcre_ucp_searchfuncs.c 269 ucpinternal.h 270 ucptable.h 271 272* local\_config.h - OK, that's our child, it contains PCRE-specific 273 configure-results (i.e. the #defines that are results from out 274 parameters to configure, like NO\_RECURSE etc). Just copy it and 275 edit it according to what specific settings you can find in the 276 generated config.h from the real library build. In our example case, 277 the #define SUPPORT\_UTF8 should be renamed to #define SUPPORT\_UTF 278 and #define VERSION "7.6" should be changed to #define VERSION 279 "8.33"... 280 281* make\_latin1\_table.c - it was renamed to dftables.c, so we copy 282 that instead. 283 284* pcre\_info.c - It was simply removed from the library. Good, because 285 it was useless... So just ignore. 286 287* pcre\_latin\_1\_table.c - No problem, we generated a new one in the 288 earlier stage. 289 290* pcre\_make\_latin1\_default.c - No longer used, a hack that's not 291 needed with dftables. Ignored 292 293* pcre\_try\_flipped.c - This functionality has been removed from 294 pcre\_exec, you cannot compile on one endianess and execute on 295 another any more :( Ignored. 296 297* pcre\_ucp\_searchfuncs.c, ucpinternal.h, ucptable.h - this 298 functionality is moved to pcre\_ucd.c, copy that one instead. 299 300OK, now go the other way and look at what was actually built for the new version of pcre: 301 302 ~/tmp/pcre/epcre-7.6> cd ../pcre-8.33/ 303 ~/tmp/pcre/pcre-8.33> nm ./.libs/libpcre.a | egrep 'lib.*.o:' 304 305The output for this release was: 306 307 libpcre_la-pcre_byte_order.o: 308 libpcre_la-pcre_compile.o: 309 libpcre_la-pcre_config.o: 310 libpcre_la-pcre_dfa_exec.o: 311 libpcre_la-pcre_exec.o: 312 libpcre_la-pcre_fullinfo.o: 313 libpcre_la-pcre_get.o: 314 libpcre_la-pcre_globals.o: 315 libpcre_la-pcre_jit_compile.o: 316 libpcre_la-pcre_maketables.o: 317 libpcre_la-pcre_newline.o: 318 libpcre_la-pcre_ord2utf8.o: 319 libpcre_la-pcre_refcount.o: 320 libpcre_la-pcre_string_utils.o: 321 libpcre_la-pcre_study.o: 322 libpcre_la-pcre_tables.o: 323 libpcre_la-pcre_ucd.o: 324 libpcre_la-pcre_valid_utf8.o: 325 libpcre_la-pcre_version.o: 326 libpcre_la-pcre_xclass.o: 327 libpcre_la-pcre_chartables.o: 328 329Libtool has changed the object names, but we can fix that and see what 330sources we have already decided should exist: 331 332 ~/tmp/pcre/pcre-8.33> NAMES=`nm ./.libs/libpcre.a | egrep 'lib.*.o:'| sed 's,libpcre_la-,,' | sed 's,.o:$,,'` 333 ~/tmp/pcre/pcre-8.33> for x in $NAMES; do if [ '!' -f ../epcre-8.33/$x.c ]; then echo $x; fi; done 334 335And the list contained: 336 337 pcre_byte_order 338 pcre_jit_compile 339 pcre_string_utils 340 341pcre\_jit\_compile is actually needed, even though we have not enabled 342jit, and the other two contain functionality needed, so just copy the 343sources... 344 345 ~/tmp/pcre/pcre-8.33> for x in $NAMES; do if [ '!' -f ../epcre-8.33/$x.c ]; then cp $x.c ../epcre-8.33/; fi; done 346 347## Test build of stripped down version of new PCRE 348 349Time to do a test build. Copy and edit the pcre.mk makefile and try to 350get something that builds... 351 352I made a wrapper Makefile, hacked pcre.mk a little and did a few 353changes to a few files, namely added: 354 355 #ifdef ERLANG_INTEGRATION 356 #include "local_config.h" 357 #endif 358 359to pcre\_config.c and pcre\_internal.h. Also pcre.mk needs to get the 360new files added and the old files removed, directory names need to be 361changed and the wrapper can define most. My wrapper Makefile looked 362like this: 363 364 EPCRE_LIB = ./obj/libepcre.a 365 PCRE_GENINC = ./pcre_exec_loop_break_cases.inc 366 PCRE_OBJDIR = ./obj 367 V_AR = ar 368 V_CC = gcc 369 CFLAGS = -g -O2 -DHAVE_CONFIG_H -I/ldisk/pan/git/otp/erts/x86_64-unknown-linux-gnu 370 gen_verbose = 371 PCRE_DIR=. 372 include pcre.mk 373 374And the according variables were removed together with dependencies 375from pcre.mk. Note that you will need to put things back in order in 376pcre.mk after all testing is done. Once a 'make' is successful, you 377can generate new dependencies: 378 379 ~/tmp/pcre/epcre-8.33> gcc -MM -c -g -O2 -DHAVE_CONFIG_H -I/ldisk/pan/git/otp/erts/x86_64-unknown-linux-gnu -DERLANG_INTEGRATION *.c | grep -v $ERL_TOP 380 381Well, then you have to add $(PCRE\_OBJDIR)/ to each object and 382$(PCRE\_DIR)/ to each header. I did it manually, it's just a couple of 383files. Now your pcre.mk is fairly up to date and it's time to start 384patching in the changes... 385 386## Actually patching in the changes to the C code 387 388### Fixing the functionality (interruptable pcre\_run etc) 389 390Begin with only pcre\_exec.c, that's the important part: 391 392 ~/tmp/pcre/epcre-8.33> cd ../epcre-7.6/ 393 ~/tmp/pcre/epcre-7.6> diff -c ../pcre-7.6/pcre_exec.c ./pcre_exec.c > ../epcre_exec.c_7.6.diff 394 ~/tmp/pcre/epcre-7.6> cd ../epcre-8.33 395 396Now - if you are lucky, you can patch the new pcre\_exec with the 397patch command from the diff, but that may not be the case... Even if: 398 399 ~/tmp/pcre/epcre-8.33> patch -p0 < ../epcre_exec.c_7.6.diff 400 401works like a charm, you still have to go through the main loop and see 402that all do, while and for loops in the code contains COST\_CHK or at 403least COST, or, if it's a small loop (over, say one UTF character), 404mark it as OK with a comment. 405 406You should also check for other changes, like new local variables in 407the pcre\_exec code etc. 408 409What will probably happen, is that the majority of chunks 410fail. pcre\_exec is the main file for PCRE, one that is constantly 411optimized and where every new feature ends up. You will probably see 412so many failed HUNK's that you feel like giving up, but do not 413despair, it's just a matter of patience and hard work: 414 415* First, fix the 'pcre\_exec' function. 416 417 * Change the struct PcreExecContext to reflect the local variables 418 in this version of the code. 419 420 * Add/update the defines that makes local variables in the code 421 actually stay in an allocated "exec\_context" and be sure to 422 initialize the "pseudo-stack-variables" in the same way as in 423 the declarations for the original version of the code. 424 425 * The macros SWAPIN and SWAPOUT should be for variables that are 426 used a lot and we do not want to always access through the 427 struct. Also a few parameters are saved by SWAPIN and SWAPOUT. 428 429 * What might be tricky is to get things deallocated in a proper 430 way, there is a function that's called from the BIF code to 431 clean up an exec\_context, be especially observant about how the 432 stack in the 'match' function is allocated! The first frame is 433 supposed to be on the C stack, but in our case is allocated in 434 the exec\_context. The rest of the frames are allocated but 435 never freed, not until the match is done. 436 437 The variable 'frame' in the 'match' function is stored in our 438 additional field of the 'md' structure, that is the stack top, 439 but not necessarily the uppermost frame (due to reuse of old 440 frames, which is supposed to be an optimization...). 441 442 * The housekeeping of the "reduction counter" in the extra\_data 443 struct needs to be added to all places where we break out of the 444 main loop of pcre\_exec. Look for 'break' and you will see the 445 places. Make sure to update 446 '*extra\_data->loop\_counter\_return' whenever you leave this 447 function. It all boils down to some code that loops over the 448 call for match and returns PCRE\_ERROR\_LOOP\_LIMIT and get's 449 jumped back to when the BIF is restarted. You will see it in 450 your diff and you will find a similar place in the new version 451 where you put basically the same code. 452 453 * Fixing pcre\_exec takes about an hour of concentrated work, it 454 could be worse... 455 456* Next, go for the match function. It's simpler in some ways but 457 harder in other. The elimination of the C stack is already there, 458 you just need to modify it a little: 459 460 * In the RRETURN macro for NO\_RECURSE, add updating of 461 md->loop\_limit before returning. You can see how it's done in 462 the diff. 463 464 * RMATCH can be left as it is, at least it could in earlier 465 versions. Note however that you should mimic the allocation 466 strategies of RMATCH and RRETURN in the code at another place 467 later... The principle of the labels HEAP\_RECURSE and 468 HEAP\_RETURN are mimicked by our code in LOOP\_COUNT\_BREAK and 469 LOOP\_COUNT\_RETURN. You'll see later... 470 471 * COST and COST\_CHK, together with the jump to 472 LOOP\_COUNT\_RETURN label are in the beginning of the function 473 'match'. It's a block of macros and declaration of our local 474 variables loop\_count and loop\_limit. We patch in the code for 475 that, but may need to adopt it to new variable names etc. It's 476 important to handle the 'frames' variable correctly, dig it out 477 of the 'md' struct when we are restarting, but initialize it as 478 is done in normal NO\_RECURSE code otherwise. Note that the 479 COST\_CHK macro reuses the Xwhere field of the frame struct, it 480 is not needed when trapping. 481 482 * The LOOP\_COUNT\_BREAK and the LOOP\_COUNT\_RETURN code can now 483 be added. Make sure to check both how a new stack frame should be 484 properly allocated by mimicking the code in RMATCH, and how (if) 485 it should be freed by mimicking RRETURN. Also check which 486 variables need to be saved. They are properly pointed out in 487 8.33 with the comment 'These variables do not need to be 488 preserved over recursion' and appear in the beginning of the 489 function. Find variables of similar type in the frame structure 490 and reuse them. In 8.33 there are eight such variables. They are 491 placed at the end of the function 'match'. If You are reading 492 the diff, you need to scroll past all the COST\_CHK calls, 493 i.e. past the whole regexp machine loop. 494 495 * Now take the time to add things like debug macros to the top of 496 the file and one single COST\_CHK (preferably the one right 497 after for(;;) in 'match'), and see if you can compile. You will 498 probably need to add some fields in the structures in pcre.h, 499 see from a larger diff what you need there and iterate until you 500 can compile. 501 502 * So, what's left is to add all the COST and COST\_CHK macros, 503 plus marking all harmless loops as OK. There are a few rules 504 here: 505 506 * Mark *every* loop with the comment 'LOOP\_COUNT: xxxx', 507 where xxxx is either 'Ok', 'COST' or 'CHK'. There are 175 508 'LOOP\_COUNT:' comments in 8.33. 509 510 * Loops marked 'Ok' need no macro, either because they are so 511 short (like over an UTF character) or because they contain 512 an RMATCH macro, in which case they will be accounted for 513 anyway. 514 515 * Loops marked 'COST' will have an associated 'COST(N)' macro, 516 either before, if we know the amount of iterations, or 517 within. Reductions are counted, but we will not 518 interrupt. This is typically in what is expected to be 519 medium long loops or at places where interruption is hard 520 (like where we have local variables that are alive. The 521 selection between 'COST' and 'COST\_CHK' is hard. 'COST' is 522 much cheaper and usually enough, but when in doubt about the 523 loop length, try to use 'COST\_CHK', while making very sure 524 there are no live block-local variables that need to be 525 saved over the trap. There are 49 'COST' macros in 8.33. 526 527 * Loops marked 'CHK' shall contain a 'COST\_CHK(N)' 528 macro. This macro both counts reductions and may result in 529 an interrupt and a return to Erlang space. It is expensive 530 and it is vital to ensure that there are no unexpected local 531 variables that live past the macro. Most variables are in 532 the pseudo stack frame, but some regexp instructions declare 533 temporaries inside blocks. Make sure they are not expected 534 to be alive after a COST\_CHK if they are not in the 535 'heapframe' structure. If they are, you need to 536 conditionally move them to the 'heapframe' #if 537 defined(ERLANG\_INTEGRATION). in 8.33 the variables 'lgb' 538 and 'rgb' are preserved in this way. There are 54 539 'COST\_CHK's in 8.33. 540 541 * I've marked a few block-local variables with warnings, but 542 look thoroughly through the main loop to detect any new 543 ones. 544 545 * Be careful when it comes to freeing the context from Erlang 546 (the function erts\_pcre\_free\_restart\_data), Whatever is 547 done there has to work *both* when the context is freed in 548 the middle of an operation (because of trapping) and when 549 some things have been freed by a successful 550 return. Specifically, make sure to set md->offset\_vector to 551 NULL whenever it's freed (in the rest of the code) and 552 construct release\_match\_heapframes so that it can be 553 called multiple times for the same heapframe (set the next 554 pointer in the "static" frame, i.e. the one allocated in the 555 md to NULL after freeing). 556 557 * To add the costs to the main loop takes less than one work day, 558 keep calm and continue... 559 560OK, now you are done with the pcre\_exec (or at least, you think 561so). The rest is simpler. You have probably already handled 'pcre.h' 562and 'pcre\_internal.h' to add fields to the structures etc. Looking at 563a diff from an earlier version, you will see what's left. In upgrading 564to 8.33, the following things was left to do after pcre\_exec was 565fixed, remember you could generate a diff with: 566 567 ~/tmp/pcre/epcre-8.33> cd ../epcre-7.6/ 568 ~/tmp/pcre/epcre-7.6> (for x in *.[ch]; do if [ -f ../pcre-7.6/$x ]; then diff -c ../pcre-7.6/$x $x; fi; done) > ../epcre-7.6.diff 569 570Open the diff in your favorite editor and remove whatever changes you 571have already made, like everything that has to do with pcre\_exec.c 572and probably a large part of pcre.h/pcre\_internal.h. 573 574The expected result is a diff that either contains only the 575'%ExternalCopyright%' comments or contains them and the addition of 576the erts\_ prefix, depending on if you reverted the prefix change 577(using 'git revert') before starting to work. With a little luck, the 578patch of the remaining stuff should be possible to apply 579automatically. If anything fails, just add it manually. 580 581### Fixing the erts\_prefix 582 583The erts\_ prefix is mostly implemented by adding '#if 584defined(ERLANG\_INTEGRATION)' to a lot of function headers, inside the 585COMPILE\_UTF8 part. If you then also change the PRIV and PUBL macros 586in pcre\_internal.h. Typical diffs look like: 587 588 #if defined COMPILE_PCRE8 589 + #if defined(ERLANG_INTEGRATION) 590 + #ifndef PUBL 591 + #define PUBL(name) erts_pcre_##name 592 + #endif 593 + #ifndef PRIV 594 + #define PRIV(name) _erts_pcre_##name 595 + #endif 596 + #else 597 #ifndef PUBL 598 #define PUBL(name) pcre_##name 599 #endif 600 #ifndef PRIV 601 #define PRIV(name) _pcre_##name 602 #endif 603 + #endif 604 605and 606 607 #if defined COMPILE_PCRE8 608 + #if defined(ERLANG_INTEGRATION) 609 + PCRE_EXP_DECL int erts_pcre_pattern_to_host_byte_order(pcre *argument_re, 610 + erts_pcre_extra *extra_data, const unsigned char *tables) 611 + #else 612 PCRE_EXP_DECL int pcre_pattern_to_host_byte_order(pcre *argument_re, 613 pcre_extra *extra_data, const unsigned char *tables) 614 + #endif 615 616Note that some data types, like pcre\_extra are accessed with the PUBL 617macro, so they need to explicitly get the prefix added. pcre.h is a 618pig, as it declares prototypes for all functions regardless of 619compilation ode, so there is quite a lot of '#if 620defined(ERLANG\_INTEGRATION)' to add there. 621 622Anyway, now try to patch, using a diff where you have removed the 623changes you made manually (probably to pcre\_exec.c) but make sure to 624save your work (temporary git repository?) before, so you can revert 625any disasters... 626 627 ~/tmp/pcre/epcre-7.6> cd ../epcre-8.33/ 628 ~/tmp/pcre/epcre-8.33> patch -p0 < ../epcre-7.6_clean2.diff 629 630Some hunks may certainly still fail, read through the .rej file and fix it. 631 632### ExternalCopyright 633 634Now you should check that the 'ExternalCopyright' comment is present 635in all source files: 636 637 ~/tmp/pcre/epcre-8.33> for x in *.[ch]; do if grep ExternalCopyright $x > /dev/null; then true; else echo $x; fi; done 638 639In this upgrade (from 7.6 to 8.33) we certainly had some new and 640renamed files: 641 642 dftables.c 643 pcre_byte_order.c 644 pcre_chartables.c 645 pcre_jit_compile.c 646 pcre_latin_1_table.c 647 pcre_string_utils.c 648 pcre_ucd.c 649 650Go through them manually and add the 'ExternalCopyright' comment. 651 652## Integrate with Erlang 653 654Now you are done with most of the tedious work. It's time to move this 655into your branch of the Erlang source tree, remove old files and add 656new ones, plus add the tar file with the original pcre dist. Remember 657to fix your hacked version of pcre.mk and then try to build 658Erlang. You might need to update 'erl\_bif\_re.c' to reflect any 659changes in the PCRE library. When it builds, run the test suites. 660 661Make sure to rename any files that has new names and remove any files 662that are no longer present before copying in the new versions from 663your temporary directory. In our example we remove 'pcre\_info.c', 664'pcre\_make\_latin1\_default.c', 'pcre\_try\_flipped.c', 665'ucpinternal.h' and 'ucptable.h'. We rename 'make\_latin1\_table.c' to 666'dftables.c' and 'pcre\_ucp\_searchfuncs.c' to 'pcre\_ucd.c'. 667 668After copying in the sources, we can try to build. Do not forget to 669fix whatever you did in pcre.mk to make it build locally. 670 671## Update test suites 672 673The next step is to integrate the updated PCRE tests into our test suites. 674 675Copy testoutput[1-9] from the testdata directory of your new version 676of pcre, to the re\_SUITE\_data in stdlib's test suites. Run the 677test suites and remove any bugs. Usually the bugs come from the fact 678that the PCRE test suites get better and from our implementation of 679global matching, which may have bugs outside of the PCRE library. The 680test suite 'pcre' is the one that runs these tests. Also copy 681testoutput11-8 to testoutput10, the testoutput10 file in pcre is 682nowadays for the DFA, which we do not use. 683 684The next step is to regenerate re\_testoutput1\_replacement\_test. How 685to do that is in a comment in the beginning of the file. The key 686module is run\_pcre\_tests.erl, which both driver the pcre test and 687generate re\_testoutput1\_replacement\_test.erl. Watch during the 688generation that you do not get to many of the "Fishy character" 689messages, if they are more than, say 20, you will probably need to 690address the UTF8 issues in the Perl execution. As it is now, we skip 691non latin1 characters in this test. You will need to run iconv on the 692generated module to make it UTF-8 before running tests. Try to use a 693perl version that is as new as possible. 694 695The exact same procedure goes for the re\_testoutput1\_split\_test.erl. 696 697Make a note about perl version used in the commit updating the replace 698and split test files. 699 700Note that the perl version you are using may not be completely 701compatible with the PCRE version you are upgrading to. If this is the 702case you might get failures when running the replace and split tests. 703If you get failures, you need to inspect the failures and decide what 704to do. If there are only a small amount of failures you will probably 705end up preferring the behavior of PCRE, and manually changing these 706tests. Do these changes in a separate commit so it is easy to see 707what differed. 708 709Also add copyright headers to the files after converting them to UTF-8. 710 711After ironing out the rest of the bugs, you should be done with the 712code. 713 714## Update documentation 715 716Now it's time for the documentation, which is fairly 717straightforward. Diff the pcrepattern man pages from the old and new 718PCRE distros and update the re.xml file accordingly. It may help to 719have the generated HTML file from the new version to cut and paste 720from, but as you will notice, it's quite a few changes from HTML to 721XML. All lists are reformatted, the <pre> tags are made into 722either <code> or <quite> etc. Also the <P> tags are 723converted to lowercase and all mentioned options and function calls 724are converted to their Erlang counterpart. Really awesome work that 725requires thorough reading of all new text. For the upgrade from 7.6 726to 8.33, the update of the pcrepattern part of our manual page took 727about eight hours. 728 729## Update Licence 730 731Copy the LICENCE file to `erts/emulator/pcre/LICENCE` and update 732the `[PCRE]` section in `system/COPYRIGHT` with the content of 733the `LICENCE` file. 734 735## Add new relevant options to re 736 737Then, when all this is done, you should add any new relevant options 738from the PCRE library to both the code (erl\_bif\_re.c), the specs and 739the Erlang function 'copt/1' (re.erl) and the manual page 740(re.xml). Make sure the options are really relevant to add to the 741Erlang API, check if they are compile or run-time options (or both) and 742add them to the 'parse\_options' function of erl\_bif\_re.c. Adding an 743option that is just passed through to PCRE is pretty simple, at least 744"code wise". 745 746Now you are done. Run all test suites on all machines and you will be happy. 747 748## Final notes 749 750To avoid the work of a major upgrade, it is probably worth it to keep 751in pace with the changes to PCRE. The upgrade from 7.6 to 8.33, 752including tracking down bugs etc, took me a total of two weeks. If 753smaller diffs from the PCRE development were integrated in a more 754incremental fashion, it will be much easier each time and you will 755have the PCRE library up to date. PCRE should probably be updated for 756each major release, instead of every five years...