• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..21-Nov-2021-

LICENCEH A D21-Nov-20213.1 KiB9465

README.pcre_update.mdH A D21-Nov-202131.5 KiB756589

dftables.cH A D21-Nov-20216.8 KiB213134

local_config.hH A D21-Nov-20214.4 KiB9015

pcre.hH A D21-Nov-202136 KiB842640

pcre.mkH A D21-Nov-20215 KiB138106

pcre_byte_order.cH A D21-Nov-20219.2 KiB325220

pcre_chartables.cH A D21-Nov-20217.7 KiB201141

pcre_compile.cH A D21-Nov-2021315.6 KiB9,8266,175

pcre_config.cH A D21-Nov-20215.1 KiB201119

pcre_dfa_exec.cH A D21-Nov-2021124.1 KiB3,6852,718

pcre_exec.cH A D21-Nov-2021232.4 KiB7,7455,763

pcre_fullinfo.cH A D21-Nov-20217.8 KiB253141

pcre_get.cH A D21-Nov-202125.3 KiB760457

pcre_globals.cH A D21-Nov-20213.8 KiB8927

pcre_internal.hH A D21-Nov-2021112.2 KiB2,8381,848

pcre_jit_compile.cH A D21-Nov-2021357 KiB11,95110,041

pcre_latin_1_table.cH A D21-Nov-20217.3 KiB195141

pcre_maketables.cH A D21-Nov-20215.8 KiB16362

pcre_newline.cH A D21-Nov-20216 KiB21299

pcre_ord2utf8.cH A D21-Nov-20213.2 KiB9727

pcre_refcount.cH A D21-Nov-20213.8 KiB9929

pcre_string_utils.cH A D21-Nov-20215.3 KiB212105

pcre_study.cH A D21-Nov-202148 KiB1,7081,158

pcre_tables.cH A D21-Nov-202128.1 KiB729577

pcre_ucd.cH A D21-Nov-2021203.7 KiB3,6453,135

pcre_valid_utf8.cH A D21-Nov-202111.4 KiB372201

pcre_version.cH A D21-Nov-20214.2 KiB10526

pcre_xclass.cH A D21-Nov-20218.1 KiB270149

ucp.hH A D21-Nov-20215.1 KiB227186

README.pcre_update.md

1# How to update the PCRE version used by Erlang
2
3## The basic changes to the PCRE library
4
5To work with the Erlang VM, PCRE has been changed in three important ways:
6
71. The main execution machine in pcre\_exec has been modified so that
8matching can be interrupted and restarted. This functionality utilizes
9the code that implements recursion by allocating explicit
10"stack-frames" in heap space, which basically means that all local
11variables in the loop are part of a struct which is kept in "malloced"
12memory on the heap and there are no real stack variables that need to
13be pushed on the C stack in the case of recursive calls. This is a
14technique we also use inside the VM to avoid building large C
15stacks. In PCRE this is enabled by the NO\_RECURSE define, so that is a
16prerequisite for the ERLANG\_INTEGRATION define which also adds labels
17at restart points and counts "reductions".
18
192. All visible symbols in PCRE gets the erts_ prefix, so that NIF's
20and such using a "real" pcre library does not get confused (or 're'
21gets confused when a "real" pcre library get's loaded into the VM
22process).
23
243. All irrelevant functionality has been stripped from the library,
25which means for example UTF16 support, jit, DFA execution
26etc. Basically the source files handling this are removed, together
27with any build support from the PCRE project. We have our own
28makefiles etc.
29
30## Setting up an environment for the work
31
32I work with four temporary directories when doing this (the examples
33are from the updating of pcre-7.6 to pcre-8.33);
34
35       ~/tmp/pcre> ls
36       epcre-7.6  epcre-8.33  pcre-7.6  pcre-8.33
37
38I've unpacked the plain pcre sources in pcre-* and will work with our
39patched sources in the epcre-* directories.
40
41Make sure your ERL_TOP contains a *built* version of Erlang (and you have made a branch)
42
43First unpack the pcre libraries (which will create the pcre-*
44directories) and then copy our code to the old epcre directory:
45
46      ~/tmp/pcre> tar jxf $ERL_TOP/erts/emulator/pcre/pcre-7.6.tar.bz2
47      ~/tmp/pcre> tar jxf ~/Downloads/pcre-8.33.tar.bz2
48      ~/tmp/pcre> mkdir epcre-7.6  epcre-8.33
49      ~/tmp/pcre> cd epcre-7.6/
50      ~/tmp/pcre/epcre-7.6> cp -r $ERL_TOP/erts/emulator/pcre/* .
51      ~/tmp/pcre/epcre-7.6> rm pcre-7.6.tar.bz2
52
53Leave the obj directory, you may need the libepcre.a file...
54
55If you find it easier, you can revert the commit in GIT that adds the
56erts_ prefix to the previous version before continuing work, but as
57that is a quite small diff in newer versions of PCRE, it is probably
58not worth it. Still, you will find the erts_ prefix being a separate
59commit when integrating 8.33, so if you're nice, you will do the same
60for the person coming after you...
61
62## Generating a diff for our changes to PCRE
63
64Before you generate a diff (that, in an ideal world, would be used to
65automatically patch the newer version of pcre, which will probably
66only work for minor PCRE updates), we need to configure the old pcre.
67
68       ~/tmp/pcre/epcre-7.6> cd ../pcre-7.6
69       ~/tmp/pcre/pcre-7.6> ./configure --enable-utf8 --enable-unicode-properties --disable-shared --disable-stack-for-recursion
70
71Note that for newer versions, the configure flag '--enable-utf8'
72should be replaced with '--enable-utf'
73
74So we now generate a diff:
75
76    ~/tmp/pcre/pcre-7.6> cd ../epcre-7.6
77   ~/tmp/pcre/epcre-7.6> (for x in *.[ch]; do if [ -f ../pcre-7.6/$x ]; then diff -c ../pcre-7.6/$x $x; fi; done ) > ../epcre-7.6_clean.diff
78
79### What the diff means
80
81Let's now walk through the relevant parts of the diff. Some of the
82differences might come from patches that probably are already in the
83new version, For example in out 7.6, we had a security patch which
84added the define WORK_SIZE_CHECK and used it in some places. Those can
85probably safely be ignored, but to be on the safe side, check what's
86already integrated in the new version.
87
88The interesting part is in pcre_exec.c. You will see things like
89
90    #ifdef ERLANG_INTEGRATION
91    ...
92    #endif
93
94or
95
96    #if defined(ERLANG_INTEGRATION)
97    ...
98    #endif
99
100and a lot of
101
102    COST_CHK(1);
103
104or
105
106    COST(min);
107
108and
109
110    /* LOOP_COUNT: Ok */
111    /* LOOP_COUNT: CHK */
112    /* LOOP_COUNT: COST */
113
114
115scattered over the main loop. Those mean the following:
116
117* COST(int x) - consume reductions proportional to the integer
118  parameter, but no need for interruption here (it's like
119  bump_reductions without trapping). The loop they apply to also has a
120  'LOOP_COUNT: COST' comment at it's head.
121
122* COST\_CHK(int x) - like COST(x), but also check that the reduction
123counter does not reach zero. If it does, leave the execution loop to
124be restarted at a later point. No real stack variables can be live
125here. Note that variables like 'max' and 'min' are *not* real stack
126variables, the NO\_RECURSION setting has taken care of that. 'i' is a
127stack variable that's explicitly saved when trapping, so that will
128also be correct when returning from a trap. So will 'c', 'rrc' and
129flags like 'utf8', 'minimize' and 'posessive'. Those can also be
130regarded as "non C-stack variables". The loop where they reside also
131has a 'LOOP\_COUNT: CHK' comment.
132
133* /* LOOP_COUNT: Ok */ - means that I have checked the loop and it
134  only runs a deterministic set of iterations regardless of input, or
135  it has a call to RRECURSE in it's body, why we need not add more
136  cost than the normal reduction counting that will occur for each
137  instruction demands.
138
139The thing is that each loop in the function 'match' should be marked
140with one of these comments. If no comment is present after you patched
141the new release (if you successfully manage to do it automatically),
142it may be a new regexp instruction that is added since the last
143release.
144
145You will need to manually go through the main 'match' loop after
146upgrading to verify that there are no unhandled loops in the regexp
147machine loop (!).
148
149The COST\_CHK macro works like this:
150
1511. Add to the loop count.
1522. If loop count > limit:
153    1. Store the line (+100) in the Xwhere member of the frame structure
154    2. Goto LOOP\_COUNT\_BREAK, which ultimately returns from the function
1553. Insert a label, which is named L\_LOOP\_COUNT\_<line number>
156
157LOOP\_COUNT\_BREAK code will create an extra "stack frame" on the heap
158allocated stack used if NO\_RECURSION is set, and will store the few
159locals that are not already in the ordinary stack frame there (like
160'c' and 'i').
161
162When we continue execution (after a trap up to the main Erlang
163scheduler), we will jump to LOOP\_COUNT\_RETURN, which will restore
164the local variables and will jump to the labels. The jump code looks
165like this in the C source:
166
167     	switch (frame->Xwhere)
168      	       {
169      #include "pcre_exec_loop_break_cases.inc"
170      	       default:
171		        DPRINTF(("jump error in pcre match: label %d non-existent\n", frame->Xwhere));
172        		return PCRE_ERROR_INTERNAL;
173      		}
174
175When building, pcre\_exec\_loop\_break\_cases.inc will be generated
176during build by pcre.mk, it will look like:
177
178     case 791: goto L_LOOP_COUNT_691;
179     case 1892: goto L_LOOP_COUNT_1792;
180     case 1999: goto L_LOOP_COUNT_1899;
181
182etc
183
184So, simply put, all C-stack variables are saved when we have consumed
185our reductions, we return from the function and, as there is no real
186recursion we immediately fall out into the re:run BIF, which with the
187help of a magic binary keeps track of the heap allocated stack for the
188regexp machine. When we return from trapping out to the scheduler, all
189vital data is restored and we continue from exactly the same state as
190we left. What's needed is to patch this into the new pcre_exec and
191check all new instructions to determine what might need updating in
192terms of COST, COST\_CHK etc.
193
194Well, that's *almost* everything, because there is of course more...
195
196The actual interface function, 'pcre\_exec', needs the same treatment
197as the actual regexp machine loop, that is we need to store all local
198variables between restarts. Unfortunately the NO\_RECURSE setting does
199not do this, we need to do it ourselves. So there's quite a diff in
200that function too, where a big struct is declared, containing every
201local variable in that function, together with either local copies
202that are swapped in and out, or macros that directly access the heap
203allocated struct. The struct is called `PcreExecContext`.
204
205If a context is present, we are restarting and therefore restore
206everything. If we are restarting we can also skip all initialization
207code in the function and jump more or less directly to the
208RESTART_INTERRUPTED label and the call to 'match', which is the actual
209regexp machine loop.
210
211There are a few places in the pcre_exec we need to do some housekeeping, you will see code like:
212
213 	if ((extra_data->flags & PCRE_EXTRA_LOOP_LIMIT) != 0)
214 	  {
215		*extra_data->loop_counter_return =
216 	    		(extra_data->loop_limit - md->loop_limit);
217          }
218
219Make sure, after updating, that this housekeeping is done whenever we
220do not reach the call to 'match'.
221
222So, now we in theory know what to do, so let's do it:
223
224But...
225
226## File changes in the new version of PCRE
227
228First we need to go through what's changed in the new library
229version. Files may have new names, functions may have moved and so on.
230
231Start by building the new library:
232
233      ~/tmp/pcre> cd pcre-8.33/
234      ~/tmp/pcre/pcre-8.33> ./configure --enable-utf --enable-unicode-properties --disable-shared --disable-stack-for-recursion
235      ~/tmp/pcre/pcre-8.33> make
236
237In the make process, you will probably notice most files that are
238used, but you can bet that's not all not all...
239
240To begin with you will need a default table for Latin-1 characters, so:
241
242   	 ~/tmp/pcre/pcre-8.33> cc -DHAVE_CONFIG_H -o dftables dftables.c
243	 ~/tmp/pcre/pcre-8.33> LANG=sv_SE ./dftables -L ../epcre-8.33/pcre_latin_1_table.c
244
245Compare it to the pcre\_latin\_1\_table.c in the old version, they
246should not differ in any significant way. If they do, it might be
247that you do not have the `sv_SE` locale installed on your machine.
248
249You can test whether it's installed with `locale -a | grep sv_SE$`, and
250install with `sudo locale-gen sv_SE && sudo update-locale` if needed.
251
252A good starting point is then to try to find all files in the new
253version of the library that have (probably) the same names as the
254one's in our distribution:
255
256	~/tmp/pcre/pcre-8.33> cd ../epcre-7.6/
257	~/tmp/pcre/epcre-7.6> for x in *.[ch]; do if [ '!' -f ../pcre-8.33/$x ]; then echo $x; else cp ../pcre-8.33/$x ../epcre-8.33/; fi; done
258
259This will output a list of files not found in the new distro. Let's
260look at the list from the example upgrade:
261
262     local_config.h
263     make_latin1_table.c
264     pcre_info.c
265     pcre_latin_1_table.c
266     pcre_make_latin1_default.c
267     pcre_try_flipped.c
268     pcre_ucp_searchfuncs.c
269     ucpinternal.h
270     ucptable.h
271
272* local\_config.h - OK, that's our child, it contains PCRE-specific
273  configure-results (i.e. the #defines that are results from out
274  parameters to configure, like NO\_RECURSE etc). Just copy it and
275  edit it according to what specific settings you can find in the
276  generated config.h from the real library build. In our example case,
277  the #define SUPPORT\_UTF8 should be renamed to #define SUPPORT\_UTF
278  and #define VERSION "7.6" should be changed to #define VERSION
279  "8.33"...
280
281* make\_latin1\_table.c - it was renamed to dftables.c, so we copy
282  that instead.
283
284* pcre\_info.c - It was simply removed from the library. Good, because
285  it was useless... So just ignore.
286
287* pcre\_latin\_1\_table.c - No problem, we generated a new one in the
288  earlier stage.
289
290* pcre\_make\_latin1\_default.c - No longer used, a hack that's not
291  needed with dftables. Ignored
292
293* pcre\_try\_flipped.c - This functionality has been removed from
294  pcre\_exec, you cannot compile on one endianess and execute on
295  another any more :( Ignored.
296
297* pcre\_ucp\_searchfuncs.c, ucpinternal.h, ucptable.h - this
298  functionality is moved to pcre\_ucd.c, copy that one instead.
299
300OK, now go the other way and look at what was actually built for the new version of pcre:
301
302    ~/tmp/pcre/epcre-7.6> cd ../pcre-8.33/
303    ~/tmp/pcre/pcre-8.33> nm ./.libs/libpcre.a | egrep 'lib.*.o:'
304
305The output for this release was:
306
307    libpcre_la-pcre_byte_order.o:
308    libpcre_la-pcre_compile.o:
309    libpcre_la-pcre_config.o:
310    libpcre_la-pcre_dfa_exec.o:
311    libpcre_la-pcre_exec.o:
312    libpcre_la-pcre_fullinfo.o:
313    libpcre_la-pcre_get.o:
314    libpcre_la-pcre_globals.o:
315    libpcre_la-pcre_jit_compile.o:
316    libpcre_la-pcre_maketables.o:
317    libpcre_la-pcre_newline.o:
318    libpcre_la-pcre_ord2utf8.o:
319    libpcre_la-pcre_refcount.o:
320    libpcre_la-pcre_string_utils.o:
321    libpcre_la-pcre_study.o:
322    libpcre_la-pcre_tables.o:
323    libpcre_la-pcre_ucd.o:
324    libpcre_la-pcre_valid_utf8.o:
325    libpcre_la-pcre_version.o:
326    libpcre_la-pcre_xclass.o:
327    libpcre_la-pcre_chartables.o:
328
329Libtool has changed the object names, but we can fix that and see what
330sources we have already decided should exist:
331
332	~/tmp/pcre/pcre-8.33> NAMES=`nm ./.libs/libpcre.a | egrep 'lib.*.o:'| sed 's,libpcre_la-,,' | sed 's,.o:$,,'`
333	~/tmp/pcre/pcre-8.33> for x in $NAMES; do if [ '!' -f ../epcre-8.33/$x.c ]; then echo $x; fi; done
334
335And the list contained:
336
337    pcre_byte_order
338    pcre_jit_compile
339    pcre_string_utils
340
341pcre\_jit\_compile is actually needed, even though we have not enabled
342jit, and the other two contain functionality needed, so just copy the
343sources...
344
345    ~/tmp/pcre/pcre-8.33> for x in $NAMES; do if [ '!' -f ../epcre-8.33/$x.c ]; then cp $x.c ../epcre-8.33/; fi; done
346
347## Test build of stripped down version of new PCRE
348
349Time to do a test build. Copy and edit the pcre.mk makefile and try to
350get something that builds...
351
352I made a wrapper Makefile, hacked pcre.mk a little and did a few
353changes to a few files, namely added:
354
355	#ifdef ERLANG_INTEGRATION
356	#include "local_config.h"
357	#endif
358
359to pcre\_config.c and pcre\_internal.h. Also pcre.mk needs to get the
360new files added and the old files removed, directory names need to be
361changed and the wrapper can define most. My wrapper Makefile looked
362like this:
363
364    EPCRE_LIB = ./obj/libepcre.a
365    PCRE_GENINC = ./pcre_exec_loop_break_cases.inc
366    PCRE_OBJDIR = ./obj
367    V_AR = ar
368    V_CC = gcc
369    CFLAGS = -g -O2 -DHAVE_CONFIG_H -I/ldisk/pan/git/otp/erts/x86_64-unknown-linux-gnu
370    gen_verbose =
371    PCRE_DIR=.
372    include pcre.mk
373
374And the according variables were removed together with dependencies
375from pcre.mk. Note that you will need to put things back in order in
376pcre.mk after all testing is done. Once a 'make' is successful, you
377can generate new dependencies:
378
379    ~/tmp/pcre/epcre-8.33> gcc -MM -c -g -O2 -DHAVE_CONFIG_H -I/ldisk/pan/git/otp/erts/x86_64-unknown-linux-gnu -DERLANG_INTEGRATION *.c | grep -v $ERL_TOP
380
381Well, then you have to add $(PCRE\_OBJDIR)/ to each object and
382$(PCRE\_DIR)/ to each header. I did it manually, it's just a couple of
383files. Now your pcre.mk is fairly up to date and it's time to start
384patching in the changes...
385
386## Actually patching in the changes to the C code
387
388### Fixing the functionality (interruptable pcre\_run etc)
389
390Begin with only pcre\_exec.c, that's the important part:
391
392    ~/tmp/pcre/epcre-8.33> cd ../epcre-7.6/
393    ~/tmp/pcre/epcre-7.6> diff -c ../pcre-7.6/pcre_exec.c ./pcre_exec.c > ../epcre_exec.c_7.6.diff
394    ~/tmp/pcre/epcre-7.6> cd ../epcre-8.33
395
396Now - if you are lucky, you can patch the new pcre\_exec with the
397patch command from the diff, but that may not be the case... Even if:
398
399    ~/tmp/pcre/epcre-8.33> patch -p0 < ../epcre_exec.c_7.6.diff
400
401works like a charm, you still have to go through the main loop and see
402that all do, while and for loops in the code contains COST\_CHK or at
403least COST, or, if it's a small loop (over, say one UTF character),
404mark it as OK with a comment.
405
406You should also check for other changes, like new local variables in
407the pcre\_exec code etc.
408
409What will probably happen, is that the majority of chunks
410fail. pcre\_exec is the main file for PCRE, one that is constantly
411optimized and where every new feature ends up. You will probably see
412so many failed HUNK's that you feel like giving up, but do not
413despair, it's just a matter of patience and hard work:
414
415* First, fix the 'pcre\_exec' function.
416
417    * Change the struct PcreExecContext to reflect the local variables
418      in this version of the code.
419
420    * Add/update the defines that makes local variables in the code
421      actually stay in an allocated "exec\_context" and be sure to
422      initialize the "pseudo-stack-variables" in the same way as in
423      the declarations for the original version of the code.
424
425    * The macros SWAPIN and SWAPOUT should be for variables that are
426      used a lot and we do not want to always access through the
427      struct. Also a few parameters are saved by SWAPIN and SWAPOUT.
428
429    * What might be tricky is to get things deallocated in a proper
430      way, there is a function that's called from the BIF code to
431      clean up an exec\_context, be especially observant about how the
432      stack in the 'match' function is allocated! The first frame is
433      supposed to be on the C stack, but in our case is allocated in
434      the exec\_context. The rest of the frames are allocated but
435      never freed, not until the match is done.
436
437      The variable 'frame' in the 'match' function is stored in our
438      additional field of the 'md' structure, that is the stack top,
439      but not necessarily the uppermost frame (due to reuse of old
440      frames, which is supposed to be an optimization...).
441
442    * The housekeeping of the "reduction counter" in the extra\_data
443      struct needs to be added to all places where we break out of the
444      main loop of pcre\_exec. Look for 'break' and you will see the
445      places. Make sure to update
446      '*extra\_data->loop\_counter\_return' whenever you leave this
447      function. It all boils down to some code that loops over the
448      call for match and returns PCRE\_ERROR\_LOOP\_LIMIT and get's
449      jumped back to when the BIF is restarted. You will see it in
450      your diff and you will find a similar place in the new version
451      where you put basically the same code.
452
453    * Fixing pcre\_exec takes about an hour of concentrated work, it
454      could be worse...
455
456* Next, go for the match function. It's simpler in some ways but
457  harder in other. The elimination of the C stack is already there,
458  you just need to modify it a little:
459
460    * In the RRETURN macro for NO\_RECURSE, add updating of
461      md->loop\_limit before returning. You can see how it's done in
462      the diff.
463
464    * RMATCH can be left as it is, at least it could in earlier
465      versions. Note however that you should mimic the allocation
466      strategies of RMATCH and RRETURN in the code at another place
467      later... The principle of the labels HEAP\_RECURSE and
468      HEAP\_RETURN are mimicked by our code in LOOP\_COUNT\_BREAK and
469      LOOP\_COUNT\_RETURN. You'll see later...
470
471    * COST and COST\_CHK, together with the jump to
472      LOOP\_COUNT\_RETURN label are in the beginning of the function
473      'match'. It's a block of macros and declaration of our local
474      variables loop\_count and loop\_limit. We patch in the code for
475      that, but may need to adopt it to new variable names etc. It's
476      important to handle the 'frames' variable correctly, dig it out
477      of the 'md' struct when we are restarting, but initialize it as
478      is done in normal NO\_RECURSE code otherwise. Note that the
479      COST\_CHK macro reuses the Xwhere field of the frame struct, it
480      is not needed when trapping.
481
482    * The LOOP\_COUNT\_BREAK and the LOOP\_COUNT\_RETURN code can now
483      be added. Make sure to check both how a new stack frame should be
484      properly allocated by mimicking the code in RMATCH, and how (if)
485      it should be freed by mimicking RRETURN. Also check which
486      variables need to be saved. They are properly pointed out in
487      8.33 with the comment 'These variables do not need to be
488      preserved over recursion' and appear in the beginning of the
489      function. Find variables of similar type in the frame structure
490      and reuse them. In 8.33 there are eight such variables. They are
491      placed at the end of the function 'match'. If You are reading
492      the diff, you need to scroll past all the COST\_CHK calls,
493      i.e. past the whole regexp machine loop.
494
495    * Now take the time to add things like debug macros to the top of
496      the file and one single COST\_CHK (preferably the one right
497      after for(;;) in 'match'), and see if you can compile. You will
498      probably need to add some fields in the structures in pcre.h,
499      see from a larger diff what you need there and iterate until you
500      can compile.
501
502    * So, what's left is to add all the COST and COST\_CHK macros,
503      plus marking all harmless loops as OK. There are a few rules
504      here:
505
506        * Mark *every* loop with the comment 'LOOP\_COUNT: xxxx',
507          where xxxx is either 'Ok', 'COST' or 'CHK'. There are 175
508          'LOOP\_COUNT:' comments in 8.33.
509
510        * Loops marked 'Ok' need no macro, either because they are so
511          short (like over an UTF character) or because they contain
512          an RMATCH macro, in which case they will be accounted for
513          anyway.
514
515        * Loops marked 'COST' will have an associated 'COST(N)' macro,
516          either before, if we know the amount of iterations, or
517          within. Reductions are counted, but we will not
518          interrupt. This is typically in what is expected to be
519          medium long loops or at places where interruption is hard
520          (like where we have local variables that are alive. The
521          selection between 'COST' and 'COST\_CHK' is hard. 'COST' is
522          much cheaper and usually enough, but when in doubt about the
523          loop length, try to use 'COST\_CHK', while making very sure
524          there are no live block-local variables that need to be
525          saved over the trap. There are 49 'COST' macros in 8.33.
526
527        * Loops marked 'CHK' shall contain a 'COST\_CHK(N)'
528          macro. This macro both counts reductions and may result in
529          an interrupt and a return to Erlang space. It is expensive
530          and it is vital to ensure that there are no unexpected local
531          variables that live past the macro. Most variables are in
532          the pseudo stack frame, but some regexp instructions declare
533          temporaries inside blocks. Make sure they are not expected
534          to be alive after a COST\_CHK if they are not in the
535          'heapframe' structure. If they are, you need to
536          conditionally move them to the 'heapframe' #if
537          defined(ERLANG\_INTEGRATION). in 8.33 the variables 'lgb'
538          and 'rgb' are preserved in this way. There are 54
539          'COST\_CHK's in 8.33.
540
541        * I've marked a few block-local variables with warnings, but
542          look thoroughly through the main loop to detect any new
543          ones.
544
545        * Be careful when it comes to freeing the context from Erlang
546          (the function erts\_pcre\_free\_restart\_data), Whatever is
547          done there has to work *both* when the context is freed in
548          the middle of an operation (because of trapping) and when
549          some things have been freed by a successful
550          return. Specifically, make sure to set md->offset\_vector to
551          NULL whenever it's freed (in the rest of the code) and
552          construct release\_match\_heapframes so that it can be
553          called multiple times for the same heapframe (set the next
554          pointer in the "static" frame, i.e. the one allocated in the
555          md to NULL after freeing).
556
557    * To add the costs to the main loop takes less than one work day,
558      keep calm and continue...
559
560OK, now you are done with the pcre\_exec (or at least, you think
561so). The rest is simpler. You have probably already handled 'pcre.h'
562and 'pcre\_internal.h' to add fields to the structures etc. Looking at
563a diff from an earlier version, you will see what's left. In upgrading
564to 8.33, the following things was left to do after pcre\_exec was
565fixed, remember you could generate a diff with:
566
567    ~/tmp/pcre/epcre-8.33> cd ../epcre-7.6/
568    ~/tmp/pcre/epcre-7.6> (for x in *.[ch]; do if [ -f ../pcre-7.6/$x ]; then diff -c ../pcre-7.6/$x $x; fi; done) > ../epcre-7.6.diff
569
570Open the diff in your favorite editor and remove whatever changes you
571have already made, like everything that has to do with pcre\_exec.c
572and probably a large part of pcre.h/pcre\_internal.h.
573
574The expected result is a diff that either contains only the
575'%ExternalCopyright%' comments or contains them and the addition of
576the erts\_ prefix, depending on if you reverted the prefix change
577(using 'git revert') before starting to work. With a little luck, the
578patch of the remaining stuff should be possible to apply
579automatically. If anything fails, just add it manually.
580
581### Fixing the erts\_prefix
582
583The erts\_ prefix is mostly implemented by adding '#if
584defined(ERLANG\_INTEGRATION)' to a lot of function headers, inside the
585COMPILE\_UTF8 part. If you then also change the PRIV and PUBL macros
586in pcre\_internal.h. Typical diffs look like:
587
588        #if defined COMPILE_PCRE8
589      + #if defined(ERLANG_INTEGRATION)
590      + #ifndef PUBL
591      + #define PUBL(name) erts_pcre_##name
592      + #endif
593      + #ifndef PRIV
594      + #define PRIV(name) _erts_pcre_##name
595      + #endif
596      + #else
597        #ifndef PUBL
598        #define PUBL(name) pcre_##name
599        #endif
600        #ifndef PRIV
601        #define PRIV(name) _pcre_##name
602        #endif
603      + #endif
604
605and
606
607	    #if defined COMPILE_PCRE8
608	  + #if defined(ERLANG_INTEGRATION)
609	  + PCRE_EXP_DECL int erts_pcre_pattern_to_host_byte_order(pcre *argument_re,
610	  +   erts_pcre_extra *extra_data, const unsigned char *tables)
611	  + #else
612  	    PCRE_EXP_DECL int pcre_pattern_to_host_byte_order(pcre *argument_re,
613    	      pcre_extra *extra_data, const unsigned char *tables)
614	  + #endif
615
616Note that some data types, like pcre\_extra are accessed with the PUBL
617macro, so they need to explicitly get the prefix added. pcre.h is a
618pig, as it declares prototypes for all functions regardless of
619compilation ode, so there is quite a lot of '#if
620defined(ERLANG\_INTEGRATION)' to add there.
621
622Anyway, now try to patch, using a diff where you have removed the
623changes you made manually (probably to pcre\_exec.c) but make sure to
624save your work (temporary git repository?) before, so you can revert
625any disasters...
626
627    ~/tmp/pcre/epcre-7.6> cd ../epcre-8.33/
628    ~/tmp/pcre/epcre-8.33> patch -p0 < ../epcre-7.6_clean2.diff
629
630Some hunks may certainly still fail, read through the .rej file and fix it.
631
632### ExternalCopyright
633
634Now you should check that the 'ExternalCopyright' comment is present
635in all source files:
636
637    ~/tmp/pcre/epcre-8.33> for x in *.[ch]; do if grep ExternalCopyright $x > /dev/null; then true; else echo $x; fi; done
638
639In this upgrade (from 7.6 to 8.33) we certainly had some new and
640renamed files:
641
642    dftables.c
643    pcre_byte_order.c
644    pcre_chartables.c
645    pcre_jit_compile.c
646    pcre_latin_1_table.c
647    pcre_string_utils.c
648    pcre_ucd.c
649
650Go through them manually and add the 'ExternalCopyright' comment.
651
652## Integrate with Erlang
653
654Now you are done with most of the tedious work. It's time to move this
655into your branch of the Erlang source tree, remove old files and add
656new ones, plus add the tar file with the original pcre dist. Remember
657to fix your hacked version of pcre.mk and then try to build
658Erlang. You might need to update 'erl\_bif\_re.c' to reflect any
659changes in the PCRE library. When it builds, run the test suites.
660
661Make sure to rename any files that has new names and remove any files
662that are no longer present before copying in the new versions from
663your temporary directory. In our example we remove 'pcre\_info.c',
664'pcre\_make\_latin1\_default.c', 'pcre\_try\_flipped.c',
665'ucpinternal.h' and 'ucptable.h'. We rename 'make\_latin1\_table.c' to
666'dftables.c' and 'pcre\_ucp\_searchfuncs.c' to 'pcre\_ucd.c'.
667
668After copying in the sources, we can try to build. Do not forget to
669fix whatever you did in pcre.mk to make it build locally.
670
671## Update test suites
672
673The next step is to integrate the updated PCRE tests into our test suites.
674
675Copy testoutput[1-9] from the testdata directory of your new version
676of pcre, to the re\_SUITE\_data in stdlib's test suites. Run the
677test suites and remove any bugs. Usually the bugs come from the fact
678that the PCRE test suites get better and from our implementation of
679global matching, which may have bugs outside of the PCRE library. The
680test suite 'pcre' is the one that runs these tests. Also copy
681testoutput11-8 to testoutput10, the testoutput10 file in pcre is
682nowadays for the DFA, which we do not use.
683
684The next step is to regenerate re\_testoutput1\_replacement\_test. How
685to do that is in a comment in the beginning of the file. The key
686module is run\_pcre\_tests.erl, which both driver the pcre test and
687generate re\_testoutput1\_replacement\_test.erl. Watch during the
688generation that you do not get to many of the "Fishy character"
689messages, if they are more than, say 20, you will probably need to
690address the UTF8 issues in the Perl execution. As it is now, we skip
691non latin1 characters in this test. You will need to run iconv on the
692generated module to make it UTF-8 before running tests. Try to use a
693perl version that is as new as possible.
694
695The exact same procedure goes for the re\_testoutput1\_split\_test.erl.
696
697Make a note about perl version used in the commit updating the replace
698and split test files.
699
700Note that the perl version you are using may not be completely
701compatible with the PCRE version you are upgrading to. If this is the
702case you might get failures when running the replace and split tests.
703If you get failures, you need to inspect the failures and decide what
704to do. If there are only a small amount of failures you will probably
705end up preferring the behavior of PCRE, and manually changing these
706tests. Do these changes in a separate commit so it is easy to see
707what differed.
708
709Also add copyright headers to the files after converting them to UTF-8.
710
711After ironing out the rest of the bugs, you should be done with the
712code.
713
714## Update documentation
715
716Now it's time for the documentation, which is fairly
717straightforward. Diff the pcrepattern man pages from the old and new
718PCRE distros and update the re.xml file accordingly. It may help to
719have the generated HTML file from the new version to cut and paste
720from, but as you will notice, it's quite a few changes from HTML to
721XML. All lists are reformatted, the &lt;pre&gt; tags are made into
722either &lt;code&gt; or &lt;quite&gt; etc. Also the &lt;P&gt; tags are
723converted to lowercase and all mentioned options and function calls
724are converted to their Erlang counterpart. Really awesome work that
725requires thorough reading of all new text.  For the upgrade from 7.6
726to 8.33, the update of the pcrepattern part of our manual page took
727about eight hours.
728
729## Update Licence
730
731Copy the LICENCE file to `erts/emulator/pcre/LICENCE` and update
732the `[PCRE]` section in `system/COPYRIGHT` with the content of
733the `LICENCE` file.
734
735## Add new relevant options to re
736
737Then, when all this is done, you should add any new relevant options
738from the PCRE library to both the code (erl\_bif\_re.c), the specs and
739the Erlang function 'copt/1' (re.erl) and the manual page
740(re.xml). Make sure the options are really relevant to add to the
741Erlang API, check if they are compile or run-time options (or both) and
742add them to the 'parse\_options' function of erl\_bif\_re.c. Adding an
743option that is just passed through to PCRE is pretty simple, at least
744"code wise".
745
746Now you are done. Run all test suites on all machines and you will be happy.
747
748## Final notes
749
750To avoid the work of a major upgrade, it is probably worth it to keep
751in pace with the changes to PCRE. The upgrade from 7.6 to 8.33,
752including tracking down bugs etc, took me a total of two weeks. If
753smaller diffs from the PCRE development were integrated in a more
754incremental fashion, it will be much easier each time and you will
755have the PCRE library up to date. PCRE should probably be updated for
756each major release, instead of every five years...