1# Automated regression tests for librdkafka
2
3
4## Supported test environments
5
6While the standard test suite works well on OSX and Windows,
7the full test suite (which must be run for PRs and releases) will
8only run on recent Linux distros due to its use of ASAN, Kerberos, etc.
9
10
11## Automated broker cluster setup using trivup
12
13A local broker cluster can be set up using
14[trivup](https://github.com/edenhill/trivup), which is a Python package
15available on PyPi.
16These self-contained clusters are used to run the librdkafka test suite
17on a number of different broker versions or with specific broker configs.
18
19trivup will download the specified Kafka version into its root directory,
20the root directory is also used for cluster instances, where Kafka will
21write messages, logs, etc.
22The trivup root directory is by default `tmp` in the current directory but
23may be specified by setting the `TRIVUP_ROOT` environment variable
24to alternate directory, e.g., `TRIVUP_ROOT=$HOME/trivup make full`.
25
26First install trivup:
27
28    $ pip3 install trivup
29
30Bring up a Kafka cluster (with the specified version) and start an interactive
31shell, when the shell is exited the cluster is brought down and deleted.
32
33    $ ./interactive_broker_version.py 2.3.0   # Broker version
34
35In the trivup shell, run the test suite:
36
37    $ make
38
39
40If you'd rather use an existing cluster, you may omit trivup and
41provide a `test.conf` file that specifies the brokers and possibly other
42librdkafka configuration properties:
43
44    $ cp test.conf.example test.conf
45    $ $EDITOR test.conf
46
47
48
49## Run specific tests
50
51To run tests:
52
53    # Run tests in parallel (quicker, but harder to troubleshoot)
54    $ make
55
56    # Run a condensed test suite (quickest)
57    # This is what is run on CI builds.
58    $ make quick
59
60    # Run tests in sequence
61    $ make run_seq
62
63    # Run specific test
64    $ TESTS=0004 make
65
66    # Run test(s) with helgrind, valgrind, gdb
67    $ TESTS=0009 ./run-test.sh valgrind|helgrind|gdb
68
69
70All tests in the 0000-0999 series are run automatically with `make`.
71
72Tests 1000-1999 are subject to specific non-standard setups or broker
73configuration, these tests are run with `TESTS=1nnn make`.
74See comments in the test's source file for specific requirements.
75
76To insert test results into SQLite database make sure the `sqlite3` utility
77is installed, then add this to `test.conf`:
78
79    test.sql.command=sqlite3 rdktests
80
81
82
83## Adding a new test
84
85The simplest way to add a new test is to copy one of the recent
86(higher `0nnn-..` number) tests to the next free
87`0nnn-<what-is-tested>` file.
88
89If possible and practical, try to use the C++ API in your test as that will
90cover both the C and C++ APIs and thus provide better test coverage.
91Do note that the C++ test framework is not as feature rich as the C one,
92so if you need message verification, etc, you're better off with a C test.
93
94After creating your test file it needs to be added in a couple of places:
95
96 * Add to [tests/CMakeLists.txt](tests/CMakeLists.txt)
97 * Add to [win32/tests/tests.vcxproj](win32/tests/tests.vcxproj)
98 * Add to both locations in [tests/test.c](tests/test.c) - search for an
99   existing test number to see what needs to be done.
100
101You don't need to add the test to the Makefile, it is picked up automatically.
102
103Some additional guidelines:
104 * If your test depends on a minimum broker version, make sure to specify it
105   in test.c using `TEST_BRKVER()` (see 0091 as an example).
106 * If your test can run without an active cluster, flag the test
107   with `TEST_F_LOCAL`.
108 * If your test runs for a long time or produces/consumes a lot of messages
109   it might not be suitable for running on CI (which should run quickly
110   and are bound by both time and resources). In this case it is preferred
111   if you modify your test to be able to run quicker and/or with less messages
112   if the `test_quick` variable is true.
113 * There's plenty of helper wrappers in test.c for common librdkafka functions
114   that makes tests easier to write by not having to deal with errors, etc.
115 * Fail fast, use `TEST_ASSERT()` et.al., the sooner an error is detected
116   the better since it makes troubleshooting easier.
117 * Use `TEST_SAY()` et.al. to inform the developer what your test is doing,
118   making it easier to troubleshoot upon failure. But try to keep output
119   down to reasonable levels. There is a `TEST_LEVEL` environment variable
120   that can be used with `TEST_SAYL()` to only emit certain printouts
121   if the test level is increased. The default test level is 2.
122 * The test runner will automatically adjust timeouts (it knows about)
123   if running under valgrind, on CI, or similar environment where the
124   execution speed may be slower.
125   To make sure your test remains sturdy in these type of environments, make
126   sure to use the `tmout_multip(milliseconds)` macro when passing timeout
127   values to non-test functions, e.g, `rd_kafka_poll(rk, tmout_multip(3000))`.
128 * If your test file contains multiple separate sub-tests, use the
129   `SUB_TEST()`, `SUB_TEST_QUICK()` and `SUB_TEST_PASS()` from inside
130   the test functions to help differentiate test failures.
131
132
133## Test scenarios
134
135A test scenario defines the cluster configuration used by tests.
136The majority of tests use the "default" scenario which matches the
137Apache Kafka default broker configuration (topic auto creation enabled, etc).
138
139If a test relies on cluster configuration that is mutually exclusive with
140the default configuration an alternate scenario must be defined in
141`scenarios/<scenario>.json` which is a configuration object which
142is passed to [trivup](https://github.com/edenhill/trivup).
143
144Try to reuse an existing test scenario as far as possible to speed up
145test times, since each new scenario will require a new cluster incarnation.
146
147
148## A guide to testing, verifying, and troubleshooting, librdkafka
149
150
151### Creating a development build
152
153The [dev-conf.sh](../dev-conf.sh) script configures and builds librdkafka and
154the test suite for development use, enabling extra runtime
155checks (`ENABLE_DEVEL`, `rd_dassert()`, etc), disabling optimization
156(to get accurate stack traces and line numbers), enable ASAN, etc.
157
158    # Reconfigure librdkafka for development use and rebuild.
159    $ ./dev-conf.sh
160
161**NOTE**: Performance tests and benchmarks should not use a development build.
162
163
164### Controlling the test framework
165
166A test run may be dynamically set up using a number of environment variables.
167These environment variables work for all different ways of invocing the tests,
168be it `make`, `run-test.sh`, `until-fail.sh`, etc.
169
170 * `TESTS=0nnn` - only run a single test identified by its full number, e.g.
171                  `TESTS=0102 make`. (Yes, the var should have been called TEST)
172 * `SUBTESTS=...` - only run sub-tests (tests that are using `SUB_TEST()`)
173                      that contains this string.
174 * `TESTS_SKIP=...` - skip these tests.
175 * `TEST_DEBUG=...` - this will automatically set the `debug` config property
176                      of all instantiated clients to the value.
177                      E.g.. `TEST_DEBUG=broker,protocol TESTS=0001 make`
178 * `TEST_LEVEL=n` - controls the `TEST_SAY()` output level, a higher number
179                      yields more test output. Default level is 2.
180 * `RD_UT_TEST=name` - only run unittest containing `name`, should be used
181                          with `TESTS=0000`.
182                          See [../src/rdunittest.c](../src/rdunittest.c) for
183                          unit test names.
184
185
186Let's say that you run the full test suite and get a failure in test 0061,
187which is a consumer test. You want to quickly reproduce the issue
188and figure out what is wrong, so limit the tests to just 0061, and provide
189the relevant debug options (which is typically `cgrp,fetch` for consumers):
190
191    $ TESTS=0061 TEST_DEBUG=cgrp,fetch make
192
193If the test did not fail you've found an intermittent issue, this is where
194[until-fail.sh](until-fail.sh) comes in to play, so run the test until it fails:
195
196    # bare means to run the test without valgrind
197    $ TESTS=0061 TEST_DEBUG=cgrp,fetch ./until-fail.sh bare
198
199
200### How to run tests
201
202The standard way to run the test suite is firing up a trivup cluster
203in an interactive shell:
204
205    $ ./interactive_broker_version.py 2.3.0   # Broker version
206
207
208And then running the test suite in parallel:
209
210    $ make
211
212
213Run one test at a time:
214
215    $ make run_seq
216
217
218Run a single test:
219
220    $ TESTS=0034 make
221
222
223Run test suite with valgrind (see instructions below):
224
225    $ ./run-test.sh valgrind   # memory checking
226
227or with helgrind (the valgrind thread checker):
228
229    $ ./run-test.sh helgrind   # thread checking
230
231
232To run the tests in gdb:
233
234**NOTE**: gdb support is flaky on OSX due to signing issues.
235
236    $ ./run-test.sh gdb
237    (gdb) run
238
239    # wait for test to crash, or interrupt with Ctrl-C
240
241    # backtrace of current thread
242    (gdb) bt
243    # move up or down a stack frame
244    (gdb) up
245    (gdb) down
246    # select specific stack frame
247    (gdb) frame 3
248    # show code at location
249    (gdb) list
250
251    # print variable content
252    (gdb) p rk.rk_conf.group_id
253    (gdb) p *rkb
254
255    # continue execution (if interrupted)
256    (gdb) cont
257
258    # single-step one instruction
259    (gdb) step
260
261    # restart
262    (gdb) run
263
264    # see all threads
265    (gdb) info threads
266
267    # see backtraces of all threads
268    (gdb) thread apply all bt
269
270    # exit gdb
271    (gdb) exit
272
273
274If a test crashes and produces a core file (make sure your shell has
275`ulimit -c unlimited` set!), do:
276
277    # On linux
278    $ LD_LIBRARY_PATH=../src:../src-cpp gdb ./test-runner <core-file>
279    (gdb) bt
280
281    # On OSX
282    $ DYLD_LIBRARY_PATH=../src:../src-cpp gdb ./test-runner /cores/core.<pid>
283    (gdb) bt
284
285
286To run all tests repeatedly until one fails, this is a good way of finding
287intermittent failures, race conditions, etc:
288
289    $ ./until-fail.sh bare  # bare is to run the test without valgrind,
290                            # may also be one or more of the modes supported
291                            # by run-test.sh:
292                            #  bare valgrind helgrind gdb, etc..
293
294To run a single test repeatedly with valgrind until failure:
295
296    $ TESTS=0103 ./until-fail.sh valgrind
297
298
299
300### Finding memory leaks, memory corruption, etc.
301
302There are two ways to verifying there are no memory leaks, out of bound
303memory accesses, use after free, etc. ASAN or valgrind.
304
305#### ASAN - AddressSanitizer
306
307The first option is using AddressSanitizer, this is build-time instrumentation
308provided by clang and gcc to insert memory checks in the build library.
309
310To enable AddressSanitizer (ASAN), run `./dev-conf.sh asan` from the
311librdkafka root directory.
312This script will rebuild librdkafka and the test suite with ASAN enabled.
313
314Then run tests as usual. Memory access issues will be reported on stderr
315in real time as they happen (and the test will fail eventually), while
316memory leaks will be reported on stderr when the test run exits successfully,
317i.e., no tests failed.
318
319Test failures will typically cause the current test to exit hard without
320cleaning up, in which case there will be a large number of reported memory
321leaks, these shall be ignored. The memory leak report is only relevant
322when the test suite passes.
323
324**NOTE**: The OSX version of ASAN does not provide memory leak protection,
325          you will need to run the test suite on Linux (native or in Docker).
326
327**NOTE**: ASAN, TSAN and valgrind are mutually exclusive.
328
329
330#### Valgrind - memory checker
331
332Valgrind is a powerful virtual machine that intercepts all memory accesses
333of an unmodified program, reporting memory access violations, use after free,
334memory leaks, etc.
335
336Valgrind provides additional checks over ASAN and is mostly useful
337for troubleshooting crashes, memory issues and leaks when ASAN falls short.
338
339To use valgrind, make sure librdkafka and the test suite is built without
340ASAN or TSAN, it must be a clean build without any other instrumentation,
341then simply run:
342
343    $ ./run-test.sh valgrind
344
345Valgrind will report to stderr, just like ASAN.
346
347
348**NOTE**: Valgrind only runs on Linux.
349
350**NOTE**: ASAN, TSAN and valgrind are mutually exclusive.
351
352
353### TSAN - Thread and locking issues
354
355librdkafka uses a number of internal threads which communicate and share state
356through op queues, conditional variables, mutexes and atomics.
357
358While the docstrings in the librdkafka source code specify what locking is
359required it is very hard to manually verify that the correct locks
360are acquired, and in the correct order (to avoid deadlocks).
361
362TSAN, ThreadSanitizer, is of great help here. As with ASAN, TSAN is a
363build-time option: run `./dev-conf.sh tsan` to rebuild with TSAN.
364
365Run the test suite as usual, preferably in parallel. TSAN will output
366thread errors to stderr and eventually fail the test run.
367
368If you're having threading issues and TSAN does not provide enough information
369to sort it out, you can also try running the test with helgrind, which
370is valgrind's thread checker (`./run-test.sh helgrind`).
371
372
373**NOTE**: ASAN, TSAN and valgrind are mutually exclusive.
374
375
376### Resource usage thresholds (experimental)
377
378**NOTE**: This is an experimental feature, some form of system-specific
379          calibration will be needed.
380
381If the `-R` option is passed to the `test-runner`, or the `make rusage`
382target is used, the test framework will monitor each test's resource usage
383and fail the test if the default or test-specific thresholds are exceeded.
384
385Per-test thresholds are specified in test.c using the `_THRES()` macro.
386
387Currently monitored resources are:
388 * `utime` - User CPU time in seconds (default 1.0s)
389 * `stime` - System/Kernel CPU time in seconds (default 0.5s).
390 * `rss` - RSS (memory) usage (default 10.0 MB)
391 * `ctxsw` - Number of voluntary context switches, e.g. syscalls (default 10000).
392
393Upon successful test completion a log line will be emitted with a resource
394usage summary, e.g.:
395
396    Test resource usage summary: 20.161s (32.3%) User CPU time, 12.976s (20.8%) Sys CPU time, 0.000MB RSS memory increase, 4980 Voluntary context switches
397
398The User and Sys CPU thresholds are based on observations running the
399test suite on an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz (8 cores)
400which define the base line system.
401
402Since no two development environments are identical a manual CPU calibration
403value can be passed as `-R<C>`, where `C` is the CPU calibration for
404the local system compared to the base line system.
405The CPU threshold will be multiplied by the CPU calibration value (default 1.0),
406thus a value less than 1.0 means the local system is faster than the
407base line system, and a value larger than 1.0 means the local system is
408slower than the base line system.
409I.e., if you are on an i5 system, pass `-R2.0` to allow higher CPU usages,
410or `-R0.8` if your system is faster than the base line system.
411The the CPU calibration value may also be set with the
412`TEST_CPU_CALIBRATION=1.5` environment variable.
413
414In an ideal future, the test suite would be able to auto-calibrate.
415
416
417**NOTE**: The resource usage threshold checks will run tests in sequence,
418          not parallell, to be able to effectively measure per-test usage.
419
420
421# PR and release verification
422
423Prior to pushing your PR  you must verify that your code change has not
424introduced any regression or new issues, this requires running the test
425suite in multiple different modes:
426
427 * PLAINTEXT, SSL transports
428 * All SASL mechanisms (PLAIN, GSSAPI, SCRAM, OAUTHBEARER)
429 * Idempotence enabled for all tests
430 * With memory checking
431 * With thread checking
432 * Compatibility with older broker versions
433
434These tests must also be run for each release candidate that is created.
435
436    $ make release-test
437
438This will take approximately 30 minutes.
439
440**NOTE**: Run this on Linux (for ASAN and Kerberos tests to work properly), not OSX.
441
442
443# Test mode specifics
444
445The following sections rely on trivup being installed.
446
447
448### Compatbility tests with multiple broker versions
449
450To ensure compatibility across all supported broker versions the entire
451test suite is run in a trivup based cluster, one test run for each
452relevant broker version.
453
454    $ ./broker_version_tests.py
455
456
457### SASL tests
458
459Testing SASL requires a bit of configuration on the brokers, to automate
460this the entire test suite is run on trivup based clusters.
461
462    $ ./sasl_tests.py
463
464
465
466### Full test suite(s) run
467
468To run all tests, including the broker version and SASL tests, etc, use
469
470    $ make full
471
472**NOTE**: `make full` is a sub-set of the more complete `make release-test` target.
473
474
475### Idempotent Producer tests
476
477To run the entire test suite with `enable.idempotence=true` enabled, use
478`make idempotent_seq` or `make idempotent_par` for sequencial or
479parallel testing.
480Some tests are skipped or slightly modified when idempotence is enabled.
481
482
483## Manual testing notes
484
485The following manual tests are currently performed manually, they should be
486implemented as automatic tests.
487
488### LZ4 interop
489
490    $ ./interactive_broker_version.py -c ./lz4_manual_test.py 0.8.2.2 0.9.0.1 2.3.0
491
492Check the output and follow the instructions.
493
494
495
496
497## Test numbers
498
499Automated tests: 0000-0999
500Manual tests:    8000-8999
501