1# Automated regression tests for librdkafka 2 3 4## Supported test environments 5 6While the standard test suite works well on OSX and Windows, 7the full test suite (which must be run for PRs and releases) will 8only run on recent Linux distros due to its use of ASAN, Kerberos, etc. 9 10 11## Automated broker cluster setup using trivup 12 13A local broker cluster can be set up using 14[trivup](https://github.com/edenhill/trivup), which is a Python package 15available on PyPi. 16These self-contained clusters are used to run the librdkafka test suite 17on a number of different broker versions or with specific broker configs. 18 19trivup will download the specified Kafka version into its root directory, 20the root directory is also used for cluster instances, where Kafka will 21write messages, logs, etc. 22The trivup root directory is by default `tmp` in the current directory but 23may be specified by setting the `TRIVUP_ROOT` environment variable 24to alternate directory, e.g., `TRIVUP_ROOT=$HOME/trivup make full`. 25 26First install trivup: 27 28 $ pip3 install trivup 29 30Bring up a Kafka cluster (with the specified version) and start an interactive 31shell, when the shell is exited the cluster is brought down and deleted. 32 33 $ ./interactive_broker_version.py 2.3.0 # Broker version 34 35In the trivup shell, run the test suite: 36 37 $ make 38 39 40If you'd rather use an existing cluster, you may omit trivup and 41provide a `test.conf` file that specifies the brokers and possibly other 42librdkafka configuration properties: 43 44 $ cp test.conf.example test.conf 45 $ $EDITOR test.conf 46 47 48 49## Run specific tests 50 51To run tests: 52 53 # Run tests in parallel (quicker, but harder to troubleshoot) 54 $ make 55 56 # Run a condensed test suite (quickest) 57 # This is what is run on CI builds. 58 $ make quick 59 60 # Run tests in sequence 61 $ make run_seq 62 63 # Run specific test 64 $ TESTS=0004 make 65 66 # Run test(s) with helgrind, valgrind, gdb 67 $ TESTS=0009 ./run-test.sh valgrind|helgrind|gdb 68 69 70All tests in the 0000-0999 series are run automatically with `make`. 71 72Tests 1000-1999 are subject to specific non-standard setups or broker 73configuration, these tests are run with `TESTS=1nnn make`. 74See comments in the test's source file for specific requirements. 75 76To insert test results into SQLite database make sure the `sqlite3` utility 77is installed, then add this to `test.conf`: 78 79 test.sql.command=sqlite3 rdktests 80 81 82 83## Adding a new test 84 85The simplest way to add a new test is to copy one of the recent 86(higher `0nnn-..` number) tests to the next free 87`0nnn-<what-is-tested>` file. 88 89If possible and practical, try to use the C++ API in your test as that will 90cover both the C and C++ APIs and thus provide better test coverage. 91Do note that the C++ test framework is not as feature rich as the C one, 92so if you need message verification, etc, you're better off with a C test. 93 94After creating your test file it needs to be added in a couple of places: 95 96 * Add to [tests/CMakeLists.txt](tests/CMakeLists.txt) 97 * Add to [win32/tests/tests.vcxproj](win32/tests/tests.vcxproj) 98 * Add to both locations in [tests/test.c](tests/test.c) - search for an 99 existing test number to see what needs to be done. 100 101You don't need to add the test to the Makefile, it is picked up automatically. 102 103Some additional guidelines: 104 * If your test depends on a minimum broker version, make sure to specify it 105 in test.c using `TEST_BRKVER()` (see 0091 as an example). 106 * If your test can run without an active cluster, flag the test 107 with `TEST_F_LOCAL`. 108 * If your test runs for a long time or produces/consumes a lot of messages 109 it might not be suitable for running on CI (which should run quickly 110 and are bound by both time and resources). In this case it is preferred 111 if you modify your test to be able to run quicker and/or with less messages 112 if the `test_quick` variable is true. 113 * There's plenty of helper wrappers in test.c for common librdkafka functions 114 that makes tests easier to write by not having to deal with errors, etc. 115 * Fail fast, use `TEST_ASSERT()` et.al., the sooner an error is detected 116 the better since it makes troubleshooting easier. 117 * Use `TEST_SAY()` et.al. to inform the developer what your test is doing, 118 making it easier to troubleshoot upon failure. But try to keep output 119 down to reasonable levels. There is a `TEST_LEVEL` environment variable 120 that can be used with `TEST_SAYL()` to only emit certain printouts 121 if the test level is increased. The default test level is 2. 122 * The test runner will automatically adjust timeouts (it knows about) 123 if running under valgrind, on CI, or similar environment where the 124 execution speed may be slower. 125 To make sure your test remains sturdy in these type of environments, make 126 sure to use the `tmout_multip(milliseconds)` macro when passing timeout 127 values to non-test functions, e.g, `rd_kafka_poll(rk, tmout_multip(3000))`. 128 * If your test file contains multiple separate sub-tests, use the 129 `SUB_TEST()`, `SUB_TEST_QUICK()` and `SUB_TEST_PASS()` from inside 130 the test functions to help differentiate test failures. 131 132 133## Test scenarios 134 135A test scenario defines the cluster configuration used by tests. 136The majority of tests use the "default" scenario which matches the 137Apache Kafka default broker configuration (topic auto creation enabled, etc). 138 139If a test relies on cluster configuration that is mutually exclusive with 140the default configuration an alternate scenario must be defined in 141`scenarios/<scenario>.json` which is a configuration object which 142is passed to [trivup](https://github.com/edenhill/trivup). 143 144Try to reuse an existing test scenario as far as possible to speed up 145test times, since each new scenario will require a new cluster incarnation. 146 147 148## A guide to testing, verifying, and troubleshooting, librdkafka 149 150 151### Creating a development build 152 153The [dev-conf.sh](../dev-conf.sh) script configures and builds librdkafka and 154the test suite for development use, enabling extra runtime 155checks (`ENABLE_DEVEL`, `rd_dassert()`, etc), disabling optimization 156(to get accurate stack traces and line numbers), enable ASAN, etc. 157 158 # Reconfigure librdkafka for development use and rebuild. 159 $ ./dev-conf.sh 160 161**NOTE**: Performance tests and benchmarks should not use a development build. 162 163 164### Controlling the test framework 165 166A test run may be dynamically set up using a number of environment variables. 167These environment variables work for all different ways of invocing the tests, 168be it `make`, `run-test.sh`, `until-fail.sh`, etc. 169 170 * `TESTS=0nnn` - only run a single test identified by its full number, e.g. 171 `TESTS=0102 make`. (Yes, the var should have been called TEST) 172 * `SUBTESTS=...` - only run sub-tests (tests that are using `SUB_TEST()`) 173 that contains this string. 174 * `TESTS_SKIP=...` - skip these tests. 175 * `TEST_DEBUG=...` - this will automatically set the `debug` config property 176 of all instantiated clients to the value. 177 E.g.. `TEST_DEBUG=broker,protocol TESTS=0001 make` 178 * `TEST_LEVEL=n` - controls the `TEST_SAY()` output level, a higher number 179 yields more test output. Default level is 2. 180 * `RD_UT_TEST=name` - only run unittest containing `name`, should be used 181 with `TESTS=0000`. 182 See [../src/rdunittest.c](../src/rdunittest.c) for 183 unit test names. 184 185 186Let's say that you run the full test suite and get a failure in test 0061, 187which is a consumer test. You want to quickly reproduce the issue 188and figure out what is wrong, so limit the tests to just 0061, and provide 189the relevant debug options (which is typically `cgrp,fetch` for consumers): 190 191 $ TESTS=0061 TEST_DEBUG=cgrp,fetch make 192 193If the test did not fail you've found an intermittent issue, this is where 194[until-fail.sh](until-fail.sh) comes in to play, so run the test until it fails: 195 196 # bare means to run the test without valgrind 197 $ TESTS=0061 TEST_DEBUG=cgrp,fetch ./until-fail.sh bare 198 199 200### How to run tests 201 202The standard way to run the test suite is firing up a trivup cluster 203in an interactive shell: 204 205 $ ./interactive_broker_version.py 2.3.0 # Broker version 206 207 208And then running the test suite in parallel: 209 210 $ make 211 212 213Run one test at a time: 214 215 $ make run_seq 216 217 218Run a single test: 219 220 $ TESTS=0034 make 221 222 223Run test suite with valgrind (see instructions below): 224 225 $ ./run-test.sh valgrind # memory checking 226 227or with helgrind (the valgrind thread checker): 228 229 $ ./run-test.sh helgrind # thread checking 230 231 232To run the tests in gdb: 233 234**NOTE**: gdb support is flaky on OSX due to signing issues. 235 236 $ ./run-test.sh gdb 237 (gdb) run 238 239 # wait for test to crash, or interrupt with Ctrl-C 240 241 # backtrace of current thread 242 (gdb) bt 243 # move up or down a stack frame 244 (gdb) up 245 (gdb) down 246 # select specific stack frame 247 (gdb) frame 3 248 # show code at location 249 (gdb) list 250 251 # print variable content 252 (gdb) p rk.rk_conf.group_id 253 (gdb) p *rkb 254 255 # continue execution (if interrupted) 256 (gdb) cont 257 258 # single-step one instruction 259 (gdb) step 260 261 # restart 262 (gdb) run 263 264 # see all threads 265 (gdb) info threads 266 267 # see backtraces of all threads 268 (gdb) thread apply all bt 269 270 # exit gdb 271 (gdb) exit 272 273 274If a test crashes and produces a core file (make sure your shell has 275`ulimit -c unlimited` set!), do: 276 277 # On linux 278 $ LD_LIBRARY_PATH=../src:../src-cpp gdb ./test-runner <core-file> 279 (gdb) bt 280 281 # On OSX 282 $ DYLD_LIBRARY_PATH=../src:../src-cpp gdb ./test-runner /cores/core.<pid> 283 (gdb) bt 284 285 286To run all tests repeatedly until one fails, this is a good way of finding 287intermittent failures, race conditions, etc: 288 289 $ ./until-fail.sh bare # bare is to run the test without valgrind, 290 # may also be one or more of the modes supported 291 # by run-test.sh: 292 # bare valgrind helgrind gdb, etc.. 293 294To run a single test repeatedly with valgrind until failure: 295 296 $ TESTS=0103 ./until-fail.sh valgrind 297 298 299 300### Finding memory leaks, memory corruption, etc. 301 302There are two ways to verifying there are no memory leaks, out of bound 303memory accesses, use after free, etc. ASAN or valgrind. 304 305#### ASAN - AddressSanitizer 306 307The first option is using AddressSanitizer, this is build-time instrumentation 308provided by clang and gcc to insert memory checks in the build library. 309 310To enable AddressSanitizer (ASAN), run `./dev-conf.sh asan` from the 311librdkafka root directory. 312This script will rebuild librdkafka and the test suite with ASAN enabled. 313 314Then run tests as usual. Memory access issues will be reported on stderr 315in real time as they happen (and the test will fail eventually), while 316memory leaks will be reported on stderr when the test run exits successfully, 317i.e., no tests failed. 318 319Test failures will typically cause the current test to exit hard without 320cleaning up, in which case there will be a large number of reported memory 321leaks, these shall be ignored. The memory leak report is only relevant 322when the test suite passes. 323 324**NOTE**: The OSX version of ASAN does not provide memory leak protection, 325 you will need to run the test suite on Linux (native or in Docker). 326 327**NOTE**: ASAN, TSAN and valgrind are mutually exclusive. 328 329 330#### Valgrind - memory checker 331 332Valgrind is a powerful virtual machine that intercepts all memory accesses 333of an unmodified program, reporting memory access violations, use after free, 334memory leaks, etc. 335 336Valgrind provides additional checks over ASAN and is mostly useful 337for troubleshooting crashes, memory issues and leaks when ASAN falls short. 338 339To use valgrind, make sure librdkafka and the test suite is built without 340ASAN or TSAN, it must be a clean build without any other instrumentation, 341then simply run: 342 343 $ ./run-test.sh valgrind 344 345Valgrind will report to stderr, just like ASAN. 346 347 348**NOTE**: Valgrind only runs on Linux. 349 350**NOTE**: ASAN, TSAN and valgrind are mutually exclusive. 351 352 353### TSAN - Thread and locking issues 354 355librdkafka uses a number of internal threads which communicate and share state 356through op queues, conditional variables, mutexes and atomics. 357 358While the docstrings in the librdkafka source code specify what locking is 359required it is very hard to manually verify that the correct locks 360are acquired, and in the correct order (to avoid deadlocks). 361 362TSAN, ThreadSanitizer, is of great help here. As with ASAN, TSAN is a 363build-time option: run `./dev-conf.sh tsan` to rebuild with TSAN. 364 365Run the test suite as usual, preferably in parallel. TSAN will output 366thread errors to stderr and eventually fail the test run. 367 368If you're having threading issues and TSAN does not provide enough information 369to sort it out, you can also try running the test with helgrind, which 370is valgrind's thread checker (`./run-test.sh helgrind`). 371 372 373**NOTE**: ASAN, TSAN and valgrind are mutually exclusive. 374 375 376### Resource usage thresholds (experimental) 377 378**NOTE**: This is an experimental feature, some form of system-specific 379 calibration will be needed. 380 381If the `-R` option is passed to the `test-runner`, or the `make rusage` 382target is used, the test framework will monitor each test's resource usage 383and fail the test if the default or test-specific thresholds are exceeded. 384 385Per-test thresholds are specified in test.c using the `_THRES()` macro. 386 387Currently monitored resources are: 388 * `utime` - User CPU time in seconds (default 1.0s) 389 * `stime` - System/Kernel CPU time in seconds (default 0.5s). 390 * `rss` - RSS (memory) usage (default 10.0 MB) 391 * `ctxsw` - Number of voluntary context switches, e.g. syscalls (default 10000). 392 393Upon successful test completion a log line will be emitted with a resource 394usage summary, e.g.: 395 396 Test resource usage summary: 20.161s (32.3%) User CPU time, 12.976s (20.8%) Sys CPU time, 0.000MB RSS memory increase, 4980 Voluntary context switches 397 398The User and Sys CPU thresholds are based on observations running the 399test suite on an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz (8 cores) 400which define the base line system. 401 402Since no two development environments are identical a manual CPU calibration 403value can be passed as `-R<C>`, where `C` is the CPU calibration for 404the local system compared to the base line system. 405The CPU threshold will be multiplied by the CPU calibration value (default 1.0), 406thus a value less than 1.0 means the local system is faster than the 407base line system, and a value larger than 1.0 means the local system is 408slower than the base line system. 409I.e., if you are on an i5 system, pass `-R2.0` to allow higher CPU usages, 410or `-R0.8` if your system is faster than the base line system. 411The the CPU calibration value may also be set with the 412`TEST_CPU_CALIBRATION=1.5` environment variable. 413 414In an ideal future, the test suite would be able to auto-calibrate. 415 416 417**NOTE**: The resource usage threshold checks will run tests in sequence, 418 not parallell, to be able to effectively measure per-test usage. 419 420 421# PR and release verification 422 423Prior to pushing your PR you must verify that your code change has not 424introduced any regression or new issues, this requires running the test 425suite in multiple different modes: 426 427 * PLAINTEXT, SSL transports 428 * All SASL mechanisms (PLAIN, GSSAPI, SCRAM, OAUTHBEARER) 429 * Idempotence enabled for all tests 430 * With memory checking 431 * With thread checking 432 * Compatibility with older broker versions 433 434These tests must also be run for each release candidate that is created. 435 436 $ make release-test 437 438This will take approximately 30 minutes. 439 440**NOTE**: Run this on Linux (for ASAN and Kerberos tests to work properly), not OSX. 441 442 443# Test mode specifics 444 445The following sections rely on trivup being installed. 446 447 448### Compatbility tests with multiple broker versions 449 450To ensure compatibility across all supported broker versions the entire 451test suite is run in a trivup based cluster, one test run for each 452relevant broker version. 453 454 $ ./broker_version_tests.py 455 456 457### SASL tests 458 459Testing SASL requires a bit of configuration on the brokers, to automate 460this the entire test suite is run on trivup based clusters. 461 462 $ ./sasl_tests.py 463 464 465 466### Full test suite(s) run 467 468To run all tests, including the broker version and SASL tests, etc, use 469 470 $ make full 471 472**NOTE**: `make full` is a sub-set of the more complete `make release-test` target. 473 474 475### Idempotent Producer tests 476 477To run the entire test suite with `enable.idempotence=true` enabled, use 478`make idempotent_seq` or `make idempotent_par` for sequencial or 479parallel testing. 480Some tests are skipped or slightly modified when idempotence is enabled. 481 482 483## Manual testing notes 484 485The following manual tests are currently performed manually, they should be 486implemented as automatic tests. 487 488### LZ4 interop 489 490 $ ./interactive_broker_version.py -c ./lz4_manual_test.py 0.8.2.2 0.9.0.1 2.3.0 491 492Check the output and follow the instructions. 493 494 495 496 497## Test numbers 498 499Automated tests: 0000-0999 500Manual tests: 8000-8999 501