1# $Id$ 2 3Note: this only applies to locking using test-and-set and fcntl calls, 4pthreads were added after this was written. 5 6Resource locking routines: lock based on a DB_MUTEX. All this gunk 7(including trying to make assembly code portable), is necessary because 8System V semaphores require system calls for uncontested locks and we 9don't want to make two system calls per resource lock. 10 11First, this is how it works. The DB_MUTEX structure contains a resource 12test-and-set lock (tsl), a file offset, a pid for debugging and statistics 13information. 14 15If HAVE_MUTEX_FCNTL is NOT defined (that is, we know how to do 16test-and-sets for this compiler/architecture combination), we try and 17lock the resource tsl some number of times (based on the number of 18processors). If we can't acquire the mutex that way, we use a system 19call to sleep for 1ms, 2ms, 4ms, etc. (The time is bounded at 10ms for 20mutexes backing logical locks and 25 ms for data structures, just in 21case.) Using the timer backoff means that there are two assumptions: 22that mutexes are held for brief periods (never over system calls or I/O) 23and mutexes are not hotly contested. 24 25If HAVE_MUTEX_FCNTL is defined, we use a file descriptor to do byte 26locking on a file at a specified offset. In this case, ALL of the 27locking is done in the kernel. Because file descriptors are allocated 28per process, we have to provide the file descriptor as part of the lock 29call. We still have to do timer backoff because we need to be able to 30block ourselves, that is, the lock manager causes processes to wait by 31having the process acquire a mutex and then attempting to re-acquire the 32mutex. There's no way to use kernel locking to block yourself, that is, 33if you hold a lock and attempt to re-acquire it, the attempt will 34succeed. 35 36Next, let's talk about why it doesn't work the way a reasonable person 37would think it should work. 38 39Ideally, we'd have the ability to try to lock the resource tsl, and if 40that fails, increment a counter of waiting processes, then block in the 41kernel until the tsl is released. The process holding the resource tsl 42would see the wait counter when it went to release the resource tsl, and 43would wake any waiting processes up after releasing the lock. This would 44actually require both another tsl (call it the mutex tsl) and 45synchronization between the call that blocks in the kernel and the actual 46resource tsl. The mutex tsl would be used to protect accesses to the 47DB_MUTEX itself. Locking the mutex tsl would be done by a busy loop, 48which is safe because processes would never block holding that tsl (all 49they would do is try to obtain the resource tsl and set/check the wait 50count). The problem in this model is that the blocking call into the 51kernel requires a blocking semaphore, i.e. one whose normal state is 52locked. 53 54The only portable forms of locking under UNIX are fcntl(2) on a file 55descriptor/offset, and System V semaphores. Neither of these locking 56methods are sufficient to solve the problem. 57 58The problem with fcntl locking is that only the process that obtained the 59lock can release it. Remember, we want the normal state of the kernel 60semaphore to be locked. So, if the creator of the DB_MUTEX were to 61initialize the lock to "locked", then a second process locks the resource 62tsl, and then a third process needs to block, waiting for the resource 63tsl, when the second process wants to wake up the third process, it can't 64because it's not the holder of the lock! For the second process to be 65the holder of the lock, we would have to make a system call per 66uncontested lock, which is what we were trying to get away from in the 67first place. 68 69There are some hybrid schemes, such as signaling the holder of the lock, 70or using a different blocking offset depending on which process is 71holding the lock, but it gets complicated fairly quickly. I'm open to 72suggestions, but I'm not holding my breath. 73 74Regardless, we use this form of locking when we don't have any other 75choice, because it doesn't have the limitations found in System V 76semaphores, and because the normal state of the kernel object in that 77case is unlocked, so the process releasing the lock is also the holder 78of the lock. 79 80The System V semaphore design has a number of other limitations that make 81it inappropriate for this task. Namely: 82 83First, the semaphore key name space is separate from the file system name 84space (although there exist methods for using file names to create 85semaphore keys). If we use a well-known key, there's no reason to believe 86that any particular key will not already be in use, either by another 87instance of the DB application or some other application, in which case 88the DB application will fail. If we create a key, then we have to use a 89file system name to rendezvous and pass around the key. 90 91Second, System V semaphores traditionally have compile-time, system-wide 92limits on the number of semaphore keys that you can have. Typically, that 93number is far too low for any practical purpose. Since the semaphores 94permit more than a single slot per semaphore key, we could try and get 95around that limit by using multiple slots, but that means that the file 96that we're using for rendezvous is going to have to contain slot 97information as well as semaphore key information, and we're going to be 98reading/writing it on every db_mutex_t init or destroy operation. Anyhow, 99similar compile-time, system-wide limits on the numbers of slots per 100semaphore key kick in, and you're right back where you started. 101 102My fantasy is that once POSIX.1 standard mutexes are in wide-spread use, 103we can switch to them. My guess is that it won't happen, because the 104POSIX semaphores are only required to work for threads within a process, 105and not independent processes. 106 107Note: there are races in the statistics code, but since it's just that, 108I didn't bother fixing them. (The fix requires a mutex tsl, so, when/if 109this code is fixed to do rational locking (see above), then change the 110statistics update code to acquire/release the mutex tsl. 111