1CDDL HEADER START
2
3The contents of this file are subject to the terms of the
4Common Development and Distribution License, Version 1.0 only
5(the "License").  You may not use this file except in compliance
6with the License.
7
8You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9or http://www.opensolaris.org/os/licensing.
10See the License for the specific language governing permissions
11and limitations under the License.
12
13When distributing Covered Code, include this CDDL HEADER in each
14file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15If applicable, add the following below this CDDL HEADER, with the
16fields enclosed by brackets "[]" replaced with your own identifying
17information: Portions Copyright [yyyy] [name of copyright owner]
18
19CDDL HEADER END
20
21Copyright 2004 Sun Microsystems, Inc.  All rights reserved.
22Use is subject to license terms.
23
24Architectural Overview for the DHCP agent
25Peter Memishian
26ident	"%Z%%M%	%I%	%E% SMI"
27
28INTRODUCTION
29============
30
31The Solaris DHCP agent (dhcpagent) is an RFC2131-compliant DHCP client
32implementation.  The major forces shaping its design were:
33
34	* Must be capable of managing multiple network interfaces.
35	* Must consume little CPU, since it will always be running.
36	* Must have a small memory footprint, since it will always be
37	  running.
38	* Must not rely on any shared libraries, since it must run
39	  before all filesystems have been mounted.
40
41When a DHCP agent implementation is only required to control a single
42interface on a machine, the problem is expressed well as a simple
43state-machine, as shown in RFC2131.  However, when a DHCP agent is
44responsible for managing more than one interface at a time, the
45problem becomes much more complicated, especially when threads cannot
46be used to attack the problem (since the filesystems containing the
47thread libraries may not be available when the agent starts).
48Instead, the problem must be solved using an event-driven model, which
49while tried-and-true, is subtle and easy to get wrong.  Indeed, much
50of the agent's code is there to manage the complexity of programming
51in an asynchronous event-driven paradigm.
52
53THE BASICS
54==========
55
56The DHCP agent consists of roughly 20 source files, most with a
57companion header file.  While the largest source file is around 700
58lines, most are much shorter.  The source files can largely be broken
59up into three groups:
60
61	* Source files, which along with their companion header files,
62	  define an abstract "object" that is used by other parts of
63	  the system.  Examples include "timer_queue.c", which along
64	  with "timer_queue.h" provide a Timer Queue object for use
65	  by the rest of the agent, and "async.c", which along with
66	  "async.h" defines an interface for managing asynchronous
67	  transactions within the agent.
68
69	* Source files which implement a given state of the agent; for
70	  instance, there is a "request.c" which comprises all of
71	  the procedural "work" which must be done while in the
72	  REQUESTING state of the agent.  By encapsulating states in
73	  files, it becomes easier to debug errors in the
74	  client/server protocol and adapt the agent to new
75	  constraints, since all the relevant code is in one place.
76
77	* Source files, which along with their companion header files,
78  	  encapsulate a given task or related set of tasks.  The
79	  difference between this and the first group is that the
80	  interfaces exported from these files do not operate on
81	  an "object", but rather perform a specific task.  Examples
82	  include "dlpi_io.c", which provides a useful interface
83	  to DLPI-related i/o operations.
84
85OVERVIEW
86========
87
88Here we discuss the essential objects and subtle aspects of the
89DHCP agent implementation.  Note that there is of course much more
90that is not discussed here, but after this overview you should be able
91to fend for yourself in the source code.
92
93Event Handlers and Timer Queues
94-------------------------------
95
96The most important object in the agent is the event handler, whose
97interface is in libinetutil.h and whose implementation is in
98libinetutil.  The event handler is essentially an object-oriented
99wrapper around poll(2): other components of the agent can register to
100be called back when specific events on file descriptors happen -- for
101instance, to wait for requests to arrive on its IPC socket, the agent
102registers a callback function (accept_event()) that will be called
103back whenever a new connection arrives on the file descriptor
104associated with the IPC socket.  When the agent initially begins in
105main(), it registers a number of events with the event handler, and
106then calls iu_handle_events(), which proceeds to wait for events to
107happen -- this function does not return until the agent is shutdown
108via signal.
109
110When the registered events occur, the callback functions are called
111back, which in turn might lead to additional callbacks being
112registered -- this is the classic event-driven model.  (As an aside,
113note that programming in an event-driven model means that callbacks
114cannot block, or else the agent will become unresponsive.)
115
116A special kind of "event" is a timeout.  Since there are many timers
117which must be maintained for each DHCP-controlled interface (such as a
118lease expiration timer, time-to-first-renewal (t1) timer, and so
119forth), an object-oriented abstraction to timers called a "timer
120queue" is provided, whose interface is in libinetutil.h with a
121corresponding implementation in libinetutil.  The timer queue allows
122callback functions to be "scheduled" for callback after a certain
123amount of time has passed.
124
125The event handler and timer queue objects work hand-in-hand: the event
126handler is passed a pointer to a timer queue in iu_handle_events() --
127from there, it can use the iu_earliest_timer() routine to find the
128timer which will next fire, and use this to set its timeout value in
129its call to poll(2).  If poll(2) returns due to a timeout, the event
130handler calls iu_expire_timers() to expire all timers that expired
131(note that more than one may have expired if, for example, multiple
132timers were set to expire at the same time).
133
134Although it is possible to instantiate more than one timer queue or
135event handler object, it doesn't make a lot of sense -- these objects
136are really "singletons".  Accordingly, the agent has two global
137variables, `eh' and `tq', which store pointers to the global event
138handler and timer queue.
139
140Network Interfaces
141------------------
142
143For each network interface managed by the agent, there is a set of
144associated state that describes both its general properties (such as
145the maximum MTU) and its DHCP-related state (such as when it acquired
146a lease).  This state is stored in a a structure called an `ifslist',
147which is a poor name (since it suggests implementation artifacts but
148not purpose) but has historical precedent.  Another way to think about
149an `ifslist' is that it provides all of the context necessary to
150perform DHCP on a given interface: the state the interface is in, the
151last packet DHCP packet received on that interface, and so forth.  As
152one can imagine, the `ifslist' structure is quite complicated and rules
153governing accessing its fields are equally convoluted -- see the
154comments in interface.h for more information.
155
156One point that was brushed over in the preceding discussion of event
157handlers and timer queues was context.  Recall that the event-driven
158nature of the agent requires that functions cannot block, lest they
159starve out others and impact the observed responsiveness of the agent.
160As an example, consider the process of extending a lease: the agent
161must send a REQUEST packet and wait for an ACK or NAK packet in
162response.  This is done by sending a REQUEST and then registering a
163callback with the event handler that waits for an ACK or NAK packet to
164arrive on the file descriptor associated with the interface.  Note
165however, that when the ACK or NAK does arrive, and the callback
166function called back, it must know which interface this packet is for
167(it must get back its context).  This could be handled through an
168ad-hoc mapping of file descriptors to interfaces, but a cleaner
169approach is to have the event handler's register function
170(iu_register_event()) take in an opaque context pointer, which will
171then be passed back to the callback.  In the agent, this context
172pointer is always the `ifslist', but for reasons of decoupling and
173generality, the timer queue and event handler objects allow a generic
174(void *) context argument.
175
176Note that there is nothing that guarantees the pointer passed into
177iu_register_event() or iu_schedule_timer() will still be valid when
178the callback is called back (for instance, the memory may have been
179freed in the meantime).  To solve this problem, ifslists are reference
180counted.  For more details on how the reference count scheme is
181implemented, see the closing comments in interface.h regarding memory
182management.
183
184Transactions
185------------
186
187Many operations performed via DHCP must be performed in groups -- for
188instance, acquiring a lease requires several steps: sending a
189DISCOVER, collecting OFFERs, selecting an OFFER, sending a REQUEST,
190and receiving an ACK, assuming everything goes well.  Note however
191that due to the event-driven model the agent operates in, these
192operations are not inherently "grouped" -- instead, the agent sends a
193DISCOVER, goes back into the main event loop, waits for events
194(perhaps even requests on the IPC channel to begin acquiring a lease
195on another interface), eventually checks to see if an acceptable OFFER
196has come in, and so forth.  To some degree, the notion of the current
197state of an interface (SELECTING, REQUESTING, etc) helps control the
198potential chaos of the event-driven model (for instance, if while the
199agent is waiting for an OFFER on a given interface, an IPC event comes
200in requesting that the interface be RELEASED, the agent knows to send
201back an error since the interface must be in at least the BOUND state
202before a RELEASE can be performed.)
203
204However, states are not enough -- for instance, suppose that the agent
205begins trying to renew a lease -- this is done by sending a REQUEST
206packet and waiting for an ACK or NAK, which might never come.  If,
207while waiting for the ACK or NAK, the user sends a request to renew
208the lease as well, then if the agent were to send another REQUEST,
209things could get quite complicated (and this is only the beginning of
210this rathole).  To protect against this, two objects exist:
211`async_action' and `ipc_action'.  These objects are related, but
212independent of one another; the more essential object is the
213`async_action', which we will discuss first.
214
215In short, an `async_action' represents a pending transaction (aka
216asynchronous action), of which each interface can have at most one.
217The `async_action' structure is embedded in the `ifslist' structure,
218which is fine since there can be at most one pending transaction per
219interface.  Typical "asynchronous transactions" are START, EXTEND, and
220INFORM, since each consists of a sequence of packets that must be done
221without interruption.  Note that not all DHCP operations are
222"asynchronous" -- for instance, a RELEASE operation is synchronous
223(not asynchronous) since after the RELEASE is sent no reply is
224expected from the DHCP server.  Also, note that there can be
225synchronous operations intermixed with asynchronous operations
226although it's not recommended.
227
228When the agent realizes it must perform an asynchronous transaction,
229it first calls async_pending() to see if there is already one pending;
230if so, the new transaction must fail (the details of failure depend on
231how the transaction was initiated, which is described in more detail
232later when the `ipc_action' object is discussed).  If there is no
233pending asynchronous transaction, async_start() is called to begin
234one.
235
236When the transaction is complete, async_finish() must be called to
237complete the asynchronous action on that interface.  If the
238transaction is unable to complete within a certain amount of time
239(more on this later), async_timeout() is invoked which attempts to
240cancel the asynchronous action with async_cancel().  If the event is
241not cancellable it is left pending, although this means that no future
242asynchronous actions can be performed on the interface until the
243transaction voluntarily calls async_finish().  While this may seem
244suboptimal, cancellation here is quite analogous to thread
245cancellation, which is generally considered a difficult problem.
246
247The notion of asynchronous transactions is complicated by the fact
248that they may originate from both inside and outside of the agent.
249For instance, a user initiates an asynchronous START transaction when
250he performs an `ifconfig hme0 dhcp start', but the agent will
251internally need to perform asynchronous EXTEND transactions to extend
252the lease before it expires.  This leads us into the `ipc_action'
253object.
254
255An `ipc_action' represents the IPC-related pieces of an asynchronous
256transaction that was started as a result of a user request.  Only
257IPC-generated asynchronous transactions have a valid `ipc_action'
258object.  Note that since there can be at most one asynchronous action
259per interface, there can also be at most one `ipc_action' per
260interface (this means it can also conveniently be embedded inside the
261`ifslist' structure).
262
263One of the main purposes of the `ipc_action' object is to timeout user
264events.  This is not the same as timing out the transaction; for
265instance, when the user specifies a timeout value as an argument to
266ifconfig, he is specifying an `ipc_action' timeout; in other words,
267how long he is willing to wait for the command to complete.  However,
268even after the command times out for the user, the asynchronous
269transaction continues until async_timeout() occurs.
270
271It is worth understanding these timeouts since the relationship is
272subtle but powerful.  The `async_action' timer specifies how long the
273agent will try to perform the transaction; the `ipc_action' timer
274specifies how long the user is willing to wait for the action to
275complete.  If when the `async_action' timer fires and async_timeout()
276is called, there is no associated `ipc_action' (either because the
277transaction was not initiated by a user or because the user already
278timed out), then async_cancel() proceeds as described previously.  If,
279on the other hand, the user is still waiting for the transaction to
280complete, then async_timeout() is rescheduled and the transaction is
281left pending.  While this behavior might seem odd, it adheres to the
282principles of least surprise: when a user is willing to wait for a
283transaction to complete, the agent should try for as long as they're
284willing to wait.  On the other hand, if the agent were to take that
285stance with its internal transactions, it would block out
286user-requested operations if the internal transaction never completed
287(perhaps because the server never sent an ACK in response to our lease
288extension REQUEST).
289
290The API provided for the `ipc_action' object is quite similar to the
291one for the `async_action' object: when an IPC request comes in for an
292operation requiring asynchronous operation, ipc_action_start() is
293called.  When the request completes, ipc_action_finish() is called.
294If the user times out before the request completes, then
295ipc_action_timeout() is called.
296
297Packet Management
298-----------------
299
300Another complicated area is packet management: building, manipulating,
301sending and receiving packets.  These operations are all encapsulated
302behind a dozen or so interfaces (see packet.h) that abstract the
303unimportant details away from the rest of the agent code.  In order to
304send a DHCP packet, code first calls init_pkt(), which returns a
305dhcp_pkt_t initialized suitably for transmission.  Note that currently
306init_pkt() returns a dhcp_pkt_t that is actually allocated as part of
307the `ifslist', but this may change in the future..  After calling
308init_pkt(), the add_pkt_opt*() functions are used to add options to
309the DHCP packet.  Finally, send_pkt() can be used to transmit the
310packet to a given IP address.
311
312The send_pkt() function is actually quite complicated; for one, it
313must internally use either DLPI or sockets depending on the state of
314the interface; for two, it handles the details of packet timeout and
315retransmission.  The last argument to send_pkt() is a pointer to a
316"stop function".  If this argument is passed as NULL, then the packet
317will only be sent once (it won't be retransmitted).  Otherwise, before
318each retransmission, the stop function will be called back prior to
319retransmission.  The return value from this function indicates whether
320to continue retransmission or not, which allows the send_pkt() caller
321to control the retransmission policy without making it have to deal
322with the retransmission mechanism.  See init_reboot.c for an example
323of this in action.
324
325The recv_pkt() function is simpler but still complicated by the fact
326that one may want to receive several different types of packets at
327once; for instance, after sending a REQUEST, either an ACK or a NAK is
328acceptable.  Also, before calling recv_pkt(), the caller must know
329that there is data to be read from the socket (this can be
330accomplished by using the event handler), otherwise recv_pkt() will
331block, which is clearly not acceptable.
332
333Time
334----
335
336The notion of time is an exceptionally subtle area.  You will notice
337five ways that time is represented in the source: as lease_t's,
338uint32_t's, time_t's, hrtime_t's, and monosec_t's.  Each of these
339types serves a slightly different function.
340
341The `lease_t' type is the simplest to understand; it is the unit of
342time in the CD_{LEASE,T1,T2}_TIME options in a DHCP packet, as defined
343by RFC2131. This is defined as a positive number of seconds (relative
344to some fixed point in time) or the value `-1' (DHCP_PERM) which
345represents infinity (i.e., a permanent lease).  The lease_t should be
346used either when dealing with actual DHCP packets that are sent on the
347wire or for variables which follow the exact definition given in the
348RFC.
349
350The `uint32_t' type is also used to represent a relative time in
351seconds.  However, here the value `-1' is not special and of course
352this type is not tied to any definition given in RFC2131.  Use this
353for representing "offsets" from another point in time that are not
354DHCP lease times.
355
356The `time_t' type is the natural Unix type for representing time since
357the epoch.  Unfortunately, it is affected by stime(2) or adjtime(2)
358and since the DHCP client is used during system installation (and thus
359when time is typically being configured), the time_t cannot be used in
360general to represent an absolute time since the epoch.  For instance,
361if a time_t were used to keep track of when a lease began, and then a
362minute later stime(2) was called to adjust the system clock forward a
363year, then the lease would appeared to have expired a year ago even
364though it has only been a minute.  For this reason, time_t's should
365only be used either when wall time must be displayed (such as in
366DHCP_STATUS ipc transaction) or when a time meaningful across reboots
367must be obtained (such as when caching an ACK packet at system
368shutdown).
369
370The `hrtime_t' type returned from gethrtime() works around the
371limitations of the time_t in that it is not affected by stime(2) or
372adjtime(2), with the disadvantage that it represents time from some
373arbitrary time in the past and in nanoseconds.  The timer queue code
374deals with hrtime_t's directly since that particular piece of code is
375meant to be fairly independent of the rest of the DHCP client.
376
377However, dealing with nanoseconds is error-prone when all the other
378time types are in seconds.  As a result, yet another time type, the
379`monosec_t' was created to represent a monotonically increasing time
380in seconds, and is really no more than (hrtime_t / NANOSEC).  Note
381that this unit is typically used where time_t's would've traditionally
382been used.  The function monosec() in util.c returns the current
383monosec, and monosec_to_time() can convert a given monosec to wall
384time, using the system's current notion of time.
385
386One additional limitation of the `hrtime_t' and `monosec_t' types is
387that they are unaware of the passage of time across checkpoint/resume
388events (e.g., those generated by sys-suspend(1M)).  For example, if
389gethrtime() returns time T, and then the machine is suspended for 2
390hours, and then gethrtime() is called again, the time returned is not
391T + (2 * 60 * 60 * NANOSEC), but rather approximately still T.
392
393To work around this (and other checkpoint/resume related problems),
394when a system is resumed, the DHCP client makes the pessimistic
395assumption that all finite leases have expired while the machine was
396suspended and must be obtained again.  This is known as "refreshing"
397the leases, and is handled by refresh_ifslist().
398
399Note that it appears like a more intelligent approach would be to
400record the time(2) when the system is suspended, compare that against
401the time(2) when the system is resumed, and use the delta between them
402to decide which leases have expired.  Sadly, this cannot be done since
403through at least Solaris 8, it is not possible for userland programs
404to be notified of system suspend events.
405
406Configuration
407-------------
408
409For the most part, the DHCP client only *retrieves* configuration data
410from the DHCP server, leaving the configuration to scripts (such as
411boot scripts), which themselves use dhcpinfo(1) to retrieve the data
412from the DHCP client.  This is desirable because it keeps the mechanism
413of retrieving the configuration data decoupled from the policy of using
414the data.
415
416However, unless used in "inform" mode, the DHCP client *does* configure
417each interface enough to allow it to communicate with other hosts.
418Specifically, the DHCP client configures the interface's IP address,
419netmask, and broadcast address using the information provided by the
420server.  Further, for physical interfaces, any provided default routes
421are also configured.  Since logical interfaces cannot be stored in the
422kernel routing table, and in most cases, logical interfaces share a
423default route with their associated physical interface, the DHCP client
424does not automatically add or remove default routes when leases are
425acquired or expired on logical interfaces.
426
427Event Scripting
428---------------
429
430The DHCP client supports user program invocations on DHCP events.  The
431supported events are BOUND, EXTEND, EXPIRE, DROP and RELEASE.  The user
432program runs asynchronous to the DHCP client so that the main event
433loop stays active to process other events, including events triggered
434by the user program (for example, when it invokes dhcpinfo).
435
436The user program execution is part of the transaction of a DHCP command.
437For example, if the user program is not enabled, the transaction of the
438DHCP command START is considered over when an ACK is received and the
439interface is configured successfully.  If the user program is enabled,
440it is invoked after the interface is configured successfully, and the
441transaction is considered over only when the user program exits.  The
442event scripting implementation makes use of the asynchronous operations
443discussed in the "Transactions" section.
444
445The upper bound of 58 seconds is imposed on how long the user program
446can run. If the user program does not exit after 55 seconds, the signal
447SIGTERM is sent to it. If it still does not exit after additional 3
448seconds, the signal SIGKILL is sent to it.  Since the event handler is
449a wrapper around poll(), the DHCP client cannot directly observe the
450completion of the user program.  Instead, the DHCP client creates a
451child "helper" process to synchronously monitor the user program (this
452process is also used to send the aformentioned signals to the process,
453if necessary). The DHCP client and the helper process share a pipe
454which is included in the set of poll descriptors monitored by the DHCP
455client's event handler.  When the user program exits, the helper process
456passes the user program exit status to the DHCP client through the pipe,
457informing the DHCP client that the user program has finished.  When the
458DHCP client is asked to shut down, it will wait for any running instances
459of the user program to complete.
460