1CDDL HEADER START
2
3The contents of this file are subject to the terms of the
4Common Development and Distribution License (the "License").
5You may not use this file except in compliance with the License.
6
7You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
8or http://www.opensolaris.org/os/licensing.
9See the License for the specific language governing permissions
10and limitations under the License.
11
12When distributing Covered Code, include this CDDL HEADER in each
13file and include the License file at usr/src/OPENSOLARIS.LICENSE.
14If applicable, add the following below this CDDL HEADER, with the
15fields enclosed by brackets "[]" replaced with your own identifying
16information: Portions Copyright [yyyy] [name of copyright owner]
17
18CDDL HEADER END
19
20Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
21Use is subject to license terms.
22
23Architectural Overview for the DHCP agent
24Peter Memishian
25ident	"%Z%%M%	%I%	%E% SMI"
26
27INTRODUCTION
28============
29
30The Solaris DHCP agent (dhcpagent) is a DHCP client implementation
31compliant with RFCs 2131, 3315, and others.  The major forces shaping
32its design were:
33
34	* Must be capable of managing multiple network interfaces.
35	* Must consume little CPU, since it will always be running.
36	* Must have a small memory footprint, since it will always be
37	  running.
38	* Must not rely on any shared libraries outside of /lib, since
39	  it must run before all filesystems have been mounted.
40
41When a DHCP agent implementation is only required to control a single
42interface on a machine, the problem is expressed well as a simple
43state-machine, as shown in RFC2131.  However, when a DHCP agent is
44responsible for managing more than one interface at a time, the
45problem becomes much more complicated.
46
47This can be resolved using threads or with an event-driven model.
48Given that DHCP's behavior can be expressed concisely as a state
49machine, the event-driven model is the closest match.
50
51While tried-and-true, that model is subtle and easy to get wrong.
52Indeed, much of the agent's code is there to manage the complexity of
53programming in an asynchronous event-driven paradigm.
54
55THE BASICS
56==========
57
58The DHCP agent consists of roughly 30 source files, most with a
59companion header file.  While the largest source file is around 1700
60lines, most are much shorter.  The source files can largely be broken
61up into three groups:
62
63	* Source files that, along with their companion header files,
64	  define an abstract "object" that is used by other parts of
65	  the system.  Examples include "packet.c", which along with
66	  "packet.h" provide a Packet object for use by the rest of
67	  the agent; and "async.c", which along with "async.h" defines
68	  an interface for managing asynchronous transactions within
69	  the agent.
70
71	* Source files that implement a given state of the agent; for
72	  instance, there is a "request.c" which comprises all of
73	  the procedural "work" which must be done while in the
74	  REQUESTING state of the agent.  By encapsulating states in
75	  files, it becomes easier to debug errors in the
76	  client/server protocol and adapt the agent to new
77	  constraints, since all the relevant code is in one place.
78
79	* Source files, which along with their companion header files,
80  	  encapsulate a given task or related set of tasks.  The
81	  difference between this and the first group is that the
82	  interfaces exported from these files do not operate on
83	  an "object", but rather perform a specific task.  Examples
84	  include "dlpi_io.c", which provides a useful interface
85	  to DLPI-related i/o operations.
86
87OVERVIEW
88========
89
90Here we discuss the essential objects and subtle aspects of the
91DHCP agent implementation.  Note that there is of course much more
92that is not discussed here, but after this overview you should be able
93to fend for yourself in the source code.
94
95For details on the DHCPv6 aspects of the design, and how this relates
96to the implementation present in previous releases of Solaris, see the
97README.v6 file.
98
99Event Handlers and Timer Queues
100-------------------------------
101
102The most important object in the agent is the event handler, whose
103interface is in libinetutil.h and whose implementation is in
104libinetutil.  The event handler is essentially an object-oriented
105wrapper around poll(2): other components of the agent can register to
106be called back when specific events on file descriptors happen -- for
107instance, to wait for requests to arrive on its IPC socket, the agent
108registers a callback function (accept_event()) that will be called
109back whenever a new connection arrives on the file descriptor
110associated with the IPC socket.  When the agent initially begins in
111main(), it registers a number of events with the event handler, and
112then calls iu_handle_events(), which proceeds to wait for events to
113happen -- this function does not return until the agent is shutdown
114via signal.
115
116When the registered events occur, the callback functions are called
117back, which in turn might lead to additional callbacks being
118registered -- this is the classic event-driven model.  (As an aside,
119note that programming in an event-driven model means that callbacks
120cannot block, or else the agent will become unresponsive.)
121
122A special kind of "event" is a timeout.  Since there are many timers
123which must be maintained for each DHCP-controlled interface (such as a
124lease expiration timer, time-to-first-renewal (t1) timer, and so
125forth), an object-oriented abstraction to timers called a "timer
126queue" is provided, whose interface is in libinetutil.h with a
127corresponding implementation in libinetutil.  The timer queue allows
128callback functions to be "scheduled" for callback after a certain
129amount of time has passed.
130
131The event handler and timer queue objects work hand-in-hand: the event
132handler is passed a pointer to a timer queue in iu_handle_events() --
133from there, it can use the iu_earliest_timer() routine to find the
134timer which will next fire, and use this to set its timeout value in
135its call to poll(2).  If poll(2) returns due to a timeout, the event
136handler calls iu_expire_timers() to expire all timers that expired
137(note that more than one may have expired if, for example, multiple
138timers were set to expire at the same time).
139
140Although it is possible to instantiate more than one timer queue or
141event handler object, it doesn't make a lot of sense -- these objects
142are really "singletons".  Accordingly, the agent has two global
143variables, `eh' and `tq', which store pointers to the global event
144handler and timer queue.
145
146Network Interfaces
147------------------
148
149For each network interface managed by the agent, there is a set of
150associated state that describes both its general properties (such as
151the maximum MTU) and its connections to DHCP-related state (the
152protocol state machines).  This state is stored in a pair of
153structures called `dhcp_pif_t' (the IP physical interface layer or
154PIF) and `dhcp_lif_t' (the IP logical interface layer or LIF).  Each
155dhcp_pif_t represents a single physical interface, such as "hme0," for
156a given IP protocol version (4 or 6), and has a list of dhcp_lif_t
157structures representing the logical interfaces (such as "hme0:1") in
158use by the agent.
159
160This split is important because of differences between IPv4 and IPv6.
161For IPv4, each DHCP state machine manages a single IP address and
162associated configuration data.  This corresponds to a single logical
163interface, which must be specified by the user.  For IPv6, however,
164each DHCP state machine manages a group of addresses, and is
165associated with DUID value rather than with just an interface.
166
167Thus, DHCPv6 behaves more like in.ndpd in its creation of "ADDRCONF"
168interfaces.  The agent automatically plumbs logical interfaces when
169needed and removes them when the addresses expire.
170
171The state for a given session is stored separately in `dhcp_smach_t'.
172This state machine then points to the main LIF used for I/O, and to a
173list of `dhcp_lease_t' structures representing individual leases, and
174each of those points to a list of LIFs corresponding to the individual
175addresses being managed.
176
177One point that was brushed over in the preceding discussion of event
178handlers and timer queues was context.  Recall that the event-driven
179nature of the agent requires that functions cannot block, lest they
180starve out others and impact the observed responsiveness of the agent.
181As an example, consider the process of extending a lease: the agent
182must send a REQUEST packet and wait for an ACK or NAK packet in
183response.  This is done by sending a REQUEST and then returning to the
184event handler that waits for an ACK or NAK packet to arrive on the
185file descriptor associated with the interface.  Note however, that
186when the ACK or NAK does arrive, and the callback function called
187back, it must know which state machine this packet is for (it must get
188back its context).  This could be handled through an ad-hoc mapping of
189file descriptors to state machines, but a cleaner approach is to have
190the event handler's register function (iu_register_event()) take in an
191opaque context pointer, which will then be passed back to the
192callback.  In the agent, the context pointer used depends on the
193nature of the event: events on LIFs use the dhcp_lif_t pointer, events
194on the state machine use dhcp_smach_t, and so on.
195
196Note that there is nothing that guarantees the pointer passed into
197iu_register_event() or iu_schedule_timer() will still be valid when
198the callback is called back (for instance, the memory may have been
199freed in the meantime).  To solve this problem, all of the data
200structures used in this way are reference counted.  For more details
201on how the reference count scheme is implemented, see the closing
202comments in interface.h regarding memory management.
203
204Transactions
205------------
206
207Many operations performed via DHCP must be performed in groups -- for
208instance, acquiring a lease requires several steps: sending a
209DISCOVER, collecting OFFERs, selecting an OFFER, sending a REQUEST,
210and receiving an ACK, assuming everything goes well.  Note however
211that due to the event-driven model the agent operates in, these
212operations are not inherently "grouped" -- instead, the agent sends a
213DISCOVER, goes back into the main event loop, waits for events
214(perhaps even requests on the IPC channel to begin acquiring a lease
215on another state machine), eventually checks to see if an acceptable
216OFFER has come in, and so forth.  To some degree, the notion of the
217state machine's current state (SELECTING, REQUESTING, etc) helps
218control the potential chaos of the event-driven model (for instance,
219if while the agent is waiting for an OFFER on a given state machine,
220an IPC event comes in requesting that the leases be RELEASED, the
221agent knows to send back an error since the state machine must be in
222at least the BOUND state before a RELEASE can be performed.)
223
224However, states are not enough -- for instance, suppose that the agent
225begins trying to renew a lease.  This is done by sending a REQUEST
226packet and waiting for an ACK or NAK, which might never come.  If,
227while waiting for the ACK or NAK, the user sends a request to renew
228the lease as well, then if the agent were to send another REQUEST,
229things could get quite complicated (and this is only the beginning of
230this rathole).  To protect against this, two objects exist:
231`async_action' and `ipc_action'.  These objects are related, but
232independent of one another; the more essential object is the
233`async_action', which we will discuss first.
234
235In short, an `async_action' represents a pending transaction (aka
236asynchronous action), of which each state machine can have at most
237one.  The `async_action' structure is embedded in the `dhcp_smach_t'
238structure, which is fine since there can be at most one pending
239transaction per state machine.  Typical "asynchronous transactions"
240are START, EXTEND, and INFORM, since each consists of a sequence of
241packets that must be done without interruption.  Note that not all
242DHCP operations are "asynchronous" -- for instance, a DHCPv4 RELEASE
243operation is synchronous (not asynchronous) since after the RELEASE is
244sent no reply is expected from the DHCP server, but DHCPv6 Release is
245asynchronous, as all DHCPv6 messages are transactional.  Some
246operations, such as status query, are synchronous and do not affect
247the system state, and thus do not require sequencing.
248
249When the agent realizes it must perform an asynchronous transaction,
250it calls async_async() to open the transaction.  If one is already
251pending, then the new transaction must fail (the details of failure
252depend on how the transaction was initiated, which is described in
253more detail later when the `ipc_action' object is discussed).  If
254there is no pending asynchronous transaction, the operation succeeds.
255
256When the transaction is complete, either async_finish() or
257async_cancel() must be called to complete or cancel the asynchronous
258action on that state machine.  If the transaction is unable to
259complete within a certain amount of time (more on this later), a timer
260should be used to cancel the operation.
261
262The notion of asynchronous transactions is complicated by the fact
263that they may originate from both inside and outside of the agent.
264For instance, a user initiates an asynchronous START transaction when
265he performs an `ifconfig hme0 dhcp start', but the agent will
266internally need to perform asynchronous EXTEND transactions to extend
267the lease before it expires.  Note that user-initiated actions always
268have priority over internal actions: the former will cancel the
269latter, if necessary.
270
271This leads us into the `ipc_action' object.  An `ipc_action'
272represents the IPC-related pieces of an asynchronous transaction that
273was started as a result of a user request, as well as the `BUSY' state
274of the administrative interface.  Only IPC-generated asynchronous
275transactions have a valid `ipc_action' object.  Note that since there
276can be at most one asynchronous action per state machine, there can
277also be at most one `ipc_action' per state machine (this means it can
278also conveniently be embedded inside the `dhcp_smach_t' structure).
279
280One of the main purposes of the `ipc_action' object is to timeout user
281events.  When the user specifies a timeout value as an argument to
282ifconfig, he is specifying an `ipc_action' timeout; in other words,
283how long he is willing to wait for the command to complete.  When this
284time expires, the ipc_action is terminated, as well as the
285asynchronous operation.
286
287The API provided for the `ipc_action' object is quite similar to the
288one for the `async_action' object: when an IPC request comes in for an
289operation requiring asynchronous operation, ipc_action_start() is
290called.  When the request completes, ipc_action_finish() is called.
291If the user times out before the request completes, then
292ipc_action_timeout() is called.
293
294Packet Management
295-----------------
296
297Another complicated area is packet management: building, manipulating,
298sending and receiving packets.  These operations are all encapsulated
299behind a dozen or so interfaces (see packet.h) that abstract the
300unimportant details away from the rest of the agent code.  In order to
301send a DHCP packet, code first calls init_pkt(), which returns a
302dhcp_pkt_t initialized suitably for transmission.  Note that currently
303init_pkt() returns a dhcp_pkt_t that is actually allocated as part of
304the `dhcp_smach_t', but this may change in the future..  After calling
305init_pkt(), the add_pkt_opt*() functions are used to add options to
306the DHCP packet.  Finally, send_pkt() and send_pkt_v6() can be used to
307transmit the packet to a given IP address.
308
309The send_pkt() function is actually quite complicated; for one, it
310must internally use either DLPI or sockets depending on the machine
311state; for another, it handles the details of packet timeout and
312retransmission.  The last argument to send_pkt() is a pointer to a
313"stop function."  If this argument is passed as NULL, then the packet
314will only be sent once (it won't be retransmitted).  Otherwise, before
315each retransmission, the stop function will be called back prior to
316retransmission.  The callback may alter dsm_send_timeout if necessary
317to place a cap on the next timeout; this is done for DHCPv6 in
318stop_init_reboot() in order to implement the CNF_MAX_RD constraint.
319
320The return value from this function indicates whether to continue
321retransmission or not, which allows the send_pkt() caller to control
322the retransmission policy without making it have to deal with the
323retransmission mechanism.  See request.c for an example of this in
324action.
325
326The recv_pkt() function is simpler but still complicated by the fact
327that one may want to receive several different types of packets at
328once and in different ways (DLPI or sockets).  The caller registers an
329event handler on the file descriptor, and then calls recv_pkt() to
330read in the packet along with meta information about the message (the
331sender and interface identifier).
332
333For IPv6, packet reception is done with a single socket, using
334IPV6_PKTINFO to determine the actual destination address and receiving
335interface.  Packets are then matched against the state machines on the
336given interface through the transaction ID.
337
338The same facility exists for inbound IPv4 packets, but because there's
339no IP_PKTINFO processing on output yet in Solaris, and because IPv4
340still relies on DLPI, DHCP packets are handled on a per-LIF (when
341bound) and per-PIF (when unbound) basis.  Eventually, when IP_PKTINFO
342is available for IPv4, the per-LIF sockets can go away.  If it ever
343becomes possible to send and receive IP packets without having an IP
344address configured on an interface, then the DLPI streams can go as
345well.
346
347Time
348----
349
350The notion of time is an exceptionally subtle area.  You will notice
351five ways that time is represented in the source: as lease_t's,
352uint32_t's, time_t's, hrtime_t's, and monosec_t's.  Each of these
353types serves a slightly different function.
354
355The `lease_t' type is the simplest to understand; it is the unit of
356time in the CD_{LEASE,T1,T2}_TIME options in a DHCP packet, as defined
357by RFC2131. This is defined as a positive number of seconds (relative
358to some fixed point in time) or the value `-1' (DHCP_PERM) which
359represents infinity (i.e., a permanent lease).  The lease_t should be
360used either when dealing with actual DHCP packets that are sent on the
361wire or for variables which follow the exact definition given in the
362RFC.
363
364The `uint32_t' type is also used to represent a relative time in
365seconds.  However, here the value `-1' is not special and of course
366this type is not tied to any definition given in RFC2131.  Use this
367for representing "offsets" from another point in time that are not
368DHCP lease times.
369
370The `time_t' type is the natural Unix type for representing time since
371the epoch.  Unfortunately, it is affected by stime(2) or adjtime(2)
372and since the DHCP client is used during system installation (and thus
373when time is typically being configured), the time_t cannot be used in
374general to represent an absolute time since the epoch.  For instance,
375if a time_t were used to keep track of when a lease began, and then a
376minute later stime(2) was called to adjust the system clock forward a
377year, then the lease would appeared to have expired a year ago even
378though it has only been a minute.  For this reason, time_t's should
379only be used either when wall time must be displayed (such as in
380DHCP_STATUS ipc transaction) or when a time meaningful across reboots
381must be obtained (such as when caching an ACK packet at system
382shutdown).
383
384The `hrtime_t' type returned from gethrtime() works around the
385limitations of the time_t in that it is not affected by stime(2) or
386adjtime(2), with the disadvantage that it represents time from some
387arbitrary time in the past and in nanoseconds.  The timer queue code
388deals with hrtime_t's directly since that particular piece of code is
389meant to be fairly independent of the rest of the DHCP client.
390
391However, dealing with nanoseconds is error-prone when all the other
392time types are in seconds.  As a result, yet another time type, the
393`monosec_t' was created to represent a monotonically increasing time
394in seconds, and is really no more than (hrtime_t / NANOSEC).  Note
395that this unit is typically used where time_t's would've traditionally
396been used.  The function monosec() in util.c returns the current
397monosec, and monosec_to_time() can convert a given monosec to wall
398time, using the system's current notion of time.
399
400One additional limitation of the `hrtime_t' and `monosec_t' types is
401that they are unaware of the passage of time across checkpoint/resume
402events (e.g., those generated by sys-suspend(1M)).  For example, if
403gethrtime() returns time T, and then the machine is suspended for 2
404hours, and then gethrtime() is called again, the time returned is not
405T + (2 * 60 * 60 * NANOSEC), but rather approximately still T.
406
407To work around this (and other checkpoint/resume related problems),
408when a system is resumed, the DHCP client makes the pessimistic
409assumption that all finite leases have expired while the machine was
410suspended and must be obtained again.  This is known as "refreshing"
411the leases, and is handled by refresh_smachs().
412
413Note that it appears like a more intelligent approach would be to
414record the time(2) when the system is suspended, compare that against
415the time(2) when the system is resumed, and use the delta between them
416to decide which leases have expired.  Sadly, this cannot be done since
417through at least Solaris 10, it is not possible for userland programs
418to be notified of system suspend events.
419
420Configuration
421-------------
422
423For the most part, the DHCP client only *retrieves* configuration data
424from the DHCP server, leaving the configuration to scripts (such as
425boot scripts), which themselves use dhcpinfo(1) to retrieve the data
426from the DHCP client.  This is desirable because it keeps the mechanism
427of retrieving the configuration data decoupled from the policy of using
428the data.
429
430However, unless used in "inform" mode, the DHCP client *does*
431configure each IP interface enough to allow it to communicate with
432other hosts.  Specifically, the DHCP client configures the interface's
433IP address, netmask, and broadcast address using the information
434provided by the server.  Further, for IPv4 logical interface 0
435("hme0"), any provided default routes are also configured.
436
437For IPv6, only the IP addresses are set.  The netmask (prefix) is then
438set automatically by in.ndpd, and routes are discovered in the usual
439way by router discovery or routing protocols.  DHCPv6 doesn't set
440routes.
441
442Since logical interfaces cannot be specified as output interfaces in
443the kernel forwarding table, and in most cases, logical interfaces
444share a default route with their associated physical interface, the
445DHCP client does not automatically add or remove default routes when
446IPv4 leases are acquired or expired on logical interfaces.
447
448Event Scripting
449---------------
450
451The DHCP client supports user program invocations on DHCP events.  The
452supported events are BOUND, EXTEND, EXPIRE, DROP, RELEASE, and INFORM
453for DHCPv4, and BUILD6, EXTEND6, EXPIRE6, DROP6, LOSS6, RELEASE6, and
454INFORM6 for DHCPv6.  The user program runs asynchronous to the DHCP
455client so that the main event loop stays active to process other
456events, including events triggered by the user program (for example,
457when it invokes dhcpinfo).
458
459The user program execution is part of the transaction of a DHCP command.
460For example, if the user program is not enabled, the transaction of the
461DHCP command START is considered over when an ACK is received and the
462interface is configured successfully.  If the user program is enabled,
463it is invoked after the interface is configured successfully, and the
464transaction is considered over only when the user program exits.  The
465event scripting implementation makes use of the asynchronous operations
466discussed in the "Transactions" section.
467
468An upper bound of 58 seconds is imposed on how long the user program
469can run. If the user program does not exit after 55 seconds, the signal
470SIGTERM is sent to it. If it still does not exit after additional 3
471seconds, the signal SIGKILL is sent to it.  Since the event handler is
472a wrapper around poll(), the DHCP client cannot directly observe the
473completion of the user program.  Instead, the DHCP client creates a
474child "helper" process to synchronously monitor the user program (this
475process is also used to send the aformentioned signals to the process,
476if necessary).  The DHCP client and the helper process share a pipe
477which is included in the set of poll descriptors monitored by the DHCP
478client's event handler.  When the user program exits, the helper process
479passes the user program exit status to the DHCP client through the pipe,
480informing the DHCP client that the user program has finished.  When the
481DHCP client is asked to shut down, it will wait for any running instances
482of the user program to complete.
483