• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

hdrs/H03-May-2022-22,76116,445

http/H03-May-2022-55,17838,771

http2/H03-May-2022-151,588147,927

http3/H03-May-2022-9,2456,561

logging/H03-May-2022-19,09712,738

private/H29-Oct-2021-11840

shared/H03-May-2022-1,5451,128

CacheControl.ccH A D29-Oct-202112.9 KiB447317

CacheControl.hH A D29-Oct-20213.6 KiB13165

ControlBase.ccH A D29-Oct-202119.8 KiB861658

ControlBase.hH A D29-Oct-20213.1 KiB11155

ControlMatcher.ccH A D29-Oct-202126.3 KiB979625

ControlMatcher.hH A D29-Oct-202110.7 KiB358208

HostStatus.hH A D29-Oct-20216.5 KiB206141

IPAllow.ccH A D29-Oct-202118.1 KiB604510

IPAllow.hH A D29-Oct-20219.5 KiB340214

InkAPIInternal.hH A D29-Oct-20219.3 KiB393261

Main.hH A D29-Oct-20211.2 KiB467

Makefile.amH A D29-Oct-20214.1 KiB180143

Makefile.inH A D03-May-202241.6 KiB1,2521,079

Milestones.hH A D29-Oct-20212.8 KiB10547

ParentConsistentHash.ccH A D29-Oct-202114.7 KiB404311

ParentConsistentHash.hH A D29-Oct-20212.2 KiB6728

ParentRoundRobin.ccH A D29-Oct-20218 KiB208152

ParentRoundRobin.hH A D29-Oct-20211.7 KiB5320

ParentSelection.ccH A D29-Oct-202159.4 KiB1,9331,441

ParentSelection.hH A D29-Oct-202113.4 KiB477330

ParentSelectionStrategy.ccH A D29-Oct-20214.8 KiB13771

Plugin.ccH A D29-Oct-20219.3 KiB366274

Plugin.hH A D29-Oct-20212.7 KiB9736

PluginVC.ccH A D29-Oct-202136.4 KiB1,370931

PluginVC.hH A D29-Oct-20217.1 KiB279165

PoolableSession.hH A D29-Oct-20214.8 KiB195132

ProtocolProbeSessionAccept.ccH A D29-Oct-20217.2 KiB215152

ProtocolProbeSessionAccept.hH A D29-Oct-20212.3 KiB7128

ProxySession.ccH A D29-Oct-20217.3 KiB309239

ProxySession.hH A D29-Oct-20217.3 KiB284174

ProxyTransaction.ccH A D29-Oct-20215.4 KiB231170

ProxyTransaction.hH A D29-Oct-20217.6 KiB277189

README-stats.otlH A D29-Oct-202120.8 KiB527472

RegressionSM.ccH A D29-Oct-20216.4 KiB278229

RegressionSM.hH A D29-Oct-20212.5 KiB7937

ReverseProxy.ccH A D29-Oct-20215.2 KiB192111

ReverseProxy.hH A D29-Oct-20211.8 KiB6117

Show.hH A D29-Oct-20213.8 KiB165111

StatPages.ccH A D29-Oct-20216 KiB288212

StatPages.hH A D29-Oct-20213.5 KiB12251

Transform.ccH A D29-Oct-202131.1 KiB1,051685

Transform.hH A D29-Oct-20213 KiB10850

TransformInternal.hH A D29-Oct-20213.6 KiB14089

example_alarm_bin.shH A D29-Oct-20212.1 KiB7237

example_prep.shH A D29-Oct-20211.1 KiB339

README-stats.otl

1---------------------------------------------------------------------------------
2                                    STATS
3
4---------------------------------------------------------------------------------
5The following are the list of problems associated with the
6existing stat system:
71) atomicity issues
82) consistency and coupling issues
93) clearing
104) persistence
115) aggregation and derivation
126) relationship with manager.
13
14Specifically, some of the stats reported seemed incorrect.
15In the cases where individual stats were incorrect (e.g.
16number of bytes transferred), it is suspected that atomicity
17issues played a role. For example, if the stat variable was
1864 bits long, were all 64 bits being read and written atomically?
19Is there the possibility that while the lower/higher 32 bits
20were being accessed, the other 32 bits were being changed?
21
22In stats which were computed based on other stats (e.g. hit
23rate is ratio of number of cache hits over total number of
24cache lookups), it is suspected that there were coupling problems.
25For example, if the number of cache hits is read and then
26before the number of cache lookups can be read, it is updated,
27the hit rate may appear incorrect.
28
29Some stats are interrelated with other stats (e.g. number of
30client requests and number of cache lookups). Inconsistencies
31between such stats may show up if for example one of the
32values is accessed and before the other value is accessed, it
33is allowed to change. A single client request may then, for example,
34show two cache lookups.
35
36These issues can be aggravated by allowing administrators to
37clear stat values. This lead us to introduce persistence:
38disallow clearing of stats. While this does deal with the problems
39associated with clearing, it does not address the coupling,
40inconsistency or atomicity problems. Those remain.
41
42The remaining issues, those of aggregation and interaction with
43the manager did not suffer from any fundamental design problems.
44They were more implementation related issues.
45
46A new stat system which deals correctly with the issues of
47atomicity, consistency and coupling will automatically address
48the clearing and persistence problem. Atomicity can be handled
49very simply by either of two techniques:
501) make ALL accesses to stat variable atomic operations or
512) protect ALL accesses to stat variables by placing them
52   within critical regions.
53
54The inconsistency and coupling problem is two-fold in nature.
55There is the problem of grouping - in that related stats (either
56indirectly, as illustrated by the inconsistency example, or
57directly as illustrated by the coupling example), may be grouped
58together. Access to either one of them, then, if controlled, will
59guarantee that the other is not concurrently modified. So, this
60will prevent values from changing while related values are being
61accessed. It will not, however, solve the problem of maintaining
62"transactional" consistency. Look at the number of request/
63number of cache lookup example again. If the stat for the number
64of requests is updated. Then the stat for the number of cache
65lookups is updated, with both updates being within critical regions,
66if the values of both stats are accessed in the time between when
67one is updated and when the other is updated, the inconsistency
68problem remains. So, even though the stats are grouped together in
69that they are accessed through the same mutex, for example, the
70inconsistency problem remains. It can be solved only if we introduce
71the notion of a transaction. All stats which are for the same
72transaction must be updated together. The values of these stats
73at any time, then, will be consistent. This is a more difficult
74problem, because not only does it introduce grouping in space,
75it introduces grouping in time as well. Unfortunately, an implementation
76requires the notion of "local" and "global" stats.
77
78Now, stats fall into three categories:
791) transactional (dynamic) stats: e.g. number of current origin server
80   connections,
812) incremental stats: number of bytes from origin server,
823) transactional stats: number of requests served.
83
84Stats of type 1) need to just be accessed atomically. They do not
85need to be cleared. Stats of type 2) need to also be accessed atomically,
86and may need to be cleared. They are not related to other stats, so they
87do not need to be grouped. Stats of type 3) need to be accessed atomically,
88but they are grouped and so must be accessed simultaneously in space-time.
89
90The three types of stats can be implemented with three different mechanisms,
91or a single mechanism can be defined which handles all three types. The latter
92approach is more elegant, but it raises questions about performance.
93
94Since transactional stats have to be accessed together, it is logical to
95control access to them through a single lock. A stat structure, then, can
96be defined to have a lock which controls access. For individual stats which
97require just atomic access, however, it may be better to allow direct
98access, without getting a lock. This may reduce lock contention.
99
100We ran two experiments to investigate the performance issue.
101The existing stat system allows for individual atomic access to all stats.
102We compared the performance of this system with a system where instead of
103atomic access, all access requires getting a single lock. The results from
104inkbench are as follows.
105
106[keep_alive =4, 100% Hit rate, 4 client machines, 40 simultaneous users ]
107atomic: 526 ops/sec
108mutex: 499 ops/sec
109
110Changing the setup to proxy-only, which reduces the number of
111simultaneous accesses, gave:
112
113[keep_alive =4, proxy-only, 4 client machines, 40 simultaneous users ]
114atomic: 284 ops/sec
115mutex: 282 ops/sec
116
117We also ran the test with stats turned off, which gave unexpected results.
118It is unclear, however, how much we really turn the stats off (there were
119even compile-time problems and many of the stat reading/writing routines
120were still being called, etc.).
121
122In addition to the tests mentioned above, we spoke to Prof. Dirk
123Grunwald.  He said that the most expensive operations are the memory
124barriers. On the Alphas, atomic operations are implemented with
125load-locked, store- conditional and memory-barrier instructions. I
126think each operation is implemented with 2 mbs. Locks and Releases
127have one mb each. So, the number of mbs is much higher with atomic
128updates. Both atomic updates and mutex-based updates will have to pay
129similar cache loading costs (if we assume that the loads for the
130test-and-set and the loads for reads will fail, the stores will
131obviously hit). The difference, if any, will come from the contention
132for locks while all of the group elements are being updated and that
133has to be compared with the cost of as many mbs as the stats are being
134atomically updated. Context-switching while holding the lock may also
135result in a performance hit, but it may be possible to address this by
136doing "lazy" updates of global structures.
137
138The consensus, then is to have a single mechanism for access to stats:
139mutexes. This combined with a grouping scheme solves the atomicity,
140consistency, coupling, clearing and persistence problems.
141
142
143---------------------------------------------------------------------------------
144Brian's Rant:
145-------------
146In any case, I want to change the stats system once to address all the issues
147of clearing, persistence, monotonicity, coupled computations,
148etc.  Let's be sure we all understand the design issues and do
149the right thing one time.  We've gone through lots of band-aids in the past
150which got us nowhere, so let's get the design right.
151---------------------------------------------------------------------------------
152Issues:
153-------
1541) Clearing
1552) Persistence
1563) Monotonicity
1574) Coupled computations
1585) Inconsistencies between related stats
159
160- Want to be able to clear stats
161- Do not want to see inconsistencies between stats
162
163stats which increase and decrease (e.g. current open connections should
164not be clearable)
165
166persistent stats should not be clearable
167
168related stats should be updated together (updated/cleared)
169
170divide stats into 2 groups:
1711) dynamic stats - not clearable, read,written singly, atomically
1722) grouped stats - must be updated together
173
174---------------------------------------------------------------------------------
175Types Of Stats:
176---------------
177* event system stats
178	** average idle time per thread
179	** variance of idle time per thread
180	** average event exec rate per thread
181	** variance of event exec rate per thread
182	** average execution time per event
183	** variance of execution time per event
184	** average failed mutexes per event
185	** variance of failed mutexes per event
186	** average time lag per event
187	** variance of time lag per event
188
189* socks processor stats
190	** connections unsuccessful
191	** connections successful
192	** connections currently open
193
194---------------------------------------------------------------------------------
195* cache stats
196	*** total size in bytes
197	*** # cache operations in progress
198	*** Read bandwidth from the disk
199	*** Write bandwidth to the disk
200	*** RAM cache hit-rate.
201	*** Read operations per second.
202	*** Write operations per second.
203	*** Update operations per second.
204	*** Delete operations per second.
205	*** Operations per second.
206
207---------------------------------------------------------------------------------
208* hostdb/dns stats
209	** hostdb
210		*** total entries
211		*** total number of lookups
212		*** total number of hits
213		*** total number of misses
214		*** hit rate??? (or do we just compute this)
215	// to help tuning
216		*** average TTL remaining for hosts looked up
217		*** total number re-DNS because of re-DNS on reload
218		*** total number re-DNS because TTL expired
219	** dns
220		*** total number of lookups
221		*** total number of hits
222		*** total number of misses
223		*** average service time
224	// to help tuning
225		*** total number of retries
226		*** total number which failed with too many retries
227---------------------------------------------------------------------------------
228* network stats
229>	** incoming connections stats
230  shouldn't these stats be by protocol???
231>		*** average b/w for each connection
232  at the net level, I don't always know when I should be expecting bytes.
233  for example, a keepalive connection may expect more bytes and then not
234  get them.  When do I start and stop the timer???  This needs some thought
235>		*** average latency for each connection
236  same as above.  perhaps
237		*** latency to first byte on an accept
238  however for "latency to first by of a response" I would need to know when
239  the request was done, but that is a protocol level issue.
240>		*** number of documents/connection (keep-alive)
241  this is a protocol issue (http)
242>		*** number aborted
243  I could record how many received a do_io(VIO::ABORT) for what that
244  is worth... maybe this is also a protocol issue
245		*** high/low watermark number of simultaneous connections
246		*** high/low watermark number of connection time
247  is this last one useful??
248>	** outgoing connections stats
249>		*** average b/w for each connection
250>		*** average latency for each connection
251>		*** number of documents/connection (keep-alive)
252>		*** number aborted
253  see above
254		*** high/low watermark number of simultaneous connections
255		*** high/low watermark number of connection time
256>	** incoming connections stats
257  what does this mean??
258>		*** average b/w for each connection
259>		*** average latency for each connection
260>		*** number aborted
261	see above
262>		*** high/low watermark number of simultaneous connections
263>		*** high/low watermark number of connection time
264
265---------------------------------------------------------------------------------
266* Http Stats
267 	** protocol stats
268 		*** number of HTTP requests (need for ua->ts, ts->os)
269 			**** GETs
270 			**** HEADs
271 			**** TRACEs
272 			**** OPTIONs
273 			**** POSTs
274 			**** DELETEs
275 			**** CONNECTs
276 		*** number of invalid requests
277 		*** number of broken client connections
278 		*** number of proxied requests
279 			**** user-specified
280 			**** method-specified
281 			**** config specified (config file)
282 		*** number of requests with Cookies
283 		*** number of cache lookups
284 			**** number of alternates
285 		*** number of cache hits
286                        cache hits, misses, etc. stats should be categorized
287                        the same way they are categorized in WUTS (squid logs).
288 			**** fresh hits
289 			**** stale hits
290 				***** heuristically stale
291 				***** expired/max-aged
292		*** number of cache writes
293		*** number of cache updates
294		*** number of cache deletes
295 		*** number of valid responses
296 		*** number of invalid responses
297 		*** number of retried requests
298 		*** number of broken server connections
299
300	        *** For each response code, count the number of responses (need for ua->ts, ts->os)
301		*** Histogram of http version (need for ua->ts, ts->os)
302 		*** number of responses with expires
303 		*** number of responses with last-modified
304 		*** number of responses with age
305 		*** number of responses indicating server clock skew
306 		*** number of responses preventing cache store
307 			**** Set-Cookie
308 			**** Cache-Control: no-store
309 		*** number of responses proxied directly
310	** Timeouts and connection errors:
311		*** accept timeout (ua -> proxy only)
312		*** background fill timeout (os -> proxy only)
313		*** normal inactive timeout (both ua -> proxy and proxy -> os)
314		*** normal activity timeout (both ua -> proxy and proxy -> os)
315		*** keep-alive idle timeout (both ua -> proxy and proxy -> os)
316	** Connections and transactions time:
317		*** average requests count per connection (both ua and os).
318		*** average connection lifetime (both ua and os).
319		*** average connection utilization (connection time / transactions count).
320		*** average transaction time (from transaction start until transaction end).
321		*** average transaction time histogram per document sizes.
322		  sizes are:
323		    <= 100 bytes, <= 1K, <= 3K, <= 5K, <= 10L, <= 1M, <= infinity
324		*** average transaction processing time (think time).
325		*** average transaction processing time (think time) histogram per document sizes.
326	** Transfer rate:
327		*** bytes per second ua connection and os connection.
328		*** bytes per second ua connection and os connection histogram per document size.
329	** Cache Stats
330		*** cache lookup time hit
331		*** cache lookup time miss
332		*** cache open read time
333		*** cache open write time
334---------------------------------------------------------------------------------
335* loggin stats
336// bytes moved
337STAT(Sum,   log2_stat_bytes_buffered),
338STAT(Sum,   log2_stat_bytes_written_to_disk),
339STAT(Sum,   log2_stat_bytes_sent_to_network),
340STAT(Sum,   log2_stat_bytes_received_from_network),
341// I/O
342STAT(Count, log2_stat_log_files_open),			UI
343STAT(Count, log2_stat_log_files_space_used),		UI
344// events
345STAT(Count, log2_stat_event_log_error),			UI
346STAT(Count, log2_stat_event_log_access),		UI
347STAT(Count, log2_stat_event_log_access_fail),
348STAT(Count, log2_stat_event_log_access_skip),		UI
349
350---------------------------------------------------------------------------------
351------------
352Here's my version of the 20-stats list for the UI. I gathered all of the
353stats proposed from our meeting last Thursday, added requests from Tait
354Kirkham and Brian Totty, then added a few based on pilot site questions and
355customer's monitoring scripts. Not all of these additional requests are
356necessarily possible to measure. Sorted the proposed statistics into groups
357based on the kind of information they provide the proxy administrator. Then I
358picked the ones that seemed most important to me for the top 20 list. From
359the top-level statistics page there should be links to pages with more detail
360in each operational area.
361
362A few notes:
363"Average" statistics should be calculated over an interval configurable as 1
364minute, 5 minutes or 15 minutes. This includes the cache hit rate. Probably
365the cache detail info page should have a value for the overall average.
366Numbers and measurements in the user interface are presented for the
367convenience of the customer, and should be organized to suit customer needs
368and interests. Statistics from the operating system, the HTTP state machine,
369and anywhere else may be combined to create info presented in the UI.
370-----------------------------------------------------------
371
372  Statistics in the Traffic Manager UI
373  Top "20" Recommended List
374
375Client Responsiveness
376avg/max total transaction time -- request received to last byte sent.
377number of client aborts
378network bytes/sec/client
379
380Proxy Activity
381total/max active client connections
382total/max active server connections
383total/max active cache connections
384total/max active parent proxy connections
385average transaction rate
386total transaction count per protocol {dns, http, rtsp}
387total bytes served to clients
388total bytes from origin servers
389total bytes from parent proxies
390bandwidth savings: %  -- bandwidth per interface, per route options?
391
392Cache
393cache utilization %
394avg cache hit rate
395cache revalidates
396
397-----------------------------------------------------------------------------------------------------------------
398
399  Statistics in the Traffic Manager
400  The Whole List
401
402Client Responsiveness
403avg/max time to first byte  -- request received to first byte sent.
404avg/max total transaction time -- request received to last byte sent.
405avg transaction time by type of transaction:
406     cache write
407     non-cached
408     client revalidate
409     proxy revalidate
410     proxy refresh
411number of client aborts
412network bytes/sec/client
413
414Proxy Activity
415total/max active client connections
416total/max active server connections
417total/max active cache connections
418number of active connections by connection status: ESTABLISHED, CLOSE_WAIT,
419TIME_WAIT, ...
420total transaction count per protocol {dns, http, rtsp}
421total/max active parent proxy connections
422total bytes served to clients
423total bytes from origin servers
424total bytes from parent proxies
425bandwidth savings: bytes / %
426number of TCP retransmits per client
427ps-like status report of Traffic Server operations
428
429System Utilization
430proxy network utilization %
431cpu utilization %
432disk utilization %
433memory (RAM) utilization %
434
435Cache
436total cache free space
437cache utilization %
438avg cache hit rate
439cache revalidates:
440     client-specified
441     expired (expire/max-age)
442     expired (heuristic)
443cache miss time
444cache hit time
445
446Logging
447total  logging space used
448---------------------------------------------------------------------------------
449Subject: Re: Simple UI for TS100
450Date: Thu, 26 Mar 1998 11:48:15 -0800
451
452>From a Support standpoint, I do have some reservations about hacking
453information and function out of the Traffic Manager for TS Lite. Less
454information in the Monitor makes it more difficult to observe Traffic Server
455operation. Less functionality in Configure means more manual editing of
456config files and greater chance of admin errors. These factors could greatly
457increase our support problems and cost for "Lite" customers.
458
459Perhaps the TS Lite UI should just remove all cluster-related information
460and controls, and material related to any other features that simply will
461not be offered on TS Lite, like the maximum network connection limit. Other
462counters and controls that relate to functions and features that are
463operational in TS Lite could be left alone. There're already present in TS
464Classic, so it costs us nothing extra.
465
466-----------------------------------
467
468Within the framework of Adam's proposal, the "Monitor" manager reduces to a
469single screen. Great! With only one screen, gotta be sure that the key
470top-level statistics are visible. I suggest that cache free is not really a
471very interesting value for most sites. Bandwidth savings is the value we
472really want to emphasize, so it must not disappear. So the Dashboard should
473have its current "More Info" stats plus
474    cache size
475    DNS Hit rate
476    Bandwidth savings
477    tps count (the tps meter is nice, but can't be read very precisely)
478The More/Less detail option itself can disappear. The word "cluster" should
479not occur anywhere in the TS Lite UI.
480
481In Configure: Server the node on/off switch and the cluster restart button
482are redundant in TS Lite. Probably ought to lose the cluster button. A
483configurable Traffic Server name also seems useless when you can't have a
484cluster.
485
486For Configure: Logging, perhaps custom log formatting should not be offered
487for TS Lite?
488
489With reduced UI configuration capability, it may be appropriate to remove
490the Configure: Snapshots screen.
491
492Adam Beguelin wrote:
493
494> For TS100 we need a simplified UI.  Here's a first cut on how we should
495> simplify the UI.
496>
497> For most of the removed pages we will simply use the installed defaults.
498>
499> Comments?
500>
501>         Adam
502>
503> On the dashboard add these stats:
504>  o Cache Free Space
505>  o Cache Size
506>  o DNS Hit rate
507>
508> Remove these monitor pages:
509>  o node
510>  o Graphs
511>  o Protocols
512>  o Cache
513>  o Other
514>
515> Remove these sections under configure:
516>  o Server/Throttling of Network Connections
517>  o Protocols/HTTP timeouts (Leave anon and IP)
518>  o Protocols/SSL
519>  o Cache/Storage
520>  o Cache/Garbage Collection
521>  o Cache/Freshness
522>  o Cache/Variable Content
523>  o HostDB page
524---------------------------------------------------------------------------------
525---------------------------------------------------------------------------------
526---------------------------------------------------------------------------------
527