• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

collect/H21-Oct-2021-

environment/H21-Oct-2021-

examples/minimal/H21-Oct-2021-

images/H03-May-2022-

internal/H21-Oct-2021-

monotime/H21-Oct-2021-

present/H21-Oct-2021-

AUTHORSH A D21-Oct-202119

LICENSEH A D21-Oct-202110 KiB

README.mdH A D21-Oct-202120.8 KiB

callers.goH A D21-Oct-20212.2 KiB

callers_test.goH A D21-Oct-2021590

cas_safe.goH A D21-Oct-20211.8 KiB

cas_unsafe.goH A D21-Oct-20211.8 KiB

counter.goH A D21-Oct-20213.2 KiB

ctx.goH A D21-Oct-20219.6 KiB

ctx_test.goH A D21-Oct-20211.9 KiB

dist.goH A D21-Oct-20212 KiB

distgen.go.m4H A D21-Oct-20214.6 KiB

doc.goH A D21-Oct-202119.2 KiB

durdist.goH A D21-Oct-20214.7 KiB

error_names.goH A D21-Oct-20213.2 KiB

error_names_ae.goH A D21-Oct-2021687

error_names_syscall.goH A D21-Oct-2021733

floatdist.goH A D21-Oct-20214.5 KiB

func.goH A D21-Oct-20211.9 KiB

func_test.goH A D21-Oct-2021948

funcset.goH A D21-Oct-20211.8 KiB

funcstats.goH A D21-Oct-20215.6 KiB

go.modH A D21-Oct-2021112

go.sumH A D21-Oct-2021504

id.goH A D21-Oct-20211.3 KiB

intdist.goH A D21-Oct-20214.5 KiB

meter.goH A D21-Oct-20214.8 KiB

registry.goH A D21-Oct-20217.4 KiB

rng.goH A D21-Oct-20213.8 KiB

rng_test.goH A D21-Oct-2021688

scope.goH A D21-Oct-20219.6 KiB

span.goH A D21-Oct-20214.2 KiB

spanbag.goH A D21-Oct-20211.4 KiB

spinlock.goH A D21-Oct-2021873

stats.goH A D21-Oct-20212.4 KiB

struct.goH A D21-Oct-20211.8 KiB

tags.goH A D21-Oct-20213.9 KiB

tags_test.goH A D21-Oct-20212.3 KiB

task.goH A D21-Oct-20212.3 KiB

timer.goH A D21-Oct-20212.2 KiB

trace.goH A D21-Oct-20214.4 KiB

transform.goH A D21-Oct-20212.5 KiB

val.goH A D21-Oct-20215.8 KiB

README.md

1# ![monkit](https://raw.githubusercontent.com/spacemonkeygo/monkit/master/images/logo.png)
2
3Package monkit is a flexible code instrumenting and data collection library.
4
5See documentation at https://godoc.org/github.com/spacemonkeygo/monkit/v3
6
7Software is hard. Like, really hard.
8[Just the worst](http://www.stilldrinking.org/programming-sucks). Sometimes it
9feels like we've constructed a field where the whole point is to see how
10tangled we can get ourselves before seeing if we can get tangled up more while
11trying to get untangled.
12
13Many software engineering teams are coming to realize (some slower than others)
14that collecting data over time about how their systems are functioning is a
15super power you can't turn back from. Some teams are calling this
16[Telemetry](http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html),
17[Observability](https://blog.twitter.com/2013/observability-at-twitter), or
18describing it more basically through subcomponents such as
19[distributed tracing](http://research.google.com/pubs/pub36356.html),
20[time-series data](https://influxdata.com/), or even just
21[metrics](http://metrics.dropwizard.io/). We've been calling it monitoring, but
22geez, I suppose if trends continue and you want to do this yourself your first
23step should be to open a thesaurus and pick an unused term.
24
25I'm not here to tell you about our whole platform. Instead, I'm here to
26explain a redesign of a Go library for instrumenting your Go programs that we
27rather quietly launched a few years ago. If you are already using version 1 of
28our [old library](https://github.com/spacemonkeygo/monitor), we're sorry, but
29we rewrote it from scratch and renamed it to monkit. This one (this one!) is
30better - you should switch!
31
32I'm going to try and sell you as fast as I can on this library.
33
34## Example usage
35
36```go
37package main
38
39import (
40  "context"
41  "fmt"
42  "log"
43  "math/rand"
44  "net/http"
45  "time"
46
47  "github.com/spacemonkeygo/monkit/v3"
48  "github.com/spacemonkeygo/monkit/v3/environment"
49  "github.com/spacemonkeygo/monkit/v3/present"
50)
51
52var mon = monkit.Package()
53
54func main() {
55  environment.Register(monkit.Default)
56
57  go http.ListenAndServe("127.0.0.1:9000", present.HTTP(monkit.Default))
58
59  for {
60    time.Sleep(time.Second)
61    log.Println(DoStuff(context.Background()))
62  }
63}
64
65func DoStuff(ctx context.Context) (err error) {
66  defer mon.Task()(&ctx)(&err)
67
68  result, err := ComputeThing(ctx, 1, 2)
69  if err != nil {
70    return err
71  }
72
73  fmt.Println(result)
74  return
75}
76
77func ComputeThing(ctx context.Context, arg1, arg2 int) (res int, err error) {
78  defer mon.Task()(&ctx)(&err)
79
80  timer := mon.Timer("subcomputation").Start()
81  res = arg1 + arg2
82  timer.Stop()
83
84  if res == 3 {
85    mon.Event("hit 3")
86  }
87
88  mon.BoolVal("was-4").Observe(res == 4)
89  mon.IntVal("res").Observe(int64(res))
90  mon.DurationVal("took").Observe(time.Second + time.Duration(rand.Intn(int(10*time.Second))))
91  mon.Counter("calls").Inc(1)
92  mon.Gauge("arg1", func() float64 { return float64(arg1) })
93  mon.Meter("arg2").Mark(arg2)
94
95  return arg1 + arg2, nil
96}
97
98```
99
100## Metrics
101
102We've got tools that capture distribution information (including quantiles)
103about int64, float64, and bool types. We have tools that capture data about
104events (we've got meters for deltas, rates, etc). We have rich tools for
105capturing information about tasks and functions, and literally anything that
106can generate a name and a number.
107
108Almost just as importantly, the amount of boilerplate and code you have to
109write to get these features is very minimal. Data that's hard to measure
110probably won't get measured.
111
112This data can be collected and sent to [Graphite](http://graphite.wikidot.com/)
113or any other time-series database.
114
115Here's a selection of live stats from one of our storage nodes:
116
117```
118env.os.fds      120.000000
119env.os.proc.stat.Minflt 81155.000000
120env.os.proc.stat.Cminflt        11789.000000
121env.os.proc.stat.Majflt 10.000000
122env.os.proc.stat.Cmajflt        6.000000
123...
124
125env.process.control     1.000000
126env.process.crc 3819014369.000000
127env.process.uptime      163225.292925
128env.runtime.goroutines  52.000000
129env.runtime.memory.Alloc        2414080.000000
130...
131
132env.rusage.Maxrss       26372.000000
133...
134
135sm/flud/csl/client.(*CSLClient).Verify.current  0.000000
136sm/flud/csl/client.(*CSLClient).Verify.success  788.000000
137sm/flud/csl/client.(*CSLClient).Verify.error volume missing     91.000000
138sm/flud/csl/client.(*CSLClient).Verify.error dial error 1.000000
139sm/flud/csl/client.(*CSLClient).Verify.panics   0.000000
140sm/flud/csl/client.(*CSLClient).Verify.success times min        0.102214
141sm/flud/csl/client.(*CSLClient).Verify.success times avg        1.899133
142sm/flud/csl/client.(*CSLClient).Verify.success times max        8.601230
143sm/flud/csl/client.(*CSLClient).Verify.success times recent     2.673128
144sm/flud/csl/client.(*CSLClient).Verify.failure times min        0.682881
145sm/flud/csl/client.(*CSLClient).Verify.failure times avg        3.936571
146sm/flud/csl/client.(*CSLClient).Verify.failure times max        6.102318
147sm/flud/csl/client.(*CSLClient).Verify.failure times recent     2.208020
148sm/flud/csl/server.store.avg    710800.000000
149sm/flud/csl/server.store.count  271.000000
150sm/flud/csl/server.store.max    3354194.000000
151sm/flud/csl/server.store.min    467.000000
152sm/flud/csl/server.store.recent 1661376.000000
153sm/flud/csl/server.store.sum    192626890.000000
154...
155```
156
157## Call graphs
158
159This library generates call graphs of your live process for you.
160
161These call graphs aren't created through sampling. They're full pictures of all
162of the interesting functions you've annotated, along with quantile information
163about their successes, failures, how often they panic, return an error (if so
164instrumented), how many are currently running, etc.
165
166The data can be returned in dot format, in json, in text, and can be about
167just the functions that are currently executing, or all the functions the
168monitoring system has ever seen.
169
170Here's another example of one of our production nodes:
171
172![callgraph](https://raw.githubusercontent.com/spacemonkeygo/monkit/master/images/callgraph2.png)
173
174## Trace graphs
175
176This library generates trace graphs of your live process for you directly,
177without requiring standing up some tracing system such as Zipkin (though you
178can do that too).
179
180Inspired by [Google's Dapper](http://research.google.com/pubs/pub36356.html)
181and [Twitter's Zipkin](http://zipkin.io), we have process-internal trace
182graphs, triggerable by a number of different methods.
183
184You get this trace information for free whenever you use
185[Go contexts](https://blog.golang.org/context) and function monitoring. The
186output formats are svg and json.
187
188Additionally, the library supports trace observation plugins, and we've written
189[a plugin that sends this data to Zipkin](http://github.com/spacemonkeygo/monkit-zipkin).
190
191![trace](https://raw.githubusercontent.com/spacemonkeygo/monkit/master/images/trace.png)
192
193## History
194
195Before our crazy
196[Go rewrite of everything](https://www.spacemonkey.com/blog/posts/go-space-monkey)
197(and before we had even seen Google's Dapper paper), we were a Python shop, and
198all of our "interesting" functions were decorated with a helper that collected
199timing information and sent it to Graphite.
200
201When we transliterated to Go, we wanted to preserve that functionality, so the
202first version of our monitoring package was born.
203
204Over time it started to get janky, especially as we found Zipkin and started
205adding tracing functionality to it. We rewrote all of our Go code to use Google
206contexts, and then realized we could get call graph information. We decided a
207refactor and then an all-out rethinking of our monitoring package was best,
208and so now we have this library.
209
210## Aside about contexts
211
212Sometimes you really want callstack contextual information without having to
213pass arguments through everything on the call stack. In other languages, many
214people implement this with thread-local storage.
215
216Example: let's say you have written a big system that responds to user
217requests. All of your libraries log using your log library. During initial
218development everything is easy to debug, since there's low user load, but now
219you've scaled and there's OVER TEN USERS and it's kind of hard to tell what log
220lines were caused by what. Wouldn't it be nice to add request ids to all of the
221log lines kicked off by that request? Then you could grep for all log lines
222caused by a specific request id. Geez, it would suck to have to pass all
223contextual debugging information through all of your callsites.
224
225Google solved this problem by always passing a `context.Context` interface
226through from call to call. A `Context` is basically just a mapping of arbitrary
227keys to arbitrary values that users can add new values for. This way if you
228decide to add a request context, you can add it to your `Context` and then all
229callsites that descend from that place will have the new data in their contexts.
230
231It is admittedly very verbose to add contexts to every function call.
232Painfully so. I hope to write more about it in the future, but [Google also
233wrote up their thoughts about it](https://blog.golang.org/context), which you
234can go read. For now, just swallow your disgust and let's keep moving.
235
236## Motivating program
237
238Let's make a super simple [Varnish](https://www.varnish-cache.org/) clone.
239Open up gedit! (Okay just kidding, open whatever text editor you want.)
240
241For this motivating program, we won't even add the caching, though there's
242comments for where to add it if you'd like. For now, let's just make a
243barebones system that will proxy HTTP requests. We'll call it VLite, but
244maybe we should call it VReallyLite.
245
246```go
247package main
248
249import (
250  "flag"
251  "net/http"
252  "net/http/httputil"
253  "net/url"
254)
255
256type VLite struct {
257  target *url.URL
258  proxy  *httputil.ReverseProxy
259}
260
261func NewVLite(target *url.URL) *VLite {
262  return &VLite{
263	  target: target,
264	  proxy:  httputil.NewSingleHostReverseProxy(target),
265  }
266}
267
268func (v *VLite) Proxy(w http.ResponseWriter, r *http.Request) {
269  r.Host = v.target.Host // let the proxied server get the right vhost
270  v.proxy.ServeHTTP(w, r)
271}
272
273func (v *VLite) ServeHTTP(w http.ResponseWriter, r *http.Request) {
274  // here's where you'd put caching logic
275  v.Proxy(w, r)
276}
277
278func main() {
279  target := flag.String(
280	  "proxy",
281	  "http://hasthelargehadroncolliderdestroyedtheworldyet.com/",
282	  "server to cache")
283  flag.Parse()
284  targetURL, err := url.Parse(*target)
285  if err != nil {
286	  panic(err)
287  }
288  panic(http.ListenAndServe(":8080", NewVLite(targetURL)))
289}
290```
291
292Run and build this and open `localhost:8080` in your browser. If you use the
293default proxy target, it should inform you that the world hasn't been
294destroyed yet.
295
296## Adding basic instrumentation
297
298The first thing you'll want to do is add the small amount of boilerplate to
299make the instrumentation we're going to add to your process observable later.
300
301Import the basic monkit packages:
302
303```go
304"github.com/spacemonkeygo/monkit/v3"
305"github.com/spacemonkeygo/monkit/v3/environment"
306"github.com/spacemonkeygo/monkit/v3/present"
307```
308
309and then register environmental statistics and kick off a goroutine in your
310main method to serve debug requests:
311
312```go
313environment.Register(monkit.Default)
314go http.ListenAndServe("localhost:9000", present.HTTP(monkit.Default))
315```
316
317Rebuild, and then check out `localhost:9000/stats` (or
318`localhost:9000/stats/json`, if you prefer) in your browser!
319
320## Request contexts
321
322Remember what I said about [Google's contexts](https://blog.golang.org/context)?
323It might seem a bit overkill for such a small project, but it's time to add
324them.
325
326To help out here, I've created a library that constructs contexts for you
327for incoming HTTP requests. Nothing that's about to happen requires my
328[webhelp library](https://godoc.org/github.com/jtolds/webhelp), but here is the
329code now refactored to receive and pass contexts through our two per-request
330calls.
331
332```go
333package main
334
335import (
336  "context"
337  "flag"
338  "net/http"
339  "net/http/httputil"
340  "net/url"
341
342  "github.com/jtolds/webhelp"
343  "github.com/spacemonkeygo/monkit/v3"
344  "github.com/spacemonkeygo/monkit/v3/environment"
345  "github.com/spacemonkeygo/monkit/v3/present"
346)
347
348type VLite struct {
349  target *url.URL
350  proxy  *httputil.ReverseProxy
351}
352
353func NewVLite(target *url.URL) *VLite {
354  return &VLite{
355	  target: target,
356	  proxy:  httputil.NewSingleHostReverseProxy(target),
357  }
358}
359
360func (v *VLite) Proxy(ctx context.Context, w http.ResponseWriter, r *http.Request) {
361  r.Host = v.target.Host // let the proxied server get the right vhost
362  v.proxy.ServeHTTP(w, r)
363}
364
365func (v *VLite) HandleHTTP(ctx context.Context, w webhelp.ResponseWriter, r *http.Request) error {
366  // here's where you'd put caching logic
367  v.Proxy(ctx, w, r)
368  return nil
369}
370
371func main() {
372  target := flag.String(
373	  "proxy",
374	  "http://hasthelargehadroncolliderdestroyedtheworldyet.com/",
375	  "server to cache")
376  flag.Parse()
377  targetURL, err := url.Parse(*target)
378  if err != nil {
379	  panic(err)
380  }
381  environment.Register(monkit.Default)
382  go http.ListenAndServe("localhost:9000", present.HTTP(monkit.Default))
383  panic(webhelp.ListenAndServe(":8080", NewVLite(targetURL)))
384}
385```
386
387You can create a new context for a request however you want. One reason to use
388something like webhelp is that the cancelation feature of Contexts is hooked
389up to the HTTP request getting canceled.
390
391## Monitor some requests
392
393Let's start to get statistics about how many requests we receive! First, this
394package (main) will need to get a monitoring Scope. Add this global definition
395right after all your imports, much like you'd create a logger with many logging
396libraries:
397
398```go
399var mon = monkit.Package()
400```
401
402Now, make the error return value of HandleHTTP named (so, (err error)), and add
403this defer line as the very first instruction of HandleHTTP:
404
405```go
406func (v *VLite) HandleHTTP(ctx context.Context, w webhelp.ResponseWriter, r *http.Request) (err error) {
407  defer mon.Task()(&ctx)(&err)
408```
409
410Let's also add the same line (albeit modified for the lack of error) to
411Proxy, replacing &err with nil:
412
413```go
414func (v *VLite) Proxy(ctx context.Context, w http.ResponseWriter, r *http.Request) {
415  defer mon.Task()(&ctx)(nil)
416```
417
418You should now have something like:
419
420```go
421package main
422
423import (
424  "context"
425  "flag"
426  "net/http"
427  "net/http/httputil"
428  "net/url"
429
430  "github.com/jtolds/webhelp"
431  "github.com/spacemonkeygo/monkit/v3"
432  "github.com/spacemonkeygo/monkit/v3/environment"
433  "github.com/spacemonkeygo/monkit/v3/present"
434)
435
436var mon = monkit.Package()
437
438type VLite struct {
439  target *url.URL
440  proxy  *httputil.ReverseProxy
441}
442
443func NewVLite(target *url.URL) *VLite {
444  return &VLite{
445	  target: target,
446	  proxy:  httputil.NewSingleHostReverseProxy(target),
447  }
448}
449
450func (v *VLite) Proxy(ctx context.Context, w http.ResponseWriter, r *http.Request) {
451  defer mon.Task()(&ctx)(nil)
452  r.Host = v.target.Host // let the proxied server get the right vhost
453  v.proxy.ServeHTTP(w, r)
454}
455
456func (v *VLite) HandleHTTP(ctx context.Context, w webhelp.ResponseWriter, r *http.Request) (err error) {
457  defer mon.Task()(&ctx)(&err)
458  // here's where you'd put caching logic
459  v.Proxy(ctx, w, r)
460  return nil
461}
462
463func main() {
464  target := flag.String(
465	  "proxy",
466	  "http://hasthelargehadroncolliderdestroyedtheworldyet.com/",
467	  "server to cache")
468  flag.Parse()
469  targetURL, err := url.Parse(*target)
470  if err != nil {
471	  panic(err)
472  }
473  environment.Register(monkit.Default)
474  go http.ListenAndServe("localhost:9000", present.HTTP(monkit.Default))
475  panic(webhelp.ListenAndServe(":8080", NewVLite(targetURL)))
476}
477```
478
479We'll unpack what's going on here, but for now:
480
481 * Rebuild and restart!
482 * Trigger a full refresh at `localhost:8080` to make sure your new HTTP
483   handler runs
484 * Visit `localhost:9000/stats` and then `localhost:9000/funcs`
485
486For this new funcs dataset, if you want a graph, you can download a dot
487graph at `localhost:9000/funcs/dot` and json information from
488`localhost:9000/funcs/json`.
489
490You should see something like:
491
492```
493[3693964236144930897] main.(*VLite).HandleHTTP
494  parents: entry
495  current: 0, highwater: 1, success: 2, errors: 0, panics: 0
496  success times:
497    0.00: 63.930436ms
498    0.10: 70.482159ms
499    0.25: 80.309745ms
500    0.50: 96.689054ms
501    0.75: 113.068363ms
502    0.90: 122.895948ms
503    0.95: 126.17181ms
504    1.00: 129.447675ms
505    avg: 96.689055ms
506  failure times:
507    0.00: 0
508    0.10: 0
509    0.25: 0
510    0.50: 0
511    0.75: 0
512    0.90: 0
513    0.95: 0
514    1.00: 0
515    avg: 0
516```
517
518with a similar report for the Proxy method, or a graph like:
519
520![handlehttp](https://raw.githubusercontent.com/spacemonkeygo/monkit/master/images/handlehttp.png)
521
522This data reports the overall callgraph of execution for known traces, along
523with how many of each function are currently running, the most running
524concurrently (the highwater), how many were successful along with quantile
525timing information, how many errors there were (with quantile timing
526information if applicable), and how many panics there were. Since the Proxy
527method isn't capturing a returned err value, and since HandleHTTP always
528returns nil, this example won't ever have failures.
529
530If you're wondering about the success count being higher than you expected,
531keep in mind your browser probably requested a favicon.ico.
532
533Cool, eh?
534
535## How it works
536
537```go
538defer mon.Task()(&ctx)(&err)
539```
540
541is an interesting line of code - there's three function calls. If you look at
542the Go spec, all of the function calls will run at the time the function starts
543except for the very last one.
544
545The first function call, mon.Task(), creates or looks up a wrapper around a
546Func. You could get this yourself by requesting mon.Func() inside of the
547appropriate function or mon.FuncNamed(). Both mon.Task() and mon.Func()
548are inspecting runtime.Caller to determine the name of the function. Because
549this is a heavy operation, you can actually store the result of mon.Task() and
550reuse it somehow else if you prefer, so instead of
551
552```go
553func MyFunc(ctx context.Context) (err error) {
554  defer mon.Task()(&ctx)(&err)
555}
556```
557
558you could instead use
559
560```go
561var myFuncMon = mon.Task()
562
563func MyFunc(ctx context.Context) (err error) {
564  defer myFuncMon(&ctx)(&err)
565}
566```
567
568which is more performant every time after the first time. runtime.Caller only
569gets called once.
570
571Careful! Don't use the same myFuncMon in different functions unless you want to
572screw up your statistics!
573
574The second function call starts all the various stop watches and bookkeeping to
575keep track of the function. It also mutates the context pointer it's given to
576extend the context with information about what current span (in Zipkin
577parlance) is active. Notably, you *can* pass nil for the context if you really
578don't want a context. You just lose callgraph information.
579
580The last function call stops all the stop watches ad makes a note of any
581observed errors or panics (it repanics after observing them).
582
583## Tracing
584
585Turns out, we don't even need to change our program anymore to get rich tracing
586information!
587
588Open your browser and go to `localhost:9000/trace/svg?regex=HandleHTTP`. It
589won't load, and in fact, it's waiting for you to open another tab and refresh
590`localhost:8080` again. Once you retrigger the actual application behavior,
591the trace regex will capture a trace starting on the first function that
592matches the supplied regex, and return an svg. Go back to your first tab, and
593you should see a relatively uninteresting but super promising svg.
594
595Let's make the trace more interesting. Add a
596
597```go
598time.Sleep(200 * time.Millisecond)
599```
600
601to your HandleHTTP method, rebuild, and restart. Load `localhost:8080`, then
602start a new request to your trace URL, then reload `localhost:8080` again. Flip
603back to your trace, and you should see that the Proxy method only takes a
604portion of the time of HandleHTTP!
605
606![trace](https://cdn.rawgit.com/spacemonkeygo/monkit/master/images/trace.svg)
607
608There's multiple ways to select a trace. You can select by regex using the
609preselect method (default), which first evaluates the regex on all known
610functions for sanity checking. Sometimes, however, the function you want to
611trace may not yet be known to monkit, in which case you'll want
612to turn preselection off. You may have a bad regex, or you may be in this case
613if you get the error "Bad Request: regex preselect matches 0 functions."
614
615Another way to select a trace is by providing a trace id, which we'll get to
616next!
617
618Make sure to check out what the addition of the time.Sleep call did to the
619other reports.
620
621## Plugins
622
623It's easy to write plugins for monkit! Check out our first one that exports
624data to [Zipkin](http://zipkin.io/)'s Scribe API:
625
626 * https://github.com/spacemonkeygo/monkit-zipkin
627
628We plan to have more (for HTrace, OpenTracing, etc, etc), soon!
629
630## License
631
632Copyright (C) 2016 Space Monkey, Inc.
633
634Licensed under the Apache License, Version 2.0 (the "License");
635you may not use this file except in compliance with the License.
636You may obtain a copy of the License at
637
638   http://www.apache.org/licenses/LICENSE-2.0
639
640Unless required by applicable law or agreed to in writing, software
641distributed under the License is distributed on an "AS IS" BASIS,
642WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
643See the License for the specific language governing permissions and
644limitations under the License.
645