1// Copyright 2009 The Go Authors. All rights reserved.
2// Use of this source code is governed by a BSD-style
3// license that can be found in the LICENSE file.
4
5// Garbage collector (GC).
6//
7// The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple
8// GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is
9// non-generational and non-compacting. Allocation is done using size segregated per P allocation
10// areas to minimize fragmentation while eliminating locks in the common case.
11//
12// The algorithm decomposes into several steps.
13// This is a high level description of the algorithm being used. For an overview of GC a good
14// place to start is Richard Jones' gchandbook.org.
15//
16// The algorithm's intellectual heritage includes Dijkstra's on-the-fly algorithm, see
17// Edsger W. Dijkstra, Leslie Lamport, A. J. Martin, C. S. Scholten, and E. F. M. Steffens. 1978.
18// On-the-fly garbage collection: an exercise in cooperation. Commun. ACM 21, 11 (November 1978),
19// 966-975.
20// For journal quality proofs that these steps are complete, correct, and terminate see
21// Hudson, R., and Moss, J.E.B. Copying Garbage Collection without stopping the world.
22// Concurrency and Computation: Practice and Experience 15(3-5), 2003.
23//
24// 1. GC performs sweep termination.
25//
26//    a. Stop the world. This causes all Ps to reach a GC safe-point.
27//
28//    b. Sweep any unswept spans. There will only be unswept spans if
29//    this GC cycle was forced before the expected time.
30//
31// 2. GC performs the mark phase.
32//
33//    a. Prepare for the mark phase by setting gcphase to _GCmark
34//    (from _GCoff), enabling the write barrier, enabling mutator
35//    assists, and enqueueing root mark jobs. No objects may be
36//    scanned until all Ps have enabled the write barrier, which is
37//    accomplished using STW.
38//
39//    b. Start the world. From this point, GC work is done by mark
40//    workers started by the scheduler and by assists performed as
41//    part of allocation. The write barrier shades both the
42//    overwritten pointer and the new pointer value for any pointer
43//    writes (see mbarrier.go for details). Newly allocated objects
44//    are immediately marked black.
45//
46//    c. GC performs root marking jobs. This includes scanning all
47//    stacks, shading all globals, and shading any heap pointers in
48//    off-heap runtime data structures. Scanning a stack stops a
49//    goroutine, shades any pointers found on its stack, and then
50//    resumes the goroutine.
51//
52//    d. GC drains the work queue of grey objects, scanning each grey
53//    object to black and shading all pointers found in the object
54//    (which in turn may add those pointers to the work queue).
55//
56//    e. Because GC work is spread across local caches, GC uses a
57//    distributed termination algorithm to detect when there are no
58//    more root marking jobs or grey objects (see gcMarkDone). At this
59//    point, GC transitions to mark termination.
60//
61// 3. GC performs mark termination.
62//
63//    a. Stop the world.
64//
65//    b. Set gcphase to _GCmarktermination, and disable workers and
66//    assists.
67//
68//    c. Perform housekeeping like flushing mcaches.
69//
70// 4. GC performs the sweep phase.
71//
72//    a. Prepare for the sweep phase by setting gcphase to _GCoff,
73//    setting up sweep state and disabling the write barrier.
74//
75//    b. Start the world. From this point on, newly allocated objects
76//    are white, and allocating sweeps spans before use if necessary.
77//
78//    c. GC does concurrent sweeping in the background and in response
79//    to allocation. See description below.
80//
81// 5. When sufficient allocation has taken place, replay the sequence
82// starting with 1 above. See discussion of GC rate below.
83
84// Concurrent sweep.
85//
86// The sweep phase proceeds concurrently with normal program execution.
87// The heap is swept span-by-span both lazily (when a goroutine needs another span)
88// and concurrently in a background goroutine (this helps programs that are not CPU bound).
89// At the end of STW mark termination all spans are marked as "needs sweeping".
90//
91// The background sweeper goroutine simply sweeps spans one-by-one.
92//
93// To avoid requesting more OS memory while there are unswept spans, when a
94// goroutine needs another span, it first attempts to reclaim that much memory
95// by sweeping. When a goroutine needs to allocate a new small-object span, it
96// sweeps small-object spans for the same object size until it frees at least
97// one object. When a goroutine needs to allocate large-object span from heap,
98// it sweeps spans until it frees at least that many pages into heap. There is
99// one case where this may not suffice: if a goroutine sweeps and frees two
100// nonadjacent one-page spans to the heap, it will allocate a new two-page
101// span, but there can still be other one-page unswept spans which could be
102// combined into a two-page span.
103//
104// It's critical to ensure that no operations proceed on unswept spans (that would corrupt
105// mark bits in GC bitmap). During GC all mcaches are flushed into the central cache,
106// so they are empty. When a goroutine grabs a new span into mcache, it sweeps it.
107// When a goroutine explicitly frees an object or sets a finalizer, it ensures that
108// the span is swept (either by sweeping it, or by waiting for the concurrent sweep to finish).
109// The finalizer goroutine is kicked off only when all spans are swept.
110// When the next GC starts, it sweeps all not-yet-swept spans (if any).
111
112// GC rate.
113// Next GC is after we've allocated an extra amount of memory proportional to
114// the amount already in use. The proportion is controlled by GOGC environment variable
115// (100 by default). If GOGC=100 and we're using 4M, we'll GC again when we get to 8M
116// (this mark is tracked in next_gc variable). This keeps the GC cost in linear
117// proportion to the allocation cost. Adjusting GOGC just changes the linear constant
118// (and also the amount of extra memory used).
119
120// Oblets
121//
122// In order to prevent long pauses while scanning large objects and to
123// improve parallelism, the garbage collector breaks up scan jobs for
124// objects larger than maxObletBytes into "oblets" of at most
125// maxObletBytes. When scanning encounters the beginning of a large
126// object, it scans only the first oblet and enqueues the remaining
127// oblets as new scan jobs.
128
129package runtime
130
131import (
132	"internal/cpu"
133	"runtime/internal/atomic"
134	"unsafe"
135)
136
137const (
138	_DebugGC         = 0
139	_ConcurrentSweep = true
140	_FinBlockSize    = 4 * 1024
141
142	// debugScanConservative enables debug logging for stack
143	// frames that are scanned conservatively.
144	debugScanConservative = false
145
146	// sweepMinHeapDistance is a lower bound on the heap distance
147	// (in bytes) reserved for concurrent sweeping between GC
148	// cycles.
149	sweepMinHeapDistance = 1024 * 1024
150)
151
152// heapminimum is the minimum heap size at which to trigger GC.
153// For small heaps, this overrides the usual GOGC*live set rule.
154//
155// When there is a very small live set but a lot of allocation, simply
156// collecting when the heap reaches GOGC*live results in many GC
157// cycles and high total per-GC overhead. This minimum amortizes this
158// per-GC overhead while keeping the heap reasonably small.
159//
160// During initialization this is set to 4MB*GOGC/100. In the case of
161// GOGC==0, this will set heapminimum to 0, resulting in constant
162// collection even when the heap size is small, which is useful for
163// debugging.
164var heapminimum uint64 = defaultHeapMinimum
165
166// defaultHeapMinimum is the value of heapminimum for GOGC==100.
167const defaultHeapMinimum = 4 << 20
168
169// Initialized from $GOGC.  GOGC=off means no GC.
170var gcpercent int32
171
172func gcinit() {
173	if unsafe.Sizeof(workbuf{}) != _WorkbufSize {
174		throw("size of Workbuf is suboptimal")
175	}
176
177	// No sweep on the first cycle.
178	mheap_.sweepdone = 1
179
180	// Set a reasonable initial GC trigger.
181	memstats.triggerRatio = 7 / 8.0
182
183	// Fake a heap_marked value so it looks like a trigger at
184	// heapminimum is the appropriate growth from heap_marked.
185	// This will go into computing the initial GC goal.
186	memstats.heap_marked = uint64(float64(heapminimum) / (1 + memstats.triggerRatio))
187
188	// Set gcpercent from the environment. This will also compute
189	// and set the GC trigger and goal.
190	_ = setGCPercent(readgogc())
191
192	work.startSema = 1
193	work.markDoneSema = 1
194}
195
196func readgogc() int32 {
197	p := gogetenv("GOGC")
198	if p == "off" {
199		return -1
200	}
201	if n, ok := atoi32(p); ok {
202		return n
203	}
204	return 100
205}
206
207// gcenable is called after the bulk of the runtime initialization,
208// just before we're about to start letting user code run.
209// It kicks off the background sweeper goroutine, the background
210// scavenger goroutine, and enables GC.
211func gcenable() {
212	// Kick off sweeping and scavenging.
213	c := make(chan int, 2)
214	go bgsweep(c)
215	go bgscavenge(c)
216	<-c
217	<-c
218	memstats.enablegc = true // now that runtime is initialized, GC is okay
219}
220
221//go:linkname setGCPercent runtime/debug.setGCPercent
222func setGCPercent(in int32) (out int32) {
223	// Run on the system stack since we grab the heap lock.
224	systemstack(func() {
225		lock(&mheap_.lock)
226		out = gcpercent
227		if in < 0 {
228			in = -1
229		}
230		gcpercent = in
231		heapminimum = defaultHeapMinimum * uint64(gcpercent) / 100
232		// Update pacing in response to gcpercent change.
233		gcSetTriggerRatio(memstats.triggerRatio)
234		unlock(&mheap_.lock)
235	})
236	// Pacing changed, so the scavenger should be awoken.
237	wakeScavenger()
238
239	// If we just disabled GC, wait for any concurrent GC mark to
240	// finish so we always return with no GC running.
241	if in < 0 {
242		gcWaitOnMark(atomic.Load(&work.cycles))
243	}
244
245	return out
246}
247
248// Garbage collector phase.
249// Indicates to write barrier and synchronization task to perform.
250var gcphase uint32
251
252// The compiler knows about this variable.
253// If you change it, you must change builtin/runtime.go, too.
254// If you change the first four bytes, you must also change the write
255// barrier insertion code.
256var writeBarrier struct {
257	enabled bool    // compiler emits a check of this before calling write barrier
258	pad     [3]byte // compiler uses 32-bit load for "enabled" field
259	needed  bool    // whether we need a write barrier for current GC phase
260	cgo     bool    // whether we need a write barrier for a cgo check
261	alignme uint64  // guarantee alignment so that compiler can use a 32 or 64-bit load
262}
263
264// gcBlackenEnabled is 1 if mutator assists and background mark
265// workers are allowed to blacken objects. This must only be set when
266// gcphase == _GCmark.
267var gcBlackenEnabled uint32
268
269const (
270	_GCoff             = iota // GC not running; sweeping in background, write barrier disabled
271	_GCmark                   // GC marking roots and workbufs: allocate black, write barrier ENABLED
272	_GCmarktermination        // GC mark termination: allocate black, P's help GC, write barrier ENABLED
273)
274
275//go:nosplit
276func setGCPhase(x uint32) {
277	atomic.Store(&gcphase, x)
278	writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
279	writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
280}
281
282// gcMarkWorkerMode represents the mode that a concurrent mark worker
283// should operate in.
284//
285// Concurrent marking happens through four different mechanisms. One
286// is mutator assists, which happen in response to allocations and are
287// not scheduled. The other three are variations in the per-P mark
288// workers and are distinguished by gcMarkWorkerMode.
289type gcMarkWorkerMode int
290
291const (
292	// gcMarkWorkerDedicatedMode indicates that the P of a mark
293	// worker is dedicated to running that mark worker. The mark
294	// worker should run without preemption.
295	gcMarkWorkerDedicatedMode gcMarkWorkerMode = iota
296
297	// gcMarkWorkerFractionalMode indicates that a P is currently
298	// running the "fractional" mark worker. The fractional worker
299	// is necessary when GOMAXPROCS*gcBackgroundUtilization is not
300	// an integer. The fractional worker should run until it is
301	// preempted and will be scheduled to pick up the fractional
302	// part of GOMAXPROCS*gcBackgroundUtilization.
303	gcMarkWorkerFractionalMode
304
305	// gcMarkWorkerIdleMode indicates that a P is running the mark
306	// worker because it has nothing else to do. The idle worker
307	// should run until it is preempted and account its time
308	// against gcController.idleMarkTime.
309	gcMarkWorkerIdleMode
310)
311
312// gcMarkWorkerModeStrings are the strings labels of gcMarkWorkerModes
313// to use in execution traces.
314var gcMarkWorkerModeStrings = [...]string{
315	"GC (dedicated)",
316	"GC (fractional)",
317	"GC (idle)",
318}
319
320// gcController implements the GC pacing controller that determines
321// when to trigger concurrent garbage collection and how much marking
322// work to do in mutator assists and background marking.
323//
324// It uses a feedback control algorithm to adjust the memstats.gc_trigger
325// trigger based on the heap growth and GC CPU utilization each cycle.
326// This algorithm optimizes for heap growth to match GOGC and for CPU
327// utilization between assist and background marking to be 25% of
328// GOMAXPROCS. The high-level design of this algorithm is documented
329// at https://golang.org/s/go15gcpacing.
330//
331// All fields of gcController are used only during a single mark
332// cycle.
333var gcController gcControllerState
334
335type gcControllerState struct {
336	// scanWork is the total scan work performed this cycle. This
337	// is updated atomically during the cycle. Updates occur in
338	// bounded batches, since it is both written and read
339	// throughout the cycle. At the end of the cycle, this is how
340	// much of the retained heap is scannable.
341	//
342	// Currently this is the bytes of heap scanned. For most uses,
343	// this is an opaque unit of work, but for estimation the
344	// definition is important.
345	scanWork int64
346
347	// bgScanCredit is the scan work credit accumulated by the
348	// concurrent background scan. This credit is accumulated by
349	// the background scan and stolen by mutator assists. This is
350	// updated atomically. Updates occur in bounded batches, since
351	// it is both written and read throughout the cycle.
352	bgScanCredit int64
353
354	// assistTime is the nanoseconds spent in mutator assists
355	// during this cycle. This is updated atomically. Updates
356	// occur in bounded batches, since it is both written and read
357	// throughout the cycle.
358	assistTime int64
359
360	// dedicatedMarkTime is the nanoseconds spent in dedicated
361	// mark workers during this cycle. This is updated atomically
362	// at the end of the concurrent mark phase.
363	dedicatedMarkTime int64
364
365	// fractionalMarkTime is the nanoseconds spent in the
366	// fractional mark worker during this cycle. This is updated
367	// atomically throughout the cycle and will be up-to-date if
368	// the fractional mark worker is not currently running.
369	fractionalMarkTime int64
370
371	// idleMarkTime is the nanoseconds spent in idle marking
372	// during this cycle. This is updated atomically throughout
373	// the cycle.
374	idleMarkTime int64
375
376	// markStartTime is the absolute start time in nanoseconds
377	// that assists and background mark workers started.
378	markStartTime int64
379
380	// dedicatedMarkWorkersNeeded is the number of dedicated mark
381	// workers that need to be started. This is computed at the
382	// beginning of each cycle and decremented atomically as
383	// dedicated mark workers get started.
384	dedicatedMarkWorkersNeeded int64
385
386	// assistWorkPerByte is the ratio of scan work to allocated
387	// bytes that should be performed by mutator assists. This is
388	// computed at the beginning of each cycle and updated every
389	// time heap_scan is updated.
390	assistWorkPerByte float64
391
392	// assistBytesPerWork is 1/assistWorkPerByte.
393	assistBytesPerWork float64
394
395	// fractionalUtilizationGoal is the fraction of wall clock
396	// time that should be spent in the fractional mark worker on
397	// each P that isn't running a dedicated worker.
398	//
399	// For example, if the utilization goal is 25% and there are
400	// no dedicated workers, this will be 0.25. If the goal is
401	// 25%, there is one dedicated worker, and GOMAXPROCS is 5,
402	// this will be 0.05 to make up the missing 5%.
403	//
404	// If this is zero, no fractional workers are needed.
405	fractionalUtilizationGoal float64
406
407	_ cpu.CacheLinePad
408}
409
410// startCycle resets the GC controller's state and computes estimates
411// for a new GC cycle. The caller must hold worldsema.
412func (c *gcControllerState) startCycle() {
413	c.scanWork = 0
414	c.bgScanCredit = 0
415	c.assistTime = 0
416	c.dedicatedMarkTime = 0
417	c.fractionalMarkTime = 0
418	c.idleMarkTime = 0
419
420	// Ensure that the heap goal is at least a little larger than
421	// the current live heap size. This may not be the case if GC
422	// start is delayed or if the allocation that pushed heap_live
423	// over gc_trigger is large or if the trigger is really close to
424	// GOGC. Assist is proportional to this distance, so enforce a
425	// minimum distance, even if it means going over the GOGC goal
426	// by a tiny bit.
427	if memstats.next_gc < memstats.heap_live+1024*1024 {
428		memstats.next_gc = memstats.heap_live + 1024*1024
429	}
430
431	// Compute the background mark utilization goal. In general,
432	// this may not come out exactly. We round the number of
433	// dedicated workers so that the utilization is closest to
434	// 25%. For small GOMAXPROCS, this would introduce too much
435	// error, so we add fractional workers in that case.
436	totalUtilizationGoal := float64(gomaxprocs) * gcBackgroundUtilization
437	c.dedicatedMarkWorkersNeeded = int64(totalUtilizationGoal + 0.5)
438	utilError := float64(c.dedicatedMarkWorkersNeeded)/totalUtilizationGoal - 1
439	const maxUtilError = 0.3
440	if utilError < -maxUtilError || utilError > maxUtilError {
441		// Rounding put us more than 30% off our goal. With
442		// gcBackgroundUtilization of 25%, this happens for
443		// GOMAXPROCS<=3 or GOMAXPROCS=6. Enable fractional
444		// workers to compensate.
445		if float64(c.dedicatedMarkWorkersNeeded) > totalUtilizationGoal {
446			// Too many dedicated workers.
447			c.dedicatedMarkWorkersNeeded--
448		}
449		c.fractionalUtilizationGoal = (totalUtilizationGoal - float64(c.dedicatedMarkWorkersNeeded)) / float64(gomaxprocs)
450	} else {
451		c.fractionalUtilizationGoal = 0
452	}
453
454	// In STW mode, we just want dedicated workers.
455	if debug.gcstoptheworld > 0 {
456		c.dedicatedMarkWorkersNeeded = int64(gomaxprocs)
457		c.fractionalUtilizationGoal = 0
458	}
459
460	// Clear per-P state
461	for _, p := range allp {
462		p.gcAssistTime = 0
463		p.gcFractionalMarkTime = 0
464	}
465
466	// Compute initial values for controls that are updated
467	// throughout the cycle.
468	c.revise()
469
470	if debug.gcpacertrace > 0 {
471		print("pacer: assist ratio=", c.assistWorkPerByte,
472			" (scan ", memstats.heap_scan>>20, " MB in ",
473			work.initialHeapLive>>20, "->",
474			memstats.next_gc>>20, " MB)",
475			" workers=", c.dedicatedMarkWorkersNeeded,
476			"+", c.fractionalUtilizationGoal, "\n")
477	}
478}
479
480// revise updates the assist ratio during the GC cycle to account for
481// improved estimates. This should be called either under STW or
482// whenever memstats.heap_scan, memstats.heap_live, or
483// memstats.next_gc is updated (with mheap_.lock held).
484//
485// It should only be called when gcBlackenEnabled != 0 (because this
486// is when assists are enabled and the necessary statistics are
487// available).
488func (c *gcControllerState) revise() {
489	gcpercent := gcpercent
490	if gcpercent < 0 {
491		// If GC is disabled but we're running a forced GC,
492		// act like GOGC is huge for the below calculations.
493		gcpercent = 100000
494	}
495	live := atomic.Load64(&memstats.heap_live)
496
497	// Assume we're under the soft goal. Pace GC to complete at
498	// next_gc assuming the heap is in steady-state.
499	heapGoal := int64(memstats.next_gc)
500
501	// Compute the expected scan work remaining.
502	//
503	// This is estimated based on the expected
504	// steady-state scannable heap. For example, with
505	// GOGC=100, only half of the scannable heap is
506	// expected to be live, so that's what we target.
507	//
508	// (This is a float calculation to avoid overflowing on
509	// 100*heap_scan.)
510	scanWorkExpected := int64(float64(memstats.heap_scan) * 100 / float64(100+gcpercent))
511
512	if live > memstats.next_gc || c.scanWork > scanWorkExpected {
513		// We're past the soft goal, or we've already done more scan
514		// work than we expected. Pace GC so that in the worst case it
515		// will complete by the hard goal.
516		const maxOvershoot = 1.1
517		heapGoal = int64(float64(memstats.next_gc) * maxOvershoot)
518
519		// Compute the upper bound on the scan work remaining.
520		scanWorkExpected = int64(memstats.heap_scan)
521	}
522
523	// Compute the remaining scan work estimate.
524	//
525	// Note that we currently count allocations during GC as both
526	// scannable heap (heap_scan) and scan work completed
527	// (scanWork), so allocation will change this difference
528	// slowly in the soft regime and not at all in the hard
529	// regime.
530	scanWorkRemaining := scanWorkExpected - c.scanWork
531	if scanWorkRemaining < 1000 {
532		// We set a somewhat arbitrary lower bound on
533		// remaining scan work since if we aim a little high,
534		// we can miss by a little.
535		//
536		// We *do* need to enforce that this is at least 1,
537		// since marking is racy and double-scanning objects
538		// may legitimately make the remaining scan work
539		// negative, even in the hard goal regime.
540		scanWorkRemaining = 1000
541	}
542
543	// Compute the heap distance remaining.
544	heapRemaining := heapGoal - int64(live)
545	if heapRemaining <= 0 {
546		// This shouldn't happen, but if it does, avoid
547		// dividing by zero or setting the assist negative.
548		heapRemaining = 1
549	}
550
551	// Compute the mutator assist ratio so by the time the mutator
552	// allocates the remaining heap bytes up to next_gc, it will
553	// have done (or stolen) the remaining amount of scan work.
554	c.assistWorkPerByte = float64(scanWorkRemaining) / float64(heapRemaining)
555	c.assistBytesPerWork = float64(heapRemaining) / float64(scanWorkRemaining)
556}
557
558// endCycle computes the trigger ratio for the next cycle.
559func (c *gcControllerState) endCycle() float64 {
560	if work.userForced {
561		// Forced GC means this cycle didn't start at the
562		// trigger, so where it finished isn't good
563		// information about how to adjust the trigger.
564		// Just leave it where it is.
565		return memstats.triggerRatio
566	}
567
568	// Proportional response gain for the trigger controller. Must
569	// be in [0, 1]. Lower values smooth out transient effects but
570	// take longer to respond to phase changes. Higher values
571	// react to phase changes quickly, but are more affected by
572	// transient changes. Values near 1 may be unstable.
573	const triggerGain = 0.5
574
575	// Compute next cycle trigger ratio. First, this computes the
576	// "error" for this cycle; that is, how far off the trigger
577	// was from what it should have been, accounting for both heap
578	// growth and GC CPU utilization. We compute the actual heap
579	// growth during this cycle and scale that by how far off from
580	// the goal CPU utilization we were (to estimate the heap
581	// growth if we had the desired CPU utilization). The
582	// difference between this estimate and the GOGC-based goal
583	// heap growth is the error.
584	goalGrowthRatio := gcEffectiveGrowthRatio()
585	actualGrowthRatio := float64(memstats.heap_live)/float64(memstats.heap_marked) - 1
586	assistDuration := nanotime() - c.markStartTime
587
588	// Assume background mark hit its utilization goal.
589	utilization := gcBackgroundUtilization
590	// Add assist utilization; avoid divide by zero.
591	if assistDuration > 0 {
592		utilization += float64(c.assistTime) / float64(assistDuration*int64(gomaxprocs))
593	}
594
595	triggerError := goalGrowthRatio - memstats.triggerRatio - utilization/gcGoalUtilization*(actualGrowthRatio-memstats.triggerRatio)
596
597	// Finally, we adjust the trigger for next time by this error,
598	// damped by the proportional gain.
599	triggerRatio := memstats.triggerRatio + triggerGain*triggerError
600
601	if debug.gcpacertrace > 0 {
602		// Print controller state in terms of the design
603		// document.
604		H_m_prev := memstats.heap_marked
605		h_t := memstats.triggerRatio
606		H_T := memstats.gc_trigger
607		h_a := actualGrowthRatio
608		H_a := memstats.heap_live
609		h_g := goalGrowthRatio
610		H_g := int64(float64(H_m_prev) * (1 + h_g))
611		u_a := utilization
612		u_g := gcGoalUtilization
613		W_a := c.scanWork
614		print("pacer: H_m_prev=", H_m_prev,
615			" h_t=", h_t, " H_T=", H_T,
616			" h_a=", h_a, " H_a=", H_a,
617			" h_g=", h_g, " H_g=", H_g,
618			" u_a=", u_a, " u_g=", u_g,
619			" W_a=", W_a,
620			" goalΔ=", goalGrowthRatio-h_t,
621			" actualΔ=", h_a-h_t,
622			" u_a/u_g=", u_a/u_g,
623			"\n")
624	}
625
626	return triggerRatio
627}
628
629// enlistWorker encourages another dedicated mark worker to start on
630// another P if there are spare worker slots. It is used by putfull
631// when more work is made available.
632//
633//go:nowritebarrier
634func (c *gcControllerState) enlistWorker() {
635	// If there are idle Ps, wake one so it will run an idle worker.
636	// NOTE: This is suspected of causing deadlocks. See golang.org/issue/19112.
637	//
638	//	if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
639	//		wakep()
640	//		return
641	//	}
642
643	// There are no idle Ps. If we need more dedicated workers,
644	// try to preempt a running P so it will switch to a worker.
645	if c.dedicatedMarkWorkersNeeded <= 0 {
646		return
647	}
648	// Pick a random other P to preempt.
649	if gomaxprocs <= 1 {
650		return
651	}
652	gp := getg()
653	if gp == nil || gp.m == nil || gp.m.p == 0 {
654		return
655	}
656	myID := gp.m.p.ptr().id
657	for tries := 0; tries < 5; tries++ {
658		id := int32(fastrandn(uint32(gomaxprocs - 1)))
659		if id >= myID {
660			id++
661		}
662		p := allp[id]
663		if p.status != _Prunning {
664			continue
665		}
666		if preemptone(p) {
667			return
668		}
669	}
670}
671
672// findRunnableGCWorker returns the background mark worker for _p_ if it
673// should be run. This must only be called when gcBlackenEnabled != 0.
674func (c *gcControllerState) findRunnableGCWorker(_p_ *p) *g {
675	if gcBlackenEnabled == 0 {
676		throw("gcControllerState.findRunnable: blackening not enabled")
677	}
678	if _p_.gcBgMarkWorker == 0 {
679		// The mark worker associated with this P is blocked
680		// performing a mark transition. We can't run it
681		// because it may be on some other run or wait queue.
682		return nil
683	}
684
685	if !gcMarkWorkAvailable(_p_) {
686		// No work to be done right now. This can happen at
687		// the end of the mark phase when there are still
688		// assists tapering off. Don't bother running a worker
689		// now because it'll just return immediately.
690		return nil
691	}
692
693	decIfPositive := func(ptr *int64) bool {
694		if *ptr > 0 {
695			if atomic.Xaddint64(ptr, -1) >= 0 {
696				return true
697			}
698			// We lost a race
699			atomic.Xaddint64(ptr, +1)
700		}
701		return false
702	}
703
704	if decIfPositive(&c.dedicatedMarkWorkersNeeded) {
705		// This P is now dedicated to marking until the end of
706		// the concurrent mark phase.
707		_p_.gcMarkWorkerMode = gcMarkWorkerDedicatedMode
708	} else if c.fractionalUtilizationGoal == 0 {
709		// No need for fractional workers.
710		return nil
711	} else {
712		// Is this P behind on the fractional utilization
713		// goal?
714		//
715		// This should be kept in sync with pollFractionalWorkerExit.
716		delta := nanotime() - gcController.markStartTime
717		if delta > 0 && float64(_p_.gcFractionalMarkTime)/float64(delta) > c.fractionalUtilizationGoal {
718			// Nope. No need to run a fractional worker.
719			return nil
720		}
721		// Run a fractional worker.
722		_p_.gcMarkWorkerMode = gcMarkWorkerFractionalMode
723	}
724
725	// Run the background mark worker
726	gp := _p_.gcBgMarkWorker.ptr()
727	casgstatus(gp, _Gwaiting, _Grunnable)
728	if trace.enabled {
729		traceGoUnpark(gp, 0)
730	}
731	return gp
732}
733
734// pollFractionalWorkerExit reports whether a fractional mark worker
735// should self-preempt. It assumes it is called from the fractional
736// worker.
737func pollFractionalWorkerExit() bool {
738	// This should be kept in sync with the fractional worker
739	// scheduler logic in findRunnableGCWorker.
740	now := nanotime()
741	delta := now - gcController.markStartTime
742	if delta <= 0 {
743		return true
744	}
745	p := getg().m.p.ptr()
746	selfTime := p.gcFractionalMarkTime + (now - p.gcMarkWorkerStartTime)
747	// Add some slack to the utilization goal so that the
748	// fractional worker isn't behind again the instant it exits.
749	return float64(selfTime)/float64(delta) > 1.2*gcController.fractionalUtilizationGoal
750}
751
752// gcSetTriggerRatio sets the trigger ratio and updates everything
753// derived from it: the absolute trigger, the heap goal, mark pacing,
754// and sweep pacing.
755//
756// This can be called any time. If GC is the in the middle of a
757// concurrent phase, it will adjust the pacing of that phase.
758//
759// This depends on gcpercent, memstats.heap_marked, and
760// memstats.heap_live. These must be up to date.
761//
762// mheap_.lock must be held or the world must be stopped.
763func gcSetTriggerRatio(triggerRatio float64) {
764	// Compute the next GC goal, which is when the allocated heap
765	// has grown by GOGC/100 over the heap marked by the last
766	// cycle.
767	goal := ^uint64(0)
768	if gcpercent >= 0 {
769		goal = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
770	}
771
772	// If we let triggerRatio go too low, then if the application
773	// is allocating very rapidly we might end up in a situation
774	// where we're allocating black during a nearly always-on GC.
775	// The result of this is a growing heap and ultimately an
776	// increase in RSS. By capping us at a point >0, we're essentially
777	// saying that we're OK using more CPU during the GC to prevent
778	// this growth in RSS.
779	//
780	// The current constant was chosen empirically: given a sufficiently
781	// fast/scalable allocator with 48 Ps that could drive the trigger ratio
782	// to <0.05, this constant causes applications to retain the same peak
783	// RSS compared to not having this allocator.
784	const minTriggerRatio = 0.6
785
786	// Set the trigger ratio, capped to reasonable bounds.
787	if triggerRatio < minTriggerRatio {
788		// This can happen if the mutator is allocating very
789		// quickly or the GC is scanning very slowly.
790		triggerRatio = minTriggerRatio
791	} else if gcpercent >= 0 {
792		// Ensure there's always a little margin so that the
793		// mutator assist ratio isn't infinity.
794		maxTriggerRatio := 0.95 * float64(gcpercent) / 100
795		if triggerRatio > maxTriggerRatio {
796			triggerRatio = maxTriggerRatio
797		}
798	}
799	memstats.triggerRatio = triggerRatio
800
801	// Compute the absolute GC trigger from the trigger ratio.
802	//
803	// We trigger the next GC cycle when the allocated heap has
804	// grown by the trigger ratio over the marked heap size.
805	trigger := ^uint64(0)
806	if gcpercent >= 0 {
807		trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio))
808		// Don't trigger below the minimum heap size.
809		minTrigger := heapminimum
810		if !isSweepDone() {
811			// Concurrent sweep happens in the heap growth
812			// from heap_live to gc_trigger, so ensure
813			// that concurrent sweep has some heap growth
814			// in which to perform sweeping before we
815			// start the next GC cycle.
816			sweepMin := atomic.Load64(&memstats.heap_live) + sweepMinHeapDistance
817			if sweepMin > minTrigger {
818				minTrigger = sweepMin
819			}
820		}
821		if trigger < minTrigger {
822			trigger = minTrigger
823		}
824		if int64(trigger) < 0 {
825			print("runtime: next_gc=", memstats.next_gc, " heap_marked=", memstats.heap_marked, " heap_live=", memstats.heap_live, " initialHeapLive=", work.initialHeapLive, "triggerRatio=", triggerRatio, " minTrigger=", minTrigger, "\n")
826			throw("gc_trigger underflow")
827		}
828		if trigger > goal {
829			// The trigger ratio is always less than GOGC/100, but
830			// other bounds on the trigger may have raised it.
831			// Push up the goal, too.
832			goal = trigger
833		}
834	}
835
836	// Commit to the trigger and goal.
837	memstats.gc_trigger = trigger
838	memstats.next_gc = goal
839	if trace.enabled {
840		traceNextGC()
841	}
842
843	// Update mark pacing.
844	if gcphase != _GCoff {
845		gcController.revise()
846	}
847
848	// Update sweep pacing.
849	if isSweepDone() {
850		mheap_.sweepPagesPerByte = 0
851	} else {
852		// Concurrent sweep needs to sweep all of the in-use
853		// pages by the time the allocated heap reaches the GC
854		// trigger. Compute the ratio of in-use pages to sweep
855		// per byte allocated, accounting for the fact that
856		// some might already be swept.
857		heapLiveBasis := atomic.Load64(&memstats.heap_live)
858		heapDistance := int64(trigger) - int64(heapLiveBasis)
859		// Add a little margin so rounding errors and
860		// concurrent sweep are less likely to leave pages
861		// unswept when GC starts.
862		heapDistance -= 1024 * 1024
863		if heapDistance < _PageSize {
864			// Avoid setting the sweep ratio extremely high
865			heapDistance = _PageSize
866		}
867		pagesSwept := atomic.Load64(&mheap_.pagesSwept)
868		pagesInUse := atomic.Load64(&mheap_.pagesInUse)
869		sweepDistancePages := int64(pagesInUse) - int64(pagesSwept)
870		if sweepDistancePages <= 0 {
871			mheap_.sweepPagesPerByte = 0
872		} else {
873			mheap_.sweepPagesPerByte = float64(sweepDistancePages) / float64(heapDistance)
874			mheap_.sweepHeapLiveBasis = heapLiveBasis
875			// Write pagesSweptBasis last, since this
876			// signals concurrent sweeps to recompute
877			// their debt.
878			atomic.Store64(&mheap_.pagesSweptBasis, pagesSwept)
879		}
880	}
881
882	gcPaceScavenger()
883}
884
885// gcEffectiveGrowthRatio returns the current effective heap growth
886// ratio (GOGC/100) based on heap_marked from the previous GC and
887// next_gc for the current GC.
888//
889// This may differ from gcpercent/100 because of various upper and
890// lower bounds on gcpercent. For example, if the heap is smaller than
891// heapminimum, this can be higher than gcpercent/100.
892//
893// mheap_.lock must be held or the world must be stopped.
894func gcEffectiveGrowthRatio() float64 {
895	egogc := float64(memstats.next_gc-memstats.heap_marked) / float64(memstats.heap_marked)
896	if egogc < 0 {
897		// Shouldn't happen, but just in case.
898		egogc = 0
899	}
900	return egogc
901}
902
903// gcGoalUtilization is the goal CPU utilization for
904// marking as a fraction of GOMAXPROCS.
905const gcGoalUtilization = 0.30
906
907// gcBackgroundUtilization is the fixed CPU utilization for background
908// marking. It must be <= gcGoalUtilization. The difference between
909// gcGoalUtilization and gcBackgroundUtilization will be made up by
910// mark assists. The scheduler will aim to use within 50% of this
911// goal.
912//
913// Setting this to < gcGoalUtilization avoids saturating the trigger
914// feedback controller when there are no assists, which allows it to
915// better control CPU and heap growth. However, the larger the gap,
916// the more mutator assists are expected to happen, which impact
917// mutator latency.
918const gcBackgroundUtilization = 0.25
919
920// gcCreditSlack is the amount of scan work credit that can
921// accumulate locally before updating gcController.scanWork and,
922// optionally, gcController.bgScanCredit. Lower values give a more
923// accurate assist ratio and make it more likely that assists will
924// successfully steal background credit. Higher values reduce memory
925// contention.
926const gcCreditSlack = 2000
927
928// gcAssistTimeSlack is the nanoseconds of mutator assist time that
929// can accumulate on a P before updating gcController.assistTime.
930const gcAssistTimeSlack = 5000
931
932// gcOverAssistWork determines how many extra units of scan work a GC
933// assist does when an assist happens. This amortizes the cost of an
934// assist by pre-paying for this many bytes of future allocations.
935const gcOverAssistWork = 64 << 10
936
937var work struct {
938	full  lfstack          // lock-free list of full blocks workbuf
939	empty lfstack          // lock-free list of empty blocks workbuf
940	pad0  cpu.CacheLinePad // prevents false-sharing between full/empty and nproc/nwait
941
942	wbufSpans struct {
943		lock mutex
944		// free is a list of spans dedicated to workbufs, but
945		// that don't currently contain any workbufs.
946		free mSpanList
947		// busy is a list of all spans containing workbufs on
948		// one of the workbuf lists.
949		busy mSpanList
950	}
951
952	// Restore 64-bit alignment on 32-bit.
953	_ uint32
954
955	// bytesMarked is the number of bytes marked this cycle. This
956	// includes bytes blackened in scanned objects, noscan objects
957	// that go straight to black, and permagrey objects scanned by
958	// markroot during the concurrent scan phase. This is updated
959	// atomically during the cycle. Updates may be batched
960	// arbitrarily, since the value is only read at the end of the
961	// cycle.
962	//
963	// Because of benign races during marking, this number may not
964	// be the exact number of marked bytes, but it should be very
965	// close.
966	//
967	// Put this field here because it needs 64-bit atomic access
968	// (and thus 8-byte alignment even on 32-bit architectures).
969	bytesMarked uint64
970
971	markrootNext uint32 // next markroot job
972	markrootJobs uint32 // number of markroot jobs
973
974	nproc  uint32
975	tstart int64
976	nwait  uint32
977	ndone  uint32
978
979	// Number of roots of various root types. Set by gcMarkRootPrepare.
980	nFlushCacheRoots                               int
981	nDataRoots, nBSSRoots, nSpanRoots, nStackRoots int
982
983	// Each type of GC state transition is protected by a lock.
984	// Since multiple threads can simultaneously detect the state
985	// transition condition, any thread that detects a transition
986	// condition must acquire the appropriate transition lock,
987	// re-check the transition condition and return if it no
988	// longer holds or perform the transition if it does.
989	// Likewise, any transition must invalidate the transition
990	// condition before releasing the lock. This ensures that each
991	// transition is performed by exactly one thread and threads
992	// that need the transition to happen block until it has
993	// happened.
994	//
995	// startSema protects the transition from "off" to mark or
996	// mark termination.
997	startSema uint32
998	// markDoneSema protects transitions from mark to mark termination.
999	markDoneSema uint32
1000
1001	bgMarkReady note   // signal background mark worker has started
1002	bgMarkDone  uint32 // cas to 1 when at a background mark completion point
1003	// Background mark completion signaling
1004
1005	// mode is the concurrency mode of the current GC cycle.
1006	mode gcMode
1007
1008	// userForced indicates the current GC cycle was forced by an
1009	// explicit user call.
1010	userForced bool
1011
1012	// totaltime is the CPU nanoseconds spent in GC since the
1013	// program started if debug.gctrace > 0.
1014	totaltime int64
1015
1016	// initialHeapLive is the value of memstats.heap_live at the
1017	// beginning of this GC cycle.
1018	initialHeapLive uint64
1019
1020	// assistQueue is a queue of assists that are blocked because
1021	// there was neither enough credit to steal or enough work to
1022	// do.
1023	assistQueue struct {
1024		lock mutex
1025		q    gQueue
1026	}
1027
1028	// sweepWaiters is a list of blocked goroutines to wake when
1029	// we transition from mark termination to sweep.
1030	sweepWaiters struct {
1031		lock mutex
1032		list gList
1033	}
1034
1035	// cycles is the number of completed GC cycles, where a GC
1036	// cycle is sweep termination, mark, mark termination, and
1037	// sweep. This differs from memstats.numgc, which is
1038	// incremented at mark termination.
1039	cycles uint32
1040
1041	// Timing/utilization stats for this cycle.
1042	stwprocs, maxprocs                 int32
1043	tSweepTerm, tMark, tMarkTerm, tEnd int64 // nanotime() of phase start
1044
1045	pauseNS    int64 // total STW time this cycle
1046	pauseStart int64 // nanotime() of last STW
1047
1048	// debug.gctrace heap sizes for this cycle.
1049	heap0, heap1, heap2, heapGoal uint64
1050}
1051
1052// GC runs a garbage collection and blocks the caller until the
1053// garbage collection is complete. It may also block the entire
1054// program.
1055func GC() {
1056	// We consider a cycle to be: sweep termination, mark, mark
1057	// termination, and sweep. This function shouldn't return
1058	// until a full cycle has been completed, from beginning to
1059	// end. Hence, we always want to finish up the current cycle
1060	// and start a new one. That means:
1061	//
1062	// 1. In sweep termination, mark, or mark termination of cycle
1063	// N, wait until mark termination N completes and transitions
1064	// to sweep N.
1065	//
1066	// 2. In sweep N, help with sweep N.
1067	//
1068	// At this point we can begin a full cycle N+1.
1069	//
1070	// 3. Trigger cycle N+1 by starting sweep termination N+1.
1071	//
1072	// 4. Wait for mark termination N+1 to complete.
1073	//
1074	// 5. Help with sweep N+1 until it's done.
1075	//
1076	// This all has to be written to deal with the fact that the
1077	// GC may move ahead on its own. For example, when we block
1078	// until mark termination N, we may wake up in cycle N+2.
1079
1080	// Wait until the current sweep termination, mark, and mark
1081	// termination complete.
1082	n := atomic.Load(&work.cycles)
1083	gcWaitOnMark(n)
1084
1085	// We're now in sweep N or later. Trigger GC cycle N+1, which
1086	// will first finish sweep N if necessary and then enter sweep
1087	// termination N+1.
1088	gcStart(gcTrigger{kind: gcTriggerCycle, n: n + 1})
1089
1090	// Wait for mark termination N+1 to complete.
1091	gcWaitOnMark(n + 1)
1092
1093	// Finish sweep N+1 before returning. We do this both to
1094	// complete the cycle and because runtime.GC() is often used
1095	// as part of tests and benchmarks to get the system into a
1096	// relatively stable and isolated state.
1097	for atomic.Load(&work.cycles) == n+1 && sweepone() != ^uintptr(0) {
1098		sweep.nbgsweep++
1099		Gosched()
1100	}
1101
1102	// Callers may assume that the heap profile reflects the
1103	// just-completed cycle when this returns (historically this
1104	// happened because this was a STW GC), but right now the
1105	// profile still reflects mark termination N, not N+1.
1106	//
1107	// As soon as all of the sweep frees from cycle N+1 are done,
1108	// we can go ahead and publish the heap profile.
1109	//
1110	// First, wait for sweeping to finish. (We know there are no
1111	// more spans on the sweep queue, but we may be concurrently
1112	// sweeping spans, so we have to wait.)
1113	for atomic.Load(&work.cycles) == n+1 && atomic.Load(&mheap_.sweepers) != 0 {
1114		Gosched()
1115	}
1116
1117	// Now we're really done with sweeping, so we can publish the
1118	// stable heap profile. Only do this if we haven't already hit
1119	// another mark termination.
1120	mp := acquirem()
1121	cycle := atomic.Load(&work.cycles)
1122	if cycle == n+1 || (gcphase == _GCmark && cycle == n+2) {
1123		mProf_PostSweep()
1124	}
1125	releasem(mp)
1126}
1127
1128// gcWaitOnMark blocks until GC finishes the Nth mark phase. If GC has
1129// already completed this mark phase, it returns immediately.
1130func gcWaitOnMark(n uint32) {
1131	for {
1132		// Disable phase transitions.
1133		lock(&work.sweepWaiters.lock)
1134		nMarks := atomic.Load(&work.cycles)
1135		if gcphase != _GCmark {
1136			// We've already completed this cycle's mark.
1137			nMarks++
1138		}
1139		if nMarks > n {
1140			// We're done.
1141			unlock(&work.sweepWaiters.lock)
1142			return
1143		}
1144
1145		// Wait until sweep termination, mark, and mark
1146		// termination of cycle N complete.
1147		work.sweepWaiters.list.push(getg())
1148		goparkunlock(&work.sweepWaiters.lock, waitReasonWaitForGCCycle, traceEvGoBlock, 1)
1149	}
1150}
1151
1152// gcMode indicates how concurrent a GC cycle should be.
1153type gcMode int
1154
1155const (
1156	gcBackgroundMode gcMode = iota // concurrent GC and sweep
1157	gcForceMode                    // stop-the-world GC now, concurrent sweep
1158	gcForceBlockMode               // stop-the-world GC now and STW sweep (forced by user)
1159)
1160
1161// A gcTrigger is a predicate for starting a GC cycle. Specifically,
1162// it is an exit condition for the _GCoff phase.
1163type gcTrigger struct {
1164	kind gcTriggerKind
1165	now  int64  // gcTriggerTime: current time
1166	n    uint32 // gcTriggerCycle: cycle number to start
1167}
1168
1169type gcTriggerKind int
1170
1171const (
1172	// gcTriggerHeap indicates that a cycle should be started when
1173	// the heap size reaches the trigger heap size computed by the
1174	// controller.
1175	gcTriggerHeap gcTriggerKind = iota
1176
1177	// gcTriggerTime indicates that a cycle should be started when
1178	// it's been more than forcegcperiod nanoseconds since the
1179	// previous GC cycle.
1180	gcTriggerTime
1181
1182	// gcTriggerCycle indicates that a cycle should be started if
1183	// we have not yet started cycle number gcTrigger.n (relative
1184	// to work.cycles).
1185	gcTriggerCycle
1186)
1187
1188// test reports whether the trigger condition is satisfied, meaning
1189// that the exit condition for the _GCoff phase has been met. The exit
1190// condition should be tested when allocating.
1191func (t gcTrigger) test() bool {
1192	if !memstats.enablegc || panicking != 0 || gcphase != _GCoff {
1193		return false
1194	}
1195	switch t.kind {
1196	case gcTriggerHeap:
1197		// Non-atomic access to heap_live for performance. If
1198		// we are going to trigger on this, this thread just
1199		// atomically wrote heap_live anyway and we'll see our
1200		// own write.
1201		return memstats.heap_live >= memstats.gc_trigger
1202	case gcTriggerTime:
1203		if gcpercent < 0 {
1204			return false
1205		}
1206		lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime))
1207		return lastgc != 0 && t.now-lastgc > forcegcperiod
1208	case gcTriggerCycle:
1209		// t.n > work.cycles, but accounting for wraparound.
1210		return int32(t.n-work.cycles) > 0
1211	}
1212	return true
1213}
1214
1215// gcStart starts the GC. It transitions from _GCoff to _GCmark (if
1216// debug.gcstoptheworld == 0) or performs all of GC (if
1217// debug.gcstoptheworld != 0).
1218//
1219// This may return without performing this transition in some cases,
1220// such as when called on a system stack or with locks held.
1221func gcStart(trigger gcTrigger) {
1222	// Since this is called from malloc and malloc is called in
1223	// the guts of a number of libraries that might be holding
1224	// locks, don't attempt to start GC in non-preemptible or
1225	// potentially unstable situations.
1226	mp := acquirem()
1227	if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {
1228		releasem(mp)
1229		return
1230	}
1231	releasem(mp)
1232	mp = nil
1233
1234	// Pick up the remaining unswept/not being swept spans concurrently
1235	//
1236	// This shouldn't happen if we're being invoked in background
1237	// mode since proportional sweep should have just finished
1238	// sweeping everything, but rounding errors, etc, may leave a
1239	// few spans unswept. In forced mode, this is necessary since
1240	// GC can be forced at any point in the sweeping cycle.
1241	//
1242	// We check the transition condition continuously here in case
1243	// this G gets delayed in to the next GC cycle.
1244	for trigger.test() && sweepone() != ^uintptr(0) {
1245		sweep.nbgsweep++
1246	}
1247
1248	// Perform GC initialization and the sweep termination
1249	// transition.
1250	semacquire(&work.startSema)
1251	// Re-check transition condition under transition lock.
1252	if !trigger.test() {
1253		semrelease(&work.startSema)
1254		return
1255	}
1256
1257	// For stats, check if this GC was forced by the user.
1258	work.userForced = trigger.kind == gcTriggerCycle
1259
1260	// In gcstoptheworld debug mode, upgrade the mode accordingly.
1261	// We do this after re-checking the transition condition so
1262	// that multiple goroutines that detect the heap trigger don't
1263	// start multiple STW GCs.
1264	mode := gcBackgroundMode
1265	if debug.gcstoptheworld == 1 {
1266		mode = gcForceMode
1267	} else if debug.gcstoptheworld == 2 {
1268		mode = gcForceBlockMode
1269	}
1270
1271	// Ok, we're doing it! Stop everybody else
1272	semacquire(&worldsema)
1273
1274	if trace.enabled {
1275		traceGCStart()
1276	}
1277
1278	// Check that all Ps have finished deferred mcache flushes.
1279	for _, p := range allp {
1280		if fg := atomic.Load(&p.mcache.flushGen); fg != mheap_.sweepgen {
1281			println("runtime: p", p.id, "flushGen", fg, "!= sweepgen", mheap_.sweepgen)
1282			throw("p mcache not flushed")
1283		}
1284	}
1285
1286	gcBgMarkStartWorkers()
1287
1288	systemstack(gcResetMarkState)
1289
1290	work.stwprocs, work.maxprocs = gomaxprocs, gomaxprocs
1291	if work.stwprocs > ncpu {
1292		// This is used to compute CPU time of the STW phases,
1293		// so it can't be more than ncpu, even if GOMAXPROCS is.
1294		work.stwprocs = ncpu
1295	}
1296	work.heap0 = atomic.Load64(&memstats.heap_live)
1297	work.pauseNS = 0
1298	work.mode = mode
1299
1300	now := nanotime()
1301	work.tSweepTerm = now
1302	work.pauseStart = now
1303	if trace.enabled {
1304		traceGCSTWStart(1)
1305	}
1306	systemstack(stopTheWorldWithSema)
1307	// Finish sweep before we start concurrent scan.
1308	systemstack(func() {
1309		finishsweep_m()
1310	})
1311	// clearpools before we start the GC. If we wait they memory will not be
1312	// reclaimed until the next GC cycle.
1313	clearpools()
1314
1315	work.cycles++
1316
1317	gcController.startCycle()
1318	work.heapGoal = memstats.next_gc
1319
1320	// In STW mode, disable scheduling of user Gs. This may also
1321	// disable scheduling of this goroutine, so it may block as
1322	// soon as we start the world again.
1323	if mode != gcBackgroundMode {
1324		schedEnableUser(false)
1325	}
1326
1327	// Enter concurrent mark phase and enable
1328	// write barriers.
1329	//
1330	// Because the world is stopped, all Ps will
1331	// observe that write barriers are enabled by
1332	// the time we start the world and begin
1333	// scanning.
1334	//
1335	// Write barriers must be enabled before assists are
1336	// enabled because they must be enabled before
1337	// any non-leaf heap objects are marked. Since
1338	// allocations are blocked until assists can
1339	// happen, we want enable assists as early as
1340	// possible.
1341	setGCPhase(_GCmark)
1342
1343	gcBgMarkPrepare() // Must happen before assist enable.
1344	gcMarkRootPrepare()
1345
1346	// Mark all active tinyalloc blocks. Since we're
1347	// allocating from these, they need to be black like
1348	// other allocations. The alternative is to blacken
1349	// the tiny block on every allocation from it, which
1350	// would slow down the tiny allocator.
1351	gcMarkTinyAllocs()
1352
1353	// At this point all Ps have enabled the write
1354	// barrier, thus maintaining the no white to
1355	// black invariant. Enable mutator assists to
1356	// put back-pressure on fast allocating
1357	// mutators.
1358	atomic.Store(&gcBlackenEnabled, 1)
1359
1360	// Assists and workers can start the moment we start
1361	// the world.
1362	gcController.markStartTime = now
1363
1364	// Concurrent mark.
1365	systemstack(func() {
1366		now = startTheWorldWithSema(trace.enabled)
1367		work.pauseNS += now - work.pauseStart
1368		work.tMark = now
1369	})
1370	// In STW mode, we could block the instant systemstack
1371	// returns, so don't do anything important here. Make sure we
1372	// block rather than returning to user code.
1373	if mode != gcBackgroundMode {
1374		Gosched()
1375	}
1376
1377	semrelease(&work.startSema)
1378}
1379
1380// gcMarkDoneFlushed counts the number of P's with flushed work.
1381//
1382// Ideally this would be a captured local in gcMarkDone, but forEachP
1383// escapes its callback closure, so it can't capture anything.
1384//
1385// This is protected by markDoneSema.
1386var gcMarkDoneFlushed uint32
1387
1388// debugCachedWork enables extra checks for debugging premature mark
1389// termination.
1390//
1391// For debugging issue #27993.
1392const debugCachedWork = false
1393
1394// gcWorkPauseGen is for debugging the mark completion algorithm.
1395// gcWork put operations spin while gcWork.pauseGen == gcWorkPauseGen.
1396// Only used if debugCachedWork is true.
1397//
1398// For debugging issue #27993.
1399var gcWorkPauseGen uint32 = 1
1400
1401// gcMarkDone transitions the GC from mark to mark termination if all
1402// reachable objects have been marked (that is, there are no grey
1403// objects and can be no more in the future). Otherwise, it flushes
1404// all local work to the global queues where it can be discovered by
1405// other workers.
1406//
1407// This should be called when all local mark work has been drained and
1408// there are no remaining workers. Specifically, when
1409//
1410//   work.nwait == work.nproc && !gcMarkWorkAvailable(p)
1411//
1412// The calling context must be preemptible.
1413//
1414// Flushing local work is important because idle Ps may have local
1415// work queued. This is the only way to make that work visible and
1416// drive GC to completion.
1417//
1418// It is explicitly okay to have write barriers in this function. If
1419// it does transition to mark termination, then all reachable objects
1420// have been marked, so the write barrier cannot shade any more
1421// objects.
1422func gcMarkDone() {
1423	// Ensure only one thread is running the ragged barrier at a
1424	// time.
1425	semacquire(&work.markDoneSema)
1426
1427top:
1428	// Re-check transition condition under transition lock.
1429	//
1430	// It's critical that this checks the global work queues are
1431	// empty before performing the ragged barrier. Otherwise,
1432	// there could be global work that a P could take after the P
1433	// has passed the ragged barrier.
1434	if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
1435		semrelease(&work.markDoneSema)
1436		return
1437	}
1438
1439	// Flush all local buffers and collect flushedWork flags.
1440	gcMarkDoneFlushed = 0
1441	systemstack(func() {
1442		gp := getg().m.curg
1443		// Mark the user stack as preemptible so that it may be scanned.
1444		// Otherwise, our attempt to force all P's to a safepoint could
1445		// result in a deadlock as we attempt to preempt a worker that's
1446		// trying to preempt us (e.g. for a stack scan).
1447		casgstatus(gp, _Grunning, _Gwaiting)
1448		forEachP(func(_p_ *p) {
1449			// Flush the write barrier buffer, since this may add
1450			// work to the gcWork.
1451			wbBufFlush1(_p_)
1452			// For debugging, shrink the write barrier
1453			// buffer so it flushes immediately.
1454			// wbBuf.reset will keep it at this size as
1455			// long as throwOnGCWork is set.
1456			if debugCachedWork {
1457				b := &_p_.wbBuf
1458				b.end = uintptr(unsafe.Pointer(&b.buf[wbBufEntryPointers]))
1459				b.debugGen = gcWorkPauseGen
1460			}
1461			// Flush the gcWork, since this may create global work
1462			// and set the flushedWork flag.
1463			//
1464			// TODO(austin): Break up these workbufs to
1465			// better distribute work.
1466			_p_.gcw.dispose()
1467			// Collect the flushedWork flag.
1468			if _p_.gcw.flushedWork {
1469				atomic.Xadd(&gcMarkDoneFlushed, 1)
1470				_p_.gcw.flushedWork = false
1471			} else if debugCachedWork {
1472				// For debugging, freeze the gcWork
1473				// until we know whether we've reached
1474				// completion or not. If we think
1475				// we've reached completion, but
1476				// there's a paused gcWork, then
1477				// that's a bug.
1478				_p_.gcw.pauseGen = gcWorkPauseGen
1479				// Capture the G's stack.
1480				for i := range _p_.gcw.pauseStack {
1481					_p_.gcw.pauseStack[i] = 0
1482				}
1483				callers(1, _p_.gcw.pauseStack[:])
1484			}
1485		})
1486		casgstatus(gp, _Gwaiting, _Grunning)
1487	})
1488
1489	if gcMarkDoneFlushed != 0 {
1490		if debugCachedWork {
1491			// Release paused gcWorks.
1492			atomic.Xadd(&gcWorkPauseGen, 1)
1493		}
1494		// More grey objects were discovered since the
1495		// previous termination check, so there may be more
1496		// work to do. Keep going. It's possible the
1497		// transition condition became true again during the
1498		// ragged barrier, so re-check it.
1499		goto top
1500	}
1501
1502	if debugCachedWork {
1503		throwOnGCWork = true
1504		// Release paused gcWorks. If there are any, they
1505		// should now observe throwOnGCWork and panic.
1506		atomic.Xadd(&gcWorkPauseGen, 1)
1507	}
1508
1509	// There was no global work, no local work, and no Ps
1510	// communicated work since we took markDoneSema. Therefore
1511	// there are no grey objects and no more objects can be
1512	// shaded. Transition to mark termination.
1513	now := nanotime()
1514	work.tMarkTerm = now
1515	work.pauseStart = now
1516	getg().m.preemptoff = "gcing"
1517	if trace.enabled {
1518		traceGCSTWStart(0)
1519	}
1520	systemstack(stopTheWorldWithSema)
1521	// The gcphase is _GCmark, it will transition to _GCmarktermination
1522	// below. The important thing is that the wb remains active until
1523	// all marking is complete. This includes writes made by the GC.
1524
1525	if debugCachedWork {
1526		// For debugging, double check that no work was added after we
1527		// went around above and disable write barrier buffering.
1528		for _, p := range allp {
1529			gcw := &p.gcw
1530			if !gcw.empty() {
1531				printlock()
1532				print("runtime: P ", p.id, " flushedWork ", gcw.flushedWork)
1533				if gcw.wbuf1 == nil {
1534					print(" wbuf1=<nil>")
1535				} else {
1536					print(" wbuf1.n=", gcw.wbuf1.nobj)
1537				}
1538				if gcw.wbuf2 == nil {
1539					print(" wbuf2=<nil>")
1540				} else {
1541					print(" wbuf2.n=", gcw.wbuf2.nobj)
1542				}
1543				print("\n")
1544				if gcw.pauseGen == gcw.putGen {
1545					println("runtime: checkPut already failed at this generation")
1546				}
1547				throw("throwOnGCWork")
1548			}
1549		}
1550	} else {
1551		// For unknown reasons (see issue #27993), there is
1552		// sometimes work left over when we enter mark
1553		// termination. Detect this and resume concurrent
1554		// mark. This is obviously unfortunate.
1555		//
1556		// Switch to the system stack to call wbBufFlush1,
1557		// though in this case it doesn't matter because we're
1558		// non-preemptible anyway.
1559		restart := false
1560		systemstack(func() {
1561			for _, p := range allp {
1562				wbBufFlush1(p)
1563				if !p.gcw.empty() {
1564					restart = true
1565					break
1566				}
1567			}
1568		})
1569		if restart {
1570			getg().m.preemptoff = ""
1571			systemstack(func() {
1572				now := startTheWorldWithSema(true)
1573				work.pauseNS += now - work.pauseStart
1574			})
1575			goto top
1576		}
1577	}
1578
1579	// Disable assists and background workers. We must do
1580	// this before waking blocked assists.
1581	atomic.Store(&gcBlackenEnabled, 0)
1582
1583	// Wake all blocked assists. These will run when we
1584	// start the world again.
1585	gcWakeAllAssists()
1586
1587	// Likewise, release the transition lock. Blocked
1588	// workers and assists will run when we start the
1589	// world again.
1590	semrelease(&work.markDoneSema)
1591
1592	// In STW mode, re-enable user goroutines. These will be
1593	// queued to run after we start the world.
1594	schedEnableUser(true)
1595
1596	// endCycle depends on all gcWork cache stats being flushed.
1597	// The termination algorithm above ensured that up to
1598	// allocations since the ragged barrier.
1599	nextTriggerRatio := gcController.endCycle()
1600
1601	// Perform mark termination. This will restart the world.
1602	gcMarkTermination(nextTriggerRatio)
1603}
1604
1605func gcMarkTermination(nextTriggerRatio float64) {
1606	// World is stopped.
1607	// Start marktermination which includes enabling the write barrier.
1608	atomic.Store(&gcBlackenEnabled, 0)
1609	setGCPhase(_GCmarktermination)
1610
1611	work.heap1 = memstats.heap_live
1612	startTime := nanotime()
1613
1614	mp := acquirem()
1615	mp.preemptoff = "gcing"
1616	_g_ := getg()
1617	_g_.m.traceback = 2
1618	gp := _g_.m.curg
1619	casgstatus(gp, _Grunning, _Gwaiting)
1620	gp.waitreason = waitReasonGarbageCollection
1621
1622	// Run gc on the g0 stack. We do this so that the g stack
1623	// we're currently running on will no longer change. Cuts
1624	// the root set down a bit (g0 stacks are not scanned, and
1625	// we don't need to scan gc's internal state).  We also
1626	// need to switch to g0 so we can shrink the stack.
1627	systemstack(func() {
1628		gcMark(startTime)
1629		// Must return immediately.
1630		// The outer function's stack may have moved
1631		// during gcMark (it shrinks stacks, including the
1632		// outer function's stack), so we must not refer
1633		// to any of its variables. Return back to the
1634		// non-system stack to pick up the new addresses
1635		// before continuing.
1636	})
1637
1638	systemstack(func() {
1639		work.heap2 = work.bytesMarked
1640		if debug.gccheckmark > 0 {
1641			// Run a full non-parallel, stop-the-world
1642			// mark using checkmark bits, to check that we
1643			// didn't forget to mark anything during the
1644			// concurrent mark process.
1645			gcResetMarkState()
1646			initCheckmarks()
1647			gcw := &getg().m.p.ptr().gcw
1648			gcDrain(gcw, 0)
1649			wbBufFlush1(getg().m.p.ptr())
1650			gcw.dispose()
1651			clearCheckmarks()
1652		}
1653
1654		// marking is complete so we can turn the write barrier off
1655		setGCPhase(_GCoff)
1656		gcSweep(work.mode)
1657	})
1658
1659	_g_.m.traceback = 0
1660	casgstatus(gp, _Gwaiting, _Grunning)
1661
1662	if trace.enabled {
1663		traceGCDone()
1664	}
1665
1666	// all done
1667	mp.preemptoff = ""
1668
1669	if gcphase != _GCoff {
1670		throw("gc done but gcphase != _GCoff")
1671	}
1672
1673	// Record next_gc and heap_inuse for scavenger.
1674	memstats.last_next_gc = memstats.next_gc
1675	memstats.last_heap_inuse = memstats.heap_inuse
1676
1677	// Update GC trigger and pacing for the next cycle.
1678	gcSetTriggerRatio(nextTriggerRatio)
1679
1680	// Pacing changed, so the scavenger should be awoken.
1681	wakeScavenger()
1682
1683	// Update timing memstats
1684	now := nanotime()
1685	sec, nsec, _ := time_now()
1686	unixNow := sec*1e9 + int64(nsec)
1687	work.pauseNS += now - work.pauseStart
1688	work.tEnd = now
1689	atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
1690	atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
1691	memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
1692	memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
1693	memstats.pause_total_ns += uint64(work.pauseNS)
1694
1695	// Update work.totaltime.
1696	sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
1697	// We report idle marking time below, but omit it from the
1698	// overall utilization here since it's "free".
1699	markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
1700	markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
1701	cycleCpu := sweepTermCpu + markCpu + markTermCpu
1702	work.totaltime += cycleCpu
1703
1704	// Compute overall GC CPU utilization.
1705	totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
1706	memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)
1707
1708	// Reset sweep state.
1709	sweep.nbgsweep = 0
1710	sweep.npausesweep = 0
1711
1712	if work.userForced {
1713		memstats.numforcedgc++
1714	}
1715
1716	// Bump GC cycle count and wake goroutines waiting on sweep.
1717	lock(&work.sweepWaiters.lock)
1718	memstats.numgc++
1719	injectglist(&work.sweepWaiters.list)
1720	unlock(&work.sweepWaiters.lock)
1721
1722	// Finish the current heap profiling cycle and start a new
1723	// heap profiling cycle. We do this before starting the world
1724	// so events don't leak into the wrong cycle.
1725	mProf_NextCycle()
1726
1727	systemstack(func() { startTheWorldWithSema(true) })
1728
1729	// Flush the heap profile so we can start a new cycle next GC.
1730	// This is relatively expensive, so we don't do it with the
1731	// world stopped.
1732	mProf_Flush()
1733
1734	// Prepare workbufs for freeing by the sweeper. We do this
1735	// asynchronously because it can take non-trivial time.
1736	prepareFreeWorkbufs()
1737
1738	// Free stack spans. This must be done between GC cycles.
1739	systemstack(freeStackSpans)
1740
1741	// Ensure all mcaches are flushed. Each P will flush its own
1742	// mcache before allocating, but idle Ps may not. Since this
1743	// is necessary to sweep all spans, we need to ensure all
1744	// mcaches are flushed before we start the next GC cycle.
1745	systemstack(func() {
1746		forEachP(func(_p_ *p) {
1747			_p_.mcache.prepareForSweep()
1748		})
1749	})
1750
1751	// Print gctrace before dropping worldsema. As soon as we drop
1752	// worldsema another cycle could start and smash the stats
1753	// we're trying to print.
1754	if debug.gctrace > 0 {
1755		util := int(memstats.gc_cpu_fraction * 100)
1756
1757		var sbuf [24]byte
1758		printlock()
1759		print("gc ", memstats.numgc,
1760			" @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ",
1761			util, "%: ")
1762		prev := work.tSweepTerm
1763		for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
1764			if i != 0 {
1765				print("+")
1766			}
1767			print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
1768			prev = ns
1769		}
1770		print(" ms clock, ")
1771		for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
1772			if i == 2 || i == 3 {
1773				// Separate mark time components with /.
1774				print("/")
1775			} else if i != 0 {
1776				print("+")
1777			}
1778			print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
1779		}
1780		print(" ms cpu, ",
1781			work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ",
1782			work.heapGoal>>20, " MB goal, ",
1783			work.maxprocs, " P")
1784		if work.userForced {
1785			print(" (forced)")
1786		}
1787		print("\n")
1788		printunlock()
1789	}
1790
1791	semrelease(&worldsema)
1792	// Careful: another GC cycle may start now.
1793
1794	releasem(mp)
1795	mp = nil
1796
1797	// now that gc is done, kick off finalizer thread if needed
1798	if !concurrentSweep {
1799		// give the queued finalizers, if any, a chance to run
1800		Gosched()
1801	}
1802}
1803
1804// gcBgMarkStartWorkers prepares background mark worker goroutines.
1805// These goroutines will not run until the mark phase, but they must
1806// be started while the work is not stopped and from a regular G
1807// stack. The caller must hold worldsema.
1808func gcBgMarkStartWorkers() {
1809	// Background marking is performed by per-P G's. Ensure that
1810	// each P has a background GC G.
1811	for _, p := range allp {
1812		if p.gcBgMarkWorker == 0 {
1813			go gcBgMarkWorker(p)
1814			notetsleepg(&work.bgMarkReady, -1)
1815			noteclear(&work.bgMarkReady)
1816		}
1817	}
1818}
1819
1820// gcBgMarkPrepare sets up state for background marking.
1821// Mutator assists must not yet be enabled.
1822func gcBgMarkPrepare() {
1823	// Background marking will stop when the work queues are empty
1824	// and there are no more workers (note that, since this is
1825	// concurrent, this may be a transient state, but mark
1826	// termination will clean it up). Between background workers
1827	// and assists, we don't really know how many workers there
1828	// will be, so we pretend to have an arbitrarily large number
1829	// of workers, almost all of which are "waiting". While a
1830	// worker is working it decrements nwait. If nproc == nwait,
1831	// there are no workers.
1832	work.nproc = ^uint32(0)
1833	work.nwait = ^uint32(0)
1834}
1835
1836func gcBgMarkWorker(_p_ *p) {
1837	gp := getg()
1838
1839	type parkInfo struct {
1840		m      muintptr // Release this m on park.
1841		attach puintptr // If non-nil, attach to this p on park.
1842	}
1843	// We pass park to a gopark unlock function, so it can't be on
1844	// the stack (see gopark). Prevent deadlock from recursively
1845	// starting GC by disabling preemption.
1846	gp.m.preemptoff = "GC worker init"
1847	park := new(parkInfo)
1848	gp.m.preemptoff = ""
1849
1850	park.m.set(acquirem())
1851	park.attach.set(_p_)
1852	// Inform gcBgMarkStartWorkers that this worker is ready.
1853	// After this point, the background mark worker is scheduled
1854	// cooperatively by gcController.findRunnable. Hence, it must
1855	// never be preempted, as this would put it into _Grunnable
1856	// and put it on a run queue. Instead, when the preempt flag
1857	// is set, this puts itself into _Gwaiting to be woken up by
1858	// gcController.findRunnable at the appropriate time.
1859	notewakeup(&work.bgMarkReady)
1860
1861	for {
1862		// Go to sleep until woken by gcController.findRunnable.
1863		// We can't releasem yet since even the call to gopark
1864		// may be preempted.
1865		gopark(func(g *g, parkp unsafe.Pointer) bool {
1866			park := (*parkInfo)(parkp)
1867
1868			// The worker G is no longer running, so it's
1869			// now safe to allow preemption.
1870			releasem(park.m.ptr())
1871
1872			// If the worker isn't attached to its P,
1873			// attach now. During initialization and after
1874			// a phase change, the worker may have been
1875			// running on a different P. As soon as we
1876			// attach, the owner P may schedule the
1877			// worker, so this must be done after the G is
1878			// stopped.
1879			if park.attach != 0 {
1880				p := park.attach.ptr()
1881				park.attach.set(nil)
1882				// cas the worker because we may be
1883				// racing with a new worker starting
1884				// on this P.
1885				if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) {
1886					// The P got a new worker.
1887					// Exit this worker.
1888					return false
1889				}
1890			}
1891			return true
1892		}, unsafe.Pointer(park), waitReasonGCWorkerIdle, traceEvGoBlock, 0)
1893
1894		// Loop until the P dies and disassociates this
1895		// worker (the P may later be reused, in which case
1896		// it will get a new worker) or we failed to associate.
1897		if _p_.gcBgMarkWorker.ptr() != gp {
1898			break
1899		}
1900
1901		// Disable preemption so we can use the gcw. If the
1902		// scheduler wants to preempt us, we'll stop draining,
1903		// dispose the gcw, and then preempt.
1904		park.m.set(acquirem())
1905
1906		if gcBlackenEnabled == 0 {
1907			throw("gcBgMarkWorker: blackening not enabled")
1908		}
1909
1910		startTime := nanotime()
1911		_p_.gcMarkWorkerStartTime = startTime
1912
1913		decnwait := atomic.Xadd(&work.nwait, -1)
1914		if decnwait == work.nproc {
1915			println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
1916			throw("work.nwait was > work.nproc")
1917		}
1918
1919		systemstack(func() {
1920			// Mark our goroutine preemptible so its stack
1921			// can be scanned. This lets two mark workers
1922			// scan each other (otherwise, they would
1923			// deadlock). We must not modify anything on
1924			// the G stack. However, stack shrinking is
1925			// disabled for mark workers, so it is safe to
1926			// read from the G stack.
1927			casgstatus(gp, _Grunning, _Gwaiting)
1928			switch _p_.gcMarkWorkerMode {
1929			default:
1930				throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
1931			case gcMarkWorkerDedicatedMode:
1932				gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
1933				if gp.preempt {
1934					// We were preempted. This is
1935					// a useful signal to kick
1936					// everything out of the run
1937					// queue so it can run
1938					// somewhere else.
1939					lock(&sched.lock)
1940					for {
1941						gp, _ := runqget(_p_)
1942						if gp == nil {
1943							break
1944						}
1945						globrunqput(gp)
1946					}
1947					unlock(&sched.lock)
1948				}
1949				// Go back to draining, this time
1950				// without preemption.
1951				gcDrain(&_p_.gcw, gcDrainFlushBgCredit)
1952			case gcMarkWorkerFractionalMode:
1953				gcDrain(&_p_.gcw, gcDrainFractional|gcDrainUntilPreempt|gcDrainFlushBgCredit)
1954			case gcMarkWorkerIdleMode:
1955				gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
1956			}
1957			casgstatus(gp, _Gwaiting, _Grunning)
1958		})
1959
1960		// Account for time.
1961		duration := nanotime() - startTime
1962		switch _p_.gcMarkWorkerMode {
1963		case gcMarkWorkerDedicatedMode:
1964			atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
1965			atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
1966		case gcMarkWorkerFractionalMode:
1967			atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
1968			atomic.Xaddint64(&_p_.gcFractionalMarkTime, duration)
1969		case gcMarkWorkerIdleMode:
1970			atomic.Xaddint64(&gcController.idleMarkTime, duration)
1971		}
1972
1973		// Was this the last worker and did we run out
1974		// of work?
1975		incnwait := atomic.Xadd(&work.nwait, +1)
1976		if incnwait > work.nproc {
1977			println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode,
1978				"work.nwait=", incnwait, "work.nproc=", work.nproc)
1979			throw("work.nwait > work.nproc")
1980		}
1981
1982		// If this worker reached a background mark completion
1983		// point, signal the main GC goroutine.
1984		if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
1985			// Make this G preemptible and disassociate it
1986			// as the worker for this P so
1987			// findRunnableGCWorker doesn't try to
1988			// schedule it.
1989			_p_.gcBgMarkWorker.set(nil)
1990			releasem(park.m.ptr())
1991
1992			gcMarkDone()
1993
1994			// Disable preemption and prepare to reattach
1995			// to the P.
1996			//
1997			// We may be running on a different P at this
1998			// point, so we can't reattach until this G is
1999			// parked.
2000			park.m.set(acquirem())
2001			park.attach.set(_p_)
2002		}
2003	}
2004}
2005
2006// gcMarkWorkAvailable reports whether executing a mark worker
2007// on p is potentially useful. p may be nil, in which case it only
2008// checks the global sources of work.
2009func gcMarkWorkAvailable(p *p) bool {
2010	if p != nil && !p.gcw.empty() {
2011		return true
2012	}
2013	if !work.full.empty() {
2014		return true // global work available
2015	}
2016	if work.markrootNext < work.markrootJobs {
2017		return true // root scan work available
2018	}
2019	return false
2020}
2021
2022// gcMark runs the mark (or, for concurrent GC, mark termination)
2023// All gcWork caches must be empty.
2024// STW is in effect at this point.
2025func gcMark(start_time int64) {
2026	if debug.allocfreetrace > 0 {
2027		tracegc()
2028	}
2029
2030	if gcphase != _GCmarktermination {
2031		throw("in gcMark expecting to see gcphase as _GCmarktermination")
2032	}
2033	work.tstart = start_time
2034
2035	// Check that there's no marking work remaining.
2036	if work.full != 0 || work.markrootNext < work.markrootJobs {
2037		print("runtime: full=", hex(work.full), " next=", work.markrootNext, " jobs=", work.markrootJobs, " nDataRoots=", work.nDataRoots, " nBSSRoots=", work.nBSSRoots, " nSpanRoots=", work.nSpanRoots, " nStackRoots=", work.nStackRoots, "\n")
2038		panic("non-empty mark queue after concurrent mark")
2039	}
2040
2041	if debug.gccheckmark > 0 {
2042		// This is expensive when there's a large number of
2043		// Gs, so only do it if checkmark is also enabled.
2044		gcMarkRootCheck()
2045	}
2046	if work.full != 0 {
2047		throw("work.full != 0")
2048	}
2049
2050	// Clear out buffers and double-check that all gcWork caches
2051	// are empty. This should be ensured by gcMarkDone before we
2052	// enter mark termination.
2053	//
2054	// TODO: We could clear out buffers just before mark if this
2055	// has a non-negligible impact on STW time.
2056	for _, p := range allp {
2057		// The write barrier may have buffered pointers since
2058		// the gcMarkDone barrier. However, since the barrier
2059		// ensured all reachable objects were marked, all of
2060		// these must be pointers to black objects. Hence we
2061		// can just discard the write barrier buffer.
2062		if debug.gccheckmark > 0 || throwOnGCWork {
2063			// For debugging, flush the buffer and make
2064			// sure it really was all marked.
2065			wbBufFlush1(p)
2066		} else {
2067			p.wbBuf.reset()
2068		}
2069
2070		gcw := &p.gcw
2071		if !gcw.empty() {
2072			printlock()
2073			print("runtime: P ", p.id, " flushedWork ", gcw.flushedWork)
2074			if gcw.wbuf1 == nil {
2075				print(" wbuf1=<nil>")
2076			} else {
2077				print(" wbuf1.n=", gcw.wbuf1.nobj)
2078			}
2079			if gcw.wbuf2 == nil {
2080				print(" wbuf2=<nil>")
2081			} else {
2082				print(" wbuf2.n=", gcw.wbuf2.nobj)
2083			}
2084			print("\n")
2085			throw("P has cached GC work at end of mark termination")
2086		}
2087		// There may still be cached empty buffers, which we
2088		// need to flush since we're going to free them. Also,
2089		// there may be non-zero stats because we allocated
2090		// black after the gcMarkDone barrier.
2091		gcw.dispose()
2092	}
2093
2094	throwOnGCWork = false
2095
2096	cachestats()
2097
2098	// Update the marked heap stat.
2099	memstats.heap_marked = work.bytesMarked
2100
2101	// Update other GC heap size stats. This must happen after
2102	// cachestats (which flushes local statistics to these) and
2103	// flushallmcaches (which modifies heap_live).
2104	memstats.heap_live = work.bytesMarked
2105	memstats.heap_scan = uint64(gcController.scanWork)
2106
2107	if trace.enabled {
2108		traceHeapAlloc()
2109	}
2110}
2111
2112// gcSweep must be called on the system stack because it acquires the heap
2113// lock. See mheap for details.
2114//go:systemstack
2115func gcSweep(mode gcMode) {
2116	if gcphase != _GCoff {
2117		throw("gcSweep being done but phase is not GCoff")
2118	}
2119
2120	lock(&mheap_.lock)
2121	mheap_.sweepgen += 2
2122	mheap_.sweepdone = 0
2123	if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 {
2124		// We should have drained this list during the last
2125		// sweep phase. We certainly need to start this phase
2126		// with an empty swept list.
2127		throw("non-empty swept list")
2128	}
2129	mheap_.pagesSwept = 0
2130	mheap_.sweepArenas = mheap_.allArenas
2131	mheap_.reclaimIndex = 0
2132	mheap_.reclaimCredit = 0
2133	unlock(&mheap_.lock)
2134
2135	if !_ConcurrentSweep || mode == gcForceBlockMode {
2136		// Special case synchronous sweep.
2137		// Record that no proportional sweeping has to happen.
2138		lock(&mheap_.lock)
2139		mheap_.sweepPagesPerByte = 0
2140		unlock(&mheap_.lock)
2141		// Sweep all spans eagerly.
2142		for sweepone() != ^uintptr(0) {
2143			sweep.npausesweep++
2144		}
2145		// Free workbufs eagerly.
2146		prepareFreeWorkbufs()
2147		for freeSomeWbufs(false) {
2148		}
2149		// All "free" events for this mark/sweep cycle have
2150		// now happened, so we can make this profile cycle
2151		// available immediately.
2152		mProf_NextCycle()
2153		mProf_Flush()
2154		return
2155	}
2156
2157	// Background sweep.
2158	lock(&sweep.lock)
2159	if sweep.parked {
2160		sweep.parked = false
2161		ready(sweep.g, 0, true)
2162	}
2163	unlock(&sweep.lock)
2164}
2165
2166// gcResetMarkState resets global state prior to marking (concurrent
2167// or STW) and resets the stack scan state of all Gs.
2168//
2169// This is safe to do without the world stopped because any Gs created
2170// during or after this will start out in the reset state.
2171//
2172// gcResetMarkState must be called on the system stack because it acquires
2173// the heap lock. See mheap for details.
2174//
2175//go:systemstack
2176func gcResetMarkState() {
2177	// This may be called during a concurrent phase, so make sure
2178	// allgs doesn't change.
2179	lock(&allglock)
2180	for _, gp := range allgs {
2181		gp.gcscandone = false // set to true in gcphasework
2182		gp.gcAssistBytes = 0
2183	}
2184	unlock(&allglock)
2185
2186	// Clear page marks. This is just 1MB per 64GB of heap, so the
2187	// time here is pretty trivial.
2188	lock(&mheap_.lock)
2189	arenas := mheap_.allArenas
2190	unlock(&mheap_.lock)
2191	for _, ai := range arenas {
2192		ha := mheap_.arenas[ai.l1()][ai.l2()]
2193		for i := range ha.pageMarks {
2194			ha.pageMarks[i] = 0
2195		}
2196	}
2197
2198	work.bytesMarked = 0
2199	work.initialHeapLive = atomic.Load64(&memstats.heap_live)
2200}
2201
2202// Hooks for other packages
2203
2204var poolcleanup func()
2205
2206//go:linkname sync_runtime_registerPoolCleanup sync.runtime_registerPoolCleanup
2207func sync_runtime_registerPoolCleanup(f func()) {
2208	poolcleanup = f
2209}
2210
2211func clearpools() {
2212	// clear sync.Pools
2213	if poolcleanup != nil {
2214		poolcleanup()
2215	}
2216
2217	// Clear central sudog cache.
2218	// Leave per-P caches alone, they have strictly bounded size.
2219	// Disconnect cached list before dropping it on the floor,
2220	// so that a dangling ref to one entry does not pin all of them.
2221	lock(&sched.sudoglock)
2222	var sg, sgnext *sudog
2223	for sg = sched.sudogcache; sg != nil; sg = sgnext {
2224		sgnext = sg.next
2225		sg.next = nil
2226	}
2227	sched.sudogcache = nil
2228	unlock(&sched.sudoglock)
2229
2230	// Clear central defer pools.
2231	// Leave per-P pools alone, they have strictly bounded size.
2232	lock(&sched.deferlock)
2233	for i := range sched.deferpool {
2234		// disconnect cached list before dropping it on the floor,
2235		// so that a dangling ref to one entry does not pin all of them.
2236		var d, dlink *_defer
2237		for d = sched.deferpool[i]; d != nil; d = dlink {
2238			dlink = d.link
2239			d.link = nil
2240		}
2241		sched.deferpool[i] = nil
2242	}
2243	unlock(&sched.deferlock)
2244}
2245
2246// Timing
2247
2248// itoaDiv formats val/(10**dec) into buf.
2249func itoaDiv(buf []byte, val uint64, dec int) []byte {
2250	i := len(buf) - 1
2251	idec := i - dec
2252	for val >= 10 || i >= idec {
2253		buf[i] = byte(val%10 + '0')
2254		i--
2255		if i == idec {
2256			buf[i] = '.'
2257			i--
2258		}
2259		val /= 10
2260	}
2261	buf[i] = byte(val + '0')
2262	return buf[i:]
2263}
2264
2265// fmtNSAsMS nicely formats ns nanoseconds as milliseconds.
2266func fmtNSAsMS(buf []byte, ns uint64) []byte {
2267	if ns >= 10e6 {
2268		// Format as whole milliseconds.
2269		return itoaDiv(buf, ns/1e6, 0)
2270	}
2271	// Format two digits of precision, with at most three decimal places.
2272	x := ns / 1e3
2273	if x == 0 {
2274		buf[0] = '0'
2275		return buf[:1]
2276	}
2277	dec := 3
2278	for x >= 100 {
2279		x /= 10
2280		dec--
2281	}
2282	return itoaDiv(buf, x, dec)
2283}
2284