• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

README.mdH A D02-Dec-20213.3 KiB8066

job.cH A D02-Dec-202125.2 KiB906649

job.hH A D02-Dec-20212.3 KiB6241

job_stat.cH A D02-Dec-202116.8 KiB540398

job_stat.hH A D02-Dec-20211.3 KiB4226

launcher_interface.cH A D02-Dec-20211.6 KiB6445

launcher_interface.hH A D02-Dec-2021570 179

scheduler.cH A D02-Dec-202124 KiB906613

scheduler.hH A D02-Dec-20211.2 KiB4022

timer.cH A D02-Dec-20212.7 KiB12893

timer.hH A D02-Dec-2021692 3118

README.md

1# Background worker jobs
2
3TimescaleDB needs to run multiple background jobs. This module
4implements a simple scheduler so that jobs inserted into a jobs table
5can be run on a schedule. Each database in an instance runs it's own
6scheduler because different databases may run different TimescaleDB
7extension versions which may require different scheduler logic.
8
9## Schedules
10
11The scheduler allows you to set a `schedule_interval` for every job.
12That defines the interval the scheduler will wait after a job finishes to start
13it again, if the job is successful. If the job fails, the scheduler uses `retry_period`
14in an exponential backoff to decide when to run the job again.
15
16## Design
17
18The scheduler itself is a background job that continuously runs and waits
19for a time when jobs need to be scheduled. It then launches jobs as new
20background workers that it controls through the background worker handle.
21
22Aggregate statistics about a job are kept in the job stat catalog table.
23These statistics include the start and finish times of the last run of the job
24as well as whether or not the job succeeded. The `next_start` is used to
25figure out when next to run a job after a scheduler is restarted.
26
27The statistics table also tracks consecutive failures and crashes for the job
28which are used for calculating the exponential backoff after a crash or failure
29(which is used to set the `next_start` after the crash/failure). Note also that
30there is a minimum time after the database scheduler starts up and a crashed job
31is restarted. This is to allow the operator enough time to disable the job
32if needed.
33
34Note that the number of crashes is an overestimate of the actual number of crashes
35for a job. This is so that we are conservative and never miss a crash and fail to
36use the appropriate backoff logic. There is some complexity
37in ensuring that all crashes are counted. A crash in Postgres causes /all/
38processes to quit immediately therefore we cannot write anything to the database once
39any process has crashed. Thus, we must be able to deduce that a crash occurred
40from a commit that happened before any crash. We accomplish
41this by committing a changes to the stats table before a job starts and
42undoing the change after it finishes. If a job crashed, it will be left
43in an intermediate state from which we deduce that it could have been the
44crashing process.
45
46## Scheduler State Machine
47
48The scheduler implements a state machine for each job.
49Each job starts in the SCHEDULED state. As soon as a job starts
50it enters the STARTING state. If the scheduler determines the
51job should be terminated (e.g. it has reached a timeout), it moves
52the job to a TERMINATING state. Once a background worker has for
53a job has stopped, the job returns to the SCHEDULED state.
54The states and associated transitions are as follows.
55
56```
57      +---------+         +--------+
58+---> |SCHEDULED+-------> |DISABLED|
59|     +----+----+         +--------+
60|          |
61|          |
62|          v
63|      +---+----+
64+<-----+STARTING|
65|      +---+----+
66|          |
67|          |
68|          v
69|      +---+-------+
70+<-----+TERMINATING|
71       +-----------+
72```
73## Limitations
74This first implementation has two limitations:
75
76- The list of jobs to be run is read from the database when the scheduler is first started.
77We do not update this list if the jobs table changes.
78- There is no prioritization for when to run jobs.
79
80