1# Satellite Service Separation
2
3## Table of Contents
41. [Abstract](#abstract)
52. [Background](#background): Current Satellite Design
63. [Design](#design)
74. [Rationale](#rationale)
84. [Implementation](#implementation)
9
10## Abstract
11
12The goal of the design doc is to describe the work necessary to make the Satellite horizontally scalable.
13
14## Background
15
16Currently all Satellite services run as a single process. While this is great for development, this is bad for scaling.
17
18### Current Satellite design
19
20Currently the Satellite runs as a single process made of up the following parts:
21
22#### overlay
23The overlay maintains critical information about nodes in the network (via the nodes table in satellite.DB). One important use of overlay is to help select which storage nodes to upload files to.
24
25#### metainfo
26Metainfo is responsible for all things related to the metainfo stored for each file on the network. The metainfo system is currently composed of the following parts:
27
281) metainfo database, which stores data about buckets, objects and segments.
29
302) metainfo loop which iterates over all the data in metainfo database.
31
323) metainfo endpoint which creates the public RPCs for creating/deleting metainfo.
33
344)  metainfo service, which ensures proper access to metainfo database and simplifies modification for other services.
35
36#### orders
37Orders is responsible for creating/managing orders that the satellite issues for a file upload/download. Orders are used to keep track of how much bandwidth was used to upload/download a file. This data is used to pay storage nodes for bandwidth usage and to charge the uplinks. See this [doc on the lifecycle of data](https://github.com/storj/docs/blob/main/code/payments/Accounting.md#lifecycle-of-the-data) related to accounting which includes orders.
38
39#### audit
40Audit performs audits of the storage nodes to make sure the data they store is still retrievable. The audit system is currently made up of an audit service that runs on an interval performing audits on a segment at a time. The result of the audits are reported to the overlay service to store in node table in satellite.DB. See [docs on audit](https://github.com/storj/docs/blob/main/code/audits/audit-service.md) for more details.
41
42#### repair
43Repair searches metainfo for injured segments and adds them to the repair queue. Repairer picks segments from queue and tries to fix them. When repair fails, then the segment is added to irreparable database. The repair system is currently made of 4 parts and 2 DBs (db tables). The 4 parts are 1) repair observer (contains ReliabilityCache) 2) irreparable loop 3) repairer 4) repair queue. The 2 DBs are 1) injuredsegment (repair queue) table in satellite.DB 2) and irreparabledb table in satellite.DB
44
45#### garbage collection (GC)
46GC iterates over the metainfo and creates a list for each storage node. It sends these list to storage nodes, who can delete all pieces missing from the list. See [GC design doc](https://github.com/storj/storj/blob/main/docs/design/garbage-collection.md) for more details.
47
48#### accounting
49Accounting calculates how much uplinks use storage and how much storage nodes store data. See [docs on accounting](https://github.com/storj/docs/blob/main/code/payments/Accounting.md) for more details.
50
51#### console
52Console provides the web UI for the Satellite where users can create new accounts/projects/apiKeys needed for uploading/downloading to the network.
53
54#### mail
55`mail` service sends emails. Currently it's used by console UI.
56
57#### marketing
58Marketing provides a private web UI for the the marketing admins for referral program.
59
60#### nodestats
61Nodestats allows storage nodes to ask information about themselves from the satellite. For example, it can ask for stats on reputation and accounting storage usage.
62
63#### inspectors
64Inspectors allow private diagnostics on certain systems. The following inspectors currently exist: overlay inspector, health inspector, and irreparable inspector.
65
66#### kademlia
67Kademlia, discovery, bootstrap, and vouchers are being removed and not included in this doc. See [kademlia removal design doc](https://github.com/storj/storj/blob/main/docs/design/kademlia-removal.md) for more details.
68
69#### RPC endpoints
70The Satellite has the following RPC endpoints:
71- Public: metainfo, nodestats, orders, overlay (currently part of kademlia, but may be added here)
72- Private: inspectors
73
74#### HTTP endpoints
75The Satellite has the following HTTP endpoints:
76- Public: console
77- Private: marketing admin UI, debug
78
79#### databases
80All services (except version) make connections to the satellite.DB. Five services rely on metainfo service to access the metainfoDB, this includes inspectors, accounting, audit, repair, and garbage collection.
81
82See these docs for details on current database design:
83- https://github.com/storj/docs/blob/main/code/Database.md#database
84- https://github.com/storj/docs/blob/main/code/persistentstorage/Databases.md
85
86#### limitations
87The current items that prevent Satellite horizontal scaling include:
88- live accounting cache is currently stored in memory which cannot be shared across many replicas
89- in-memory/locally stored databases. This includes revocationDB and the devDefaults for satellite.DB and metainfoDB.
90- potentially things that use mutexes, sync.Mutex or sync.RWMutex (these don't prevent, but indicate that there might be shared state)
91- set on database, i.e. non transactional db changes
92- ?
93
94## Design
95
96The plan is to break the Satellite into multiple processes. Each process runs independently in their own isolated environment. Ideally, each process can be replicated and load balanced. This means they need to share access to any persistent data.
97
98### New satellite processes
99
100Currently there is only one Satellite process. We propose to add the following processes:
101
102#### satellite api
103The satellite api will handle all public RPC and HTTP requests, this includes all public endpoints for nodestats, overlay, orders, metainfo, and console web UI. It will need all the code to successfully process these public requests, but no more than that. For example, if the console needs the mail service to successfully complete any request, then that code should be added, but make sure to only include the necessary parts. There shouldn't be any background jobs running here nor persistent state, meaning if there are no requests coming in, the satellite api should be idle.
104
105#### private api
106The private api process handles all private RPC and HTTP requests, this includes inspectors (overlay, health, irreparable), debug endpoints, and the marketing web UI. Open question: do we need the inspectors, if not should they be removed?
107
108#### metainfo loop and the observer system
109The metainfo loop process iterates over all the segments in metainfoDB repeatedly on an interval. With each loop, the process can also execute the code for the observer systems that take a segment as input and performs some action with it. The observer systems currently include: audit observer, gc observer, repair checker observer, and accounting tally.
110
111The audit observer uses the segments from the metainfo loop to create segment reservoir samples for each storage node and saves those samples to a reservoir cache. Audit observer currently runs on a 30s interval for the release default setting. See [audit-v2 design](https://github.com/storj/storj/blob/main/docs/design/audit-v2.md) for more details.
112
113The repair (checker) observer uses the segments from the metainfo loop to identify segments that need to be repaired and adds those injured segments to the repair queue. The repair check currently has a `checkerObserver.ReliabilityCache`, this should be ok to stay in-memory for the time being since we plan on only running a single metainfo loop. The repair observer currently runs on a 30s interval for the release default setting.
114
115The garbage collector (GC) observer uses the segments from the metainfo loop to create bloom filters for each storge node. The bloom filters contain all the pieceIDs that the storage node should have. The bloom filters are stored in-memory then sent to the appropriate storage node. Keeping the bloom filters in-memory is ok for the time being since we don't plan to run more than a single replica of the metainfo loop service. The GC observer currently runs on 5 day interval for the release default setting.
116
117The tally observer uses the segments from the metainfo loop to sum the total data stored on storage nodes and in buckets then saves these values in the tables `storagenode_storage_tally` and `bucket_storage_tally` in the satellite.DB database.
118
119The following diagram outlines the metainfo loop with the 4 observer:
120
121***
122
123![Diagram of the above described metainfo loop observers](images/metainfo-loop-design.svg)
124
125***
126
127#### irreparable loop
128Lets get rid of the irreparable loop. For one, we never expect there to be files that can't get repaired. In the off chance there are, lets just add the file back to the repair queue and try again later, but lets keep track of how many times we've tried.
129
130#### repair workers
131The repair worker executes a repair for an item in the repair queue. We want to work through the repair queue as fast as possible so its important to be able to dispatch many workers at a time.
132
133#### audit workers
134The audit process should be able to run many audits in parallel. See the [audit-v2 design doc](https://github.com/storj/storj/blob/main/docs/design/audit-v2.md) for updates to the audit design.
135
136#### accounting
137The accounting process is responsible for calculating disk usage and bandwidth usage. These calculations are used for uplink invoices and storage nodes payments. Accounting should receive storage node total stored bytes data from the tally observer running with the metainfo loop.
138
139#### uptime
140The uptime process will be responsible for pinging nodes we haven't heard from in a while and updating the overlay cache with the results.
141
142***
143
144The following diagram shows the above proposed design:
145
146![Diagram of the above listed binaries](images/sa-separation-design.svg)
147
148***
149
150## Rationale
151
152#### metainfo loop and observer system
153For database performance reasons we should only have one thing looping over the metainfo database. This is why we are combining everything into the metainfo loop observer system. There is no sense in having multiple different processes all redoing the work of iterating over the database and parsing protobufs if lots of things need to do that same work over and over. We could add functionality to this version service so that all binaries running in the new satellite system check in and provide software version compatibility details. There has been discussion about making the audit and GC observers run on different loops, but for performance concerns, its been decided to run all observers on the same metainfo loop (though GC will run less frequently).
154
155Keep in mind it's important for the GC observer that we hit every single item in the metainfoDB, otherwise the GC bloom filters could be inaccurate which could cause storage nodes to delete live data.
156
157For the metainfo loop and observer system its not critical to have high availability for these systems in the short term, therefore its ok to have all these observers depend on a single metainfo loop process.  If the need arises to have multiple concurrent metainfo loops then we can address those specific needs as they come up. In the worst case scenario, if the metainfo loop goes down we don't expect any overall satellite system downtime, however there may be metainfo loop downtime until its fixed and back up. This should be fine for small periods of time.  We should figure out what is an acceptable amount of downtime for metainfo loop should it occur.
158
159#### satellite api
160
161For creating the satellite api, there are two options for design. One, a single process containing all public RPC/HTTP endpoints **and** all code necessary to process any request to those endpoints. Or two, a single process that contains all public RPC/HTTP endpoints but does **not** contain the code to process requests, instead the api would act as a proxy passing along requests to the correct backend services. Here we will do the first option since it is less complex and it fulfills our need to run replicas to handle lots of traffic. In the future we can migrate to option two should the additional complexity be needed to satisfy some other need.
162
163#### version
164
165We need a way for all these new satellite binaries to perform a version compatibility check. In the short term the version server can be responsible for this. So in addition to the storage nodes, the satellite system binaries can check in as well. Lets make sure that they cannot start up unless they are running the right version.
166
167## Implementation
168
169Note: getting Kademlia removed is a blocker to implementation because the routing table is stored in-memory.
170
171We should break out one process at a time from the Satellite.  Here are the things we should do first since they impede scale the most:
172- satellite api
173- repair workers
174- audit workers
175
176We will need to add a `Satellite` in `testplanet` so that we can test continue using the same unit tests for the Satellite as we break it apart. See an example of that [here](https://github.com/storj/storj/pull/2836/files#diff-c2ce7a2c9b2f4920e2de524c0fffd2f1R70).
177
178For each new satellite process we need to do the following steps:
179- create xProcess, where x is a Satellite service, e.g. RepairProcess, AccountingProcess, etc
180- update satellite binary to include a subcommand so that the new xProcess can be run like this:
181`satellite run repair`
182- look for any areas in the code that prevent horizontal scaling for this new process and fix if found
183- update testplanet so that unit tests still pass
184- update storj-sim so that integration tests still pass
185- create kubernetes deployment configs (this includes a dockerfile and an k8s HPA resource)
186- automated deployment to staging kubernetes environment is setup and deploying
187
188A couple important notes:
189- every satellite system process must check in with the version server to confirm it is running the right version. The process cannot start unless it is running the correct version.
190- when creating these new satellite system binaries, is that we want to make sure we to only pull in the code necessary for the new process to accomplish its main tasks. For example, if the new satellite api process needs to create a pointer for pointerDB, we only want to include the minimum metainfo code needed to do so.  We should not be adding any additional code like the metainfo loop or anything else unnecessary.
191
192[Here](https://github.com/storj/storj/pull/2836) is a prototype PR with an example that implements these steps (minus deployment configs) for the repair service. This is now out of date since we are adding the repair (checker) observer to the metainfo loop observer system, but this prototype is still useful as an example.
193
194Other work:
195- Add support for a cache for the live accounting cache so that its not stored in memory any longer.
196- add support for postgres to revocationDB.
197- Update storj-sim to run with all the new services.
198