1# Satellite Service Separation 2 3## Table of Contents 41. [Abstract](#abstract) 52. [Background](#background): Current Satellite Design 63. [Design](#design) 74. [Rationale](#rationale) 84. [Implementation](#implementation) 9 10## Abstract 11 12The goal of the design doc is to describe the work necessary to make the Satellite horizontally scalable. 13 14## Background 15 16Currently all Satellite services run as a single process. While this is great for development, this is bad for scaling. 17 18### Current Satellite design 19 20Currently the Satellite runs as a single process made of up the following parts: 21 22#### overlay 23The overlay maintains critical information about nodes in the network (via the nodes table in satellite.DB). One important use of overlay is to help select which storage nodes to upload files to. 24 25#### metainfo 26Metainfo is responsible for all things related to the metainfo stored for each file on the network. The metainfo system is currently composed of the following parts: 27 281) metainfo database, which stores data about buckets, objects and segments. 29 302) metainfo loop which iterates over all the data in metainfo database. 31 323) metainfo endpoint which creates the public RPCs for creating/deleting metainfo. 33 344) metainfo service, which ensures proper access to metainfo database and simplifies modification for other services. 35 36#### orders 37Orders is responsible for creating/managing orders that the satellite issues for a file upload/download. Orders are used to keep track of how much bandwidth was used to upload/download a file. This data is used to pay storage nodes for bandwidth usage and to charge the uplinks. See this [doc on the lifecycle of data](https://github.com/storj/docs/blob/main/code/payments/Accounting.md#lifecycle-of-the-data) related to accounting which includes orders. 38 39#### audit 40Audit performs audits of the storage nodes to make sure the data they store is still retrievable. The audit system is currently made up of an audit service that runs on an interval performing audits on a segment at a time. The result of the audits are reported to the overlay service to store in node table in satellite.DB. See [docs on audit](https://github.com/storj/docs/blob/main/code/audits/audit-service.md) for more details. 41 42#### repair 43Repair searches metainfo for injured segments and adds them to the repair queue. Repairer picks segments from queue and tries to fix them. When repair fails, then the segment is added to irreparable database. The repair system is currently made of 4 parts and 2 DBs (db tables). The 4 parts are 1) repair observer (contains ReliabilityCache) 2) irreparable loop 3) repairer 4) repair queue. The 2 DBs are 1) injuredsegment (repair queue) table in satellite.DB 2) and irreparabledb table in satellite.DB 44 45#### garbage collection (GC) 46GC iterates over the metainfo and creates a list for each storage node. It sends these list to storage nodes, who can delete all pieces missing from the list. See [GC design doc](https://github.com/storj/storj/blob/main/docs/design/garbage-collection.md) for more details. 47 48#### accounting 49Accounting calculates how much uplinks use storage and how much storage nodes store data. See [docs on accounting](https://github.com/storj/docs/blob/main/code/payments/Accounting.md) for more details. 50 51#### console 52Console provides the web UI for the Satellite where users can create new accounts/projects/apiKeys needed for uploading/downloading to the network. 53 54#### mail 55`mail` service sends emails. Currently it's used by console UI. 56 57#### marketing 58Marketing provides a private web UI for the the marketing admins for referral program. 59 60#### nodestats 61Nodestats allows storage nodes to ask information about themselves from the satellite. For example, it can ask for stats on reputation and accounting storage usage. 62 63#### inspectors 64Inspectors allow private diagnostics on certain systems. The following inspectors currently exist: overlay inspector, health inspector, and irreparable inspector. 65 66#### kademlia 67Kademlia, discovery, bootstrap, and vouchers are being removed and not included in this doc. See [kademlia removal design doc](https://github.com/storj/storj/blob/main/docs/design/kademlia-removal.md) for more details. 68 69#### RPC endpoints 70The Satellite has the following RPC endpoints: 71- Public: metainfo, nodestats, orders, overlay (currently part of kademlia, but may be added here) 72- Private: inspectors 73 74#### HTTP endpoints 75The Satellite has the following HTTP endpoints: 76- Public: console 77- Private: marketing admin UI, debug 78 79#### databases 80All services (except version) make connections to the satellite.DB. Five services rely on metainfo service to access the metainfoDB, this includes inspectors, accounting, audit, repair, and garbage collection. 81 82See these docs for details on current database design: 83- https://github.com/storj/docs/blob/main/code/Database.md#database 84- https://github.com/storj/docs/blob/main/code/persistentstorage/Databases.md 85 86#### limitations 87The current items that prevent Satellite horizontal scaling include: 88- live accounting cache is currently stored in memory which cannot be shared across many replicas 89- in-memory/locally stored databases. This includes revocationDB and the devDefaults for satellite.DB and metainfoDB. 90- potentially things that use mutexes, sync.Mutex or sync.RWMutex (these don't prevent, but indicate that there might be shared state) 91- set on database, i.e. non transactional db changes 92- ? 93 94## Design 95 96The plan is to break the Satellite into multiple processes. Each process runs independently in their own isolated environment. Ideally, each process can be replicated and load balanced. This means they need to share access to any persistent data. 97 98### New satellite processes 99 100Currently there is only one Satellite process. We propose to add the following processes: 101 102#### satellite api 103The satellite api will handle all public RPC and HTTP requests, this includes all public endpoints for nodestats, overlay, orders, metainfo, and console web UI. It will need all the code to successfully process these public requests, but no more than that. For example, if the console needs the mail service to successfully complete any request, then that code should be added, but make sure to only include the necessary parts. There shouldn't be any background jobs running here nor persistent state, meaning if there are no requests coming in, the satellite api should be idle. 104 105#### private api 106The private api process handles all private RPC and HTTP requests, this includes inspectors (overlay, health, irreparable), debug endpoints, and the marketing web UI. Open question: do we need the inspectors, if not should they be removed? 107 108#### metainfo loop and the observer system 109The metainfo loop process iterates over all the segments in metainfoDB repeatedly on an interval. With each loop, the process can also execute the code for the observer systems that take a segment as input and performs some action with it. The observer systems currently include: audit observer, gc observer, repair checker observer, and accounting tally. 110 111The audit observer uses the segments from the metainfo loop to create segment reservoir samples for each storage node and saves those samples to a reservoir cache. Audit observer currently runs on a 30s interval for the release default setting. See [audit-v2 design](https://github.com/storj/storj/blob/main/docs/design/audit-v2.md) for more details. 112 113The repair (checker) observer uses the segments from the metainfo loop to identify segments that need to be repaired and adds those injured segments to the repair queue. The repair check currently has a `checkerObserver.ReliabilityCache`, this should be ok to stay in-memory for the time being since we plan on only running a single metainfo loop. The repair observer currently runs on a 30s interval for the release default setting. 114 115The garbage collector (GC) observer uses the segments from the metainfo loop to create bloom filters for each storge node. The bloom filters contain all the pieceIDs that the storage node should have. The bloom filters are stored in-memory then sent to the appropriate storage node. Keeping the bloom filters in-memory is ok for the time being since we don't plan to run more than a single replica of the metainfo loop service. The GC observer currently runs on 5 day interval for the release default setting. 116 117The tally observer uses the segments from the metainfo loop to sum the total data stored on storage nodes and in buckets then saves these values in the tables `storagenode_storage_tally` and `bucket_storage_tally` in the satellite.DB database. 118 119The following diagram outlines the metainfo loop with the 4 observer: 120 121*** 122 123![Diagram of the above described metainfo loop observers](images/metainfo-loop-design.svg) 124 125*** 126 127#### irreparable loop 128Lets get rid of the irreparable loop. For one, we never expect there to be files that can't get repaired. In the off chance there are, lets just add the file back to the repair queue and try again later, but lets keep track of how many times we've tried. 129 130#### repair workers 131The repair worker executes a repair for an item in the repair queue. We want to work through the repair queue as fast as possible so its important to be able to dispatch many workers at a time. 132 133#### audit workers 134The audit process should be able to run many audits in parallel. See the [audit-v2 design doc](https://github.com/storj/storj/blob/main/docs/design/audit-v2.md) for updates to the audit design. 135 136#### accounting 137The accounting process is responsible for calculating disk usage and bandwidth usage. These calculations are used for uplink invoices and storage nodes payments. Accounting should receive storage node total stored bytes data from the tally observer running with the metainfo loop. 138 139#### uptime 140The uptime process will be responsible for pinging nodes we haven't heard from in a while and updating the overlay cache with the results. 141 142*** 143 144The following diagram shows the above proposed design: 145 146![Diagram of the above listed binaries](images/sa-separation-design.svg) 147 148*** 149 150## Rationale 151 152#### metainfo loop and observer system 153For database performance reasons we should only have one thing looping over the metainfo database. This is why we are combining everything into the metainfo loop observer system. There is no sense in having multiple different processes all redoing the work of iterating over the database and parsing protobufs if lots of things need to do that same work over and over. We could add functionality to this version service so that all binaries running in the new satellite system check in and provide software version compatibility details. There has been discussion about making the audit and GC observers run on different loops, but for performance concerns, its been decided to run all observers on the same metainfo loop (though GC will run less frequently). 154 155Keep in mind it's important for the GC observer that we hit every single item in the metainfoDB, otherwise the GC bloom filters could be inaccurate which could cause storage nodes to delete live data. 156 157For the metainfo loop and observer system its not critical to have high availability for these systems in the short term, therefore its ok to have all these observers depend on a single metainfo loop process. If the need arises to have multiple concurrent metainfo loops then we can address those specific needs as they come up. In the worst case scenario, if the metainfo loop goes down we don't expect any overall satellite system downtime, however there may be metainfo loop downtime until its fixed and back up. This should be fine for small periods of time. We should figure out what is an acceptable amount of downtime for metainfo loop should it occur. 158 159#### satellite api 160 161For creating the satellite api, there are two options for design. One, a single process containing all public RPC/HTTP endpoints **and** all code necessary to process any request to those endpoints. Or two, a single process that contains all public RPC/HTTP endpoints but does **not** contain the code to process requests, instead the api would act as a proxy passing along requests to the correct backend services. Here we will do the first option since it is less complex and it fulfills our need to run replicas to handle lots of traffic. In the future we can migrate to option two should the additional complexity be needed to satisfy some other need. 162 163#### version 164 165We need a way for all these new satellite binaries to perform a version compatibility check. In the short term the version server can be responsible for this. So in addition to the storage nodes, the satellite system binaries can check in as well. Lets make sure that they cannot start up unless they are running the right version. 166 167## Implementation 168 169Note: getting Kademlia removed is a blocker to implementation because the routing table is stored in-memory. 170 171We should break out one process at a time from the Satellite. Here are the things we should do first since they impede scale the most: 172- satellite api 173- repair workers 174- audit workers 175 176We will need to add a `Satellite` in `testplanet` so that we can test continue using the same unit tests for the Satellite as we break it apart. See an example of that [here](https://github.com/storj/storj/pull/2836/files#diff-c2ce7a2c9b2f4920e2de524c0fffd2f1R70). 177 178For each new satellite process we need to do the following steps: 179- create xProcess, where x is a Satellite service, e.g. RepairProcess, AccountingProcess, etc 180- update satellite binary to include a subcommand so that the new xProcess can be run like this: 181`satellite run repair` 182- look for any areas in the code that prevent horizontal scaling for this new process and fix if found 183- update testplanet so that unit tests still pass 184- update storj-sim so that integration tests still pass 185- create kubernetes deployment configs (this includes a dockerfile and an k8s HPA resource) 186- automated deployment to staging kubernetes environment is setup and deploying 187 188A couple important notes: 189- every satellite system process must check in with the version server to confirm it is running the right version. The process cannot start unless it is running the correct version. 190- when creating these new satellite system binaries, is that we want to make sure we to only pull in the code necessary for the new process to accomplish its main tasks. For example, if the new satellite api process needs to create a pointer for pointerDB, we only want to include the minimum metainfo code needed to do so. We should not be adding any additional code like the metainfo loop or anything else unnecessary. 191 192[Here](https://github.com/storj/storj/pull/2836) is a prototype PR with an example that implements these steps (minus deployment configs) for the repair service. This is now out of date since we are adding the repair (checker) observer to the metainfo loop observer system, but this prototype is still useful as an example. 193 194Other work: 195- Add support for a cache for the live accounting cache so that its not stored in memory any longer. 196- add support for postgres to revocationDB. 197- Update storj-sim to run with all the new services. 198