1# paraLLEl-RDP
2
3This project is a revival and complete rewrite of the old, defunct paraLLEl-RDP project.
4
5The goal is to implement the Nintendo 64 RDP graphics chip as accurately as possible using Vulkan compute.
6The implementation aims to be bitexact with the
7[Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus) reference renderer where possible.
8
9## Disclaimer
10
11While paraLLEl-RDP uses [Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus)
12as an implementation reference, it is not a port, and not a derived codebase of said project.
13It is written from scratch by studying [Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus)
14and trying to understand what is going on.
15The test suite uses [Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus) as a reference
16to validate implementation and cross-checking behavior.
17
18## Use cases
19
20- **Much** faster LLE RDP emulation of N64 compared to a CPU implementation
21  as parallel graphics workloads are offloaded to the GPU.
22  Emulation performance is now completely bound by CPU and LLE RSP performance.
23  Early benchmarking results suggest 2000 - 5000 VI/s being achieved on mid-range desktop GPUs based on timestamp data.
24  There is no way the CPU emulation can keep up with that, but that means this should
25  scale down to fairly gimped GPUs as well, assuming the driver requirements are met.
26- A backend renderer for standalone engines which aim to efficiently reproduce faithful N64 graphics.
27- Hopefully, an easier to understand implementation than the reference renderer.
28- An esoteric use case of advanced Vulkan compute programming.
29
30## Missing features
31
32The implementation is quite complete, and compatibility is very high in the limited amount of content I've tested.
33However, not every single feature is supported at this moment.
34Ticking the last boxes depends mostly on real content making use of said features.
35
36- Color combiner chroma keying
37- Various "bugs" / questionable behavior that seems meaningless to emulate
38- Certain extreme edge cases in TMEM upload. The implementation has tests for many "crazy" edge cases though.
39- ... possibly other obscure features
40
41The VI is essentially complete. A fancy deinterlacer might be useful to add since we have plenty of GPU cycles to spare in the graphics queue.
42The VI filtering is always turned on if game requests it, but features can selectively be turned off for the pixel purists.
43
44## Environment variables for development / testing
45
46### `RDP_DEBUG` / `RDP_DEBUG_X` / `RDP_DEBUG_Y`
47
48Supports printf in shaders, which is extremely useful to drill down difficult bugs.
49Only printfs from certain pixels can be filtered through to avoid spam.
50
51### `VI_DEBUG` / `VI_DEBUG_X` / `VI_DEBUG_Y`
52
53Same as `RDP_DEBUG` but for the VI.
54
55### `PARALLEL_RDP_MEASURE_SYNC_TIME`
56
57Measures time stalled in `CommandProcessor::wait_for_timeline`. Useful to measure
58CPU overhead in hard-synced emulator integrations.
59
60### `PARALLEL_RDP_SMALL_TYPES=0`
61
62Force-disables 8/16-bit arithmetic support. Useful when suspecting driver bugs.
63
64### `PARALLEL_RDP_UBERSHADER=1`
65
66Forces the use of ubershaders. Can be extremely slow depending on the shader compiler.
67
68### `PARALLEL_RDP_FORCE_SYNC_SHADER=1`
69
70Disabled async pipeline optimization, and blocks for every shader compiler.
71Only use if the ubershader crashes, since this adds the dreaded shader compilation stalls.
72
73### `PARALLEL_RDP_BENCH=1`
74
75Measures RDP rendering time spent on GPU using Vulkan timestamps.
76At end of a run, reports average time spent per render pass,
77and how many render passes are flushed per frame.
78
79### `PARALLEL_RDP_SUBGROUP=0`
80
81Force-disables use of Vulkan subgroup operations,
82which are used to optimize the tile binning algorithm.
83
84### `PARALLEL_RDP_ALLOW_EXTERNAL_HOST=0`
85
86Disables use of `VK_EXT_external_memory_host`. For testing.
87
88## Vulkan driver requirements
89
90paraLLEl-RDP requires up-to-date Vulkan implementations. A lot of the great improvements over the previous implementation
91comes from the idea that we can implement N64's UMA by simply importing RDRAM directly as an SSBO and perform 8 and 16-bit
92data access over the bus. With the tile based architecture in paraLLEl-RDP, this works very well and actual
93PCI-e traffic is massively reduced. The bandwidth for doing this is also trivial. On iGPU systems, this also works really well, since
94it's all the same memory anyways.
95
96Thus, the requirements are as follows. All of these features are widely supported, or will soon be in drivers.
97paraLLEl-RDP does not aim for compatibility with ancient hardware and drivers.
98Just use the reference renderer for that. This is enthusiast software for a niche audience.
99
100- Vulkan 1.1
101- VK_KHR_8bit_storage / VK_KHR_16bit_storage
102- Optionally VK_KHR_shader_float16_int8 which enables small integer arithmetic
103- Optionally subgroup support with VK_EXT_subgroup_size_control
104- For integration in emulators, VK_EXT_external_memory_host is currently required (may be relaxed later at some performance cost)
105
106### Tested drivers
107
108paraLLEl-RDP has been tested on Linux and Windows on all desktop vendors.
109
110- Intel Mesa (20.0.6) - Passes conformance
111- Intel Windows - Passes conformance (**CAVEAT**. Intel Windows requires 64 KiB alignment for host memory import, make sure to add some padding around RDRAM in an emulator to make this work well.)
112- AMD RADV LLVM (20.0.6) - Passes conformance
113- AMD RADV ACO - Passes conformance with bleeding edge drivers and `PARALLEL_RDP_SMALL_TYPES=0`.
114- Linux AMDGPU-PRO - Passes conformance, with caveat that 8/16-bit arithmetic does not work correctly for some tests.
115  paraLLEl-RDP automatically disables small integer arithmetic for proprietary AMD driver.
116- AMD Windows - Passes conformance with same caveat and workaround as AMDGPU-PRO.
117- NVIDIA Linux - Passes conformance (**MAJOR CAVEAT**, NVIDIA Linux does not support VK_EXT_external_memory_host as of 2020-05-12.)
118- NVIDIA Windows - Passes conformance
119
120## Implementation strategy
121
122This project uses Vulkan compute shaders to implement a fully programmable rasterization pipeline.
123The overall rendering architecture is reused from [RetroWarp](https://github.com/Themaister/RetroWarp)
124with some further refinements.
125
126The lower level Vulkan backend comes from [Granite](https://github.com/Themaister/Granite).
127
128### Asynchronous pipeline optimization
129
130Toggleable paths in RDP state is expressed as specialization constants. The rendering thread will
131detect new state combinations and kick off building pipelines which only specify exact state needed to render.
132This is a massive performance optimization.
133
134The same shaders are used for an "ubershader" fallback when pipelines are not ready.
135In this case, specialization constants are simply not used.
136The same SPIR-V modules are reused to great effect using this Vulkan feature.
137
138### Tile-based rendering
139
140See [RetroWarp](https://github.com/Themaister/RetroWarp) for more details.
141
142### GPU-driven TMEM management
143
144TMEM management is fully GPU-driven, but this is a very complicated implementation.
145Certain combinations of formats are not supported, but such cases would produce
146meaningless results, and it is unclear that applications can make meaningful use of these "weird" uploads.
147
148### Synchronization
149
150Synchronizing the GPU and CPU emulation is one of the hot button issues of N64 emulation.
151The integration code is designed around a timeline of synchronization points which can be waited on by the CPU
152when appropriate. For accurate emulation, an OpSyncFull is generally followed by a full wait,
153but most games can be more relaxed and only synchronize with the CPU N frames later.
154Implementation of this behavior is outside the scope of paraLLEl-RDP, and is left up to the integration code.
155
156### Asynchronous compute
157
158GPUs with a dedicated compute queue is recommended for optimal performance since
159RDP shading work can happen on the compute queue, and won't be blocked by graphics workloads happening
160in the graphics queue, which will typically be VI scanout and frontend applying shaders on top.
161
162## Project structure
163
164This project implements several submodules which are quite useful.
165
166### rdp-replayer
167
168This app replays RDP dump files, which are produced by running content through an RDP dumper.
169An implementation can be found in e.g. parallel-N64. The file format is very simple and essentially
170contains a record of RDRAM changes and RDP command streams.
171This dump is replayed and a live comparison between the reference renderer can be compared to paraLLEl-RDP
172with visual output. The UI is extremely crude, and is not user-friendly, but good enough for my use.
173
174### rdp-conformance
175
176I made a somewhat comprehensive test suite for the RDP, with a custom higher level RDP command stream generator.
177There are roughly ~150 fuzz tests which exercise many aspects of the RDP.
178In order to pass the test, paraLLEl-RDP must produce bit-exact results compared to Angrylion,
179so the test condition is as stringent as possible.
180
181#### A note on bitexactness
182
183There are a few cases where bit-exactness is a meaningless term, such as the noise feature of the RDP.
184It is not particularly meaningful to exactly reproduce noise, since it is by its very nature unpredictable.
185For that reason, this repo references a fork of the reference renderer which implements deterministic "undefined behavior"
186where appropriate. The exact formulation of the noise generator is not very interesting as long as
187correct entropy and output range is reproduced.
188
189##### Intentional differences from reference renderer
190
191Certain effects invoke "undefined behavior" in the RDP and requires cycle accuracy to resolve bit-accurately with real RDP.
192Reference renderer attempts to emulate these effects, but to reproduce this behavior breaks any form of multi-threading.
193To be able to validate dumps in a sensible way with buggy content, I modified the reference slightly to make certain
194"undefined behavior" deterministic. This doesn't meaningfully change the rendered output in the cases I've seen in the wild.
195Some of these effects would be possible to emulate,
196but at the cost of lots of added complexity and it wouldn't be quite correct anyways given the cycle accuracy issue.
197
198- CombinedColor/Alpha in first cycle is cleared to zero. Some games read this in first cycle,
199  and reference renderer will read whatever was generated last pixel.
200  This causes issues in some cases, where cycle accuracy would have caused the feedback to converge to zero over time.
201- Reading LODFrac in 1 cycle mode. This is currently ignored. The results generated seem non-sensical. Never seen this in the wild.
202- Using TexLOD in copy mode. This is currently ignored. The results generated seem non-sensical. Never seen this in the wild.
203- Reading MemoryColor in first blender cycle in 2-cycle mode. Reference seems to wait until the second cycle before updating this value,
204  despite memory coverage being updated right away. The sensible thing to do is to allow reading memory color in first cycle.
205- Alpha testing in 2-cycle mode reads combined alpha from next pixel in reference.
206  Just doing alpha testing in first cycle on current pixel is good enough.
207  If this is correct hardware behavior, I consider this a hardware bug.
208- Reading Texel1 in cycle 1 of 2-cycle mode reads the Texel0 from next pixel.
209  In the few cases I've seen this, the rendered output is slightly buggy, but it's hardly visible in motion.
210  The workaround is just to read Texel0 from current pixel which still renders fine.
211
212### vi-conformance
213
214This is a conformance suite, except for the video interface (VI) unit.
215
216### rdp-validate-dump
217
218This tool replays an RDP dump headless and compares outputs between reference renderer and paraLLEl-RDP.
219To pass, bitexact output must be generated.
220
221## Build
222
223Checkout submodules. This pulls in Angrylion-Plus as well as Granite.
224
225```
226git submodule update --init --recursive
227```
228
229Standard CMake build.
230
231```
232mkdir build
233cd build
234cmake ..
235cmake --build . --parallel (--config Release on MSVC)
236```
237
238### Run test suite
239
240You can run rdp-conformance and vi-conformance with ctest to verify if your driver is behaving correctly.
241
242```
243ctest (-C Release on MSVC)
244```
245
246### Embedding shaders in a C++ header
247
248If embedding paraLLEl-RDP in an emulator project, it is helpful to pre-compile and bake SPIR-V shaders in a C++ header.
249Build slangmosh from Granite, and then run:
250
251```
252slangmosh parallel-rdp/shaders/slangmosh.json --output slangmosh.hpp --vk11 --strip -O --namespace RDP
253```
254
255### Generating a standalone code base for emulator integration
256
257Run the `generate_standalone_codebase.sh $OUTDIR` script with an output directory `$OUTDIR/` as argument to generate a standalone code base which can be built without any special build system support.
258Include `$OUTDIR/config.mk` if building with Make to make your life easier.
259Note that `slangmosh` must be in your path for this script to run. It executes the command above to build `slangmosh.hpp`.
260
261## License
262
263paraLLEl-RDP is licensed under the permissive license MIT. See included LICENSE file.
264This implementation builds heavily on the knowledge (but not code) gained from studying the reference implementation,
265thus it felt fair to release it under a permissive license, so my work could be reused more easily.
266