Scaling Up 3D Gaussian Splatting Training

Cover Image

We present Grendel, a distributed training system, to partition 3D Gaussian Splatting (3D GS) parameters and parallelize its computation across multiple GPUs. To optimize batched training, we explore different optimization hyperparameter scaling strategies and identify the simple √(batch_size) scaling rule to be highly effective. Using Grendel, we show that scaling up 3D GS training in terms of parameters and compute leads to significant improvements in visual quality on multiple large-scale scene reconstruction datasets.

Videos Showcase

Please select the highest playback quality. Youtube defaults to low-resolution :(

MegaNeRF Rubble (4K)

MatrixCity (1080p)

Introduction

3D Gaussian Splatting (3D GS) has been an emerging and popular technique for novel 3D view synthesis. Its popularity is because it offers faster training and rendering than comparable previous approaches such as NeRF . However, most existing 3D GS pipelines are limited to using a single GPU for training, where the memory and computation constraints become a bottleneck when applying 3D GS to higher-resolution or larger-scale scenes. To address these constraints, our Grendel system enables fast distributed training with an increased number of Gaussians and larger batches to improve reconstruction quality.

Quantitative Results

density interpolation density interpolation
Image Quality vs. # of Gaussians on Rubble and MatrixCity. Grendel enables fitting more Gaussians than possible on one GPU, leading to improved PSNR and LPIPS metrics. Left: On Rubble, PSNR and LPIPS continue to improve when more Gaussians are used. Right: Similarly, on MatrixCity, image quality continues with the # of Gaussians.

Grendel System Design

We design Grendel to leverage the inherent mixed parallelism of 3D GS. For tasks exhibiting Gaussian-wise parallelism, such as projection, color computation, and parameter storage, Grendel distributes Gaussians across GPUs. For pixel-wise rendering and loss computation, pixels are distributed across GPUs. Grendel then uses sparse all-to-all communication to transfer Gaussians to their designated GPUs by exploiting spatial locality. Additionally, Grendel employs a dynamic load balancer that utilizes observations from previous training iterations to partition images, aiming to minimize workload imbalance.

density interpolation
Grendel System Design. (left) conventional 3D GS training using a single GPU. (right) Our Grendel system distributes 3D Gaussians across multiple GPUs to alleviate the GPU memory bottleneck. We partition rendering in both the pixel and batch dimensions to achieve optimal speedup.

Hyperparameter Scaling for Batched Training

To efficiently scale to many GPUs, Grendel increases the batch size beyond one so it can partition training into a batch of images and into pixels inside each image. However, increasing the batch size without tuning optimization hyperparameters can lead to unstable and inefficient training , yet hyperparameter tuning is itself a time-consuming and tedious process. Driven by a heuristic Independent Gradients Hypothesis for 3D GS training, we propose to scale Adam’s learning rate and momentum coefficients with a square-root and exponential rule:

λ=λ×batch_size

β1,β2=β1batch_size,β2batch_size

Assuming the gradients from different images in a batch are independent, we want a batched update step to equal the sum of individual images' updates. Thus, we scale the learning rate to "undo" the Adam optimizer's second moment normalization and scale the momentum coefficient to keep the effective per-image momentum similar. Our learning rate and momentum scaling rules together enable hyperparameter-tuning-free training by making the training trajectory approximately invariant to the batch size. In experiments below, we first train a 3D GS model on the Rubble scene to iteration 15,000, then reset the Adam and continue training with different batch sizes. Since different parameter groups of 3D GS have vastly different magnitudes, we focus on one specific group, namely the diffuse color, to make the comparisons meaningful. We discover that our proposed scaling rules maintain high cosine similarity and approximately equal magnitudes regardless of batch size:

density interpolation
Learning rate scaling rules vs. similarity to BS 1 training. Our proposed sqrt(batch_size) learning rate scaling (red curves) maintains similar training trajectories to BS 1 training in terms of direction and magnitude, while other scaling rules differ more. Variants tested here adopt our proposed exponential momentum scaling.
density interpolation
Momentum scaling rules vs. similarity to BS 1 training. Our proposed exponential momentum scaling (red curves) maintains similar training trajectories to BS 1 training in terms of direction and magnitude, while other scaling rules differ more. Variants tested here adopt our proposed sqrt(batch_size) learning rate scaling.

BibTeX

@misc{zhao2024scaling3dgaussiansplatting,
  title={On Scaling Up 3D Gaussian Splatting Training},
  author={Hexu Zhao and Haoyang Weng and Daohan Lu and Ang Li and Jinyang Li and Aurojit Panda and Saining Xie},
  year={2024},
  eprint={2406.18533},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2406.18533},
}