What Everyone Gets Wrong About 3D Gaussian Splatting

Video Title: What Everyone Gets Wrong About 3D Gaussian Splatting: with Jonathan Stephens Channel: Michal Gula Publication Date: 2026-02-05 Duration: 01:25:32 Source URL: https://www.youtube.com/watch?v=Duhv81cMPLw Series: 2D is Against My Religion PODCAST

Reference URLs

Video Section Index

The following timestamps link to sections where live demonstrations are shown, including simulation environments, world model outputs, and robotics training scenarios.

TimestampDemo
41:44Tesla autopilot simulator - synthetic 4D Gaussian scenes
44:14Real-time driving at NeurIPS - Stephens drives the simulator
47:48Parallel Domain - mapping large urban areas as Gaussians
55:02Robotic arm reinforcement learning with Gaussian scenes
56:14SplatSim - mapping Gaussians onto low-quality simulations
56:59Robot in NVIDIA Isaac Sim with NeRFRek Gaussian pipeline
59:53World Labs Marble - hotel room from single 360 image
64:29Kitchen scene from World Labs single 360 image

Summary

This episode of the “2D is Against My Religion” podcast features Jonathan Stephens, a professional at Lightwheel AI and NVIDIA creator/partner, in conversation with host Michal Gula. The discussion spans three major domains: the technical mechanics of 3D Gaussian splatting (3DGS) and common misconceptions about its accuracy, practical applications of 3DGS in autonomous driving and robotics simulation, and the emerging role of world models in physical AI training pipelines. Stephens provides corrections and deeper technical explanations to points raised in a prior episode with a different guest, emphasizing that 3DGS is a visualization layer and not a replacement for measurement-grade data from LiDAR or photogrammetry. The conversation includes multiple live demonstrations of Tesla’s autopilot simulator, Parallel Domain’s urban-scale reconstructions, robotic arm training, NVIDIA Isaac Sim integration, and World Labs’ Marble world model generating 3D scenes from single images.

Main Analysis

What 3D Gaussian Splatting Actually Is

Stephens opens by identifying what he calls a “marketing problem” with 3DGS: the term conflates two distinct concepts. “3D Gaussian splatting” refers to the process of generating a radiance field scene through iterative optimization, while “3D Gaussian splats” refer to the output data, the collection of ellipsoidal primitives that represent the scene.

The process begins with Structure from Motion (SFM) points, sparse surface points derived from overlapping photographs. These points serve only as initialization. Stephens emphasizes that initialization can come from any source: a dense LiDAR point cloud, a sparse SFM cloud, or even a random distribution of points in 3D space. The algorithm converts each initial point into a 3D Gaussian (an ellipsoid) and then iterates through a cycle of projection, comparison, and adjustment approximately 30,000 times.

At ~10:00, Stephens walks through the optimization loop:

  1. Project the current 3D Gaussians onto a 2D image plane from a known camera position
  2. Compare the projected result against the ground truth photograph from that same position
  3. Adjust the Gaussians: move, duplicate, resize, reshape, or remove them
  4. Repeat from another camera position until the scene matches all source photographs

Each Gaussian stores more information than a traditional point cloud point. Beyond XYZ coordinates, a Gaussian carries covariance parameters (defining its 3D shape, from football-like to basketball-like to javelin-shaped), an opacity value, and spherical harmonic color encoding that changes based on viewing angle.

Tips for Creating High-Quality 3D Gaussian Splats

Several practical recommendations emerge from the discussion:

Initialization quality affects speed, not ceiling. Starting with a dense point cloud (from a LiDAR scanner or dense photogrammetry) means the Gaussians begin closer to their final positions, reducing training time. A sparse SFM cloud or random points will eventually reach comparable visual quality but requires significantly more iterations. At ~18:27, Stephens states: “If you have a really good dense point cloud, you will have these Gaussians probably already somewhat aligned in a more dense and closer-to-reality way, and you’ll get to a really good looking result faster.”

Source photo coverage determines novel view quality. The optimization matches Gaussians to ground truth photographs. Regions of the scene with poor photo coverage will exhibit artifacts when viewed from novel angles. At ~28:00, Stephens explains that the “really lifelike” quality between known viewpoints works because the Gaussians generalize well between camera positions, but gaps in coverage still produce noticeable degradation.

Coordinate preservation requires explicit configuration. By default, most 3DGS implementations transform point coordinates for optimization speed. At ~39:48, Stephens notes that platforms like gsplat allow users to preserve original coordinates during training: “You can change the default to say no, preserve the coordinates of the original splat.” For georeferenced work, this setting matters.

Use RTK-quality positioning for coordinate-locked scenes. iPhone captures produce unreliable spatial coordinates. At ~39:52: “Most people aren’t really getting high-quality scans… They’re using an iPhone. Well, I’ve seen iPhones bounce off walls and put you across the street.” RTK GNSS or equivalent positioning is recommended when Gaussians must live in real-world coordinate space.

Source photos do not need to be 4K. In the context of autonomous driving, Stephens notes at ~50:14 that perception sensors on self-driving cars operate at resolutions like 640x480. The training data resolution should match the target application, not an arbitrary quality ceiling.

The Accuracy Debate - Appearance vs Geometry

The central technical argument of the episode begins at ~29:59. Stephens describes the accuracy question as an “apples and oranges” comparison with LiDAR and photogrammetry.

LiDAR and photogrammetry accuracy measures how precisely surfaces are captured in physical space, typically reported in millimeters of deviation from ground truth geometry.

3DGS accuracy measures how well the scene looks compared to source photographs, using three standard metrics:

MetricFull NameWhat It Measures
PSNRPeak Signal-to-Noise RatioPixel-level color and intensity similarity between rendered and GT image
SSIMStructural Similarity IndexEdge and structural pattern fidelity between rendered and GT image
LPIPSLearned Perceptual Image PatchHuman-perceived visual realism of the rendered scene

The optimization process actively destroys geometric accuracy. Within a few iterations, Gaussians leave their initialized surface positions because the algorithm cares only about matching photo appearance. Stephens summarizes: “The first thing it does is it takes these semi-accurate sparse points… and says I don’t care where those belong anymore. I’m just going to move them around wherever I need them to be to make sure the scene looks right.”

The practical conclusion, summarized by commenter David Borman: “Splats for show, points for dough.” 3DGS serves as a visualization layer. Measurement-grade point clouds from LiDAR or photogrammetry live underneath, sharing the same coordinate space.

Primitive Alternatives - 2D Gaussians, Convex Shapes, and Triangles

At ~19:39, the discussion moves to alternative primitives beyond 3D Gaussians.

2D Gaussian Splatting uses flat discs instead of 3D ellipsoids. These discs align to geometric surfaces, making them better at representing flat structures like walls and edges. The tradeoff is that 2D Gaussians require surface geometry information to align to, while 3D Gaussians are geometry-agnostic. Surface normals guide orientation, indicating which direction each disc faces. 2DGS may produce more geometrically accurate results, though Stephens qualifies that the centroid of a disc still cannot align perfectly to sharp edges.

Convex Splatting (~22:17) replaces Gaussians with arbitrary convex shapes: rectangles, circles, and complex polygons. These shapes can have hard rigid edges or soft Gaussian falloff. The key advantage is efficiency: a large flat wall can be represented by a few hard-edged rectangles rather than millions of round blobs. Convex splatting uses fewer primitives to represent scenes and can achieve better visual quality in architectural contexts.

Triangle Splatting (~21:23) reduces the convex approach to three-point primitives. Triangles retain the hard-edge and soft-edge properties of convex shapes but use less data per primitive. Stephens considers triangle splatting “perhaps the most superior of the different ways you can reproduce a scene” because triangles can be combined to form any shape and match both curved and angular scene geometry.

Standardization pressure, however, may override technical superiority. At ~26:22, Stephens reports that the GLTF standard is moving to formalize 3D Gaussian splats for radiance fields. Once standardized, retooling every viewer, plugin, and renderer for triangles becomes impractical. He compares this to the VHS-versus-Betamax outcome: “Betamax tapes supposedly were better, but everyone went to VHS, and so we ended up with this inferior product. That’s what the world chose.”

PrimitiveShapeEdge TypeSurface AlignmentData Per PrimitiveGeometric Accuracy
3D GaussianEllipsoidSoft falloffNoneHighLow
2D GaussianFlat discSoft falloffSurface normalsMediumMedium
Convex shapeArbitrary polygonHard or softOptionalMediumMedium-High
TriangleThree-pointHard or softOptionalLowMedium-High

Dedicated Hardware - XGRIDS and SLAM Scanners

XGRIDS is referenced at two points in the video. At ~18:12, Stephens speculates on XGRIDS’ capture workflow: “If you’re using like an XGRIDS scanner, I don’t know exactly how their process works because it’s not like there’s open code you’re looking at. But I’m sure they initialize off of some points that they’re getting from their LiDAR scanner, that SLAM scanner.” This suggests XGRIDS uses its own LiDAR-based initialization for Gaussian training, likely producing higher-quality starting positions than phone-based capture.

At ~35:54, Stephens discusses the Portal Cam (presumably a SLAM-based handheld scanner) in favorable terms: “It does get a good SLAM-based point cloud… their collision layers are always so good.” He theorizes that the collision detection in the Portal Cam software uses the geometric point cloud rather than the Gaussians for surface detection, which aligns with the “splats for show, points for dough” principle.

The broader point regarding dedicated hardware: SLAM-based scanners that produce both a geometric point cloud and camera images provide the best of both worlds. The point cloud supplies accurate initialization and measurement-grade geometry, while the camera images feed the 3DGS training pipeline for visual quality. Stephens suggests this hybrid approach (geometric LiDAR layer plus visual Gaussian layer in shared coordinate space) is the practical standard in professional workflows.

XGRIDS is also confirmed as a partner at the upcoming FreeDays conference in Prague (May 2026) at ~85:12, described alongside other “processing companies or hardware software producers.”

3DGS in Autonomous Driving Simulation

The simulation section begins at ~41:07. Stephens initially was not excited about 4D Gaussian splatting (dynamic Gaussians with temporal dimension) due to the impractical hardware requirements: volumetric capture studios, multiple synchronized cameras, and extensive processing time. Simulation changed his perspective because the data is synthetically generated, removing real-world capture constraints.

Tesla Autopilot Simulator (~41:44): Tesla captures real-world driving scenes from their fleet cameras and reconstructs them as 4D Gaussian splat scenes using their world model. The entire scene, including moving vehicles, is represented as dynamic Gaussians. Text on trucks appears as gibberish and brand logos are unreadable because the world model predicts appearance rather than copying exact content. The key advantage is that the scene maintains 3D consistency: turning one’s head left then back right preserves scene continuity because the Gaussians exist in 3D space, unlike pure video generation where occluded objects can disappear. Tesla uses these reconstructed scenes for “car policy” training, replaying near-misses and edge cases in varied conditions. An example: if a ball rolls across the street, the system learns to wait for a child who might be chasing it before accelerating.

NeurIPS Live Demo (~44:14): Stephens personally drove the Tesla simulator at NeurIPS, confirming it runs in real time via server-side rendering streamed to a local machine. He was given several minutes of free driving before the scene degraded due to long-horizon prediction limits. The scene, including background vehicles, was entirely composed of Gaussians.

Parallel Domain (~47:48): This company mapped a large portion of a city (Stephens references Michigan/Detroit) as Gaussian splats by driving through corridors with 360 cameras. The base scene is static Gaussians; vehicles and dynamic objects are added post-capture as meshes. This provides “perfect annotation” because every inserted asset has a known identity (a car is labeled as a car with 100% certainty). They simulate LiDAR and sonar returns within the Gaussian environment for training perception systems. Regarding scale, Stephens states there is no inherent size limit to 3DGS scenes.

3DGS in Robotics Training

Robotic Arm Learning (~55:02): A real robotic arm performing a task is captured and reconstructed as a Gaussian splat scene. The arm’s state data (joint positions, capabilities) is also recorded. The reconstructed scene enables reinforcement learning: the robot can replay the task thousands of times in different variations without physical equipment. This bypasses the need for 3D artists to manually rebuild the scene as a mesh.

SplatSim (~56:14): A technique that takes a low-quality simulation environment and overlays a captured Gaussian splat scene onto it. This provides visual realism for perception system testing without manual scene creation. The robot’s attention networks then train to focus on task-relevant objects rather than background elements.

NVIDIA Isaac Sim (~56:59): A robot navigates a room captured using NeRFRek (NVIDIA’s open-source Gaussian splatting package). The pipeline: capture photos of a room, run through NeRFRek, convert to Gaussians, extract a mesh for collisions, then train the robot to navigate and perform tasks within the scene.

Lightwheel AI - Physics-Accurate Simulation Data

At ~58:02, Stephens describes his work at Lightwheel AI, an NVIDIA Inception partner. Using a Matrix analogy, he explains that Lightwheel creates the simulation environments and assets for robots to learn in. Their data must be physics-accurate: objects must have correct weight, friction, spring-back, and material properties. They physically test real objects (bending cables, measuring force curves) and replicate those behaviors in simulation. If an apple has no weight or water floats instead of spilling, the robot learns incorrect physics.

Lightwheel operates a three-layer data engine:

  • SimReady Library: Physics-accurate 3D assets for simulation
  • EgoSuite: Egocentric human demonstration data collection
  • RoboFinals: Industrial-grade evaluation platform for robot policies

World Labs Marble - Scenes from Single Images

At ~59:53, Stephens demonstrates World Labs’ Marble world model. From a single 360-degree photograph of a hotel room, Marble generates a complete 3D Gaussian splat scene that can be navigated in all directions. The generated scene includes areas that were occluded in the original photo, predicted by the world model’s understanding of typical room structures.

The practical workflow at Lightwheel: capture a 360 image, feed it to Marble, receive a Gaussian splat file, convert it using NVIDIA’s 3DGU (3D Gaussian Unscented Transforms) tool to USDZ format for Omniverse, then place robot assets and begin training. This eliminates weeks of manual 3D artist work for environment creation. The API launched January 21, 2026, enables automated pipelines.

Limitations observed at ~64:29: Stephens shows his kitchen generated from a single 360 image. Surfaces that were entirely occluded (like a kitchen island behind the camera) are missing or degraded. Marble produces geometrically approximate scenes useful for training context but not measurement-grade reconstruction. For exact replicas, camera-based or SLAM-based capture remains necessary.

World Models and Frame Prediction

At ~71:12, Stephens defines world models as “prediction engines for future world states.” Unlike LLMs that predict the next most likely token, world models predict the next most likely physical scene state given inputs. They must conform to real-world physics: a cat in a spacesuit playing golf on the moon is a valid video generation output but an invalid world model output.

NVIDIA Cosmos is an open-source world foundation model platform offering three capabilities:

  • Cosmos Predict: Generates video scenes up to 30 seconds from multimodal prompts
  • Cosmos Transfer: Applies style modifications to physics-based video from simulators
  • Cosmos Reason: Multimodal vision-language model for scene understanding

Gen3C (~65:40) is an open-source NVIDIA project that takes five input images and predicts intermediate frames. It can also identify degraded novel views (areas where 3DGS quality drops due to insufficient source photos) and generate corrected predictions to reinject into training. One demonstrated capability: shifting the viewpoint 6 meters from the original camera path and producing plausible scenes despite having no source footage from that position.

Data scarcity is identified as the primary bottleneck at ~79:40. Text-based LLMs consumed the internet’s text corpus. World models require physics-accurate video, annotations describing scene contents, and sensor-specific data (hand tracking, joint states for robotics). Collecting this data runs at a 1:1 ratio (one hour of recording produces one hour of data), unlike text which was scraped passively. World models may help break this bottleneck by generating synthetic variations: one hour of real capture could produce thousands of hours of training data through world model inference.

Long-horizon prediction is the key remaining challenge at ~74:20. Current systems can maintain coherent scene prediction for roughly one minute before small errors compound and degrade the output. Auto-regressive approaches (predicting 30-second chunks, seeding each from the end of the previous) extend this range, but multi-hour continuous prediction, necessary for VR experiences or full-length immersive content, remains unsolved.

The Hybrid Pipeline in Practice

The discussion converges on a practical workflow for professional use:

Dedicated hardware like XGRIDS or Portal Cam captures both LiDAR geometry and camera imagery simultaneously. The geometric point cloud handles collision detection, measurements, and coordinate accuracy. The camera images train a Gaussian splatting scene for visual fidelity. Both layers coexist in the same coordinate space. In viewers like BLK 360’s software, users see the visual layer but measurements are taken from the geometric layer beneath.

For simulation and AI training, the visual layer matters most because perception systems operate on camera data. For AEC and engineering applications, the geometric layer provides the measurement-grade accuracy. Attempting to use 3DGS for both visual quality and geometric precision misunderstands the technology’s design.

Key Findings

  • 3D Gaussian splatting optimizes for appearance fidelity (PSNR, SSIM, LPIPS), not geometric accuracy, and should be treated strictly as a visualization layer complementing, not replacing, LiDAR and photogrammetry
  • Initialization from dense point clouds accelerates training but does not change the final visual quality ceiling; even random point distributions converge to comparable results given sufficient iterations
  • The GLTF standardization of 3D Gaussians may lock the ecosystem into this primitive type despite technically superior alternatives (triangle splatting, convex splatting), echoing the VHS-Betamax dynamic
  • 4D Gaussian splatting has found its strongest application in simulation (Tesla autopilot, Parallel Domain) where synthetic data generation removes real-world capture constraints
  • World Labs Marble and NVIDIA Cosmos enable rapid environment generation for robotics training by converting single images or short captures into navigable 3D Gaussian scenes, reducing weeks of manual 3D artist work to minutes
  • The primary bottleneck for world model advancement is physics-accurate training data, not compute or algorithmic innovation, because world data must be purposefully collected rather than scraped passively
  • Professional workflows combine SLAM or LiDAR scanners (geometric accuracy) with 3DGS (visual quality) in a shared coordinate space, with dedicated hardware like XGRIDS and Portal Cam providing both data streams simultaneously

Speakers

Jonathan Stephens works at Lightwheel AI and is an NVIDIA creator/partner. Lightwheel specializes in physics-accurate synthetic data for robot and autonomous vehicle training. Stephens runs a YouTube channel with deep technical content on 3D Gaussian splatting and attended NeurIPS to test Tesla’s autopilot simulator firsthand.

Michal Gula hosts the “2D is Against My Religion” podcast covering reality capture, 3D Gaussian splatting, and geospatial technology. He organizes the FreeDays conference (Prague, May 2026) focused on 3D Gaussian splatting and related technologies, with XGRIDS as a confirmed partner. Channel: @michalgula.

References

  1. What Everyone Gets Wrong About 3D Gaussian Splatting (YouTube) - Published 2026-02-05
  2. 3D Gaussian Splatting for Real-Time Radiance Field Rendering, Kerbl et al., SIGGRAPH 2023 - Original paper
  3. Lightwheel AI - Physical AI infrastructure company
  4. World Labs - Marble - Spatial intelligence and world model platform, API launched 2026-01-21
  5. NVIDIA Cosmos - Open-source world foundation models for physical AI
  6. Parallel Domain - Driving simulation from Gaussian splat reconstructions
  7. Is Photogrammetry Dead? The Truth About 3D Gaussian Splatting - Prior episode referenced in discussion
  8. FreeDays Conference - Prague, May 2026