Gasgoo Munich- Xiaomi EV has formally launched the new Xiaomi Auto World Model framework, deeply coupling 3D reconstruction with video generation. Unlike the traditional serial path of simply chaining "reconstruction plus generation," this framework forces the two to structurally constrain each other. The reconstruction side provides geometric anchors to "lay the foundation" for generation, while the generation side fills in unobserved areas to "expand boundaries" for reconstruction—working together to suppress long-term drift.Image Source: Xiaomi EV TechnologyWorld models are viewed as a "rehearsal system" for the autonomous driving brain, capable of predicting the environment's next evolution based on historical and current observations. This helps vehicles handle low-probability, high-risk scenarios such as torrential rain, falling rocks, and wrong-way drivers. Previously, the industry's two mainstream paths—reconstruction (high fidelity but lacking imagination) and generation (predictive but prone to drift)—each had distinct shortcomings. Xiaomi's integrated architecture attempts to merge the strengths of both.On the technical front, the reconstruction module WorldRec uses sparse 3D anchor representations to replace traditional pixel-wise dense Gaussian methods. By aggregating features across multiple views and moments with visibility-weighted fusion, it achieves 10-second video reconstruction in just 10 seconds. The generation module WorldGen undergoes two-stage training—full bidirectional temporal attention pre-training plus causal fine-tuning and distillation acceleration—requiring only four denoising steps and 0.19 seconds to generate a frame. It supports videos up to one minute long and can simulate long-tail scenarios like rare animal intrusions and extreme weather.According to Xiaomi, the framework has achieved comprehensive state-of-the-art (SOTA) results in mainstream benchmarks such as Waymo and nuScenes. WorldRec reached a PSNR of 28.48 on the Waymo dataset, surpassing previous best methods. WorldGen achieved an FVD of 64.97 on nuScenes, with single-view generation speeds roughly 5.6 times faster than similar autoregressive methods.The framework is already deployed across three scenarios at Xiaomi EV: synthetic data generation (over 100,000 clips delivered for perception model training), simulation testing (closed-loop reproduction of real accidents), and the Assisted Driving Academy (which has launched realistic simulation functions for all vehicle models).