NVIDIA researchers are this week in Seattle presenting the latest work with visual generative AI models and techniques at the Computer Vision and Pattern Recognition (CVPR) conference.
The new work covers creating custom image generation, 3D scene editing, visual language understanding, and autonomous vehicle perception.
"Artificial intelligence, and in particular generative AI, is the most important technological advance," said Jan Kautz, VP of learning and perception research at NVIDIA.
"At CVPR, NVIDIA Research is sharing how we're pushing the boundaries of what's possible — from powerful image generation models that could supercharge professional creators to autonomous driving software that could help enable next-generation self-driving cars."
Among over 50 NVIDIA works being presented at this conference, two papers are finalists for CVPR's Best Paper Awards: one studies the training dynamics of diffusion models and another on high-definition maps for self-driving cars.
Further, NVIDIA itself ranked first in the CVPR Autonomous Grand Challenge's "End-to-End Driving at Scale" track after over 450 submissions worldwide.
The award testifies to leading-edge work by NVIDIA in using generative AI to create full self-driving vehicle models, further bolstered by the receipt of a related Innovation Award from CVPR.
One of the headline projects is JeDi: a new technique that allows creators to quickly customize existing diffusion models so they depict specific objects or characters, using just a few reference images, rather than having to fine-tune on custom datasets.
Other breakthroughs included FoundationPose, a novel foundation model capable of 3D pose instant understanding and tracking of objects inside video without the need for per-object training, which established new records for performance and could unlock new AR and Robotics applications.
The NVIDIA team further demonstrated NeRFDeformer: the ability to edit, with a single 2D snapshot, 3D scenes captured by a Neural Radiance Field (NeRF) without the laborious process of re-animating changes or creating the NeRF scene from scratch. This could streamline 3D scene editing for graphics, robotics, and digital twin applications.
In visual language, NVIDIA collaborated with MIT on a new family of vision language models, branded as VILA, to achieve top performance in understanding images, videos, and texts. Through advanced reasoning capabilities, VILA even understands internet memes by combining visual and linguistic understanding.
NVIDIA's AI visual research spans many industries, with over a dozen papers dedicated to new approaches for the perception, mapping, and planning of autonomous vehicles. Sanja Fidler, vice president of AI Research at NVIDIA, discusses how vision language models could assist in creating better self-driving cars.
This is a glimpse of how NVIDIA's CVPR research spans the possibilities of generative AI in enabling creators, accelerating automation in manufacturing and health care, and powering the future of autonomy and robotics.