To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution.
We introduce AutoCoder, the first Large Language Model to surpass GPT-4 Turbo (April 2024) and GPT-4o in pass@1 on the Human Eval benchmark test ($\mathbf{90. 9\%}$ vs. $\mathbf{90. 2\%}$).
We present Orbit, a unified and modular framework for robot learning powered by NVIDIA Isaac Sim.
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving.
The motion module can be adapted to various DiT baseline methods to generate video with different styles.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications.
Creating high-quality scientific figures can be time-consuming and challenging, even though sketching ideas on paper is relatively easy.
We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations.
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems.