NVIDIA and Apple Collaborate to Accelerate Machine-Learning Research

NVIDIA said developers can “significantly boost LLM workload performance on NVIDIA GPUs” using the novel ReDrafter coding technique with its TensorRT-LLM

3
Artificial IntelligenceLatest News

Published: December 20, 2024

Zeus Kerravala

On December 18, NVIDIA announced that its TensorRT-LLM (large language model) now supports recurrent drafting (ReDrafter), an open-source machine-learning technique pioneered by Apple. The breakthrough enhances LLM inference optimization with throughput improvements of up to 2.7x.

In a blog post, NVIDIA said developers can “significantly boost LLM workload performance on NVIDIA GPUs” using the novel ReDrafter coding technique with its TensorRT-LLM. The collaboration has “made TensorRT-LLM more powerful and more flexible, enabling the LLM community to innovate more sophisticated models and easily deploy them with TensorRT-LLM to achieve unparalleled performance on NVIDIA GPUs,” according to an NVIDIA technical blog. The company went on to say it “eagerly anticipate the next generation of advanced models from the community that leverage TensorRT-LLM capabilities, driving further improvements in LLM workloads.”

In its own blog post, Apple wrote: “Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users.”

Earlier this year, Apple published and open-sourced ReDrafter, which it calls “a novel approach to speculative decoding that achieves state-of-the-art performance.” ReDrafter uses a recurrent neural network (RNN) draft model that “combines beam search with dynamic tree attention to speed up LLM token generation by up to 3.5 tokens per generation step for open-source models, surpassing the performance of prior speculative decoding techniques.”

Apple said it collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework, making this advancement “production-ready for NVIDIA GPUs.”

Although TensorRT-LLM supports numerous open-source LLMs and the Medusa speculative decoding method, ReDrafter’s beam search and tree attention algorithms rely on operators never used in previous applications. To integrate ReDrafter, NVIDIA added new operators or exposed existing ones, considerably improving TensorRT-LLM’s capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.

Apple said it benchmarked a “tens-of-billions parameter production model” on NVIDIA GPUs, using the TensorRT-LLM inference acceleration framework with ReDrafter, which yielded a “2.7x speed-up in generated tokens per second” for what’s known as “greedy decoding.” The benchmark results showed that the tech could “significantly reduce latency users may experience while also using fewer GPUs and consuming less power,” Apple said.

Key Takeaways

ML developers using NVIDIA GPUs will realize several benefits, including:

  • ReDrafter with NVIDIA TensorRT-LLM can deliver up to 2.7x throughput improvements on NVIDIA H100 GPUs with TP8 over the base LLM, a significant LLM workload performance enhancement.
  • A more powerful and flexible NVIDIA TensorRT-LLM delivers optimized inference, enabling developers to achieve what Apple calls “unparalleled performance” on NVIDIA GPUs.
  • Better utilization of GPU resources will improve efficiency, particularly in low-traffic scenarios where GPUs are typically underutilized due to small batch sizes.

Over time, every application we use will have a generative AI interface built into it, like how search has become ubiquitous. Integrating LLMs into applications is rapidly becoming the norm, but can be complicated and expensive. ReDrafter’s approach to speculative decoding integrated into the NVIDIA Tensor-RT LLM framework enables faster token generation on NVIDIA GPUs, reducing latency, lowering costs, and improving efficiency.

Partnerships like this are essential for accelerating the growth of AI as it requires a lot of technology to work together to deliver an optimized experience. An ecosystem approach, particularly from two tech giants like Apple and NVIDIA is good for everyone as developers can focus on the apps instead of tweaking and tuning technology components. I’m hoping that 2025 is the year we see AI vendors focus on growing their ecosystems as well as one-upping each other on speeds and feeds.

Featured

Share This Post