The Race to the Bottom - Low Latency in the age of the Transformer Berlin Buzzwords 2022






So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive.

This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology.

The audience will walk away with the information they need to decide on the best direction for inference in their production platform.

Keywords: MLOps, Inference, Latency


The Race to the Bottom - Low Latency in the age of the Transformer Berlin Buzzwords 2022 So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive.

This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology.

The audience will walk away with the information they need to decide on the best direction for inference in their production platform.

Keywords: MLOps, Inference, Latency So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive. This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology. The audience will walk away with the information they need to decide on the best direction for inference in their production platform. Keywords: MLOps, Inference, Latency Get your ticket now! Register for Berlin Buzzwords in our ticket shop! We also have online tickets and reduced tickets for students available and you can find more information about our Diversity Ticket Initiative here!