NVIDIA Triton inference server with Langchain

April 22, 2024

Background

LangChain is an open source orchestration framework for the development of applications using large language models (LLMs). LangChain has become a very popular framework for AI application developers and used with variety of LLM providers like OpenAI, Cohere, Hugging Face etc.

Triton Inference Server is open-source software built by NVIDIA that standardizes AI model deployment and execution across every workload. It’s a high performance platform that supports CPU and GPUs as well as variety of model/NN architectures like pytourch, tensor flow, TensorRT etc.

While Langchain provides integration with many GenAI model hosting services, there’s no out of the box support to for models hosted on Triton inference server.

Extend Langchain to support Triton

The LLM class of Langchain is the base class to extend from while creating any custom component.

The following methods needs to be added to the custom LLM class

  • A _call method that takes in a string, some optional stop words, and returns a string.
  • A _llm_type property that returns a string. Used for logging purposes only.

In this implementation the Triton endpoint needs to be set as an environment variable: TRITON_LLM_ENDPOINT

The payload for the request to Triton looks as follows:

{
   "id":trace_id,   // in 16 bytes hex format
   "inputs":[
       {
       "name":"text_input",
       "datatype":"BYTES",
       "shape":[1],
       "data":[prompt], // the prompt needed for the model
       }
   ]
}

Example

okahu-demo/rag_triton_hosted/triton_llm.py

Learn at your own pace

Watch a Video

Learn from our team

Book a Demo

Learn hands-on

Try Okahu