LangChain is an open source orchestration framework for the development of applications using large language models (LLMs). LangChain has become a very popular framework for AI application developers and used with variety of LLM providers like OpenAI, Cohere, Hugging Face etc.
Triton Inference Server is open-source software built by NVIDIA that standardizes AI model deployment and execution across every workload. It’s a high performance platform that supports CPU and GPUs as well as variety of model/NN architectures like pytourch, tensor flow, TensorRT etc.
While Langchain provides integration with many GenAI model hosting services, there’s no out of the box support to for models hosted on Triton inference server.
The LLM class of Langchain is the base class to extend from while creating any custom component.
The following methods needs to be added to the custom LLM class
In this implementation the Triton endpoint needs to be set as an environment variable: TRITON_LLM_ENDPOINT
The payload for the request to Triton looks as follows:
{
"id":trace_id, // in 16 bytes hex format
"inputs":[
{
"name":"text_input",
"datatype":"BYTES",
"shape":[1],
"data":[prompt], // the prompt needed for the model
}
]
}