WebLLM API Reference¶
The MLCEngine
class is the core interface of WebLLM. It enables model loading, chat completions, embeddings, and other operations. Below, we document its methods, along with the associated configuration interfaces.
Interfaces¶
The following interfaces are used as parameters or configurations within MLCEngine
methods. They are linked to their respective methods for reference.
MLCEngineConfig¶
Optional configurations for CreateMLCEngine()
and CreateWebWorkerMLCEngine()
.
- Fields:
appConfig
: Configure the app, including the list of models and whether to use IndexedDB cache.initProgressCallback
: A callback for showing the progress of loading the model.logitProcessorRegistry
: A register for stateful logit processors, seewebllm.LogitProcessor
.
- Usage:
appConfig
: Contains application-specific settings, including:Model configurations.
IndexedDB caching preferences.
initProgressCallback
: Allows developers to visualize model loading progress by implementing a callback.logitProcessorRegistry
: AMap
object for registering custom logit processors. Only applies toMLCEngine
.
Note
All fields are optional, and logitProcessorRegistry
is only used for MLCEngine
.
Example:
const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct", {
appConfig: { /* app-specific config */ },
initProgressCallback: (progress) => console.log(progress),
});
GenerationConfig¶
Configurations for a single generation task, primarily used in chat completions.
- Fields:
repetition_penalty
,ignore_eos
: Specific to MLC models.top_p
,temperature
,max_tokens
,stop
: Common with OpenAI APIs.logit_bias
,n
: Additional parameters for sampling control.
- Usage:
Fields like
repetition_penalty
andignore_eos
allow fine control over the output generation behavior.Common parameters shared with OpenAI APIs (e.g.,
temperature
,top_p
) ensure compatibility.
Example:
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain WebLLM." },
];
const response = await engine.chatCompletion({
messages,
top_p: 0.9,
temperature: 0.8,
max_tokens: 150,
});
ChatCompletionRequest¶
Defines the structure for chat completion requests.
- Base Interface:
ChatCompletionRequestBase
Contains parameters like
messages
,stream
,frequency_penalty
, andpresence_penalty
.
- Base Interface:
- Variants:
ChatCompletionRequestNonStreaming
: For non-streaming completions.ChatCompletionRequestStreaming
: For streaming completions.
- Usage:
Combines settings from
GenerationConfig
andChatCompletionRequestBase
to provide complete control over chat behavior.The
stream
parameter enables dynamic streaming responses, improving interactivity in conversational agents.The
logit_bias
feature allows fine-tuning of token generation probabilities, providing a mechanism to restrict or encourage specific outputs.
Example:
const response = await engine.chatCompletion({
messages: [
{ role: "user", content: "Tell me about WebLLM." },
],
stream: true,
});
Model Loading¶
MLCEngine.reload(modelId: string | string[], chatOpts?: ChatOptions | ChatOptions[]): Promise<void>
Loads the specified model(s) into the engine. Uses MLCEngineConfig
during initialization.
- Parameters:
modelId
: Identifier(s) for the model(s) to load.chatOpts
: Configuration for generation (seeGenerationConfig
).
Example:
await engine.reload(["Llama-3.1-8B", "Gemma-2B"], [
{ temperature: 0.7 },
{ top_p: 0.9 },
]);
MLCEngine.unload(): Promise<void>
Unloads all loaded models and clears their associated configurations.
Example:
await engine.unload();
—
Chat Completions¶
MLCEngine.chat.completions.create(request: ChatCompletionRequest): Promise<ChatCompletion | AsyncIterable<ChatCompletionChunk>>
Generates chat-based completions using a specified request configuration.
Parameters: -
request
: AChatCompletionRequest
instance.
Example:
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful AI assistant." },
{ role: "user", content: "What is WebLLM?" },
],
temperature: 0.8,
stream: false,
});
—
Utility Methods¶
MLCEngine.getMessage(modelId?: string): Promise<string>
Retrieves the current output message from the specified model.
MLCEngine.resetChat(keepStats?: boolean, modelId?: string): Promise<void>
Resets the chat history and optionally retains usage statistics.
GPU Information¶
The following methods provide detailed information about the GPU used for WebLLM computations.
MLCEngine.getGPUVendor(): Promise<string>
Retrieves the vendor name of the GPU used for computations. Useful for understanding the hardware capabilities during inference.
Returns: A string indicating the GPU vendor (e.g., “Intel”, “NVIDIA”).
Example:
const gpuVendor = await engine.getGPUVendor();
console.log(``GPU Vendor: ${gpuVendor}``);
MLCEngine.getMaxStorageBufferBindingSize(): Promise<number>
Returns the maximum storage buffer size supported by the GPU. This is important when working with larger models that require significant memory for processing.
Returns: A number representing the maximum size in bytes.
Example:
const maxBufferSize = await engine.getMaxStorageBufferBindingSize();
console.log(``Max Storage Buffer Binding Size: ${maxBufferSize}``);