Accessible Chat Client for Ollama
To use VOLlama, you must first set up Ollama and download a model from Ollama’s library. Follow these steps:
Download and install Ollama.
You will need a model to generate text. Execute the command below in terminal (or command-line on Windows) to download a model. If you prefer to use a different model, replace llama3
with your chosen model.
ollama pull llama3
Optionally, If you want to utilize the image description feature, you need to download a multimodal (vision+language) model.
ollama pull llava
There are also llava:13b and llava:34b which have higher accuracy but require more storage, memory, and computing power.
Optionally, If you want to utilize the retrieval-augmented generation feature, you need to download nomic-embed-text
for embedding.
ollama pull nomic-embed-text
Finally, run VOLlama.
For Mac, VOLlama is not notarized by Apple, so you need to allow to run in system settings > privacy and security.
VOLlama may take a while to load especially on Mac, so be patient. You’ll eventually hear “VOLlama is starting.”
If you want responses to be read aloud automatically, you can enable the “Speak Response with System Voice” option from the chat menu.
If you are operating Ollama on a different machine, configure the host address in Chat menu > API Settings > Ollama > Base URL.
Shortcuts for all the features can be found in the menu bar. Here are exceptions:
In order to ask a multimodal model questions about an image:
This table lists the generation parameters available in VOLlama, along with their descriptions, types, and default values:
Parameter | Description | Value Type | Default Value |
---|---|---|---|
num_ctx | Sets the size of the context window used to generate the next token. Depends on the model’s limit. | int | 4096 |
num_predict | Maximum number of tokens to predict during text generation. Use -1 for infinite, -2 to fill context. | int | -1 |
temperature | Adjusts the model’s creativity. Higher values lead to more creative responses. Range: 0.0-2.0. | float | 0.8 |
repeat_penalty | Penalizes repetitions. Higher values increase the penalty. Range: 0.0-2.0. | float | 1.0 |
repeat_last_n | How far back the model checks to prevent repetition. 0 = disabled, -1 = num_ctx. | int | 64 |
top_k | Limits the likelihood of less probable responses. Higher values allow more diversity. Range: -1-100 | int | 40 |
top_p | Works with top_k to manage diversity of responses. Higher values lead to more diversity. Range: 0.0-1.0. | float | 0.95 |
tfs_z | Tail free sampling reduces the impact of less probable tokens. Higher values diminish this impact. | float | 1.0 |
typical_p | Sets a minimum likelihood threshold for considering a token. Range: 0.0-1.0. | float | 1.0 |
presence_penalty | Penalizes new tokens based on their presence so far. Range: 0.0-1.0. | float | 0.0 |
frequency_penalty | Penalizes new tokens based on their frequency so far. Range: 0.0-1.0. | float | 0.0 |
mirostat | Enables Mirostat sampling to control perplexity. 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0. | int | 0 |
mirostat_tau | Balances between coherence and diversity of output. Lower values yield more coherence. Range: 0.0-10.0. | float | 5.0 |
mirostat_eta | Influences response speed to feedback in text generation. Higher rates mean quicker adjustments. Range: 0.0-1.0. | float | 0.1 |
num_keep | Number of tokens to keep unchanged at the beginning of generated text. | int | 0 |
penalize_newline | Whether to penalize the generation of new lines. | bool | True |
stop | Triggers the model to stop generating text when this pattern is encountered. List strings separated by “, “. | string Array | empty |
seed | Sets the random number seed for generation. Specific numbers ensure reproducibility. -1 = random. | int | -1 |
To retrieve a document and ask questions about it, follow these steps:
Note: It retrieves only snippets of text relevant to your question, so full summaries are not available.
https://bbc.com/
./q What are some positive news for today?
without the quotes. Prefacing your message with /q
triggers processing your prompt with RAG using LlamaIndex.This section describes the parameters related to the Retrieval-Augmented Generation (RAG) feature:
Parameter | Description | Value Type | Default Value |
---|---|---|---|
show_context | When enabled, displays the text chunks sent to the model. | bool | False |
chunk_size | Determines the size of text chunks for indexing. | int | 1024 |
chunk_overlap | Specifies the overlap between the start and end of each chunk. | int | 20 |
similarity_top_k | Number of the most relevant chunks fed to the model. | int | 2 |
similarity_cutoff | The threshold for filtering out less relevant chunks. Setting too high may exclude all chunks. | float | 0.0 |
response_mode | Determines how RAG synthesizes responses. | string | refine |
This feature allows you to duplicate an existing model via a model file, enabling you to use it as a preset with a different name and parameters (e.g., temperature, repeat penalty, maximum generation length, context length). It does not duplicate the model’s weight files, thus conserving storage space even with multiple duplicates.
For more details, see modelfile.
For Mac users, it is crucial to disable smart quotes before opening the copy model dialog. If your model file displays a left double quotation mark instead of a straight quotation mark, smart quotes are enabled.
If you prefer to run Ollama using Docker, follow the instructions below:
Install Ollama by executing the following command in the command line:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Download a model to generate text. Replace llama3
with your desired model if you wish to use a different model:
docker exec ollama ollama pull llama3
Optionally, If you wish to use the retrieval-augmented generation feature, download nomic-embed-text
for embedding:
docker exec ollama ollama pull nomic-embed-text
To stop Ollama, use the following command:
docker stop ollama
To restart Ollama, use the command below:
docker start ollama