Small local agents¶
Experimenting with agents does not require access to external services, expensive hardware, or even the cloud. This short tutorial notebook is a introduction to language model-based agents using small language models, and demonstrating their use to query a database.
This is a repository on Github: https://github.com/lgautier/local-small-language-agent
Setup¶
Local Ollama server¶
The language models will we use will be served by Ollama.
The default instructions use podman to run that Ollama server in a rootless container.
If podman cannot be available, or using an isolated system where running ollama outside of a container is not an issue, alternative instructions follow.
This assumes that podman is installed. If not yet the case, the instructions are here:
https://podman.io/docs/installation
! podman run --replace -d --rm -p 11434:11434 -v dot_ollama:/root/.ollama --name ollama docker.io/ollama/ollama
ed86c9e2787467f1ee8c4c1e8996388bd48bbc189d8792e8e6be7800d06ddccc
If on a debian-based system where sudo-ing an installation of ollama is possible, for example on Google Colab, the following instructions can be used. Create a new jupyter cell with the code below, or replace the podman cell above with it.
# Derived from the the Google Cloud doc:
# https://medium.com/google-cloud/gemma-3-ollama-on-colab-a-developers-quickstart-7bbf93ab8fef
! echo $(tput setaf 1)"Installing APT packages."$(tput setaf 0)
! sudo apt-get -qq update && sudo apt-get -qq install -y pciutils lshw
! echo $(tput setaf 1)"Fetching Ollama."$(tput setaf 0)
! curl -fsSL https://ollama.com/install.sh | sh
! echo $(tput setaf 1)"Starting Ollama server."$(tput setaf 0)
! nohup ollama serve > ollama.log 2>&1 &
Required Python packages¶
The Python packages ollama and langchain, and langchain_ollama are required. Make sure they are installed before continuing.
import dataclasses
import langchain
import langchain_ollama
import ollama
Utility code¶
In order to minimize code duplication, we encapsulate most of Ollama model management into a class.
@dataclasses.dataclass(frozen=True)
class OllamaModel:
"""Wrapper class for models with Ollama and LangChain."""
name: str
def __str__(self) -> str:
return self.name
def has_tools(self) -> bool:
return 'tools' in self.show().capabilities
def is_pulled(self) -> bool:
for mdl in ollama.list().models:
if mdl.model == self.name:
return True
return False
def pull(self):
"""Pull the model to make it available to the local Ollama instance."""
ollama.pull(self.name)
def show(self) -> ollama._types.ShowResponse:
return ollama.show(self.name)
def new_chat(self, temperature:float = 0) -> langchain_ollama.ChatOllama:
res = langchain_ollama.ChatOllama(
model=str(self),
temperature=temperature
)
return res
Object display customization¶
We also create a convenience custom HTML render for AI responses.
import jinja2
_aimessage_template = jinja2.Template(
"""<table>
<thead><tr><td colspan="2">AI message</td></tr></thead>
<tbody>
<tr>
{% if response.content %}
<td style="vertical-align: top; border-right: 2px solid rgb(40, 40, 150); width: 6em;"><b>response:</b></td>
<td style="text-align: left; text-indent: 0px;">
<div style="white-space: pre-wrap;">
{{ response.content }}
</div>
</td>
{% endif %}
{% if response.tool_calls %}
<td style="vertical-align: top; border-right: 2px solid rgb(150, 40, 40); width: 6em;"><b>tools:</b></td>
<td style="text-align: left; text-indent: 0px;">
<div style="white-space: pre-wrap;">
{{ response.tool_calls }}
</div>
</td>
{% endif %}
</tr></tbody></table>
""")
def aimessage_html(obj):
return _aimessage_template.render(response=obj)
html_formatter = get_ipython().display_formatter.formatters['text/html']
html_formatter.for_type_by_name('langchain_core.messages.ai', 'AIMessage', aimessage_html)
def toolmessage_html(obj):
_toolmessage_template = jinja2.Template(
"""<table>
<thead><tr><td colspan="2">Tool message</td></tr></thead>
<tbody>
<tr>
{% if response.content %}
<td style="vertical-align: top; border-right: 2px solid rgb(40, 40, 150); width: 6em;"><b>response:</b></td>
<td style="text-align: left; text-indent: 0px;">
<div style="white-space: pre-wrap;">
{{ response.content }}
</div>
</td>
{% endif %}
</tr>
</tbody></table>
""")
return _toolmessage_template.render(response=obj)
html_formatter = get_ipython().display_formatter.formatters['text/html']
html_formatter.for_type_by_name('langchain_core.messages.tool', 'ToolMessage', toolmessage_html)
def highlight(text):
return f'\n\033[1m{text}\033[0m'
List of language models¶
Below is the list all models you want to use here. Be mindful about the capabilities of the system this will run on, in particular the RAM or the VRAM is a GPU is present.
This notebook was created with gemma3:4b (Gemma3 with 4B parameters) and granite4:1b (Granite4 with 1B parameters).
MODELS = (
# OllamaModel('deepseek-r1:1.5b'),
OllamaModel('gemma3:4b'),
OllamaModel('functiongemma:270m'),
OllamaModel('granite4:1b'),
# OllamaModel('phi4-mini:3.8b'),
# OllamaModel('qwen3:4b')
)
We are ready to start. First, we ensure that the models we need are pulled locally.
for mdl in MODELS:
if mdl.is_pulled():
print(f'Model {mdl} already pulled.')
else:
print(f'Pulling model {mdl}...', end='', flush=True)
mdl.pull()
print('done.')
Model gemma3:4b already pulled. Pulling model functiongemma:270m...done. Model granite4:1b already pulled.
Language models and their limitations¶
We'll start a simple question that demonstrate that LLMs are eerily good at the pattern matching part of understanding, but might not be able to go beyond that.
from langchain_core.messages.human import HumanMessage
from langchain_core.messages.system import SystemMessage
from langchain_core.tools import tool
human_message = HumanMessage('Make the following calculation: 392373366237 * 562310951975')
for mdl in MODELS:
print(f'\n\033[1m{str(mdl)}\033[0m\n')
chat = mdl.new_chat()
response = chat.invoke([human_message])
display(response)
gemma3:4b
| AI message | |
| response: |
Okay, let's calculate 392373366237 * 562310951975.
This is a large multiplication, and it's best done using a calculator or a programming language. Here's the result:
**2236358838888888888888888888888888888888
|
functiongemma:270m
| AI message | |
| response: |
I apologize, but I cannot perform this calculation. The calculation tool I have access to is specialized for calculating mathematical operations related to financial calculations, such as calculating stock market valuations or stock market indices. I cannot access or perform complex financial data like this.
|
granite4:1b
| AI message | |
| response: |
The result of multiplying 392373366237 by 562310951975 is:
220,646,204,278,964,000,795
|
Is this correct? The actual answer is:
print(f'{392373366237 * 562310951975:,}')
220,635,841,098,362,793,468,075
Unless this notebook is running in a future when LLMs have learned to multiply, or this notebook became part of the training data, the LLM answer is either likely not correct OR the LLM declines to provide an answer to the question (which is the case with functiongemma).
One should not rely on the current LLM technology to perform correctly mathematical operations, or more generally to evaluate expressions or run algorithms. However, LLMs have seen training data that describes steps
to resolve similar problems. They might be able to generate code (software instructions), for example a function call, to get to the answer. This can already be observed when using gemma3:4b: the answer from the LLM shows includes Python code that returns the correct answer is evaluated.
If we have a way to know when LLM answers contains code, we could simply evaluate that code and consider the resulting text its answer a dialogue. In fact, it does not even have to be a dialogue and code evaluation represent simply an action.
Additional instructions in the prompt can get us there with this example. We can add to it that if evaluating code seems better path to the correct answer, only code should be returned. To facilitate the triaging of answers (answer is text / the direct answer vs answer is code to get to the answer), we can also indicate the format to use.
answer_as_code = HumanMessage("""
IF this type of question would be better answered by evaluating code, return instead
ONLY Python code in the answer. Use the Markdown format for code snippets:
```python
<code>
```
""")
human_message = HumanMessage("""Make the following calculation: 392373366237 * 562310951975""")
for mdl in MODELS:
print(highlight(str(mdl)))
chat = mdl.new_chat()
response = chat.invoke([human_message, answer_as_code])
display(response)
gemma3:4b
| AI message | |
| response: |
```python
print(392373366237 * 562310951975)
```
|
functiongemma:270m
| AI message | |
| response: |
I apologize, but I cannot assist with generating Python code. My current capabilities are limited to assisting with mathematical calculations and data analysis using the provided tools. I cannot generate programming code.
|
granite4:1b
| AI message | |
| response: |
```python
result = 392373366237 * 562310951975
result
```
|
This is looking promising, except for functiongemma that is adamant about being unabe to generate code.
A good check to perform with LLMs is whether a new prompt causes regression (that is loose the ability to answer some questions correctly). To achieve this we ask a question that should be answerable without code. promptThat is a question for which the answer was almost certainly in the training data for the model.
human_message = HumanMessage("""What is the name of the natural satellite orbiting around the earth?""")
for mdl in MODELS:
print(highlight(str(mdl)))
chat = mdl.new_chat()
response = chat.invoke([human_message, answer_as_code])
display(response)
gemma3:4b
| AI message | |
| response: |
```python
print("The name of the natural satellite orbiting around the Earth is the Moon.")
```
|
functiongemma:270m
| AI message | |
| response: |
I apologize, but I cannot assist with programming or generating Python code. My current capabilities are limited to assisting with tasks related to natural language processing and text generation. I cannot provide programming advice or code snippets.
|
granite4:1b
| AI message | |
| response: |
The name of the natural satellite orbiting around the Earth is called **the Moon**.
|
While gemma3:4b its technically correct, it may indicate that this model will resort to writing code more often than necessary. functiongemma is consistent in messaging that it cannot generate code.
Adding tools as context¶
Agents are language models to which the availability of tools (software functions), with a description of what they do, is specified as context. For gemma3:4b this can be achieved through a system prompt, while a few other models with a "tools" capability Python functions can passed directly. This is the case with granite4:1b for example.
Tools in prompt¶
We showed earlier that a prompt can indicate that code evaluation is available. Similarly, we can indicate that tools are available and calling a tool can be used as an answer.
If we make our multiplication a tool, a prompt can look like follows:
system_prompt_tools = """
You have access to functions.
You don't have to use tools if not necessary, but *IF* you decide to invoke any of the function(s),
you MUST put it in a valid JSON expression like:
[
{"name": function name,
"parameters": dictionary of argument name and its value}
]
You SHOULD NOT include any other text in the response if you call a function.
Functions:
[
{ "name": "multiply",
"description": "Multiply two numbers.",
"parameters": {
"type": "object",
"properties": {
"a": "number",
"b": "number"
},
"required": [
"a", "b"
]
}
}
]
"""
Tools as tools¶
The use of tools has been somewhat formalized since the early days of prompt engineering, a mere double-digit number of months in the past, we have seen so far.
Many of the most recent language model specifically include a capability called "tools". This frees a user, or developer,
from having to work on prompt sections that declare the availablity of tools in a way that the language model reacts to it, and on parsing out if an answer is text or code. langchain provides a wrapper for this and let one simply declare any Python function as tool. For the multiplication we can just plug the Python multiplication operator.
import operator
tools = [tool(operator.mul)]
Language model opting to answer with a tool call¶
We have shown that a prompt context, wrapped in a langchain tool declaration when the model has the capability, can lead a language model to answer with code. If we ask again our multiplication question:
human_message = HumanMessage("""Make the following calculation: 392373366237 * 562310951975""")
for model in MODELS:
chat = model.new_chat()
if model.has_tools():
print(f'{highlight(model.name)} (with tools)\n')
chat = chat.bind_tools(tools)
messages = [human_message]
else:
print(f'{highlight(model.name)}\n')
messages = [SystemMessage(system_prompt_tools),
human_message]
response = chat.invoke(messages)
display(response)
gemma3:4b
| AI message | |
| response: |
```json
[
{
"name": "multiply",
"parameters": {
"a": 392373366237,
"b": 562310951975
}
}
]
```
|
functiongemma:270m (with tools)
| AI message | |
| tools: |
[{'name': 'mul', 'args': {'a': 392373366237, 'b': 562310951975}, 'id': '7c9769d1-8531-4868-866f-3d00eb598097', 'type': 'tool_call'}]
|
granite4:1b (with tools)
| AI message | |
| tools: |
[{'name': 'mul', 'args': {'a': '392373366237', 'b': '562310951975'}, 'id': '62be38a6-3878-4ad3-839d-31093d5d8d3e', 'type': 'tool_call'}]
|
All language models understand the assignment and should provide correct function calls. Interestingly, creating a function call is creating code which functiongemma is otherwise convinced it is unable to do.
Agentic linear chaining - an example¶
Database¶
We create a local database for demonstration purposes.
import sqlite3
dbcon = sqlite3.connect(':memory:', check_same_thread=False)
dbcon.execute("""
CREATE TABLE medication(
ndc STRING PRIMARY KEY,
common_name STRING,
quantity INTEGER
);
""")
for values in (
('ABCD-DEF', 'Aspirin', 3), ('ABCD-GHI', 'Aspirin', 5), ('GHIK-KLM', 'bacitracin', 10)
):
dbcon.execute("""
INSERT INTO medication(ndc, common_name, quantity) VALUES (?,?,?);
""", values)
Tools to query a database¶
Functions to query the database:
get_ndc_codeto get all NDC codes for the common name of a medicationget_medication_stockto get the stock for a given NDC code
def get_ndc_code(medication: str) -> tuple[str, ...]:
"""Get possible ndc codes for a medication.
Args:
medication: name of the medication for which NDC code should be returned.
"""
cursor = dbcon.cursor()
cursor.execute('SELECT ndc FROM medication WHERE common_name==?',
(medication, ))
return tuple(row[0] for row in cursor.fetchall())
def get_ndc_stock(ndc_code: str) -> int:
"""Get the number of boxes of medication in stock.
Args:
ndc_code: NDC code to retrieve stock for.
"""
cursor = dbcon.cursor()
cursor.execute('SELECT quantity FROM medication WHERE ndc==?',
(ndc_code, ))
return cursor.fetchone()[0]
Prompting an LLM to use tools (when needed)¶
The prompt may contain several tools available. A prompt context will be similar to the one we showed for the multiplication.
system_prompt_tools = """
You have access to functions.
You don't have to use tools if not necessary, but *IF* you decide to invoke any of the function(s),
you MUST put it in a valid JSON expression like:
[
{"name": function name,
"parameters": dictionary of argument name and its value}
]
You SHOULD NOT include any other text in the response if you call a function.
Functions:
[
{
"name": "get_ndc_stock",
"description": "Get the number of boxes of medication in stock.",
"parameters": {
"type": "object",
"properties": {
"ndc_code": {
"type": "string"
}
},
"required": [
"ndc_code"
]
}
},
{
"name": "get_ndc_code",
"description": "Get possible ndc codes for a medication.",
"parameters": {
"type": "object",
"properties": {
"medication": {
"type": "string"
}
},
"required": [
"medication"
]
},
}
]
"""
Whenever a model has the capabilities "tools", the user or developer is spared the prompt context engineering.
The decorator tool() in langchain can be used instead.
Note: langchain is essentially an abstraction layer over prompts.
It can create the equivalent of our engineered prompt contexts from Python function definitions,
feeding the LLM with information in the docstring, or in additional declarative structures
(see https://docs.langchain.com/oss/python/langchain/tools#advanced-schema-definition).
tools = [tool(get_ndc_code), tool(get_ndc_stock)]
def invoke_model(model, human_message, system_message=SystemMessage('')):
chat = model.new_chat()
if model.has_tools():
print(f'\n{highlight(model.name)} (with tools)\n')
chat = chat.bind_tools(tools)
else:
print(f'\n{highlight(model.name)}\n')
messages = [system_message,
human_message]
response = chat.invoke(messages)
return response, chat
human_message = HumanMessage('Name the first month of the year.')
for model in MODELS:
response, chat = invoke_model(model, human_message)
display(response)
gemma3:4b
| AI message | |
| response: |
January!
Do you want to play a quick game of guessing months? 😊
|
functiongemma:270m (with tools)
| AI message | |
| response: |
I apologize, but I cannot assist with retrieving the specific NDC code for the first month of the year. My current tools are designed for retrieving medication-related data and stock information. I cannot query or retrieve specific calendar or stock data for specific dates.
|
granite4:1b (with tools)
| AI message | |
| response: |
The first month of the year is **January**.
|
functiongemma:270 appears challenged with the question. Its size (just over 1/4th of granite4:1b) might be the issue.
tool_map = {tool.name: tool for tool in tools}
if len(tool_map) != len(tools):
raise ValueError("'tools' contains at lease one tool name duplicate.")
human_message = HumanMessage('How many boxes of medication with NDC code ABCD-DEF do we have left?')
for model in MODELS:
response, chat = invoke_model(model, human_message, SystemMessage(system_prompt_tools))
display(response)
if response.tool_calls:
for tool_call in response.tool_calls:
print(
f' --> calling {tool_call["name"]}'
f'({", ".join("=".join((k, repr(v))) for k, v in tool_call["args"].items())})'
)
selected_tool = tool_map[tool_call['name']]
res = selected_tool.invoke(tool_call)
print(' result:')
print(f' {res.content}')
else:
print('Assessing automatically if there is a tool call is left as an exercise for the reader.')
gemma3:4b
| AI message | |
| response: |
```json
[
{
"name": "get_ndc_stock",
"parameters": {
"ndc_code": "ABCD-DEF"
}
}
]
```
|
Assessing automatically if there is a tool call is left as an exercise for the reader.
functiongemma:270m (with tools)
| AI message | |
| tools: |
[{'name': 'get_ndc_stock', 'args': {'ndc_code': 'ABCD-DEF'}, 'id': 'a63df468-8c35-4d1c-ba01-f07c58b4a29a', 'type': 'tool_call'}]
|
--> calling get_ndc_stock(ndc_code='ABCD-DEF')
result:
3
granite4:1b (with tools)
| AI message | |
| tools: |
[{'name': 'get_ndc_stock', 'args': {'ndc_code': 'ABCD-DEF'}, 'id': 'fe96a455-02c4-4f10-ab45-e64a4d3ff059', 'type': 'tool_call'}]
|
--> calling get_ndc_stock(ndc_code='ABCD-DEF')
result:
3
human_message = HumanMessage("What are possible NDC codes for the medication 'Aspirin'?")
for model in MODELS:
response, chat = invoke_model(model, human_message, SystemMessage(system_prompt_tools))
display(response)
if response.tool_calls:
print(' result:')
for tool_call in response.tool_calls:
selected_tool = tool_map[tool_call['name']]
res = selected_tool.invoke(tool_call)
print(f' {res.content}')
else:
print('Assessing automatically if there is a tool call is left as an exercise for the reader.')
gemma3:4b
| AI message | |
| response: |
```json
[
{
"name": "get_ndc_code",
"parameters": {
"medication": "Aspirin"
}
}
]
```
|
Assessing automatically if there is a tool call is left as an exercise for the reader.
functiongemma:270m (with tools)
| AI message | |
| tools: |
[{'name': 'get_ndc_code', 'args': {'medication': 'Aspirin'}, 'id': '5d987bf2-cc49-4058-b0d7-08196877a7d3', 'type': 'tool_call'}]
|
result:
["ABCD-DEF", "ABCD-GHI"]
granite4:1b (with tools)
| AI message | |
| tools: |
[{'name': 'get_ndc_code', 'args': {'medication': 'Aspirin'}, 'id': '65211877-5146-4db6-acfd-6c658edb630b', 'type': 'tool_call'}]
|
result:
["ABCD-DEF", "ABCD-GHI"]
Cool. But what did we exactly gain here? If you are are programmer or a data analyst knowing how to write SQL, arguably not much. However, what we have achieved is akin to a text user interface, or language user interface, that can be an alternative to a graphical user interface.
Reasoning in steps¶
So far we only showed questions for which either the language model had the answer memorized from the data it was trained on, or for which the answer was one tool call away.
There are questions questions for which the answer requires a sequence of recall from training data or tool calls, with the output of a step used as the input to the next step. For example, asking for how many Aspirin boxes are in stock, would require to 1) get the NDC codes for Aspirin products, 2) query the stock for each, and finally 3) sum those values.
human_message = HumanMessage('How many boxes of Aspirin do we have in stock altogether?')
for model in MODELS:
response, chat = invoke_model(model, human_message, SystemMessage(system_prompt_tools))
display(response)
gemma3:4b
| AI message | |
| response: |
```json
[
{
"name": "get_ndc_stock",
"parameters": {
"ndc_code": "98304-600"
}
}
]
```
|
functiongemma:270m (with tools)
| AI message | |
| tools: |
[{'name': 'get_ndc_stock', 'args': {'ndc_code': 'Aspirin'}, 'id': '7d0683fa-c668-4640-86c8-2b74bc4f273f', 'type': 'tool_call'}]
|
granite4:1b (with tools)
| AI message | |
| tools: |
[{'name': 'get_ndc_code', 'args': {'medication': 'Aspirin'}, 'id': 'bf98a6fe-bc7f-4027-ae38-1a8e8a9b8c2e', 'type': 'tool_call'}]
|
Odds are are that the results are not great. Bare language models used in this notebook generally do not handle this well without guidance from a prompt. For example,
ReAct¶
The ReAct approach prompts a language model to match its answers to the following pattern:
- question
- thought: the language model's answer to the question. The answer can be the use of a tool.
- action: whenever the answer is the use of a tool, a function call, this is selected here.
- action input: the input to the action.
- observation: observe the result of performing the action.
- The sequence of steps 2 to 5 can be repeated N times
- final answer
It is derived from the empirical observation that LLMs appeared to give more "thoughtful" answers, or better solve reasoning-oriented questions when asked to think step by step.
React can be idependently achieved through adding the description to a prompt and iteratively parsing language model responses and crafting new messages. However, in many scenario this is can be abstracted out through the use of library. We use langchain again to demonstrate how it can work:
from langchain.agents import create_agent
human_message = HumanMessage("How many boxes of Aspirin do we have in stock altogether?")
for model in MODELS:
print(highlight(model.name))
if model.has_tools():
react = create_agent(model.new_chat(), tools=tools)
r_stream = react.stream({'messages': [human_message]}, stream_mode='updates')
try:
for chunk in r_stream:
for k in chunk.keys():
for m in chunk[k]['messages']:
display(m)
except TypeError as err:
print('Error:')
print(err)
else:
print('Parsing AI messages into next actions and results is left as an exercise.')
gemma3:4b Parsing AI messages into next actions and results is left as an exercise. functiongemma:270m
| AI message | |
| tools: |
[{'name': 'get_ndc_stock', 'args': {'ndc_code': 'Aspirin'}, 'id': '0d55848c-3003-408c-846e-2ccb2238bc38', 'type': 'tool_call'}]
|
Error:
'NoneType' object is not subscriptable
granite4:1b
| AI message | |
| tools: |
[{'name': 'get_ndc_code', 'args': {'medication': 'Aspirin'}, 'id': '89afdd0e-92bf-4d6e-9270-10ec8dfb20c9', 'type': 'tool_call'}]
|
| Tool message | |
| response: |
["ABCD-DEF", "ABCD-GHI"]
|
| AI message | |
| tools: |
[{'name': 'get_ndc_stock', 'args': {'ndc_code': 'ABCD-DEF'}, 'id': '6886023e-3e2e-4ea3-a0f4-5e0a4e1b0c6a', 'type': 'tool_call'}]
|
| Tool message | |
| response: |
3
|
| AI message | |
| tools: |
[{'name': 'get_ndc_stock', 'args': {'ndc_code': 'ABCD-GHI'}, 'id': '7e28b8c2-e435-4474-86ef-e55bc82bc785', 'type': 'tool_call'}]
|
| Tool message | |
| response: |
5
|
| AI message | |
| response: |
The total number of boxes in stock for Aspirin is **8**.
|
When calling tools works without error, the answer is indeed correct.
Conclusion¶
Small language models are able to handle simple queries and chain sequences of steps to reach an answer. They can run locally / isolated and on relatively modest hardware.