Ollama API 交互

晚夜微雨问海棠呀

267人浏览 · 2026-05-14 18:45:14

晚夜微雨问海棠呀 · 2026-05-14 18:45:14 发布

Ollama API 交互详解

Ollama 提供了一套完整的 REST API，允许开发者通过 HTTP 请求与本地运行的模型进行交互。默认服务地址为 http://localhost:11434。

1. 核心 API 端点

1.1 对话补全 (`POST /api/chat`)

最常用的端点，支持多轮对话，符合 ChatML 格式。

请求体结构:

{
  "model": "llama3",          // 模型名称
  "messages": [               // 消息列表
    {"role": "system", "content": "你是一个助手"},
    {"role": "user", "content": "你好"},
    {"role": "assistant", "content": "你好！"}
  ],
  "stream": false,            // 是否流式输出
  "options": {                // 可选参数
    "temperature": 0.7,
    "num_predict": 500,
    "top_p": 0.9
  },
  "keep_alive": "5m"          // 模型保持加载时间
}

响应示例 (非流式):

{
  "model": "llama3",
  "created_at": "2023-10-27T10:00:00.000Z",
  "message": {
    "role": "assistant",
    "content": "你好！有什么可以帮你的吗？"
  },
  "done": true,
  "total_duration": 1500000000,
  "load_duration": 100000000,
  "prompt_eval_count": 15,
  "eval_count": 20,
  "eval_duration": 1400000000
}

流式响应示例:

{"model":"llama3","created_at":"...","message":{"role":"assistant","content":"你"},"done":false}
{"model":"llama3","created_at":"...","message":{"role":"assistant","content":"好"},"done":false}
{"model":"llama3","created_at":"...","done":true,"total_duration":...}

1.2 文本补全 (`POST /api/generate`)

适用于单轮任务（如续写、摘要），不维护对话历史。

请求体:

{
  "model": "llama3",
  "prompt": "Once upon a time",
  "stream": false,
  "options": {
    "temperature": 0.7
  }
}

响应:

{
  "model": "llama3",
  "created_at": "...",
  "response": "... there was a brave knight...",
  "done": true,
  "context": [123, 456, ...], // Token ID 序列（可选）
  "total_duration": ...,
  "prompt_eval_count": 5,
  "eval_count": 10
}

1.3 嵌入向量 (`POST /api/embeddings`)

生成文本的向量表示，用于 RAG、语义搜索等。

请求体:

{
  "model": "nomic-embed-text",
  "prompt": "这里是一段关于机器学习的文本。"
}

响应:

{
  "embedding": [0.123, -0.456, 0.789, ...] // 向量数组
}

1.4 模型管理 API

端点	方法	功能
`/api/tags`	GET	列出所有已下载模型
`/api/pull`	POST	下载模型
`/api/create`	POST	创建自定义模型
`/api/delete`	DELETE	删除模型
`/api/copy`	POST	复制模型
`/api/show`	POST	显示模型信息
`/api/ps`	GET	查看当前运行的模型

2. 代码实现示例

2.1 Python 实现

使用 requests 库 (原生 HTTP):

import requests
import json

def chat_ollama(prompt, model="llama3", stream=False):
    url = "http://localhost:11434/api/chat"
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "stream": stream
    }
    
    if stream:
        response = requests.post(url, json=payload, stream=True)
        for line in response.iter_lines():
            if line:
                data = json.loads(line)
                if "message" in data and "content" in data["message"]:
                    print(data["message"]["content"], end="", flush=True)
                if data.get("done"):
                    print("\n生成完成")
    else:
        response = requests.post(url, json=payload)
        data = response.json()
        return data["message"]["content"]

# 使用
print(chat_ollama("用一句话介绍自己"))

使用 ollama 官方库:

import ollama

# 简单对话
response = ollama.chat(model='llama3', messages=[
    {'role': 'user', 'content': 'Hello'}
])
print(response['message']['content'])

# 流式对话
stream = ollama.chat(
    model='llama3',
    messages=[{'role': 'user', 'content': '写一首诗'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

2.2 JavaScript / Node.js 实现

使用 fetch (原生):

async function chatOllama(prompt) {
  const response = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'llama3',
      messages: [{ role: 'user', content: prompt }],
      stream: true
    })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n');

    for (const line of lines) {
      if (line) {
        const data = JSON.parse(line);
        if (data.message?.content) {
          process.stdout.write(data.message.content);
        }
      }
    }
  }
}

chatOllama("Hello");

使用 ollama 官方库:

import ollama from 'ollama';

const response = await ollama.chat({
  model: 'llama3',
  messages: [{ role: 'user', content: 'Hello' }]
});
console.log(response.message.content);

// 流式
const stream = await ollama.chat({
  model: 'llama3',
  messages: [{ role: 'user', content: '写故事' }],
  stream: true
});

for await (const part of stream) {
  process.stdout.write(part.message.content);
}

2.3 cURL 命令行

非流式对话:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "你是一个助手"},
    {"role": "user", "content": "Hello"}
  ],
  "stream": false
}'

流式对话:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [{"role": "user", "content": "写故事"}],
  "stream": true
}'

下载模型:

curl http://localhost:11434/api/pull -d '{"name": "llama3"}'

列出模型:

curl http://localhost:11434/api/tags

3. 高级用法

3.1 多轮对话状态管理

API 本身是无状态的，需要客户端维护 messages 数组。

import ollama

messages = [
    {"role": "system", "content": "你是一个 Python 专家"},
    {"role": "user", "content": "什么是列表推导式？"},
]

# 第一轮
response = ollama.chat(model="llama3", messages=messages)
reply = response["message"]["content"]
print(f"AI: {reply}")

# 将回复加入历史
messages.append({"role": "assistant", "content": reply})

# 第二轮
messages.append({"role": "user", "content": "能举个例子吗？"})
response = ollama.chat(model="llama3", messages=messages)
print(f"AI: {response['message']['content']}")

3.2 保持模型加载 (Keep Alive)

默认情况下，模型在 5 分钟无操作后会被卸载。可以通过 keep_alive 参数延长。

import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3",
        "messages": [{"role": "user", "content": "Hello"}],
        "keep_alive": "30m"  # 保持 30 分钟
    }
)

3.3 自定义参数

通过 options 字段调整模型行为。

import ollama

response = ollama.chat(
    model="llama3",
    messages=[{"role": "user", "content": "写代码"}],
    options={
        "temperature": 0.2,      # 低随机性
        "num_predict": 500,      # 限制长度
        "top_p": 0.9,
        "repeat_penalty": 1.2    # 减少重复
    }
)

3.4 多模态交互 (图像输入)

支持 llava 等视觉模型。

import ollama

response = ollama.chat(
    model="llava",
    messages=[{
        "role": "user",
        "content": "这张图片里有什么？",
        "images": ["path/to/image.jpg"]  # 本地路径或 Base64
    }]
)
print(response["message"]["content"])

3.5 使用 OpenAI SDK 兼容模式

Ollama 的 /v1 端点兼容 OpenAI API 格式。

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # 任意值
)

completion = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True
)

for chunk in completion:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

4. 错误处理

常见错误

错误	原因	解决方案
`Connection refused`	服务未启动	运行 `ollama serve`
`Model not found`	模型未下载	运行 `ollama pull <model>`
`404 Not Found`	端点错误	检查 URL 和 API 路径
`500 Internal Server Error`	模型加载失败	检查显存/内存是否充足

错误处理示例 (Python)

import requests
from requests.exceptions import ConnectionError, HTTPError

try:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={"model": "llama3", "messages": [{"role": "user", "content": "Hello"}]}
    )
    response.raise_for_status()
    print(response.json()["message"]["content"])
except ConnectionError:
    print("错误：Ollama 服务未启动")
except HTTPError as e:
    print(f"HTTP 错误：{e}")
except Exception as e:
    print(f"未知错误：{e}")

5. 性能优化建议

流式输出: 设置 stream: true 提升用户体验。
保持会话: 频繁调用时设置 keep_alive 避免重复加载模型。
批量处理: 对于大量文本，考虑使用 num_predict 限制单次输出长度。
模型选择: 根据任务复杂度选择合适的模型（如 llama3:8b 用于通用，llama3:70b 用于复杂推理）。
量化版本: 使用量化模型（如 q4_0）减少内存占用。

通过 Ollama API，你可以轻松将本地大模型集成到 Web 应用、移动应用、自动化脚本等各种场景中。

https://edu.csdn.net/learn/39067/627173?utm_source=2019755004

汇聚全球AI编程工具，助力开发者即刻编程。

更多推荐

大模型评测与AI产品质量保障：第4篇主流大模型全景图比较

AI编程社区

大模型核心技术与企业级应用实战指南-附录

这篇文章摘要如下：《附录：AI术语速查手册》整理了人工智能领域的核心概念和工具，涵盖从基础算法到前沿技术的150+个关键术语。内容按字母顺序编排，包含术语定义、技术原理和典型应用场景，如Transformer架构、注意力机制、RAG系统等。手册特别标注了ChatGPT、BERT等主流模型的技术特点，以及LoRA微调、思维链提示等实用技巧，同时解释困惑度、BLEU等评估指标。作为工具性附录，它既可