FastMLX

FastMLX是一个高性能的生产就绪API，用于托管MLX模型，包括视觉语言模型（VLMs）和语言模型（LMs）。

免费软件：Apache软件许可证2.0
文档：https://Blaizzy.github.io/fastmlx

特性

兼容OpenAI的API：轻松与使用OpenAI API的现有应用程序集成。
动态模型加载：可以即时加载MLX模型或使用预加载模型以获得更好的性能。
支持多种模型类型：兼容各种MLX模型架构。
图像处理能力：可处理文本和图像输入，实现多功能模型交互。
高效资源管理：针对高性能和可扩展性进行了优化。
错误处理：为生产环境提供强大的错误管理。
可定制：易于扩展以适应特定用例和模型类型。

使用方法

安装
```
pip install fastmlx
```
运行服务器

启动FastMLX服务器：
```
fastmlx
```
或
```
uvicorn fastmlx:app --reload --workers 0
```
[!警告] --reload标志不应在生产环境中使用。它仅用于开发目的。

使用多个工作进程运行（并行处理）

为了提高性能和并行处理能力，你可以指定工作进程的绝对数量或CPU核心的使用比例。这对于同时处理多个请求特别有用。

你也可以通过设置FASTMLX_NUM_WORKERS环境变量来指定工作进程的数量或CPU核心的使用比例。如果没有明确传递或通过环境变量设置，workers默认为2。

按优先级顺序（从高到低），工作进程数量由以下方式确定：
- 作为命令行参数明确传递
  - --workers 4将工作进程数量设置为4
  - --workers 0.5将工作进程数量设置为可用CPU核心数量的一半（最小为1）
- 通过FASTMLX_NUM_WORKERS环境变量设置
- 默认值2
要使用所有可用的CPU核心，请将值设置为1.0。

示例：
```
fastmlx --workers 4
```
或
```
uvicorn fastmlx:app --workers 4
```
[!注意]
- --reload标志与多个工作进程不兼容
- 工作进程的数量通常不应超过你机器上可用的CPU核心数量，以获得最佳性能。
多工作进程设置的注意事项
1. 无状态应用：确保你的FastMLX应用程序是无状态的，因为每个工作进程都是独立运行的。
2. 数据库连接：如果你的应用使用数据库，请确保你的连接池配置能够处理多个工作进程。
3. 资源使用：监控你系统的资源使用情况，以找到适合你特定硬件和应用需求的最佳工作进程数量。此外，你可以使用删除模型端点来移除任何未使用的模型。
4. 负载均衡：当使用多个工作进程运行时，传入的请求会自动在工作进程之间进行负载均衡。
通过利用多个工作进程，你可以显著提高FastMLX应用程序的吞吐量和响应能力，尤其是在高负载条件下。

进行API调用

使用类似于OpenAI的聊天完成API：

视觉语言模型

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mlx-community/nanoLLaVA-1.5-4bit",
    "image": "http://images.cocodataset.org/val2017/000000039769.jpg",
    "messages": [{"role": "user", "content": "这些是什么"}],
    "max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

使用流式传输：

import requests
import json

def process_sse_stream(url, headers, data):
   response = requests.post(url, headers=headers, json=data, stream=True)

   if response.status_code != 200:
      print(f"错误：收到状态码 {response.status_code}")
      print(response.text)
      return

   full_content = ""

   try:
      for line in response.iter_lines():
            if line:
               line = line.decode('utf-8')
               if line.startswith('data: '):
                  event_data = line[6:]  # 移除 'data: ' 前缀
                  if event_data == '[DONE]':
                        print("\n流结束。✅")
                        break
                  try:
                        chunk_data = json.loads(event_data)
                        content = chunk_data['choices'][0]['delta']['content']
                        full_content += content
                        print(content, end='', flush=True)
                  except json.JSONDecodeError:
                        print(f"\n无法解码JSON：{event_data}")
                  except KeyError:
                        print(f"\n意外的数据结构：{chunk_data}")

   except KeyboardInterrupt:
      print("\n流被用户中断。")
   except requests.exceptions.RequestException as e:
      print(f"\n发生错误：{e}")

if __name__ == "__main__":
   url = "http://localhost:8000/v1/chat/completions"
   headers = {"Content-Type": "application/json"}
   data = {
      "model": "mlx-community/nanoLLaVA-1.5-4bit",
      "image": "http://images.cocodataset.org/val2017/000000039769.jpg",
      "messages": [{"role": "user", "content": "这些是什么？"}],
      "max_tokens": 500,
      "stream": True
   }
   process_sse_stream(url, headers, data)

语言模型

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mlx-community/gemma-2-9b-it-4bit",
    "messages": [{"role": "user", "content": "法国的首都是什么？"}],
    "max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

使用流式传输：

import requests
import json

def process_sse_stream(url, headers, data):
   response = requests.post(url, headers=headers, json=data, stream=True)

如果响应状态码不是200: 打印(f"错误：收到状态码 {response.status_code}") 打印(response.text) 返回

完整内容 = ""

尝试: 对于response.iter_lines()中的每一行: 如果行: 行 = 行.decode('utf-8') 如果行.startswith('data: '): 事件数据 = 行[6:] # 移除'data: '前缀如果事件数据 == '[DONE]': 打印("\n流结束。✅") 跳出尝试: 块数据 = json.loads(事件数据) 内容 = 块数据['choices'][0]['delta']['content'] 完整内容 += 内容打印(内容, end='', flush=True) 除了 json.JSONDecodeError: 打印(f"\n解码JSON失败: {事件数据}") 除了 KeyError: 打印(f"\n意外的数据结构: {块数据}")

除了 KeyboardInterrupt: 打印("\n流被用户中断。") 除了 requests.exceptions.RequestException 作为 e: 打印(f"\n发生错误: {e}")

如果 name == "main": url = "http://localhost:8000/v1/chat/completions" headers = {"Content-Type": "application/json"} 数据 = { "model": "mlx-community/gemma-2-9b-it-4bit", "messages": [{"role": "user", "content": "你好，你怎么样？"}], "max_tokens": 500, "stream": True } process_sse_stream(url, headers, 数据)

函数调用

FastMLX 现在支持根据 OpenAI API 规范进行工具调用。此功能适用于以下模型：

Llama 3.1
Arcee Agent
C4ai-Command-R-Plus
Firefunction
xLAM

支持的模式：

非流式
并行工具调用

注意：工具选择和符合 OpenAI 的函数调用流式处理目前正在开发中。

以下是如何使用 FastMLX 进行函数调用的示例：

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
数据 = {
  "model": "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit",
  "messages": [
    {
      "role": "user",
      "content": "旧金山和华盛顿的天气如何？"
    }
  ],
  "tools": [
    {
      "name": "get_current_weather",
      "description": "获取当前天气",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "城市和州，例如：旧金山，加利福尼亚"
          },
          "format": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "使用的温度单位。根据用户的位置推断。"
          }
        },
        "required": ["location", "format"]
      }
    }
  ],
  "max_tokens": 150,
  "temperature": 0.7,
  "stream": False,
}

response = requests.post(url, headers=headers, data=json.dumps(数据))
print(response.json())

这个例子展示了如何使用 Llama 3.1 模型的 get_current_weather 工具。API 将处理用户的问题并使用提供的工具获取所需信息。

请注意，虽然常规文本生成可以使用流式处理，但函数调用的流式处理实现仍在开发中，尚未完全符合 OpenAI 规范。

列出可用模型

要查看 MLX 支持的所有视觉和语言模型：

import requests

url = "http://localhost:8000/v1/supported_models"
response = requests.get(url)
print(response.json())

列出可用模型

您可以向 API 添加新模型：

import requests

url = "http://localhost:8000/v1/models"
params = {
    "model_name": "hf-repo-or-path",
}

response = requests.post(url, params=params)
print(response.json())

列出可用模型

要查看所有可用模型：

import requests

url = "http://localhost:8000/v1/models"
response = requests.get(url)
print(response.json())

删除模型

要移除加载到内存中的任何模型：

import requests

url = "http://localhost:8000/v1/models"
params = {
   "model_name": "hf-repo-or-path",
}
response = requests.delete(url, params=params)
print(response)

有关更详细的使用说明和 API 文档，请参阅完整文档。

fastmlx

FastMLX

特性

使用方法

使用多个工作进程运行（并行处理）

多工作进程设置的注意事项