Claude Computer Use 教学：让 AI 自己操作你的电脑

一个 API call，Claude 就能像人一样看着屏幕操作你的电脑。截图、点击、打字、跑指令，全部自动完成。本文从 Docker 环境搭建到自己写 Agent Loop，四步带你从零做到 Claude 自主操作桌面，附座标缩放踩坑解法。

一、Computer Use 是什么？

以前要自动化桌面操作，得靠 Selenium 或 RPA 工具，UI 一改就全坏。Claude Computer Use 不一样——它是真的在「看」屏幕截图、理解画面内容，再决定下一步要做什么。所以 UI 改版也不怕，因为 Claude 是用视觉理解来操作，不是靠固定的 DOM 选择器。

整个流程可以概括成一句话：你给任务 → Claude 回传动作 → 你的程序执行 → 回传结果给 Claude → 循环，直到任务完成。

二、搭建 Docker 环境

Anthropic 提供了一个官方 demo 项目，是一个 Docker 容器，里面装好了 Linux 桌面、浏览器、各种工具，Claude 可以直接操作。大概两分钟就能搞定。

前提：你的电脑需要安装 Docker Desktop。没有的话去 Docker 官网下载安装。

打开 terminal，clone 官方 repo 并进入 demo 目录：

 Bash 
git clone https://github.com/anthropics/anthropic-quickstarts.git
cd anthropic-quickstarts/computer-use-demo

设置 API key 并启动 Docker 容器：

 Bash 
export ANTHROPIC_API_KEY=你的API_KEY
docker run \
    -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
    -v $HOME/.anthropic:/home/computeruse/.anthropic \
    -p 5900:5900 \
    -p 8501:8501 \
    -p 6080:6080 \
    -p 8080:8080 \
    -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

开了四个 port，你只要记 8080 就好，那是主要的操作界面。其他是 VNC 跟 Streamlit 用的。第一次跑会下载 image，大概 1 到 2GB，等一下就好。跑完之后打开浏览器，进 http://localhost:8080。

左边是你跟 Claude 对话的地方，右边是虚拟桌面。环境就绪。

三、第一次让 Claude 操作电脑

直接在左边聊天框输入指令，比如：「帮我打开 Firefox，搜寻台北天气，然后截一张图存到桌面」。

送出之后，右边虚拟桌面就会自动动起来。Claude 会先截图确认画面上有什么，找到 Firefox 图标、双击打开，在网址列打字搜寻，找到结果再截图存盘。你就坐着看，光用聊天就能控制电脑，代码完全不用写。

费用参考：跑一次简单任务大概 5 到 10 轮 API call，花不到 1 块美金。

四、自己写 Agent Loop

之前都是玩别人的 demo，接下来自己写。学会这段，你就能让 Claude 帮你自动化任何桌面操作——批量改文件名、自动填表单、跑测试，全部都可以。

4.1 原理

整个机制叫 Agent Loop。你可以想成你请了一个助理坐在电脑前：你说要干嘛，助理做一步回报一下，你再给下一个指示。差别只是 Claude 当助理、你的程序当手脚。而且每一轮 Claude 都会截图确认做到哪了，不是瞎操作的。循环一直跑，直到 Claude 觉得任务完成才会停。

4.2 进入 Docker 容器

这段代码要在 Docker 容器里面跑，因为容器里有虚拟桌面可以操作。另开一个 terminal 窗口，进入容器：

 Bash 
docker exec -it $(docker ps -q --filter ancestor=ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest) bash

API key 不用再设一次，docker run 的时候用 -e 已经传进容器了。先装依赖：

 Bash 
pip install anthropic Pillow

4.3 Import 和初始化

 Python 
import anthropic
import subprocess
import base64

client = anthropic.Anthropic()

三个 import 各有用途：anthropic 是 Claude 的官方 SDK；subprocess 让你从 Python 调用系统指令，截图、控制鼠标全靠它；base64 用来把截图转成文字格式，因为 API 只接受文字，不能直接丢 png。

client = anthropic.Anthropic() 建立跟 Claude API 的连接，API key 会自动从环境变量 ANTHROPIC_API_KEY 读取。

4.4 定义工具列表

 Python 
tools = [
    {
        "type": "computer_20251124",
        "name": "computer",
        "display_width_px": 1024,
        "display_height_px": 768,
        "display_number": 1
    },
    {
        "type": "bash_20250124",
        "name": "bash"
    },
    {
        "type": "text_editor_20250728",
        "name": "str_replace_based_edit_tool"
    }
]

这个 tools 数组告诉 Claude 有三个工具可以用：

computer：操作屏幕，截图、点击、打字都靠它。display_width_px 和 display_height_px 告诉 Claude 屏幕多大，这样它才知道座标范围
bash：直接跑 terminal 指令
text editor：读文件、建文件、改文件内容

版本号（如 computer_20251124）是 Anthropic 各工具的发布日期，照抄就好。display_number 设 1 就是指第一个屏幕，在 Docker 里固定就是 1。

4.5 execute_tool 函数

这是整个程序最核心的部分。Claude 会告诉你的程序「我要截图」「我要点这里」「我要打这段字」，但 Claude 自己没有手，它只是「说」。execute_tool 就是那个真正动手的函数。

截图：

 Python 
        if action == "screenshot":
            subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
            with open("/tmp/screenshot.png", "rb") as f:
                img_base64 = base64.b64encode(f.read()).decode()
            return [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": img_base64
                    }
                }
            ]

用 scrot 这个 Linux 截图工具拍一张屏幕，然后用 base64 转成文字传回去。Claude 收到后就能「看到」屏幕上有什么了。scrot 和后面用到的 xdotool 在 Docker 容器里都装好了。

点击：

 Python 
        elif action == "left_click":
            x, y = tool_input["coordinate"]
            subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "1"])
            return "clicked"

Claude 说「点座标 500,300」，就用 xdotool 把鼠标移到那个位置再模拟点击左键。xdotool 就是一个 Linux 上模拟键盘鼠标操作的工具。

打字和快捷键：

 Python 
        elif action == "type":
            subprocess.run(["xdotool", "type", "--clearmodifiers", tool_input["text"]])
            return "typed"

        elif action == "key":
            subprocess.run(["xdotool", "key", "--clearmodifiers", tool_input["text"]])
            return "key pressed"

--clearmodifiers 是为了避免 Shift、Ctrl 这些键残留影响打字。按快捷键也是 xdotool，比如 Claude 说按 ctrl+s 存盘，xdotool 就帮你按。

鼠标移动和滚动：

 Python 
        elif action == "mouse_move":
            x, y = tool_input["coordinate"]
            subprocess.run(["xdotool", "mousemove", str(x), str(y)])
            return "moved"

        elif action == "scroll":
            x, y = tool_input["coordinate"]
            direction = tool_input["scroll_direction"]
            amount = tool_input["scroll_amount"]
            subprocess.run(["xdotool", "mousemove", str(x), str(y)])
            button = "5" if direction == "down" else "4"
            for _ in range(amount):
                subprocess.run(["xdotool", "click", button])
            return "scrolled"

滚动比较特别，xdotool 里面 button 5 代表滚轮往下、button 4 代表往上。往下滚 3 格就是连续 click 三次 button 5。

Bash 工具：

 Python 
        elif tool_name == "bash":
            result = subprocess.run(tool_input["command"], shell=True, capture_output=True, text=True)
            return result.stdout + result.stderr

Text Editor 工具：

 Python 
        elif tool_name == "str_replace_based_edit_tool":
            command = tool_input["command"]
            path = tool_input["path"]

            if command == "view":
                with open(path, "r") as f:
                    return f.read()
            elif command == "create":
                with open(path, "w") as f:
                    f.write(tool_input["file_text"])
                return f"Created {path}"
            elif command == "str_replace":
                with open(path, "r") as f:
                    content = f.read()
                content = content.replace(tool_input["old_str"], tool_input["new_str"])
                with open(path, "w") as f:
                    f.write(content)
                return f"Replaced in {path}"

支持查看、创建、替换文件内容，就是 Python 基本的文件读写操作。

除了这些，Claude 还能做 double_click、right_click、拖拽，甚至 Opus 4.5 以上还有 zoom 功能。进阶的慢慢加就好，逻辑都一样。

4.6 Agent Loop 主体

 Python 
messages = [
    {
        "role": "user",
        "content": "帮我打开Firefox搜寻Python教学"
    }
]

max_iterations = 10
iteration = 0

while iteration < max_iterations:
    iteration += 1
    print(f"\n--- 第 {iteration} 轮 ---")

    response = client.beta.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
        betas=["computer-use-2025-11-24"]
    )

    print(f"Stop reason: {response.stop_reason}")

    tool_results = []
    for block in response.content:
        if block.type == "text":
            print(f"Claude: {block.text}")
        elif block.type == "tool_use":
            print(f"Tool: {block.name} -> {block.input}")
            result = execute_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result if isinstance(result, list) else str(result)
            })

    if not tool_results:
        print("\n任务完成！")
        break

    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})

print(f"\n共执行 {iteration} 轮")

每跑一轮做四件事：

Call API：把对话记录和工具列表传给 Claude，它会看你的任务、看截图，决定下一步。注意用的是 client.beta.messages.create，因为 Computer Use 目前还是测试版，要走 beta endpoint。betas 参数一定要加
解析回应：如果回传 tool_use，代表 Claude 要用工具，调用 execute_tool 执行，把结果包成 tool_result
判断结束：如果 Claude 这轮没有要用任何工具，代表任务完成，循环结束
继续循环：把工具执行结果传回给 Claude，让它决定下一步

重要：max_iterations = 10 一定要设。Agent Loop 会一直跑，如果 Claude 卡住出不来，没设上限的话 API 费用会爆。

4.7 Model 版本对照表

不同 model 要用不同的版本号：

	新版 Model	旧版 Model
Model 名称	opus-4-6、sonnet-4-6、opus-4-5	sonnet-4-5、haiku-4-5、sonnet-4
Beta Header	computer-use-2025-11-24	computer-use-2025-01-24
Computer Tool	computer_20251124	computer_20250124
Bash Tool	bash_20250124	bash_20250124
Text Editor	text_editor_20250728	text_editor_20250124

运行脚本：

 Bash 
python computer_use_demo.py

五、三个让它更稳的技巧

5.1 开启 Thinking 功能

在 API call 里加一个参数：

 Python 
    response = client.beta.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
        betas=["computer-use-2025-11-24"],
        thinking={"type": "enabled", "budget_tokens": 1024}
    )

budget_tokens 就是你给 Claude 多少 token 的空间来思考。1024 大概够用了，设太小它可能来不及想清楚就动手。开了之后，Claude 每一步都会先说「我看到什么、我要做什么、为什么」，debug 的时候超好用。

5.2 座标缩放（最常踩的坑）

Anthropic 的 API 有两个图片限制：最长边不能超过 1568 pixel，总像素不能超过约 115 万。所以 API 会把截图自动缩小。Claude 分析的是缩小后的图，回传的座标也是基于缩小后的。但你的屏幕是原始分辨率，座标没转换就会偏。

第一步：算出缩放比例，加在文件最上面：

 Python 
import math
from PIL import Image  # 需要 pip install Pillow

# 放在文件最上面，算出缩放比例
SCREEN_W, SCREEN_H = 1024, 768  # 改成你的屏幕分辨率

def get_scale_factor(width, height):
    long_edge = max(width, height)
    total_pixels = width * height
    long_edge_scale = 1568 / long_edge
    total_pixels_scale = math.sqrt(1_150_000 / total_pixels)
    return min(1.0, long_edge_scale, total_pixels_scale)

SCALE = get_scale_factor(SCREEN_W, SCREEN_H)
SCALED_W = int(SCREEN_W * SCALE)
SCALED_H = int(SCREEN_H * SCALE)

注意：如果你的屏幕不是 1024x768，tools 声明里的 display_width_px 和 display_height_px 也要改成 SCALED_W 和 SCALED_H，让 Claude 知道它看到的屏幕是多大。

第二步：截图时先缩小再传：

 Python 
        if action == "screenshot":
            subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
            # 截图后按 SCALE 缩小再传给 Claude
            img = Image.open("/tmp/screenshot.png")
            img = img.resize((SCALED_W, SCALED_H))
            img.save("/tmp/screenshot.png")
            with open("/tmp/screenshot.png", "rb") as f:
                img_base64 = base64.b64encode(f.read()).decode()
            return [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": img_base64
                    }
                }
            ]

第三步：所有座标动作都除以 SCALE：

 Python 
def scale_coordinate(x, y):
    """把 Claude 回传的座标换算回原始屏幕位置"""
    return int(x / SCALE), int(y / SCALE)

然后 execute_tool 里面所有拿到 coordinate 的地方，都先过一手这个函数：

 Python 
        elif action == "left_click":
            x, y = scale_coordinate(*tool_input["coordinate"])
            subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "1"])
            return "clicked"

        elif action == "mouse_move":
            x, y = scale_coordinate(*tool_input["coordinate"])
            subprocess.run(["xdotool", "mousemove", str(x), str(y)])
            return "moved"

        elif action == "scroll":
            x, y = scale_coordinate(*tool_input["coordinate"])
            direction = tool_input["scroll_direction"]
            amount = tool_input["scroll_amount"]
            subprocess.run(["xdotool", "mousemove", str(x), str(y)])
            button = "5" if direction == "down" else "4"
            for _ in range(amount):
                subprocess.run(["xdotool", "click", button])
            return "scrolled"

如果你用的是 demo 里面预设的 1024x768，SCALE 刚好是 1，所有座标算出来不会变，截图也不会缩。但只要你屏幕分辨率比这个大，这三个改动就一定要做。

5.3 安全性

Computer Use 是让 Claude 控制电脑，这个权限不小。Anthropic 自己也说，一定要在 Docker 容器或虚拟机里跑，不要让 Claude 碰到你的主系统。千万不要把密码、银行帐号丢给它操作，因为网页上可能有 prompt injection，Claude 看到会被带偏。

安全提醒：永远在隔离环境（Docker / VM）中运行 Computer Use，不要让 Claude 接触敏感信息或主系统。