Screenshot to Action: A Deep Dive Into the /v1/predict Endpoint
Most automation tools rely on brittle selectors or rigid APIs that break when a UI changes. The /v1/predict endpoint flips that model. You send a base64 screenshot and a natural language instruction, and the model returns concrete actions like click, type, and scroll. This is the core of a reliable computer use agent that reads the screen and acts like a human. This guide walks through the endpoint, request fields, pricing, and a working example.
How /v1/predict works
The endpoint takes a base64 screenshot, an instruction, and a CUA version, then returns an actions array and a status. You loop capture, predict, and act until status is done. Request fields (POST https://coasty.ai/v1/predict): - screenshot: base64-encoded image, e.g., a PNG or JPEG - instruction: natural language describing what to do - cua_version: one of 'v3' or 'v4' (default 'v3') Response fields: - actions: array of action objects (e.g., type, click, scroll) - status: 'pending', 'done', or an error code Every prediction costs $0.05. You keep sending screenshots and actions until status is done.
curl -X POST https://coasty.ai/v1/predict \
-H 'X-API-Key: $COASTY_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"screenshot": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==",
"instruction": "Click the OK button",
"cua_version": "v3"
}'
Response:
{
"actions": [
{"type": "click", "x": 200, "y": 150}
],
"status": "done"
}Full Python loop
- ●Capture the screen using pyautogui or a library like mss
- ●Encode the image to base64
- ●POST to /v1/predict with the screenshot, instruction, and cua_version
- ●Loop capture, predict, and act until status is done
- ●Each prediction costs $0.05
import base64
import os
import requests
import pyautogui
def predict_and_act(instruction, cua_version="v3"):
url = "https://coasty.ai/v1/predict"
api_key = os.getenv("COASTY_API_KEY")
headers = {"X-API-Key": api_key}
while True:
# Capture screen
screenshot = pyautogui.screenshot()
with open("temp.png", "wb") as f:
screenshot.save(f)
with open("temp.png", "rb") as f:
img_bytes = f.read()
base64_img = base64.b64encode(img_bytes).decode()
# Predict
resp = requests.post(
url,
headers=headers,
json={
"screenshot": base64_img,
"instruction": instruction,
"cua_version": cua_version
}
)
resp.raise_for_status()
data = resp.json()
actions = data.get("actions", [])
status = data.get("status")
# Act on each action
for act in actions:
if act["type"] == "click":
pyautogui.click(act["x"], act["y"])
elif act["type"] == "type":
pyautogui.write(act["text"])
elif act["type"] == "scroll":
pyautogui.scroll(act["delta"])
if status == "done":
break
if __name__ == "__main__":
predict_and_act("Click the OK button")Loop capture, predict, and act until status is done. Each prediction costs $0.05.
Where this beats brittle automation
Traditional automation relies on fixed selectors or specific API endpoints. When a UI updates, those selectors break. With a computer use API, your agent sees the screen and chooses actions based on the current layout. It can handle dynamic buttons, overlapping elements, and language changes. You get a robust agent that adapts to real-world software without brittle selectors.
The /v1/predict endpoint is the foundation of a powerful computer use agent. Build a bot that reads a screen, decides what to do, and executes clicks and keystrokes just like a human. Want to try it? Get your API key at https://coasty.ai/developers and start turning screenshots into real actions.