What AI models does the agent use?

The agent has access to Claude, DeepSeek, Gemini, and other top models. It automatically routes to the best model for each task — you never have to think about API keys or token limits.

Yes. Each agent runs in an isolated environment. We don't train on your data, and you can export or delete everything at any time. The underlying engine (GoGogot) is open-source, so you can audit exactly how it works.

What can the agent actually do?

Browse the web, process files (CSV, PDF, images), run scheduled tasks, remember context across sessions, and communicate via Telegram or Slack. Think of it as a capable junior employee, not a chatbot.

How does memory work?

The agent maintains persistent memory across sessions. It remembers your preferences, past conversations, and task context. You can also explicitly tell it to remember or forget things.

Can I self-host the agent?

Absolutely. GoGogot is 100% open-source. Run it on your own hardware for free. cowork.ink is the managed version — we handle servers, model costs, updates, and uptime so you don't have to.

Is there a free trial?

We offer a 7-day money-back guarantee. Spin up an agent, give it real tasks, and if it doesn't save you time, we'll refund you — no questions asked.

How do you evaluate the performance of an AI agent?

Evaluate AI agents across four dimensions: task completion rate (did it finish the job?), tool selection accuracy (did it pick the right APIs?), quality (were outputs relevant and hallucination-free?), and cost/latency. Use frameworks like DeepEval or LangSmith to run these evaluations systematically against a curated golden dataset. See our full breakdown in the [metrics section below](#what-metrics-actually-matter-for-ai-agents).

What is the difference between LLM testing and AI agent testing?

LLM testing checks a single prompt-response pair — is the output correct? AI agent testing evaluates a multi-step execution graph: did the agent pick the right tool? Did it reason correctly at each step? Did it complete the full task without accumulating errors? Agent testing requires step-level tracing, not just output scoring.

How do you handle non-determinism when testing AI agents?

Run each test case multiple times and average the scores. Use semantic similarity metrics instead of exact-match assertions. Set score thresholds (e.g., pass if ≥70% of 5 runs succeed) rather than binary pass/fail. Track the pass^k metric — how often an agent succeeds k consecutive times — since a 60% pass@1 rate can hide a sub-25% consistency rate.

What tools do developers use to test AI agents?

The most widely adopted tools are LangSmith (best for LangChain ecosystems), DeepEval (open-source Pytest-style evals), Langfuse (self-hosted observability), and Braintrust (collaborative eval iteration). For lower-level testing of deterministic components, standard Pytest with mocked LLMs works well.

What is a golden dataset for AI agent evaluation?

A golden dataset is a versioned, curated collection of representative prompts paired with expected outcomes — your evaluation source of truth. Build it from real production queries, edge cases, and known failure modes. Update it whenever you add new agent capabilities or observe a new failure category in production.

How to Test AI Agents: Evaluation Frameworks & Quality Metrics

COMPLETE guide to AI agent testing in 2026. Unit tests, LLM evals, tracing, CI/CD gates — real tools, real metrics. Build agents that don't fail in prod.

Frequently Asked Questions

How do you evaluate the performance of an AI agent?: Evaluate AI agents across four dimensions: task completion rate (did it finish the job?), tool selection accuracy (did it pick the right APIs?), quality (were outputs relevant and hallucination-free?), and cost/latency. Use frameworks like DeepEval or LangSmith to run these evaluations systematically against a curated golden dataset. See our full breakdown in the [metrics section below](#what-metrics-actually-matter-for-ai-agents).
What is the difference between LLM testing and AI agent testing?: LLM testing checks a single prompt-response pair — is the output correct? AI agent testing evaluates a multi-step execution graph: did the agent pick the right tool? Did it reason correctly at each step? Did it complete the full task without accumulating errors? Agent testing requires step-level tracing, not just output scoring.
How do you handle non-determinism when testing AI agents?: Run each test case multiple times and average the scores. Use semantic similarity metrics instead of exact-match assertions. Set score thresholds (e.g., pass if ≥70% of 5 runs succeed) rather than binary pass/fail. Track the pass^k metric — how often an agent succeeds k consecutive times — since a 60% pass@1 rate can hide a sub-25% consistency rate.
What tools do developers use to test AI agents?: The most widely adopted tools are LangSmith (best for LangChain ecosystems), DeepEval (open-source Pytest-style evals), Langfuse (self-hosted observability), and Braintrust (collaborative eval iteration). For lower-level testing of deterministic components, standard Pytest with mocked LLMs works well.
What is a golden dataset for AI agent evaluation?: A golden dataset is a versioned, curated collection of representative prompts paired with expected outcomes — your evaluation source of truth. Build it from real production queries, edge cases, and known failure modes. Update it whenever you add new agent capabilities or observe a new failure category in production.