Number of Applicants
:000+
Let AI Supercharge Your Job Hunt!
JobCopilot scans 500,000+ company career sites daily to find jobs for you
Research the state-of-the-art agentic workflow evaluation frameworks in the industry and in the research field.
Apply the theory to build automated evaluation pipelines that can run agent scenarios, capture execution artifacts, score results, and detect regressions.
Evaluate tool-use behavior, including whether the agent selects the right tool, passes correct arguments, avoids unnecessary calls, and handles tool errors appropriately.
Analyze agent trajectories using traces, logs, intermediate steps, and final outputs to identify reasoning failures, context misuse, hallucinated assumptions, and brittle workflow patterns.
Design metrics for agent reliability, including success rate, tool-call precision, argument accuracy, recovery rate, retry count, latency, cost, and safety-related failure rates.
Create reusable evaluation datasets from synthetic cases, golden workflows, and real anonymized executions.
Support experiments comparing prompts, model providers, tool descriptions, memory strategies, context construction methods, and execution modes.
Help build human evaluation workflows and rubrics for judging agent correctness, faithfulness, usefulness, and risk awareness.
Work with engineers to translate evaluation findings into better tests, monitoring signals, tool interfaces, prompts, and guardrails.
Potentially compose research papers and publish in scientific conferences.
Currently pursuing or recent graduates of a Master’s or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, Software Engineering, Data Science, or a related field.
Strong Python fundamentals and interest in AI systems.
Curious about how LLM agents work, fail, and improve.
Interested in evaluation methodology, not just application building.
Comfortable reading logs, traces, test cases, and structured data.
Detail-oriented and able to define clear, measurable criteria for ambiguous agent behavior.
Prior experience with LLMs, LangChain-like agents, tool calling, pytest, data analysis, or observability tools is helpful but not required.
As an equal opportunity employer, we firmly believe that diverse voices fuel our innovation and allow us to better serve our users and the community. We foster an environment where every employee of Tencent feels supported and inspired to achieve individual and common goals.
Auto-Apply to Agent Evaluation Intern Jobs with your AI JobCopilot
Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.