AI Quality Engineer

Engineering

About the role

An agent that demos well is not the same as an agent that works in production, so every agent we ship is backed by evals. We're hiring an AI Quality Engineer to own that system. You'll run and extend the evaluation pipelines that prove our agents work, both in client engagements and across our own internal agent fleet, and keep them honest as the agents, models, and data underneath them change.

What you'll do

  • Own the evaluation pipelines that gate every change we ship, keeping them automated and running in CI/CD rather than leaning on manual spot-checks.
  • Keep agents scored at the step level, not just the final answer, across tool selection, planning and reasoning chains, and retrieval quality, so subtle regressions surface before clients do.
  • Extend the eval toolkit as new failure modes appear: LLM-as-a-judge, RAG faithfulness and relevance, hallucination detection, prompt and agent regression suites, and red-teaming for safety.
  • Maintain the golden datasets that mirror real production inputs and the tracing and observability that watch our agents once they are live.
  • Own the quality bar for what ships and raise it over time with the rest of the engineering team.

What we're looking for

  • Strong software engineering with Python, and a feel for testing non-deterministic systems, where classic assertion-based tests are not enough.
  • Experience evaluating LLM and agent systems: agent graphs (e.g. LangGraph), eval frameworks (DeepEval, RAGAS, G-Eval, LangSmith, or similar), and LLM-as-a-judge methods.
  • Fluency with RAG and retrieval quality, plus observability and tracing for agents in production.
  • A bias for measurement: you define what "good" means, then build the harness that proves it.
  • Bonus: data-pipeline testing experience, or exposure to private markets, financial services, or enterprise data.

Why Dealstitch

We're a senior team with 40+ years of combined private-market technology experience. You'll work on problems that matter, with direct access to decision-makers, and see the impact of your work in weeks, not quarters.

Put AI agents into action now

Whether you're deploying agents, evaluating AI-native tools, or reassessing current technology, we help you identify high-value use cases and deliver production-grade solutions that achieve firm-wide adoption.

Get in touch