Vetting standard
Vetted for production AI, not resume keywords.
Devlyn screens engineers for shipped AI work, evaluation discipline, failure-mode reasoning, AI-native tooling, communication, and ownership before they reach your shortlist.
What makes AI vetting different?
A generic coding screen can miss the work that breaks AI products: evals, grounding, tool permissions, model cost, latency, failure modes, user feedback, and release judgment.
Why generic tests fail
AI production risk is role-specific.
A candidate can pass a standard code screen and still be weak at retrieval evals, model routing, agent traces, prompt injection risk, or cost-aware release decisions. Devlyn screens for the production surface the role will own.
Screening modules
The AI-native vetting standard.
No fake acceptance percentage. The standard is visible in the checks.
Shipped-AI evidence
Proof of LLM features shipped to real users, not tutorials or side projects.
Weak signal: Only demos, hackathons, or course projects.
Strong signal: Production features with users, traffic, and lessons from what broke.
Role specialism review
Depth in the specific role — retrieval, agents, platform, security, or decision science — instead of a broad AI label.
Weak signal: Generalist answers that apply to any AI job.
Strong signal: Specific tradeoffs and failure modes from their specialism.
Live technical deep-dive
A working session in their specialism with senior engineers, not a recruiter quiz.
Weak signal: Memorized definitions with no hands-on reasoning.
Strong signal: Thinks aloud, debugs, and defends choices under questioning.
Eval and failure-mode reasoning
How they measure quality and catch regressions before users do.
Weak signal: Judges output by feel with no baseline.
Strong signal: Builds eval sets and reasons about edge cases and regressions.
Cost and latency reasoning
Whether they treat token cost and latency as first-class production constraints.
Weak signal: Ignores cost and speed until the bill or the user complains.
Strong signal: Reasons about cost, latency, and quality as deliberate tradeoffs.
AI-native workflow review
Whether AI tooling is part of how they actually work, not a novelty.
Weak signal: Talks about AI tools but does not use them in practice.
Strong signal: Uses agentic tooling daily and knows where it helps and hurts.
Communication and ownership
Clear writing, honest status updates, and the instinct to own outcomes, not tickets.
Weak signal: Waits for instructions and reports activity, not results.
Strong signal: Writes clearly, flags risk early, and owns the outcome end to end.
Security judgment
Awareness of the AI-specific attack surface: injection, leakage, and tool misuse.
Weak signal: Assumes normal appsec covers AI risk.
Strong signal: Anticipates prompt injection, data exposure, and unsafe tool authority.
What we reject
Most candidates do not pass — on purpose.
- Demo-only AI claims with no users or production history.
- No eval thinking — quality judged by feel, not a baseline.
- Vague, prompt-only experience with no specialism depth.
- Cannot explain failure modes for their own role.
- Cannot communicate tradeoffs in writing or in the deep-dive.
- Ignores cost and latency until something breaks.
- Ignores security boundaries, permissions, and tool authority.
Role-specific checks
Every role has different failure modes.
The shortlist is filtered for the role, not a broad AI label.
Forward-Deployed AI Engineer
- Workflow discovery judgment
- Production code ownership
- Stakeholder clarity
AI Application Engineer
- AI product UX
- Structured outputs
- Streaming and retries
LLM Engineer
- Eval design
- Model comparison discipline
- Cost and latency judgment
RAG & Context Engineer
- Retrieval evals
- Chunking and metadata judgment
- Permission-aware context
Agentic Workflow Engineer
- Agent graph design
- Tool permission boundaries
- Human approval gates
AI Platform Engineer
- Platform API design
- Provider routing
- Observability
AI Security Engineer
- AI threat modeling
- Prompt injection reasoning
- Tool permission audit
Data Scientist
- Metric design
- Data quality skepticism
- Experiment judgment
What buyers see
The shortlist should explain itself.
Role fit
Why the engineer matches the exact ownership model.
Relevant proof
AI systems, workflows, evals, or delivery evidence tied to the role.
Interview focus
Questions that test the role’s highest-risk decisions.
Trial proof
What the first two weeks should produce.
Trial connection
Vetting continues in your codebase.
Pull request card
The trial should create inspectable code, not a private demo.
Eval report
Quality, retrieval, routing, or workflow behavior should have a baseline.
Architecture decision record
Tradeoffs should be written clearly enough for your team to maintain.
Workflow map
The engineer should make scope, actors, systems, and failure paths visible.
FAQ
Vetting questions.
How does Devlyn vet AI engineers?
Devlyn reviews shipped-AI evidence, role specialism, live technical judgment, eval thinking, cost/latency reasoning, security judgment, and communication.
Do you use generic coding tests?
No generic test is enough. AI roles are screened around their production failure modes: retrieval quality, agents, evals, model routing, platform, security, or decision science.
Do you claim top 1% or top 3%?
No. Devlyn avoids unverifiable percentile claims and explains the vetting standard instead.
What does a buyer see in the shortlist?
A short profile set with role fit, relevant proof, availability, and matching rationale.
What does Devlyn reject?
Vague AI claims, demo-only experience, no eval discipline, weak production judgment, unclear communication, and inability to explain failure modes.
How does vetting connect to the trial?
The trial tests the same criteria inside the buyer’s real workflow.
Can we run our own interview?
Yes. Devlyn encourages role-specific interviews focused on practical judgment.
Is security reviewed for every role?
Security judgment is part of the screen, with deeper review for AI Security, RAG, Agentic, and Platform roles.