Technical Assessment

How to Assess Software Engineers: Beyond the Technical Test

Q: What is the right order for developer assessments?

For the most objective results, run cognitive and personality assessments before interviewers see the technical test results. When evaluators know a candidate did well on a coding challenge, it creates a halo effect that inflates ratings in subsequent assessments. Sequencing personality and cognitive measurement first — before revealing any technical scores — reduces this bias and makes each assessment genuinely independent.

Q: How long should a complete developer assessment take?

A complete developer assessment framework should take between 60 and 90 minutes of candidate time across all three layers. A focused work-sample technical test should run 30 to 45 minutes. A cognitive assessment covering abstract reasoning, working memory, and verbal reasoning typically requires 25 to 35 minutes. A personality assessment (HEXACO and behavioral profile) usually takes 15 to 20 minutes. Spreading assessment across multiple stages with time gaps reduces fatigue and keeps each signal independent.

Technical tests measure one dimension of engineering ability. The most predictive hiring framework for developers combines a work-sample technical assessment, cognitive evaluation, and structured personality measurement — and the order in which you run them matters.

Antonio Romero May 18, 2026 11 min read

Why the technical test alone is not enough

The dominant hiring process for software engineers worldwide follows a recognizable pattern: screen résumé, run a coding challenge, conduct one or more technical interviews, make an offer. The technical assessment is the gate. Pass the code and you advance. Everything else is a formality.

This pattern has a significant validity problem. Research on technical interview formats — including a widely cited 2019 study by Behroozi, Shirolkar, LaToza, and Parnin — found that whiteboard coding interviews are dominated by performance anxiety in ways that inflate score variance without adding predictive signal. A candidate's whiteboard performance is affected by whether they have recently practiced the specific problem type, their anxiety response to observation, and the specific interviewer they happen to draw — not just their underlying engineering ability.

The predictive validity issue is compounded by format. LeetCode-style algorithmic problems — binary tree traversals, dynamic programming puzzles, graph search implementations — measure the ability to recall and apply specific computer science patterns under timed pressure. That skill can be trained directly and efficiently with practice. Engineers who can afford interview coaching or dedicate weeks to algorithm drilling will score higher than engineers who are simply better at building production software. The test measures preparation more than it measures performance.

None of this means technical assessment is not worth doing. Work-sample tests — tests that replicate the actual work a candidate would perform in the role — have among the highest predictive validity of any selection tool. The problem is not technical assessment itself; it is treating a single, poorly-designed technical screen as the complete picture of an engineering candidate.

The cognitive profile of high-performing software engineers

Software engineering is cognitively demanding in specific and measurable ways. The organizational psychology literature on cognitive ability and technical role performance points to three dimensions as particularly load-bearing for developers.

Abstract and logical reasoning

Abstract reasoning — the ability to identify patterns, infer rules, and reason about novel systems without relying on prior domain knowledge — is the single cognitive dimension most predictive of software engineering performance. It maps directly to the core challenges of the work: reasoning about a system you have never seen before, debugging a failure in code you did not write, designing an abstraction that will hold up under requirements you cannot fully anticipate.

Abstract reasoning is also the cognitive dimension most independent of language and cultural background, which has practical implications for technical hiring in multinational contexts. A candidate whose written English is imperfect may still score at the top of the abstract reasoning distribution — and that score predicts engineering performance regardless of the language differential.

Working memory

Working memory — the capacity to hold and actively manipulate multiple pieces of information simultaneously — becomes progressively more load-bearing as engineering seniority increases. Junior engineers primarily work within well-defined task boundaries. Senior engineers and technical leads must hold the full dependency graph of a feature in mind, track edge cases across multiple interacting systems, reason about the downstream effects of architectural decisions, and maintain context across multiple work streams simultaneously.

Working memory capacity is one reason why high-performing senior engineers tend to appear qualitatively different from their peers even when technical knowledge and experience are similar. The difference is often not what they know — it is how much they can hold in working memory at once while reasoning about the problem.

Verbal reasoning

Verbal reasoning is consistently underweighted in developer assessment, but the evidence for its relevance is strong. Software engineering above the junior level requires continuous production of complex written artifacts: technical specifications, architecture decision records, code review comments, incident post-mortems, design documents, and cross-functional communications with product and business stakeholders.

Engineers with high verbal reasoning write clearer pull request descriptions, produce more useful documentation, communicate technical constraints to non-technical stakeholders more effectively, and generate fewer misunderstandings that require expensive clarification cycles. For roles involving technical leadership, system design, or significant external collaboration, verbal reasoning is not a soft nice-to-have — it is a direct predictor of output quality.

"Abstract reasoning is the top cognitive predictor for software engineers — it maps directly to reasoning about novel systems, debugging unfamiliar code, and designing abstractions that hold under unknown future requirements."

What personality science says about developer performance

Personality assessment has a mixed reputation in technical hiring circles, often dismissed as either subjective impression-gathering or pop-psychology. This reputation does not reflect the scientific evidence. Validated personality models — particularly HEXACO and D/I/S/C — have demonstrated predictive validity for job performance across role types, including technical ones, when administered as structured psychometric instruments rather than conversational impressions.

HEXACO Conscientiousness: the reliability signal

Across virtually every job performance dataset in organizational psychology, Conscientiousness is the most consistent personality predictor of individual contributor quality. In software engineering, this manifests as code quality discipline — writing tests, maintaining documentation, following naming conventions, catching edge cases before review — as well as reliability in estimates, commitments, and follow-through on technical debt acknowledgments.

High-Conscientiousness engineers are the ones whose pull requests are consistently well-scoped, whose tickets are accurately estimated, and whose code requires fewer review cycles. This is not a personality preference — it is a measurable predictor of output quality that shows up reliably in performance ratings, incident rates, and on-call reliability metrics.

HEXACO Openness to Experience: the adaptability signal

Software engineering has a shorter half-life of technical knowledge than almost any other professional discipline. Languages fall in and out of favor; frameworks are superseded; architectural paradigms shift. Engineers who score high on Openness to Experience adapt to these shifts faster, experience less resistance to relearning, and tend to maintain technical currency more effectively over a multi-year tenure than engineers with lower scores who prefer established methods.

For roles that will involve significant greenfield work, technology migrations, or work in fast-moving areas like machine learning infrastructure, cloud-native architectures, or emerging languages, Openness to Experience is a primary predictor of long-term role fit — not just initial performance.

HEXACO Honesty-Humility: the team-fit signal

The HEXACO model's sixth factor — Honesty-Humility — has no direct equivalent in the older Big Five or D/I/S/C models. It captures the tendency toward sincere, fair, and unassuming behavior in interactions with others, as opposed to manipulative, self-serving, or attention-seeking behavior. In software engineering contexts, Honesty-Humility is the best personality predictor of collaborative code review behavior, willingness to acknowledge mistakes in post-mortems, and avoidance of the "brilliant jerk" dynamic — the engineer whose individual technical output is high but whose collaborative impact on the surrounding team is negative.

Teams with low average Honesty-Humility scores tend to accumulate interpersonal debt alongside their technical debt. High-Honesty-Humility engineers are disproportionately valuable in code review, architectural discussion, and incident response — contexts where intellectual honesty and the willingness to be wrong publicly determine how much the team actually learns.

D/I/S/C behavioral profiles in engineering

While HEXACO describes stable personality traits, D/I/S/C (Dominance, Influence, Steadiness, Conscientiousness) describes behavioral tendencies and communication style — how a person prefers to act and interact at work, particularly under pressure. Both dimensions add value in developer assessment; they measure different things and do not substitute for each other.

C-style (Conscientiousness) profiles appear frequently among high-performing individual contributors: methodical, quality-focused, detail-oriented, and systematic in approach. They produce reliable, well-tested code but may need support when requirements are deliberately ambiguous or when speed is prioritized over thoroughness.

S-style (Steadiness) profiles tend toward consistency, patience, and collaborative reliability — traits that make them excellent in on-call rotations, pair programming, and team mentorship roles. They typically perform well in stable, well-defined engineering contexts and may need additional support during rapid organizational change.

D-style (Dominance) profiles appear more frequently in technical leads, architects, and engineering managers — roles where decisive action, comfort with ambiguity, and willingness to make judgment calls under incomplete information matter. At the individual contributor level, high-D engineers can produce fast results but may accumulate technical debt and conflict in code review without structural counterbalance.

I-style (Influence) profiles are less common in deep individual-contributor roles but are strongly represented among developer advocates, technical sales engineers, and senior engineers whose role involves significant evangelism, community building, or cross-functional stakeholder management.

"HEXACO Honesty-Humility is the best personality predictor of constructive code review behavior — and the most reliable guard against the 'brilliant jerk' dynamic that corrodes team productivity."

The three-layer assessment framework

The evidence from selection research points to a clear structure for developer hiring: a weighted composite of technical, cognitive, and personality assessment, administered in a sequence designed to minimize halo effects between layers.

Layer 1 — Technical work sample

The technical assessment should be a work sample — a task that reflects the actual engineering work the candidate will perform in the role. For a backend Python engineer, this means reviewing a pull request for a Django module, debugging a failing integration test, or extending a small service with a new endpoint. For a data engineer, it means transforming a messy dataset, optimizing a slow query, or designing a pipeline schema.

Work samples have higher predictive validity than algorithmic puzzle tests and lower adverse impact on candidates from underrepresented groups. They also produce more actionable signal for hiring managers: the quality of a candidate's code review comments tells you far more about how they will perform in daily work than whether they can implement a red-black tree from memory under observation pressure.

The technical assessment should be accompanied by a structured rubric with predefined scoring dimensions — code quality, correctness, edge case handling, documentation and communication, and approach to ambiguity. Without a rubric, evaluator bias dominates the score.

Layer 2 — Cognitive assessment

The cognitive layer should cover at minimum three dimensions: abstract reasoning, working memory, and verbal reasoning. A well-designed cognitive battery for engineering roles runs approximately 25 to 35 minutes and produces normed dimension scores rather than a single composite.

Scores should be normed against a relevant engineering population, not against general population distributions. A score at the 65th percentile of the general population may be at the 40th percentile of software engineers — a meaningful difference when making a seniority-level hiring decision. Build role-level score distributions over time as your candidate pool grows; each hiring cycle makes the cognitive data more useful than the last.

Layer 3 — Personality assessment

The personality layer should cover both HEXACO (for stable trait prediction of job performance) and D/I/S/C (for behavioral style and team-fit context). Together they add approximately 25 to 30 minutes of candidate time and produce role-relevant interpretations rather than raw score outputs.

Personality results are most useful when interpreted in combination with the cognitive and technical data — not in isolation. A candidate with high Conscientiousness but low abstract reasoning will be reliably diligent but may struggle with ambiguous architecture problems. A candidate with very high abstract reasoning and low Conscientiousness may produce technically brilliant but poorly documented, difficult-to-maintain code. The combination tells you which engineering contexts the candidate will thrive in and which support structures they will need.

Common mistakes in developer technical interviews

The most pervasive mistake is treating a single technical screen as the full picture of a candidate's engineering ability. A coding test, however well-designed, produces one data point. Hiring decisions made on a single data point have high error variance — the variance that produces both false rejections of strong engineers and false acceptances of candidates who performed well on the specific test format they practiced for.

The second most common mistake is using the same technical assessment format regardless of seniority level. A junior engineer should be evaluated primarily on code correctness, basic design patterns, and learning speed. A senior engineer should be evaluated on system design judgment, technical communication, and the quality of their reasoning about trade-offs — not on whether they can implement an LRU cache from scratch in 30 minutes.

Third: running technical assessment first and then allowing interviewers to know the result before administering subsequent assessments. If an interviewer knows a candidate aced the coding challenge, their ratings in the structured interview and their interpretation of personality assessment results are influenced by that knowledge — even when they try to evaluate independently. The halo effect is not a bias that can be reasoned away; it must be controlled structurally by sequencing.

Fourth: conflating technical assessment with a cultural fit evaluation. "Culture fit" evaluated informally in technical interviews is frequently a proxy for demographic similarity — an interviewer's unconscious preference for candidates who resemble them. Structured personality and behavioral assessment replaces this informal judgment with a standardized, validated signal that can be audited and calibrated over time.

A practical framework for engineering hiring teams

The framework that emerges from the research is straightforward to implement but requires deliberate sequencing and a commitment to treating all three assessment layers as genuine inputs rather than formalities.

Start by defining the engineering success profile for the specific role before selecting assessment instruments. Which cognitive dimensions are most load-bearing? What HEXACO trait profile correlates with success in this engineering context? Which D/I/S/C profile is consistent with the team's current composition needs? A solo architect role and a collaborative staff engineer role on a large team should not use identical success profiles, even if they use the same assessment instruments.

Administer personality and cognitive assessment first — before seeing any technical results. This is the single highest-impact structural change most engineering hiring processes can make. It ensures each layer's signal is genuinely independent. Send the assessment link before or concurrently with the technical challenge; require completion before technical results are reviewed.

Use the technical work sample as a scored layer with an explicit rubric, not as a binary pass/fail gate. Some of the most valuable engineering candidates will produce imperfect code on a time-limited assessment but demonstrate excellent reasoning about trade-offs in their written notes, or flag constraints they would investigate before committing to an approach. A rubric that scores communication and judgment separately from implementation captures this signal.

Combine the three layers in a weighted composite. For most individual contributor roles, technical assessment warrants the highest weight (approximately 40%) given its direct job relevance, with cognitive assessment second (approximately 35%) and personality providing directional context (approximately 25%). For senior and technical lead roles, cognitive assessment weight should increase — the ceiling problem with cognitively constrained candidates compounds with seniority, while baseline technical skills become easier to verify through experience evidence. Platforms like Calibers.ai integrate HEXACO, D/I/S/C, and cognitive assessment into a single developer evaluation workflow, producing a unified candidate report that maps each dimension to role-level benchmarks without requiring a psychometrics background to interpret.

Finally, build score distributions over time. A cognitive score is only interpretable relative to a reference population. After 10, 20, and 50 engineering hires, you have the empirical basis to norm your assessment data against your own hiring population and correlate assessment scores with first-year performance ratings. That correlation data is the most valuable output of a structured assessment program — it tells you exactly which dimensions are predictive in your specific engineering context, not just in the literature.

Frequently asked questions

Do coding tests predict software engineer performance?

Coding tests predict some aspects of software engineer performance, but their validity is lower than commonly assumed — particularly for algorithmic puzzle formats like LeetCode, which measure practiced pattern recognition more than real-world engineering ability. Work-sample tests that mirror actual job tasks have substantially higher predictive validity, and their power increases further when combined with cognitive ability assessment and structured personality measurement.

What cognitive abilities matter most for software engineers?

Abstract and logical reasoning is the highest-value cognitive dimension for software engineering — it predicts the ability to reason about novel systems, debug unfamiliar code, and adapt to new technology paradigms. Working memory is essential for senior roles that require holding multiple requirements and dependencies in mind simultaneously. Verbal reasoning strongly predicts performance in specification writing, code review, and cross-team technical communication.

Which personality traits predict software engineer performance?

HEXACO Conscientiousness is the most consistent personality predictor of individual contributor quality — correlating with code quality, documentation discipline, and reliability in meeting commitments. HEXACO Openness to Experience predicts adaptability to new languages and paradigms. HEXACO Honesty-Humility predicts collaborative code review behavior and avoidance of the "brilliant jerk" dynamic. D/I/S/C behavioral profiles add context on communication style and team fit.

What is the right order for developer assessments?

Run cognitive and personality assessments before interviewers see the technical test results. When evaluators know a candidate performed well on a coding challenge, it creates a halo effect that inflates ratings in subsequent assessments. Sequencing personality and cognitive measurement first ensures each assessment layer produces genuinely independent signal.

What is wrong with LeetCode-style technical interviews?

LeetCode-style interviews measure the ability to solve algorithmic puzzles under timed pressure — a skill that can be trained directly with practice but correlates weakly with day-to-day software engineering performance. They also introduce significant assessment anxiety that affects scores independently of actual ability. Work-sample tests that reflect the actual engineering work have higher predictive validity and lower adverse impact on candidates from underrepresented groups.

How long should a complete developer assessment take?

A complete framework should take 60 to 90 minutes of candidate time across all three layers. A focused work-sample technical test runs 30 to 45 minutes. A cognitive assessment covering abstract reasoning, working memory, and verbal reasoning requires 25 to 35 minutes. A HEXACO and behavioral personality assessment takes 15 to 20 minutes. Spreading assessment across stages with time gaps reduces fatigue and keeps each signal independent.

About the author

Antonio Romero

Electronics Engineer · Operations & Technology Leader · Airelia LLC Operations Director

Antonio Romero is an Electronics and Telecommunications Engineer who has spent more than two decades recruiting and leading technical teams in cybersecurity operations across Latin America, the United States, and Europe — environments where the cost of a wrong hire is not measured in lost productivity but in incident response failures.

That context forced an early reckoning with what actually distinguishes people who hold up under sustained pressure. Technical depth matters at the door, but the engineers who earned trust and grew into leadership consistently shared a cluster of personality traits — conscientiousness, genuine intellectual openness, and a commitment to doing things correctly when no one was watching. That pattern, across hundreds of hiring decisions, is what led to the development of Calibers.ai.

Electronics and Telecommunications Engineer. Postgraduate studies in Strategic Management (ITBA, Buenos Aires) and Technology Management (EOI, Madrid).

Connect on LinkedIn →