The Science of Personality Measurement · Personality Evaluation

The field has a rulebook: the Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014). It defines validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses” – and asks every serious instrument for evidence of reliability, validity, and fairness.

Reliability – does it repeat?

A measurement you can’t reproduce isn’t a measurement. Three indices matter:

Internal consistency. Do the items on a scale hang together? Cronbach’s α is the old default; McDonald’s ω is the methodologically preferred coefficient today. We target ≥ .80 for the kind of feedback we give.
Test-retest reliability. Do you get the same score weeks later? Good Big Five inventories land at .80–.90. By contrast, the MBTI reclassifies 39–76% of people into a different type within five weeks – an indictment of categorical scoring.
Inter-rater reliability. Would two scorers agree? Critical for any open-ended or AI-scored text (a feature we treat cautiously).

Validity – does it measure the right thing?

Validity isn’t one number; it’s a case built from several kinds of evidence:

Construct validity – does the score reflect the intended trait? (Established by factor analysis and nomological networks.)
Convergent & discriminant – it correlates strongly with other measures of the same trait, weakly with measures of different ones.
Criterion / predictive validity – it predicts real outcomes. For the Big Five, a landmark review (Roberts et al., 2007) found effects on mortality, divorce, and occupational attainment comparable in magnitude to socioeconomic status and IQ.
Face validity – it merely looks like it measures what it claims. Matters for engagement, not for evidence – and it’s exactly what Barnum-style tests exploit.

Factor analysis & item response theory

The Big Five and HEXACO structures were discovered, not invented, through factor analysis – the statistical search for the smallest set of dimensions that explains how trait words and ratings cluster. Item response theory (IRT) goes finer, modeling each item’s difficulty and discrimination. IRT is how researchers trimmed the full 300-item IPIP-NEO down to validated 120- and 60-item forms that keep most of the predictive power in a fraction of the time.

Norming – compared to whom?

A raw score is meaningless until you know the reference group. “You agreed with 18 of 24 Conscientiousness items” tells you nothing; “you score higher than 73% of the reference sample” does. That’s why we report percentiles against a disclosed norming sample – and a confidence band around each, because every score carries measurement error. See exactly which norms we use on the method page.

The Barnum trap

In 1949, psychologist Bertram Forer gave students a personality description he claimed was tailored to each of them. They rated it 4.26 out of 5 for accuracy. It was identical for everyone – assembled from a newsstand astrology book. This Forer (or Barnum) effect is the engine behind a lot of “spookily accurate” personality feedback: vague, flattering, universally endorsable statements feel uniquely true.

How we fight it

We anchor feedback to your actual percentiles, include specific and sometimes-unflattering statements, show confidence bands instead of false precision, and tell you plainly when a score is too close to the middle to mean much. You should still expect some of it to feel uncannily accurate – that is partly real signal and partly the Barnum effect, and knowing the difference is the point.

See the framework we build on – and why it won the field. The Big Five →

What makes a personality test trustworthy?