Unit 4: Test Administration and Scoring — Glossary · Professional Certificate in Neuropsychological Testing

Adaptive Testing #

Adaptive Testing

Related terms #

Fixed‑form testing, Item response theory, Computer‑adaptive administration

Explanation #

Adaptive testing modifies item difficulty in real‑time based on the examinee’s responses, aiming to estimate ability with fewer items. Example: A neuropsychological battery that selects memory items of increasing difficulty as the patient succeeds. Challenges include ensuring algorithm transparency, maintaining test security, and validating equivalence across adaptive pathways.

Age‑Adjusted Norms #

Age‑Adjusted Norms

Related terms #

Demographic corrections, Standard scores, Percentile ranks

Explanation #

Age‑adjusted norms compare an individual’s performance to a reference group matched for age, controlling for developmental effects. For instance, a processing speed score is interpreted against a sample of 70‑year‑olds rather than the entire adult cohort. Challenges arise when age ranges are broad, leading to reduced specificity, and when normative data are outdated.

Alternate Forms #

Alternate Forms

Related terms #

Parallel forms, Test–retest reliability, Counterbalancing

Explanation #

Alternate forms are two or more versions of the same test designed to reduce practice effects while preserving construct measurement. A common application is administering Form A at baseline and Form B at follow‑up. The primary challenge is ensuring equivalence in difficulty and psychometric properties across forms.

Administration Protocol #

Administration Protocol

Related terms #

Standardized procedures, Examiner training, Test environment

Explanation #

The administration protocol outlines step‑by‑step instructions for delivering a test, including timing, instructions, and handling of interruptions. For example, the protocol for the Trail Making Test specifies how to present the stimulus sheet and when to stop the clock. Deviations can introduce measurement error and threaten validity.

Artifact #

Artifact

Related terms #

Measurement error, Confounding variable, Data quality

Explanation #

An artifact is any extraneous factor that distorts the true score, such as background noise during auditory testing. Recognizing artifacts is essential for accurate interpretation. Practical strategies include documenting environmental conditions and, when possible, repeating the assessment in a controlled setting.

Baseline Assessment #

Baseline Assessment

Related terms #

Pre‑test, Initial evaluation, Reference point

Explanation #

A baseline assessment provides the initial set of scores against which future changes are compared. In longitudinal neuropsychology, baseline cognitive scores are crucial for detecting post‑injury decline. Challenges include ensuring the baseline is free from fatigue, medication effects, or acute stress.

Behavioral Observation #

Behavioral Observation

Related terms #

Qualitative data, Examiner‑participant rapport, Test validity

Explanation #

Systematic observation of the examinee’s behavior during testing (e.g., signs of frustration) supplements quantitative scores. For example, noting frequent perseveration during the Stroop task can inform interpretation of executive function deficits. The main challenge is maintaining objectivity and inter‑rater consistency.

Blinding #

Blinding

Related terms #

Examiner bias, Double‑blind design, Confidentiality

Explanation #

Blinding prevents the examiner from knowing the hypothesis or the participant’s clinical status, reducing bias in administration and scoring. In a study comparing two rehabilitation programs, the scorer may be blind to group assignment. Practical difficulties include maintaining blinding when obvious clinical signs are present.

Calibration #

Calibration

Related terms #

Test security, Equipment verification, Standardization

Explanation #

Calibration ensures that testing equipment (e.g., reaction‑time devices) operates within manufacturer specifications. Regular calibration reduces systematic error. For instance, a computerized attention task requires millisecond accuracy; any drift can inflate reaction‑time scores. Failure to calibrate can compromise data integrity.

Ceiling Effect #

Ceiling Effect

Related terms #

Floor effect, Test sensitivity, Normative range

Explanation #

A ceiling effect occurs when a test is too easy, causing many examinees to achieve maximum scores, thus limiting discrimination among high‑ability individuals. The Digit Span Forward may exhibit a ceiling effect in highly educated samples. Mitigation strategies include using more challenging items or alternate forms.

Clinical Interview #

Clinical Interview

Related terms #

Structured interview, Diagnostic clarification, Rapport building

Explanation #

The clinical interview provides contextual information that guides test selection, administration, and interpretation. During a neuropsychological assessment, the interview may reveal recent medication changes that affect attention. Challenges include balancing open‑ended questioning with time constraints and avoiding leading statements.

Composite Score #

Composite Score

Related terms #

Domain score, Factor analysis, Weighted average

Explanation #

A composite score aggregates multiple test scores within a cognitive domain (e.g., memory) to improve reliability. For example, a memory composite may combine scores from the California Verbal Learning Test and the Logical Memory subtest. The challenge is determining appropriate weighting and ensuring that combined tests tap the same construct.

Counterbalancing #

Counterbalancing

Related terms #

Order effects, Randomization, Alternate forms

Explanation #

Counterbalancing varies the order of test administration across participants to control for sequence effects such as fatigue. In a study using three subtests, one group may receive the order A‑B‑C while another receives C‑B‑A. Implementing counterbalancing increases logistical complexity.

Cut‑Score #

Cut‑Score

Related terms #

Diagnostic threshold, Sensitivity, Specificity

Explanation #

A cut‑score defines the point at which a test result is classified as “impaired” versus “normal.” For instance, a T‑score ≤ 35 on a visuospatial test may indicate clinically significant deficit. Determining cut‑scores requires balancing false positives against false negatives and may differ across populations.

Data Management #

Data Management

Related terms #

Secure storage, Data entry, Quality control

Explanation #

Data management encompasses procedures for recording, storing, and safeguarding test results, including raw scores, demographic variables, and scoring keys. Electronic databases must comply with privacy regulations (e.g., HIPAA). Common challenges involve preventing transcription errors and ensuring backup redundancy.

Demographic Corrections #

Demographic Corrections

Related terms #

Age‑adjusted norms, Education adjustment, Regression‑based norms

Explanation #

Demographic corrections adjust raw scores for variables such as age, education, and sex, producing standardized scores that reflect expected performance for a given subgroup. For example, a regression‑based formula may subtract 0.5 points per year of education. Limitations include over‑reliance on group averages and potential masking of genuine deficits.

Examiner Training #

Examiner Training

Related terms #

Certification, Competency assessment, Ongoing supervision

Explanation #

Examiner training ensures that administrators apply protocols consistently, interpret behaviors accurately, and score reliably. Training typically includes didactic sessions, supervised practice, and periodic competency checks. Challenges include resource allocation and maintaining skill retention over time.

Ethical Guidelines #

Ethical Guidelines

Related terms #

Informed consent, Confidentiality, Test security

Explanation #

Ethical guidelines govern the responsible conduct of test administration, emphasizing respect for participants, accurate reporting, and protection of test materials. Violations, such as sharing proprietary items, can lead to legal repercussions and compromised test validity. Ongoing ethics education is essential.

Examiner Bias #

Examiner Bias

Related terms #

Blinding, Subjectivity, Scoring drift

Explanation #

Examiner bias occurs when the administrator’s expectations influence test delivery or scoring, potentially inflating or deflating performance. For example, an examiner who anticipates severe impairment may unintentionally provide more prompts. Countermeasures include standardized scripts and blind scoring.

Examiner Fatigue #

Examiner Fatigue

Related terms #

Test fatigue, Session length, Performance monitoring

Explanation #

Prolonged testing sessions can lead to examiner fatigue, reducing attentiveness and increasing scoring errors. Fatigue may manifest as slower reaction to participant cues or missed administration steps. Scheduling breaks and limiting session duration mitigate this risk.

Examiner Standardization #

Examiner Standardization

Related terms #

Administration protocol, Inter‑rater reliability, Training manual

Explanation #

Standardization requires that all examiners follow identical procedures, minimizing variability due to personal style. This includes using the same script, timing devices, and scoring criteria. Achieving high inter‑rater reliability depends on rigorous training and periodic fidelity checks.

Inter‑Rater Reliability #

Inter‑Rater Reliability

Related terms #

Scoring consistency, Kappa statistic, Training calibration

Explanation #

Inter‑rater reliability measures the degree of agreement between two or more examiners scoring the same material. High reliability (e.g., κ > 0.80) indicates that scoring rules are clear and applied uniformly. Low reliability often signals ambiguous scoring criteria or insufficient training.

Item Response Theory #

Item Response Theory

Related terms #

Adaptive testing, Item difficulty, Parameter estimation

Explanation #

Item response theory (IRT) models the probability of a correct response as a function of the examinee’s latent ability and item characteristics. IRT underpins many computer‑adaptive batteries, allowing precise ability estimates with fewer items. Implementation requires sophisticated software and large calibration samples.

Manual Scoring #

Manual Scoring

Related terms #

Automated scoring, Scoring key, Human error

Explanation #

Manual scoring involves the examiner assigning scores by hand, often using a printed scoring key. While flexible, manual methods are vulnerable to transcription errors and inconsistent application of rules. Double‑checking and using standardized forms can reduce these risks.

Normative Sample #

Normative Sample

Related terms #

Representative cohort, Standardization, Demographic stratification

Explanation #

A normative sample is a group of individuals selected to represent the target population for establishing test norms. The quality of norms depends on sample size, diversity, and recruitment methods. Inadequate representation can lead to biased standard scores for under‑represented groups.

Normed Scores #

Normed Scores

Related terms #

Standard scores, Percentiles, Z‑scores

Explanation #

Normed scores translate raw performance into a metric that reflects relative standing within the normative sample. Common formats include T‑scores (mean = 50, SD = 10) and percentile ranks. Proper conversion requires accurate demographic corrections and up‑to‑date normative tables.

Raw Score #

Raw Score

Related terms #

Direct performance, Scoring conversion, Test manual

Explanation #

The raw score is the untransformed count of correct responses or points earned on a test. For example, a raw score of 12 on a 15‑item memory test reflects the number of items recalled. Raw scores are later converted to standardized metrics for interpretation.

Reliability #

Reliability

Related terms #

Internal consistency, Test‑retest reliability, Inter‑rater reliability

Explanation #

Reliability refers to the consistency of test scores across administrations, items, or raters. High reliability (e.g., Cronbach’s α > 0.90) indicates that the test measures the construct stably. Low reliability limits interpretability and may stem from poorly defined items or inconsistent administration.

Retest Interval #

Retest Interval

Related terms #

Practice effects, Test–retest reliability, Temporal stability

Explanation #

The retest interval is the time elapsed between two administrations of the same test. Short intervals increase practice effects, while long intervals may introduce true change. Selecting an appropriate interval (e.g., 2 weeks for short‑term stability) balances these considerations.

Scoring Key #

Scoring Key

Related terms #

Manual scoring, Automated algorithms, Item weighting

Explanation #

The scoring key provides the rules for assigning points to each response, including partial credit and penalty criteria. A well‑constructed key ensures uniformity across examiners. Errors in the key can propagate systematic bias throughout a dataset.

Standard Deviation #

Standard Deviation

Related terms #

Variability, Normed scores, Z‑score calculation

Explanation #

The standard deviation quantifies the spread of scores around the mean within a normative group. It is essential for converting raw scores to Z‑scores (raw – mean / SD). Misestimation of SD can distort standardized scores, leading to over‑ or under‑identification of impairment.

Standard Scores #

Standard Scores

Related terms #

T‑scores, Z‑scores, Percentile ranks

Explanation #

Standard scores express performance relative to the normative mean, facilitating comparison across different tests. For example, a T‑score of 40 indicates performance one standard deviation below the mean. Accurate conversion requires correct raw‑to‑norm transformations.

Test Booklet #

Test Booklet

Related terms #

Stimulus materials, Secure handling, Administration logistics

Explanation #

The test booklet contains the printed stimuli and response sheets for a specific assessment session. Proper organization of booklets (e.g., version A vs. B) prevents item exposure and maintains test security. Mishandling can result in compromised test integrity.

Test Environment #

Test Environment

Related terms #

Controlled setting, Distractions, Lighting conditions

Explanation #

The test environment encompasses physical factors that influence performance, such as noise level, lighting, and room temperature. A quiet, well‑lit space reduces extraneous variability. Failure to control the environment can introduce artifacts that mimic cognitive deficits.

Test Fatigue #

Test Fatigue

Related terms #

Examiner fatigue, Session length, Break scheduling

Explanation #

Test fatigue refers to declining performance due to prolonged mental effort, affecting both participants and examiners. It typically manifests after 60–90 minutes of continuous testing. Mitigation strategies include inserting short breaks and alternating demanding with less demanding tasks.

Test Security #

Test Security

Related terms #

Item confidentiality, Secure storage, Unauthorized distribution

Explanation #

Test security safeguards the confidentiality of test items and prevents unauthorized access or dissemination. Measures include locked cabinets for physical materials and encrypted servers for digital data. Breaches can invalidate scores and result in legal consequences.

Test‑Retest Reliability #

Test‑Retest Reliability

Related terms #

Temporal stability, Retest interval, Practice effects

Explanation #

Test‑retest reliability assesses the consistency of scores across two administrations separated by a defined interval. High reliability (e.g., r > 0.80) suggests that the test measures a stable construct. Practice effects and intervening events can lower reliability estimates.

Time Limits #

Time Limits

Related terms #

Administration protocol, Scoring criteria, Speeded vs. non‑speeded tests

Explanation #

Time limits dictate the maximum duration allowed for completing a test or subtest. For example, the Symbol‑Digit Modalities Test imposes a 90‑second limit per trial. Inconsistent timing can inflate error rates and affect comparability across administrations.

Validity #

Validity

Related terms #

Construct validity, Criterion validity, Content validity

Explanation #

Validity is the degree to which a test measures what it purports to assess. Evidence for validity includes correlations with established measures (criterion) and theoretical alignment (construct). A test lacking validity provides misleading information, regardless of reliability.

Variable Administration #

Variable Administration

Related terms #

Adaptive testing, Conditional branching, Protocol deviations

Explanation #

Variable administration refers to any departure from a fixed testing sequence, such as skipping items based on prior performance. While it can improve efficiency, it introduces complexity in scoring and norm comparison. Documentation of each deviation is essential for accurate interpretation.

Version Control #

Version Control

Related terms #

Test booklet, Alternate forms, Revision tracking

Explanation #

Version control tracks changes to test materials, ensuring that all examiners use the intended edition. Updates may include new items, corrected scoring keys, or revised normative tables. Failure to maintain version control can lead to mixing of incompatible forms.

Yield #

Yield

Related terms #

Diagnostic yield, Test efficiency, Information gain

Explanation #

Yield describes the proportion of assessments that produce clinically useful information. High‑yield tests (e.g., a brief screening battery) provide actionable data with minimal administration time. Low yield may indicate redundancy or inappropriate test selection.

Z‑Score #

Z‑Score

Related terms #

Standard scores, Standard deviation, Normal distribution

Explanation #

A Z‑score represents how many standard deviations a raw score lies above or below the normative mean (Z = (raw – mean)/SD). A Z‑score of –2 indicates performance two SDs below average, often considered clinically significant. Accurate calculation depends on reliable mean and SD values.

Percentile Rank #

Percentile Rank

Related terms #

Normed scores, Standard scores, Distribution interpretation

Explanation #

Percentile rank indicates the percentage of the normative sample that scored at or below a given individual’s score. For example, a 10th percentile rank means the examinee performed better than 10 % of peers. Percentiles are intuitive for clinicians but can be less precise in the extreme tails of the distribution.