Unit 4: Test Administration and Scoring

Expert-defined terms from the Professional Certificate in Neuropsychological Testing course at London School of Business and Administration. Free to read, free to share, paired with a professional course.

Download PDF Free · printable · SEO-indexed
Unit 4: Test Administration and Scoring

Adaptive Testing #

Adaptive Testing

Explanation #

Adaptive testing modifies item difficulty in real‑time based on the examinee’s responses, aiming to estimate ability with fewer items. Example: A neuropsychological battery that selects memory items of increasing difficulty as the patient succeeds. Challenges include ensuring algorithm transparency, maintaining test security, and validating equivalence across adaptive pathways.

Age‑Adjusted Norms #

Age‑Adjusted Norms

Explanation #

Age‑adjusted norms compare an individual’s performance to a reference group matched for age, controlling for developmental effects. For instance, a processing speed score is interpreted against a sample of 70‑year‑olds rather than the entire adult cohort. Challenges arise when age ranges are broad, leading to reduced specificity, and when normative data are outdated.

Alternate Forms #

Alternate Forms

Explanation #

Alternate forms are two or more versions of the same test designed to reduce practice effects while preserving construct measurement. A common application is administering Form A at baseline and Form B at follow‑up. The primary challenge is ensuring equivalence in difficulty and psychometric properties across forms.

Administration Protocol #

Administration Protocol

Explanation #

The administration protocol outlines step‑by‑step instructions for delivering a test, including timing, instructions, and handling of interruptions. For example, the protocol for the Trail Making Test specifies how to present the stimulus sheet and when to stop the clock. Deviations can introduce measurement error and threaten validity.

Artifact #

Artifact

Explanation #

An artifact is any extraneous factor that distorts the true score, such as background noise during auditory testing. Recognizing artifacts is essential for accurate interpretation. Practical strategies include documenting environmental conditions and, when possible, repeating the assessment in a controlled setting.

Baseline Assessment #

Baseline Assessment

Explanation #

A baseline assessment provides the initial set of scores against which future changes are compared. In longitudinal neuropsychology, baseline cognitive scores are crucial for detecting post‑injury decline. Challenges include ensuring the baseline is free from fatigue, medication effects, or acute stress.

Behavioral Observation #

Behavioral Observation

Explanation #

Systematic observation of the examinee’s behavior during testing (e.g., signs of frustration) supplements quantitative scores. For example, noting frequent perseveration during the Stroop task can inform interpretation of executive function deficits. The main challenge is maintaining objectivity and inter‑rater consistency.

Blinding #

Blinding

Explanation #

Blinding prevents the examiner from knowing the hypothesis or the participant’s clinical status, reducing bias in administration and scoring. In a study comparing two rehabilitation programs, the scorer may be blind to group assignment. Practical difficulties include maintaining blinding when obvious clinical signs are present.

Calibration #

Calibration

Explanation #

Calibration ensures that testing equipment (e.g., reaction‑time devices) operates within manufacturer specifications. Regular calibration reduces systematic error. For instance, a computerized attention task requires millisecond accuracy; any drift can inflate reaction‑time scores. Failure to calibrate can compromise data integrity.

Ceiling Effect #

Ceiling Effect

Explanation #

A ceiling effect occurs when a test is too easy, causing many examinees to achieve maximum scores, thus limiting discrimination among high‑ability individuals. The Digit Span Forward may exhibit a ceiling effect in highly educated samples. Mitigation strategies include using more challenging items or alternate forms.

Clinical Interview #

Clinical Interview

Explanation #

The clinical interview provides contextual information that guides test selection, administration, and interpretation. During a neuropsychological assessment, the interview may reveal recent medication changes that affect attention. Challenges include balancing open‑ended questioning with time constraints and avoiding leading statements.

Composite Score #

Composite Score

Explanation #

A composite score aggregates multiple test scores within a cognitive domain (e.g., memory) to improve reliability. For example, a memory composite may combine scores from the California Verbal Learning Test and the Logical Memory subtest. The challenge is determining appropriate weighting and ensuring that combined tests tap the same construct.

Counterbalancing #

Counterbalancing

Explanation #

Counterbalancing varies the order of test administration across participants to control for sequence effects such as fatigue. In a study using three subtests, one group may receive the order A‑B‑C while another receives C‑B‑A. Implementing counterbalancing increases logistical complexity.

Cut‑Score #

Cut‑Score

Explanation #

A cut‑score defines the point at which a test result is classified as “impaired” versus “normal.” For instance, a T‑score ≤ 35 on a visuospatial test may indicate clinically significant deficit. Determining cut‑scores requires balancing false positives against false negatives and may differ across populations.

Data Management #

Data Management

Explanation #

Data management encompasses procedures for recording, storing, and safeguarding test results, including raw scores, demographic variables, and scoring keys. Electronic databases must comply with privacy regulations (e.g., HIPAA). Common challenges involve preventing transcription errors and ensuring backup redundancy.

Demographic Corrections #

Demographic Corrections

Explanation #

Demographic corrections adjust raw scores for variables such as age, education, and sex, producing standardized scores that reflect expected performance for a given subgroup. For example, a regression‑based formula may subtract 0.5 points per year of education. Limitations include over‑reliance on group averages and potential masking of genuine deficits.

Examiner Training #

Examiner Training

Explanation #

Examiner training ensures that administrators apply protocols consistently, interpret behaviors accurately, and score reliably. Training typically includes didactic sessions, supervised practice, and periodic competency checks. Challenges include resource allocation and maintaining skill retention over time.

Ethical Guidelines #

Ethical Guidelines

Explanation #

Ethical guidelines govern the responsible conduct of test administration, emphasizing respect for participants, accurate reporting, and protection of test materials. Violations, such as sharing proprietary items, can lead to legal repercussions and compromised test validity. Ongoing ethics education is essential.

Examiner Bias #

Examiner Bias

Explanation #

Examiner bias occurs when the administrator’s expectations influence test delivery or scoring, potentially inflating or deflating performance. For example, an examiner who anticipates severe impairment may unintentionally provide more prompts. Countermeasures include standardized scripts and blind scoring.

Examiner Fatigue #

Examiner Fatigue

Explanation #

Prolonged testing sessions can lead to examiner fatigue, reducing attentiveness and increasing scoring errors. Fatigue may manifest as slower reaction to participant cues or missed administration steps. Scheduling breaks and limiting session duration mitigate this risk.

Examiner Standardization #

Examiner Standardization

Explanation #

Standardization requires that all examiners follow identical procedures, minimizing variability due to personal style. This includes using the same script, timing devices, and scoring criteria. Achieving high inter‑rater reliability depends on rigorous training and periodic fidelity checks.

Inter‑Rater Reliability #

Inter‑Rater Reliability

Explanation #

Inter‑rater reliability measures the degree of agreement between two or more examiners scoring the same material. High reliability (e.g., κ > 0.80) indicates that scoring rules are clear and applied uniformly. Low reliability often signals ambiguous scoring criteria or insufficient training.

Item Response Theory #

Item Response Theory

Explanation #

Item response theory (IRT) models the probability of a correct response as a function of the examinee’s latent ability and item characteristics. IRT underpins many computer‑adaptive batteries, allowing precise ability estimates with fewer items. Implementation requires sophisticated software and large calibration samples.

Manual Scoring #

Manual Scoring

Explanation #

Manual scoring involves the examiner assigning scores by hand, often using a printed scoring key. While flexible, manual methods are vulnerable to transcription errors and inconsistent application of rules. Double‑checking and using standardized forms can reduce these risks.

Normative Sample #

Normative Sample

Explanation #

A normative sample is a group of individuals selected to represent the target population for establishing test norms. The quality of norms depends on sample size, diversity, and recruitment methods. Inadequate representation can lead to biased standard scores for under‑represented groups.

Normed Scores #

Normed Scores

Explanation #

Normed scores translate raw performance into a metric that reflects relative standing within the normative sample. Common formats include T‑scores (mean = 50, SD = 10) and percentile ranks. Proper conversion requires accurate demographic corrections and up‑to‑date normative tables.

Raw Score #

Raw Score

Explanation #

The raw score is the untransformed count of correct responses or points earned on a test. For example, a raw score of 12 on a 15‑item memory test reflects the number of items recalled. Raw scores are later converted to standardized metrics for interpretation.

Reliability #

Reliability

Explanation #

Reliability refers to the consistency of test scores across administrations, items, or raters. High reliability (e.g., Cronbach’s α > 0.90) indicates that the test measures the construct stably. Low reliability limits interpretability and may stem from poorly defined items or inconsistent administration.

Retest Interval #

Retest Interval

Explanation #

The retest interval is the time elapsed between two administrations of the same test. Short intervals increase practice effects, while long intervals may introduce true change. Selecting an appropriate interval (e.g., 2 weeks for short‑term stability) balances these considerations.

Scoring Key #

Scoring Key

Explanation #

The scoring key provides the rules for assigning points to each response, including partial credit and penalty criteria. A well‑constructed key ensures uniformity across examiners. Errors in the key can propagate systematic bias throughout a dataset.

Standard Deviation #

Standard Deviation

Explanation #

The standard deviation quantifies the spread of scores around the mean within a normative group. It is essential for converting raw scores to Z‑scores (raw – mean / SD). Misestimation of SD can distort standardized scores, leading to over‑ or under‑identification of impairment.

Standard Scores #

Standard Scores

Explanation #

Standard scores express performance relative to the normative mean, facilitating comparison across different tests. For example, a T‑score of 40 indicates performance one standard deviation below the mean. Accurate conversion requires correct raw‑to‑norm transformations.

Test Booklet #

Test Booklet

Explanation #

The test booklet contains the printed stimuli and response sheets for a specific assessment session. Proper organization of booklets (e.g., version A vs. B) prevents item exposure and maintains test security. Mishandling can result in compromised test integrity.

Test Environment #

Test Environment

Explanation #

The test environment encompasses physical factors that influence performance, such as noise level, lighting, and room temperature. A quiet, well‑lit space reduces extraneous variability. Failure to control the environment can introduce artifacts that mimic cognitive deficits.

Test Fatigue #

Test Fatigue

Explanation #

Test fatigue refers to declining performance due to prolonged mental effort, affecting both participants and examiners. It typically manifests after 60–90 minutes of continuous testing. Mitigation strategies include inserting short breaks and alternating demanding with less demanding tasks.

Test Security #

Test Security

Explanation #

Test security safeguards the confidentiality of test items and prevents unauthorized access or dissemination. Measures include locked cabinets for physical materials and encrypted servers for digital data. Breaches can invalidate scores and result in legal consequences.

Test‑Retest Reliability #

Test‑Retest Reliability

Explanation #

Test‑retest reliability assesses the consistency of scores across two administrations separated by a defined interval. High reliability (e.g., r > 0.80) suggests that the test measures a stable construct. Practice effects and intervening events can lower reliability estimates.

Time Limits #

Time Limits

Explanation #

Time limits dictate the maximum duration allowed for completing a test or subtest. For example, the Symbol‑Digit Modalities Test imposes a 90‑second limit per trial. Inconsistent timing can inflate error rates and affect comparability across administrations.

Validity #

Validity

Explanation #

Validity is the degree to which a test measures what it purports to assess. Evidence for validity includes correlations with established measures (criterion) and theoretical alignment (construct). A test lacking validity provides misleading information, regardless of reliability.

Variable Administration #

Variable Administration

Explanation #

Variable administration refers to any departure from a fixed testing sequence, such as skipping items based on prior performance. While it can improve efficiency, it introduces complexity in scoring and norm comparison. Documentation of each deviation is essential for accurate interpretation.

Version Control #

Version Control

Explanation #

Version control tracks changes to test materials, ensuring that all examiners use the intended edition. Updates may include new items, corrected scoring keys, or revised normative tables. Failure to maintain version control can lead to mixing of incompatible forms.

Yield #

Yield

Explanation #

Yield describes the proportion of assessments that produce clinically useful information. High‑yield tests (e.g., a brief screening battery) provide actionable data with minimal administration time. Low yield may indicate redundancy or inappropriate test selection.

Z‑Score #

Z‑Score

Explanation #

A Z‑score represents how many standard deviations a raw score lies above or below the normative mean (Z = (raw – mean)/SD). A Z‑score of –2 indicates performance two SDs below average, often considered clinically significant. Accurate calculation depends on reliable mean and SD values.

Percentile Rank #

Percentile Rank

Explanation #

Percentile rank indicates the percentage of the normative sample that scored at or below a given individual’s score. For example, a 10th percentile rank means the examinee performed better than 10 % of peers. Percentiles are intuitive for clinicians but can be less precise in the extreme tails of the distribution.

June 2026 intake · open enrolment
from £90 GBP
Enrol