Patient Based Outcome Measures In Neurosurgery

Quality of life is defined as an individuals perception of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards and concern. It is a broad ranging concept affected in a complex way by the persons physical health, psychological state, level of independence, social relationships, and their relationships to salient features of their environment1. Since health is the most valued state2,3 health status is being increasingly referred to as Health Related- Quality of life ( HR- QOL ).

Health related Quality of life (HR- QOL) is patient based and multidimensional. There is an observation that the doctors and patients perception of outcome are significantly different4,5,6. There is ongoing debate about assessment of quality of life as a distinction between objective and subjective data7. Objective data are externally observable and directly measurable along a physical dimension. but focuses more on the impact of a perceived health state on the ability to live a fulfilling life. Quality of life may be viewed in terms of an individual, group, or large population of patients. The World Health Organisation's definition of health is 'a state of complete physical, mental and social well- being and not merely the absence of disease or infirmity'.

In order to understand the principles of health measurement, one has to think of the human being as a measuring instrument. The work of psychophysicists8 has developed on judgements made on measurable physical stimuli (e.g. length, weight, loudness). These are as equally relevant to judgements about health as it is to judgements about physical stimuli.

One of the important concept of health outcomes in neurosurgery is the postoperative course as a direct result of our interventions. We may not realise, but the simple 'how are you?' or 'have you any pain?' are the basic tenets of measuring outcome in the outpatient clinic. The traditional measures to assess outcome in neuro oncological surgery are the extent of surgery, postoperative surgical neurological examination and in hospital morbidity and mortality. The next step in the journey of a patient with a brain tumour may involve radiotherapy and chemotherapy. One has to take into account the toxicity associated with this treatment and its influence on the quality of life. The patient with head injury on the one hand may have severe impairment but may also be able to function with mild or moderate neuropsychological sequelae. In a similar fashion, patients with spinal injuries may have neurological deficit but also devastating psychological consequences. There are a number of reasons to rigorously measure outcome in neurosurgery. Firstly, due to technological advancements as well as the spiralling cost of health care we are being held accountable, hence required to demonstrate the benefit of our interventions and the relative benefit of different interventions. One of the other factors is the extraordinary publicity given to relatively infrequent medical scandals.

In neurosurgery, the Glasgow Outcome Score9 and the Karnofsky Performance Index10 are mainly used to measure outcome. Two limitations of single item, traditional measures have been highlighted i.e. they provide little information about the diverse consequence of disease and they fail to incorporate the patient perspective11. Although, single item measures are simple, user friendly and appropriate for measuring some individual properties they have a number of scientific limitations. Single item measures are likely to be interpreted in different ways by different -observers, may not discriminate fine degrees of an attribute and may be unreliable as they do not produce consistent answers over time. Finally, the measurement properties of single item measures are difficult to estimate.

In neurosurgery, the Glasgow Outcome Scale (GOS) is the most widely used outcome measure after traumatic brain injury and other neurosurgical conditions, but it is increasingly recognized to have important limitations. The main criticism of the GOS is that the categories are too broad and do not allow discrimination of important clinical change. However, it is universally accepted and can be easily administered. The Karnofsky Performance Index is also used to measure outcome particularly in neurooncology. It is not designed as a quality of life measure, but is frequently used as such. It was originally designed for use with lung cancer patients, but has also been used for breast cancer. The Karnofsky scale measures physical dimensions of life, rather than social and psychological dimensions. The results for construct and concurrent validity are generally good12. Various studies have reported poor inter- rater agreement13. A major disadvantage of the Karnofsky scale is that it involves categorization of patients by another person.

Multi-item instruments, where each item addresses a different aspect of the same underlying construct are able to overcome the scientific limitations of single items. More items increase the scope of the measure, are less open to variable interpretation, enable better precision and improve reliability by allowing random errors of measurement to average out14.

One of the difficulties in multi-dimensional measurement is to determine if the generic measures are adequate to measure outcome or if there is a need to develop more disease specific measures. In neurological disease, a combination of generic and disease specific questionnaires have been used quite effectively to determine outcome.

Developing And Evaluating Rating Scales

A. Development of Scales
Before embarking on the development of a new instrument, one should define exactly what the instrument is to measure. This is similar to writing a protocol for a clinical study. In order to specify measurement goals, a clear strategy will help other users of the instrument to recognise its use in their own patients.

For example, the patient population should be defined with patient characteristics, inclusion/ exclusion criteria, and other factors that may influence the outcome scale. The primary purpose of the instrument should be specified i. e. whether the scale is going to be discriminative, predictive or evaluative. It is possible to achieve all three in the same scale but with a trade off for maximum efficiency. The functional aspects of the scale should address all aspects of domains to be measured specific to the disease and the number of items. The format of the instrument should be specified i.e. face to face interview, telephonic interview or self administration.

The next step is item generation. This simply means to generate a pool of all relevant items. This can be achieved by interviews with patients, patient focus group discussions, searching the literature, discussion with health care professionals and a review of other HR- QOL measures. Following item generation, the next step is item reduction. The investigator would select the items that will be most suitable for the measure. This can be done by asking the patients to rate the importance of an item on a scaling model such as the 5- point Likert scale. This is done by rating the response as very important to not important. Results are expressed as frequency (proportion of patients experiencing a particular item), importance (the mean importance score attached to each item) and the impact, which is the product of frequency and importance.

The aim of scale evaluation is to determine whether an instrument satisfies criteria for rigorous measurement. It is important to formally evaluate a scale as firstly, it is sample dependent and therefore the performance of a measure in a specific application is more important than its performance generally. Secondly, data is only as strong as the instruments used to collect it. Sophisticated statistical methods and advances in study design will do little to overcome the damage done by poor quality measures. Hobart and Thompson recommend evaluating five measurement properties; data quality, scaling assumptions, acceptability, reliability, validity and responsiveness.

Data Quality

Indicators of data quality, such as percentage item non response and percentage computable scores, determine the extent to which an instrument can be incorporated into a clinical setting. These indicators like all psychometric properties vary across samples. If the measure is patient report, these indicators reflect respondents understanding and acceptance of a measure and help to identify items that may be irrelevant, confusing or upsetting to the patient. If the measure is clinician report these indicators reflect the ability to incorporate a measure into a clinical setting.

Scaling Assumptions

Tests of scaling assumptions determine whether it is legitimate to generate scores for an instrument using the algorithms proposed by the developers. An example of that is the Medical Outcomes Study 36- item Short form health survey.


Acceptability is the extent to which the spectrum of health measured by a scale matches the distribution of health in the study sample and is determined simply by examining score distributions. Ideally, the observed scores from a sample should span the entire range of the scale, the mean score should be near the midpoint, and floor and ceiling effects (percentage of the sample having the minimum and maximum score, respectively) should be small. McHorney recommends floor and ceiling effects should be <15%.


Reliability is defined as the extent to which a measure is free from random error, and is expressed as a reliability coefficient. For health measures the most important types of reliability are internal consistency and reproducibility (test-retest, inter-rater, and intrarater). Internal consistency is the extent to which items within a scale are reliable measures of the same construct. This type of reliability only applies to multi-item measures like the SF- 36 and Barthel Index and is determined using Cronbach's alpha coefficient. Reproducibility is the agreement between two or more ratings on the same person. Test- retest reliability is the agreement between two or more ratings for the same patient made by the same observer. Inter-rater reproducibility is the agreement between two or more ratings for the same patient made by different observers. Reproducibility should be reported as an intra- class correlation coefficient for continuous data.


The validity of a health measure is the extent to which it measures what it purports to measure. It is a difficult variable to measure as firstly, it can only be supported and not proven. Secondly, there is no consensus as to the minimum requirement of evidence to satisfy validity and evidence supporting validity in one context does not guarantee validity for another.

The strongest evidence of an instruments validity is provided by examining its correlations with other measures collected at the same time.


Responsiveness is the ability of an instrument to detect clinically significant change in the attribute measured. There are various methods to measure responsiveness although there is no clear consensus as to which method is optimal. Most methods examine scores at two points in time, usually before and after an intervention. It is recommended that responsiveness is reported in the form of an effect size (Standardised change score). The formula for the most commonly reported effect size is: mean change score divided by the standard deviation of the baseline scores. The larger the effect size, the greater the responsiveness of an instrument.


Single item measures have limitations in their application to assess functional outcome in neurosurgery. There is a need to develop multi- item measures which are patient based, hence to be used as a measure of our therapeutic interventions. Currently, there are very few patient based outcome measures available in neurosurgery. One of the important concepts is to design and develop a variety of measures to assess different disease types such as head injuries, subarachnoid haemorrhage, brain tumours and spinal surgery. Obviously, not all multi- item measures can be applied to differing pathologies. Any attempt at developing patient based outcome scales should have practical applications for neurosurgeons. Rather than making the outcome scale a research tool to audit outcome it should be used in everyday practice so that we can strive to improve our own results. Finally, the outcome scale has to be patient based, scientific and cost effective so that the three key player's i.e patient, doctor and manager are all satisfied.


  1. World Health Organisation Quality of Life Group, (WHOQOL 1993b).
  2. Rokeach, M. (1973). The nature of Human Values. New York: Free Press.
  3. Kaplan, R. M., Feeny, D. And Revicki, D. A. (1993a). Methods for assessing – Relative importance in preference based outcome measures. Quality of Life Research, 2; 467-75.
  4. Orth- Gomer, K., Britton, M. and Rehnqvist, N. ( 1979). Quality of care in an Outpatient department: The patients view. Social Science and Medicine, 13 A:347-57.
  5. Thomas, M. R. and Lyttle, D. ( 1980). Patients expectations about success of treatment and reported relief from low back pain. Journal of Psychosomatic Research, 24: 297-301.
  6. Jachuk, S. J., Brierly, H., Jachuck, S. and Wilcox, P. M. ( 1982). The effect of hypotensive drugs on the quality of life. Journal of the Royal College of General Practitioners, 32; 103-5.
  7. Slevin, M. L., Plant, H., Lynch, D. et al. ( 1988). Who should measure quality of life, the doctor or the patient? British Journal of Cancer, 57: 109-12.
  8. Guilford, J. P. ( 1954). Psychometric methods. New York: McGraw Hill.
  9. Jennett, B. and Bond, M. ( 1975). Assessment of outcome after severe brain damage: A practical scale. Lancet, 1: 480- 4.
  10. Karnofsky, D. A., Abelmann, W. H., Craver, L. F. et al. ( 1948). The use of nitrogen mustards in the palliative treatment of carcinoma. Cancer, I: 634-56.
  11. Hobart JC, Thompson AJ. Measurement of Neurological Outcomes. Neurological Outcome Measures Unit, Institute of Neurology, Queen Square, London, UK.
  12. Mor, V., Laliberte, L., Morris, J. N. and Wiemann, M. ( 1984). The Karnofsky performance status scale: An examination of its reliability and validity in a research setting. Cancer, 53: 2002-7.
  13. Hutchinson, T. A., Boyd, N. F., Feinstein, A. R. et al. ( 1979). Scientific problems in clinical scales as demonstrated in the Karnofsky Index of Performance Status. Journal of Chronic Diseases, 32: 661-6.
  14. Nunnally, J. C. ( 1978). Psychometric Theory. New York: McGraw Hill.

Article used with kind permission from Jabir Nagaria

Categories: ARTICLES, Uncategorized

About Author