Health Information Systems for Low-Income Countries: An Overview
Canadian Society for International Health

 


Appendix 5 - Descriptive statistics
Paul Fisher

Why use statistics?

Populations require health care decisions.
Health care decisions require health care information.
Health statistics are a kind of health information.
 
Natural variation amoung individuals that make up a population.
Natural variantion may mask differences between populations.
Need for tools to reveal true differences.
 

Descriptive Statistics

The first step of any analysis is to describe your data.
Type of description depends on the nature of the data.
  • Qualitative (categorical) variables, that is, counts and proportions.
  • Quantitative (continuous) variables, that is, distribution shape, location, and dispersion.

Shape of a Distribution

The shape of a distribution is its configuration of points when plotted on a graph.
The data's symmetry, modality, and kurtosis describe the shape.
Useful graphs for describing distribution:
  • histogram
  • stem-and-leaf plot
  • dot plot
  • box plot

Location of a Distribution

The location of a distribution is the position of its values when plotted on a graph.
The distribution's centre summarizes the location.
Common measures of location are:
  • Mean (arithmetic average)
  • Median (centre value)
  • Mode (most commonly occurring value)

Dispersion of a Distribution

The dispersion of a distribution is its spread (variability) around some central point.
The dispersion is as important as the location of a distribution.
Common measures of dispersion are:
  • Standard deviation
  • Interquartile range

Association between Variables

Association refers to the degree to which values "go together".
If there is a tendency for variables to go together in the same direction, a positive association is said to exist.
If there is a tendency for variable values to go in opposite directions, a negative association is said to exist.
If there is no association, variables are said to be independent.
Examples of measures of association include:
  • Mean difference (paired and independent)
  • correlation coefficient
  • regression coefficient
  • relative risk
  • odds ratio

Some Basic Definitions

Inference is the capacity to say something about a population based on examination of a sample or samples.

Regardless of the inferential method used, it is important to keep clearly in mind the distinction between:

The parameters (numerical summaries of the population) being inferred and
the statistics (numerical summaries of the sample) used to infer them

Although the two are related, they are not interchangeable. Distinctions to keep in mind are:

ParametersStatistics
  • population based
  • unknown
  • hypothetical
  • constants
  • sample based
  • calculated
  • random variables

Different symbols are used to represent sample statistics and population parameters. For example:

("p hat") may be used to represent a sample proportion (statistic)
p may be used to represent a population proportion (parameter)
x may be used to represent a sample mean (statistic)
µ may be used to represent a population mean (parameter)

A statistic is a number that can be computed from data, involving no unknown parameters. As a function of a random sample, a statistic is a random variable. Statistics are used to estimate parameters, and to test hypotheses.

A parameter is a numerical property of a population, such as its mean.

A variable is a numerical value or a characteristic that can differ from individual to individual.

Types of variables are:

Qualitative Variables
  • Variable whose values are adjectives
Categorical
Variables
  • Value ranges over categories, such as:
    • urban /rural
    • male / female
    • native / immigrant / refugee
    • Azeri / Russian / Georgian
Ordinal
Variables
  • Values have a natural order, such as:
    • thin / normal / fat
    • infant / child / adolescent / adult
Random
Variables
  • A random assignment of numbers to possible outcomes
  • For example: Consider a therapy for which outcomes are coded0 = unchanged
    1 = improved and
    2 = worse.
Quantitative Variables
  • Variable whose values are numbers AND whose values have logical meaning
  • Variables that have units of measure
    • height and weignt are quantitative
    • passport numbers are not quantitative
Discrete Variables
  • A variable whose set of possible values is countable
  • Examples of discrete variables are:
      number of people in a community
      number of doses of a vaccine on hand
Continuous Variables
  • A variable whose set of possible values is uncountable
  • Examples of continuous variables are
    • temperature
    • height
    • age
  • In practice, one can never measure a continuous variable to infinite precision, so continuous variables are sometimes approximated by discrete variables

A population is a collection of units being studied and about which information is desired. Units can be people, programs, institutions, time periods, procedures, etc.

Much of statistics is concerned with infering the characteristics of an entire population (parameters) from the characteristics of a random sample of units from the population. Population parameters are:

  • Mean - the arithmetic average of ALL of the values of a variable for a WHOLE population
  • Median - the middle value of ALL of the values of a variable for a WHOLE population
  • Standard Deviation - a measure of the variance among ALL of the values of a variable for a WHOLE population

A sample is a subset (size = n) of the population (size = N) selected for study.
  • A random sample is where every possible subset of the same size has the same chance of being selected.
  • A probability sample is where elements have a known chance of ending up in the sample.
  • A stratified sample is where elements are selected separately from different strata

Sample Statistics

Mean
  • The mean is an arithmetic average of the values of a sample.
  • The expected value of the sample mean is the population mean.
  • This is a statistic commonly used to estimate the population mean.
  • Suppose there are n data

    then

Median

  • The median is the middle of the ordered list of values.
  • If the list has an even number of entries, the median is the smaller of the two middle numbers.

Standard Deviation

  • The standard deviation is an estimator of the standard deviation of a population.
  • It measures how dispersed the sample is around the sample mean ().

Standard Variance

  • The standard variance is the square of the sample standard deviation.
  • It is an unbiased estimator of the population variance.


© 2005 Canadian Society for International Health and the Contributors
last update: 2005-06-28