Glossary
Accuracy. The degree to which a sample statistic would correspond to the population parameter it is meant to estimate if there were no random error. Accuracy is high when bias is low.
Active block.
As defined by Survey Sampling,
Inc., a block is the set of 100 numbers identified by the first two digits of
the last four digits of the telephone number. In the telephone number 255-4200,
"42" is the block. A block is termed to be working if one or more
listed telephone numbers are found in that block
Allocation. The method of distributing sample sizes to the strata in a stratified sample. Two commonly used methods of allocation are: proportional allocation where the sample size of a stratum is proportional to the population size of the stratum, and optimum allocation in which the sample sizes are allocated to the strata in such a manner as to minimize the standard error for some particular estimate.
Area samples. Those
kinds of samples that incorporate the selection of certain explicit geographic
area as part of the sample design; typically used in multi-stage door-to-door
studies.
Bias. A
tendency to underestimate or overestimate a population value of interest.
Block. Normally a rectangular piece of land, bounded by four streets. However, a block may also be irregular in shape or bounded by railroad tracks, streams, or other features. Blocks do not cross the boundaries of counties, census tracts or block numbering areas (BNAs). Census data are tabulated by block in all urbanized areas but much information is suppressed to protect the confidentiality of census information. A block is the smallest level of census geography.
Block group. A combination of numbered blocks that is a subdivision of a census tract or untracted area. Block groups are defined in areas for which block statistics are prepared.
Callbacks. Repeated attempts to contact a respondent who cannot be interviewed on an earlier attempt. A reasonable number of callbacks should be made to increase overall response and the probability that the survey results are representative of the population.
CAPI (computer-assisted personal interviewing). Data collection method in which the
researcher reads questions to the respondents from the computer screen and keys
in answers
CASI (computer-assisted self-interviewing). Data collection method in which the researcher directs respondents to a computer, and the respondents enter their own answers into the computer.
CATI (computer-assisted telephone interviewing). Data collection method in which researchers use random digit dialing to phone potential respondents, ask questions as directed by the computer, and key the responses directly into the system.
CAWI (Computer Assisted Web Interviewing). Data
collection method using web-based servers.
CGFS (computer generated fax survey). Data collection method in which researchers use random digit dialing to contact respondents' fax machines and fax a survey to them. Respondent then completes the survey and faxes it back to the researcher.
CPS (Current Population Survey). An ongoing national survey of very high quality conducted by
the Bureau of the Census. The CPS is
the best source of current population statistics. However, most small areas such as states and MSAs are not
individually reported.
Census. An
enumeration of the total population of
interest. Since no sample is selected
from the population, there is no sampling error. However, nonsampling errors
are still possible in a census.
Cleaning. To "clean" a data file is to check
for wild codes and
inconsistent responses (see Consistency Check);
to verify that the file has the correct and expected number of records, cases,
and cards or records per case; and to correct errors found.
Cluster
sample. The selection of groups of elements from a
population. Typically used in
multi-stage area probability designs to improve the efficiency of fieldwork by
assuring that groups of neighboring households are interviewed. Since neighbors tend to share certain characteristics,
clustering almost always reduces sampling efficiency.
Codebook. Generically, any information on the
structure, contents, and layout of a data file. Typically, a codebook includes:
column locations and
widths for each variable;
definitions of different record types
; response codes for each
variable; codes use to indicate non-response and missing data; exact questions
and skip patterns used in a survey; and other indications of the content of
each variable. Many codebooks also include frequencies of response.
Codebooks vary widely in quality and amount of information included. They may
be machine-readable or paper copy or microfiche.
Coefficient of variation
(C.V.). The ratio of the standard error for a variable to the mean value of the
variable. This is used to measure the imprecision in survey estimates
introduced by sampling. A coefficient of variation of 1 percent would indicate
that an estimate could vary slightly due to sampling error, while a coefficient
of variation of 50 percent means that the estimate is very imprecise. The most
common way to improve the coefficient of variation requires increases in sample
size that are typically expensive to accomplish.
Completion rate. The percent of qualified respondents
from whom a completed interview is obtained.
Confidence intervals. A range around the sample estimate in which
the population estimate is expected to fall with a specified degree of
confidence, usually 95% of the time or 90% of the time.
Consistency Check. A process of data cleaning which looks for
inappropriate responses to branched questions. For instance, one question might
ask if the respondent attended church last week; a response of "no"
should skip the questions about church attendance and code the answers to those
questions as "inapplicable." If those questions were coded any other
way than "inapplicable this would be inconsistent with the skip patterns of the
survey instrument.
Cooperation rate. The
percentage of in-scope individuals
(or organizations) who complete a survey after being contacted. The denominator
for the cooperation rate excludes individuals (or organizations) whom one has
tried unsuccessfully to contact. Thus, the cooperation rate for a survey will
be higher than its response rate unless all selected individuals (or
organizations) are contacted.
Coverage. The
extent of correspondence between the target population and the sampling frame. Ideally, all members of the target population
are included in the sampling frame. However, this is infrequently the case for
major surveys. Coverage is rarely estimable in precise terms; however, survey
designers are usually aware of the likely reasons for undercoverage and can
often estimate the extent of the problem. In addition to the problem of
undercoverage (missing population members), sampling frames can suffer from
overcoverage, i.e., the inclusion of units that do not belong on the sampling
frame and/or the listing of a given unit more than once. These problems are
usually correctable. Duplicate listings are either deleted prior to sample selection or are corrected for by appropriate
statistical adjustments. Listings that are not in-scope according to the survey definition are typically
deleted during data collection or analysis and corresponding statistical
adjustments are made to estimate the likely extent of out-of-scope cases among the survey nonrespondents.
Cross Sectional
Study. In
survey research, a study in which data are obtained only once. Contrast with longitudinal studies in
which a panel of individuals is interviewed repeatedly over a period of time.
Note that a cross sectional study can ask questions about previous periods of
time, though.
Disproportionate
sampling. The deliberate use of different sampling
rates for various strata such as high-income neighborhoods and so forth.
Estimation procedures.
Procedures followed in making population estimates from the survey responses.
Exchange. A telephone exchange is the next three
digits of the phone number after the area code.
Frequencies. (Also called
"marginals.") In survey research, the number of respondents who responded
to each of the possible answers to a question. Often codebooks list the
frequency of response to each question. So, for instance, you might be able to
tell from a codebook how many House Members voted in favor of a bill and how
many voted against it.
Household. The
person or persons occupying a housing unit.
Families are a subset of households.
Housing unit. A
house, apartment, mobile home or trailer, group of room, or a single room
occupied or intended for occupancy as a separate living quarters. Separate living quarters are those in which
the occupants do not live and eat with any other person in the structure and
which have direct access from the outside of the building or through a common
hall.
Imputation.
The process by which one estimates missing values for items that a survey respondent failed to provide.
Incidence. In market research, the term incidence
describes what percent of a population or group qualifies on some criteria.
In-scope. Sampling
units that are part of the population of
interest.
Item nonresponse. The
failure of a respondent to answer a
particular item on the survey. When item nonresponse is high and respondents
and nonrespondents differ substantially, item nonresponse can be a serious
threat to the accuracy of the estimates. Imputation techniques can be used to reduce the impact of
this problem, but the extent to which they are effective is difficult to
determine.
Listed telephone
households. Those households that are listed in
published telephone directories.
Margin of error. A measurement of
the accuracy of the results of a survey. Example: A margin of error of plus or
minus 3.5% means that there is a 95% chance that the responses of the target
population as a whole would fall somewhere between 3.5% more or 3.5% less than
the responses of the sample (a 7% spread). However, for any specific question,
the margin of error could be greater or less than plus or minus 3.5%.
Measurement error. The
extent to which there are discrepancies between survey results and the true
value of what the survey researcher is attempting to measure. There are several
possible sources of error here. Respondents may report inaccurate information because they
do not have the required information, due to carelessness, or because they do
not understand the question asked. Alternately, respondents may provide
accurate information, but errors are introduced in the data processing stage
due to keypunching, coding, or programming errors. Since it is often not possible
to determine the "true value" of what one is trying to measure,
precise estimates of measurement error are usually not possible. However,
techniques exist for obtaining some information about the likely extent of
measurement error. For example, information reported by individuals may be
compared with appropriate institutional records on the individual.
Microdata.
Nonaggregated data about the units sampled. For surveys of individuals,
microdata contain records for each individual interviewed; for surveys of
organizations, the microdata contain records for each organization.
Multimodal survey. A
survey in which more than one data collection mode was used, e.g., a mix of
mail and phone data collection. This approach is often used in large surveys
because mail data collection is cheaper than phone but response rates are typically too low to meet desired
levels. Mail nonrespondents are surveyed by phone. The major problem with this
approach is that the mode of data collection may produce different answers.
This can potentially lead to incorrect inferences about the associations among
variables.
Non-probability
sampling. A sampling procedure in which the
selection of population elements is based in part on the judgment of the
researcher or field interviewer.
Non-response
error. The difference on
measures between those who respond to a survey and those who do not respond.
Omnibus panel. A fixed sample of respondents measured on
different variables over a period of time.
Out-of-scope.
Sampling units that are not part of the population of interest. For example, in the National Survey of Recent College Graduates,
only individuals who received a bachelor's or master's degree within a
specified time frame are of interest. If an educational institution provided
the name of an individual who failed to graduate, the individual would be considered
out-of-scope for the survey. Information on
this individual would not be included in the final estimates from the survey.
Panel Study. A longitudinal study in
which a panel
of individuals is interviewed at intervals over a period of time. In general
usage, the definitions of longitudinal study and panel study overlap. At least
one author says that the term "panel study" is sometimes used for
studies that are restricted to a short period of time or are limited to two or
three interviews and "longitudinal study" is used for studies that
last longer or include more interviews; but there are significant examples
where this distinction is not accurate. In general, longitudinal studies
involve panels of respondents and panel studies are longitudinal studies.
Examples of panel studies include the Survey of Income and Program Participation
(SIPP) and the Panel Study of Income and Dynamics (PSID).
Parameter. Characteristic of a population.
Population. The
individuals or organizations of interest in a given survey. In sample surveys one makes inferences about the population
from the sample selected.
Predictive dialing.
A computer driven process that automatically dials a file of phone
numbers and passes connected calls to available agents.
Primary Sampling Units
(PSUs). Geographic areas where a survey will be
conducted. Generally applied to
door-to-door and cluster sampling.
Probability sampling. A sample selected by a random procedure
that gives every member of the population to be sampled a known nonzero chance
of selection. The probabilities of
selection may or may not be equal.
Probability proportional to
size (pps). A sampling technique in which the probability of a
unit's being selected is based on a measure of size. For example, if the
measure of size is expenditures, organizations with high expenditures are
selected with higher probability than organizations with low expenditures.
Random digit. Process used to generate telephone
samples in which all the working exchanges and working blocks within the study
are determined, and then all the possible combinations of telephone numbers
within these working exchanges and blocks are generated. A block is deemed working if three or more
listed numbers are found within that block.
Within any given block, there are 100 possible two-digit combinations to
form a complete number.
Respondent.
The individual or organization providing the information requested in the
survey. The type of respondent influences what type of information can be
obtained, e.g., individuals completing a degree may provide different
information about the degree than a representative of the academic institution
granting the degree would provide.
Response codes. Typically
responses to questions are "coded" by assigning numeric codes to each
possible response. Thus a "yes" might be coded "1" and a
"no" "2"; female respondents
might be indicated by a "1" and male respondents by a "2";
each state or county might be assigned a numeric code.
Response rate. Indicates
the percentage of sample members who
provided information in response to being surveyed. Care in interpreting
response rates is necessary, because there is not one single uniformly accepted
measure of response rate. One common measure, used extensively in demographic
surveys, is the percentage of in-scope sample
members who responded to the survey. In surveys that focus on estimating
expenditures, the response rate is often calculated as the percentage of the
total expenditures represented by responding sample members. This measure is
often referred to as a weighted response rate (though weighting may also be
used to adjust for different probabilities of sample selection).
Sample. The
individuals or organizations selected to represent the population.
Sample design. The
procedures used in selecting the sample. These procedures can be as simple as randomly
selecting a certain percentage of the cases. However, more complex designs are
frequently used in order to obtain reliable information about a particular
group(s) of interest and/or to minimize the cost of obtaining the information
desired.
Sample frame. Those
individuals or organizations from which one selects the actual sample for the survey. Ideally, the sample frame is the
same as the target population. In
reality, however, there are often differences.
Scope of survey. The
population to which the researcher plans to generalize his
or her results. The scope of the survey may be limited by both theoretical and
practical considerations. For example, while it may be of theoretical interest
to obtain information on the characteristics of institutionalized individuals,
practical difficulties often lead researchers to declare such individuals out-of-scope for a survey. Out-of-scope cases may be eliminated at the time of sample frame construction or during data collection or
data processing.
Skip Pattern. In survey
research, the sequence of questions asked and skipped. For instance, persons
who answer one question that indicates they did not vote in the last election
would trigger a "skip" so that the interviewer would not ask those
respondents questions about how they voted in the last election.
Standard error. Commonly
used measure of how precisely one can estimate a population
value from a given sample. For large
sample surveys, a reasonable interpretation of the standard error is that
approximately 68 percent of the time the sample estimate will be within one
standard error of the population value. For example, if one estimates that the
mean income for individuals within a specified group is $30,000 with a standard
error of $5,000, one would be right 68 percent of the time in assuming that the
true (or population) mean income for the group is between $25,000 and $35,000.
Subsample. A
sample selected from a sample frame that is itself a sample of a larger population. Often the original sample is used to identify
individuals or organizations of interest or is used to sort units into groups
to be sampled at different rates.
Stratification.
A sampling technique in which sampling is done separately for separate parts of
the population. Stratification is often used to
ensure that one has an adequate number of sampling units with relatively rare
characteristics (e.g., stratification may be done on race/ethnic status if one
wishes to make comparisons among racial/ethnic groups).
Target population.
Those individuals or organizations about which one wishes to make inferences on
the basis of the survey results.
Two-stage sample. A
sample selected in two steps. In one common
type of two-stage sample, the first stage consists of a sample of organizations
of interest and the second stage consists of individuals within organizations.
Unit nonresponse. The
failure of an individual or organization to respond to the survey. When unit
nonresponse is high and respondents and
nonrespondents differ substantially, unit nonresponse can be a serious threat
to the accuracy of a survey. There are statistical techniques that can be used
to reduce the impact of this problem, but all rest on assumptions about the
characteristics of missing units that are difficult to evaluate without
expensive additional data collection.
Unit of analysis. The basic
observable entity being analyzed by a study and for which data are collected in
the form of variables
. Although a unit of analysis is sometimes referred to as the case or
"observation," these are not always synonymous. For instance, in
public opinion polls, the unit of analysis is usually a single person and the
answers to the survey questions by one person constitute a "case." In
a census, however, a "case" could be considered the household because
all the data for one household is collected on one survey instrument; the
household "case" may contain different variables for the different
units of analysis: a physical housing structure, a family within the structure,
a person within the family. Contrast with Unit of observation.
Unit of Observation. When social science methodology is used to collect data, the entity which is observed or about which information is collected is the unit of observation. The unit of observation is the same as the unit of analysis when the generalizations being made from a statistical analysis are attributed to the unit of observation (i.e., the objects about which data were collected and organized for statistical analysis). While the units of observation and analysis are often the same, the wealth of secondary data sources creates opportunities to conduct analyses with data from multiple units of observation. This is probably most recognizable in GIS research.
Variable. In social science
research, for each unit of analysis
, each item of data (e.g., age of person, income of family, consumer price
index) is called a variable.
Wave. In a
panel study , a wave is
the interviewing period during which the entire panel is questioned and
asked the same questions. Typically, a panel study consists of several waves.
Waves are important because each wave typically covers a different time period
and, often, different topics.
Weight. In
survey research, a number associated with a case or unit of analysis ; the
weight is used as a measure of the relative significance of the variables of
that case when making estimates for the entire population. When a probability
sample is used, there is often a chance that some elements of the population
are under or over represented in the sample. In order to allow more accurate
estimates of a complete population, therefore, "weights" are assigned
to each case and used to adjust the overall results to more closely conform to
the total population.