Glossary

Accuracy.  The degree to which a sample statistic would correspond to the population parameter it is meant to estimate if there were no random error. Accuracy is high when bias is low.

Active block.  As defined by  Survey Sampling, Inc., a block is the set of 100 numbers identified by the first two digits of the last four digits of the telephone number. In the telephone number 255-4200, "42" is the block. A block is termed to be working if one or more listed telephone numbers are found in that block

Allocation.  The method of distributing sample sizes to the strata in a stratified sample. Two commonly used methods of allocation are: proportional allocation where the sample size of a stratum is proportional to the population size of the stratum, and optimum allocation in which the sample sizes are allocated to the strata in such a manner as to minimize the standard error for some particular estimate.

Area samples.  Those kinds of samples that incorporate the selection of certain explicit geographic area as part of the sample design; typically used in multi-stage door-to-door studies.

Bias. A tendency to underestimate or overestimate a population value of interest.

Block.  Normally a rectangular piece of land, bounded by four streets. However, a block may also be irregular in shape or bounded by railroad tracks, streams, or other features. Blocks do not cross the boundaries of counties, census tracts or block numbering areas (BNAs). Census data are tabulated by block in all urbanized areas but much information is suppressed to protect the confidentiality of census information. A block is the smallest level of census geography.

Block group.  A combination of numbered blocks that is a subdivision of a census tract or untracted area.  Block groups are defined in areas for which block statistics are prepared. 

Callbacks.  Repeated attempts to contact a respondent who cannot be interviewed on an earlier attempt.  A reasonable number of callbacks should be made to increase overall response and the probability that the survey results are representative of the population.

CAPI (computer-assisted personal interviewing).  Data collection method in which the researcher reads questions to the respondents from the computer screen and keys in answers

CASI (computer-assisted self-interviewing).  Data collection method in which the researcher directs respondents to a computer, and the respondents enter their own answers into the computer.

CATI (computer-assisted telephone interviewing).  Data collection method in which researchers use random digit dialing to phone potential respondents, ask questions as directed by the computer, and key the responses directly into the system.

CAWI (Computer Assisted Web Interviewing). Data collection method using web-based servers.

CGFS (computer generated fax survey).  Data collection method in which researchers use random digit dialing to contact respondents' fax machines and fax a survey to them. Respondent then completes the survey and faxes it back to the researcher.

CPS (Current Population Survey).  An ongoing national survey of very high quality conducted by the Bureau of the Census.  The CPS is the best source of current population statistics.  However, most small areas such as states and MSAs are not individually reported.

Census. An enumeration of the total population of interest. Since no sample is selected from the population, there is no sampling error. However, nonsampling errors are still possible in a census.

Cleaning. To "clean" a data file is to check for wild codes and inconsistent responses (see Consistency Check); to verify that the file has the correct and expected number of records, cases, and cards or records per case; and to correct errors found.

Cluster sample.  The selection of groups of elements from a population.   Typically used in multi-stage area probability designs to improve the efficiency of fieldwork by assuring that groups of neighboring households are interviewed.  Since neighbors tend to share certain characteristics, clustering almost always reduces sampling efficiency.

Codebook. Generically, any information on the structure, contents, and layout of a data file. Typically, a codebook includes: column locations and widths for each variable; definitions of different record types ; response codes for each variable; codes use to indicate non-response and missing data; exact questions and skip patterns used in a survey; and other indications of the content of each variable. Many codebooks also include frequencies of response. Codebooks vary widely in quality and amount of information included. They may be machine-readable or paper copy or microfiche.

Coefficient of variation (C.V.). The ratio of the standard error for a variable to the mean value of the variable. This is used to measure the imprecision in survey estimates introduced by sampling. A coefficient of variation of 1 percent would indicate that an estimate could vary slightly due to sampling error, while a coefficient of variation of 50 percent means that the estimate is very imprecise. The most common way to improve the coefficient of variation requires increases in sample size that are typically expensive to accomplish.

Completion rate.  The percent of qualified respondents from whom a completed interview is obtained. 

Confidence intervals.  A range around the sample estimate in which the population estimate is expected to fall with a specified degree of confidence, usually 95% of the time or 90% of the time.

Consistency Check.   A process of data cleaning which looks for inappropriate responses to branched questions. For instance, one question might ask if the respondent attended church last week; a response of "no" should skip the questions about church attendance and code the answers to those questions as "inapplicable." If those questions were coded any other way than "inapplicable this would be inconsistent with the skip patterns of the survey instrument.

Cooperation rate. The percentage of in-scope individuals (or organizations) who complete a survey after being contacted. The denominator for the cooperation rate excludes individuals (or organizations) whom one has tried unsuccessfully to contact. Thus, the cooperation rate for a survey will be higher than its response rate unless all selected individuals (or organizations) are contacted.

Coverage. The extent of correspondence between the target population and the sampling frame. Ideally, all members of the target population are included in the sampling frame. However, this is infrequently the case for major surveys. Coverage is rarely estimable in precise terms; however, survey designers are usually aware of the likely reasons for undercoverage and can often estimate the extent of the problem. In addition to the problem of undercoverage (missing population members), sampling frames can suffer from overcoverage, i.e., the inclusion of units that do not belong on the sampling frame and/or the listing of a given unit more than once. These problems are usually correctable. Duplicate listings are either deleted prior to sample selection or are corrected for by appropriate statistical adjustments. Listings that are not in-scope according to the survey definition are typically deleted during data collection or analysis and corresponding statistical adjustments are made to estimate the likely extent of out-of-scope cases among the survey nonrespondents.

Cross Sectional Study.   In survey research, a study in which data are obtained only once. Contrast with longitudinal studies in which a panel of individuals is interviewed repeatedly over a period of time. Note that a cross sectional study can ask questions about previous periods of time, though.

Disproportionate sampling.  The deliberate use of different sampling rates for various strata such as high-income neighborhoods and so forth. 

Estimation procedures. Procedures followed in making population estimates from the survey responses.

Exchange.  A telephone exchange is the next three digits of the phone number after the area code.

Frequencies.   (Also called "marginals.") In survey research, the number of respondents who responded to each of the possible answers to a question. Often codebooks list the frequency of response to each question. So, for instance, you might be able to tell from a codebook how many House Members voted in favor of a bill and how many voted against it.

Household.  The person or persons occupying a housing unit.  Families are a subset of households.

Housing unit.  A house, apartment, mobile home or trailer, group of room, or a single room occupied or intended for occupancy as a separate living quarters.  Separate living quarters are those in which the occupants do not live and eat with any other person in the structure and which have direct access from the outside of the building or through a common hall.

Imputation. The process by which one estimates missing values for items that a survey respondent failed to provide.

Incidence.  In market research, the term incidence describes what percent of a population or group qualifies on some criteria.

In-scope. Sampling units that are part of the population of interest.

Item nonresponse. The failure of a respondent to answer a particular item on the survey. When item nonresponse is high and respondents and nonrespondents differ substantially, item nonresponse can be a serious threat to the accuracy of the estimates. Imputation techniques can be used to reduce the impact of this problem, but the extent to which they are effective is difficult to determine.

Listed telephone households.  Those households that are listed in published telephone directories.

Margin of error.   A measurement of the accuracy of the results of a survey. Example: A margin of error of plus or minus 3.5% means that there is a 95% chance that the responses of the target population as a whole would fall somewhere between 3.5% more or 3.5% less than the responses of the sample (a 7% spread). However, for any specific question, the margin of error could be greater or less than plus or minus 3.5%.

Measurement error. The extent to which there are discrepancies between survey results and the true value of what the survey researcher is attempting to measure. There are several possible sources of error here. Respondents may report inaccurate information because they do not have the required information, due to carelessness, or because they do not understand the question asked. Alternately, respondents may provide accurate information, but errors are introduced in the data processing stage due to keypunching, coding, or programming errors. Since it is often not possible to determine the "true value" of what one is trying to measure, precise estimates of measurement error are usually not possible. However, techniques exist for obtaining some information about the likely extent of measurement error. For example, information reported by individuals may be compared with appropriate institutional records on the individual.

Microdata. Nonaggregated data about the units sampled. For surveys of individuals, microdata contain records for each individual interviewed; for surveys of organizations, the microdata contain records for each organization.

Multimodal survey. A survey in which more than one data collection mode was used, e.g., a mix of mail and phone data collection. This approach is often used in large surveys because mail data collection is cheaper than phone but response rates are typically too low to meet desired levels. Mail nonrespondents are surveyed by phone. The major problem with this approach is that the mode of data collection may produce different answers. This can potentially lead to incorrect inferences about the associations among variables.

Non-probability sampling.  A sampling procedure in which the selection of population elements is based in part on the judgment of the researcher or field interviewer.

Non-response error.  The difference on measures between those who respond to a survey and those who do not respond.

Omnibus panel.  A fixed sample of respondents measured on different variables over a period of time.

Out-of-scope. Sampling units that are not part of the population of interest. For example, in the National Survey of Recent College Graduates, only individuals who received a bachelor's or master's degree within a specified time frame are of interest. If an educational institution provided the name of an individual who failed to graduate, the individual would be considered out-of-scope for the survey. Information on this individual would not be included in the final estimates from the survey.

Panel Study. A longitudinal study in which a panel of individuals is interviewed at intervals over a period of time. In general usage, the definitions of longitudinal study and panel study overlap. At least one author says that the term "panel study" is sometimes used for studies that are restricted to a short period of time or are limited to two or three interviews and "longitudinal study" is used for studies that last longer or include more interviews; but there are significant examples where this distinction is not accurate. In general, longitudinal studies involve panels of respondents and panel studies are longitudinal studies. Examples of panel studies include the Survey of Income and Program Participation (SIPP) and the Panel Study of Income and Dynamics (PSID).

Parameter.  Characteristic of a population.

Population. The individuals or organizations of interest in a given survey. In sample surveys one makes inferences about the population from the sample selected.

Predictive dialing.  A computer driven process that automatically dials a file of phone numbers and passes connected calls to available agents.

Primary Sampling Units (PSUs).  Geographic areas where a survey will be conducted.  Generally applied to door-to-door and cluster sampling.

Probability sampling.  A sample selected by a random procedure that gives every member of the population to be sampled a known nonzero chance of selection.  The probabilities of selection may or may not be equal.

Probability proportional to size (pps). A sampling technique in which the probability of a unit's being selected is based on a measure of size. For example, if the measure of size is expenditures, organizations with high expenditures are selected with higher probability than organizations with low expenditures.

Random digit.  Process used to generate telephone samples in which all the working exchanges and working blocks within the study are determined, and then all the possible combinations of telephone numbers within these working exchanges and blocks are generated.  A block is deemed working if three or more listed numbers are found within that block.  Within any given block, there are 100 possible two-digit combinations to form a complete number.

Respondent. The individual or organization providing the information requested in the survey. The type of respondent influences what type of information can be obtained, e.g., individuals completing a degree may provide different information about the degree than a representative of the academic institution granting the degree would provide.

Response codes.   Typically responses to questions are "coded" by assigning numeric codes to each possible response. Thus a "yes" might be coded "1" and a "no" "2"; female respondents might be indicated by a "1" and male respondents by a "2"; each state or county might be assigned a numeric code.

Response rate. Indicates the percentage of sample members who provided information in response to being surveyed. Care in interpreting response rates is necessary, because there is not one single uniformly accepted measure of response rate. One common measure, used extensively in demographic surveys, is the percentage of in-scope sample members who responded to the survey. In surveys that focus on estimating expenditures, the response rate is often calculated as the percentage of the total expenditures represented by responding sample members. This measure is often referred to as a weighted response rate (though weighting may also be used to adjust for different probabilities of sample selection).

Sample. The individuals or organizations selected to represent the population.

Sample design. The procedures used in selecting the sample. These procedures can be as simple as randomly selecting a certain percentage of the cases. However, more complex designs are frequently used in order to obtain reliable information about a particular group(s) of interest and/or to minimize the cost of obtaining the information desired.

Sample frame. Those individuals or organizations from which one selects the actual sample for the survey. Ideally, the sample frame is the same as the target population. In reality, however, there are often differences.

Scope of survey. The population to which the researcher plans to generalize his or her results. The scope of the survey may be limited by both theoretical and practical considerations. For example, while it may be of theoretical interest to obtain information on the characteristics of institutionalized individuals, practical difficulties often lead researchers to declare such individuals out-of-scope for a survey. Out-of-scope cases may be eliminated at the time of sample frame construction or during data collection or data processing.

Skip Pattern.   In survey research, the sequence of questions asked and skipped. For instance, persons who answer one question that indicates they did not vote in the last election would trigger a "skip" so that the interviewer would not ask those respondents questions about how they voted in the last election.

Standard error. Commonly used measure of how precisely one can estimate a population value from a given sample. For large sample surveys, a reasonable interpretation of the standard error is that approximately 68 percent of the time the sample estimate will be within one standard error of the population value. For example, if one estimates that the mean income for individuals within a specified group is $30,000 with a standard error of $5,000, one would be right 68 percent of the time in assuming that the true (or population) mean income for the group is between $25,000 and $35,000.

Subsample. A sample selected from a sample frame that is itself a sample of a larger population. Often the original sample is used to identify individuals or organizations of interest or is used to sort units into groups to be sampled at different rates.

Stratification. A sampling technique in which sampling is done separately for separate parts of the population. Stratification is often used to ensure that one has an adequate number of sampling units with relatively rare characteristics (e.g., stratification may be done on race/ethnic status if one wishes to make comparisons among racial/ethnic groups).

Target population. Those individuals or organizations about which one wishes to make inferences on the basis of the survey results.

Two-stage sample. A sample selected in two steps. In one common type of two-stage sample, the first stage consists of a sample of organizations of interest and the second stage consists of individuals within organizations.

Unit nonresponse. The failure of an individual or organization to respond to the survey. When unit nonresponse is high and respondents and nonrespondents differ substantially, unit nonresponse can be a serious threat to the accuracy of a survey. There are statistical techniques that can be used to reduce the impact of this problem, but all rest on assumptions about the characteristics of missing units that are difficult to evaluate without expensive additional data collection.

Unit of analysis.   The basic observable entity being analyzed by a study and for which data are collected in the form of variables . Although a unit of analysis is sometimes referred to as the case or "observation," these are not always synonymous. For instance, in public opinion polls, the unit of analysis is usually a single person and the answers to the survey questions by one person constitute a "case." In a census, however, a "case" could be considered the household because all the data for one household is collected on one survey instrument; the household "case" may contain different variables for the different units of analysis: a physical housing structure, a family within the structure, a person within the family. Contrast with Unit of observation.

Unit of Observation.  When social science methodology is used to collect data, the entity which is observed or about which information is collected is the unit of observation.  The unit of observation is the same as the unit of analysis when the generalizations being made from a statistical analysis are attributed to the unit of observation (i.e., the objects about which data were collected and organized for statistical analysis). While the units of observation and analysis are often the same, the wealth of secondary data sources creates opportunities to conduct analyses with data from multiple units of observation. This is probably most recognizable in GIS research.

Variable.  In social science research, for each unit of analysis , each item of data (e.g., age of person, income of family, consumer price index) is called a variable.

Wave.  In a panel study , a wave is the interviewing period during which the entire panel is questioned and asked the same questions. Typically, a panel study consists of several waves. Waves are important because each wave typically covers a different time period and, often, different topics.

Weight.   In survey research, a number associated with a case or unit of analysis ; the weight is used as a measure of the relative significance of the variables of that case when making estimates for the entire population. When a probability sample is used, there is often a chance that some elements of the population are under or over represented in the sample. In order to allow more accurate estimates of a complete population, therefore, "weights" are assigned to each case and used to adjust the overall results to more closely conform to the total population.