How to calculate sample size correctly? Seminar topic: sampling in sociological research Key concepts Dependent and independent samples

Sampling in 1C 8.2 and 8.3 is a specialized method of searching through records of infobase tables. Let's take a closer look at what sampling is and how to use it.

What is sampling in 1C?

Sample- a method of sorting through information in 1C, which consists of sequentially placing the cursor on the next record. A selection in 1C can be obtained from the query result and from the object manager, for example, documents or directories.

An example of getting and iterating from an object manager:

Selection = Directories. Banks. Choose() ; Bye Selection. Next() Loop EndLoop ;

An example of obtaining a sample from a request:

Get 267 video lessons on 1C for free:

Request = New Request( "Select Link, Code, Name From Directory.Banks") ; Fetch = Query. Run() . Choose() ; Bye Selection. Next() Loop //perform the actions of interest with the "Banks" directory EndCycle ;

Both examples listed above receive the same data sets to iterate over.

Sampling Methods 1C 8.3

Sampling has a large number of methods, let’s consider them in more detail:

  • Choose()- a method by which a sample is obtained directly. From the sample, you can get another, subordinate, sample if the type of traversal “by grouping” is specified.
  • Owner()— method inverse to Select(). Allows you to get the “parent” sample of a request.
  • Next()— a method that moves the cursor to the next record. If a record exists, returns True; if there are no records, returns False.
  • FindNext()- a very useful method with which you can sort through only the required fields by selection value (selection - field structure).
  • NextByFieldValue()— allows you to get the next record with a value different from the current position. For example, you need to iterate through all records with a unique value for the “Account” field: Select.NextByFieldValue (“Account”).
  • Reset()— allows you to reset the current cursor location and set it to its original position.
  • Quantity()— returns the number of records in the selection.
  • Get()— using the method you can set the cursor on the desired record by index value.
  • Level() - level in the hierarchy of the current record (number).
  • RecordType()— displays the type of record - DetailedRecord, TotalByGrouping, TotalByHierarchy or GeneralTotal
  • Group()— returns the name of the current grouping; if the record is not a grouping, it returns an empty string.

If you are starting to learn 1C programming, we recommend our free course (don’t forget

Learning goals

  1. Clearly distinguish between the concepts of census (qualification) and sampling.
  2. Know the essence and sequence of the six stages implemented by researchers to obtain a sample population.
  3. Define the concept of "sampling frame".
  4. Explain the difference between probability and deterministic sampling.
  5. Distinguish between fixed-size sampling and multi-stage (sequential) sampling.
  6. Explain what purposive sampling is and describe both its strengths and weaknesses.
  7. Define the concept of quota sampling.
  8. Explain what a parameter is in a sampling procedure.
  9. Explain what a derived set is.
  10. Explain why the concept of sampling distribution is the most important concept statistics.

So, the researcher has precisely defined the problem and secured a research design and data collection tools suitable for solving it. The next stage of the research process should be the selection of those elements that will be examined. It is possible to survey every element of a given population by taking a complete census of that population. A complete survey of the population is called a census. There is another possibility. A certain part of the population, a sample of elements of a large group, is subjected to a statistical survey, and based on the data obtained from this subset, certain conclusions are drawn regarding the entire group. The generalizability of results obtained from sample data to a larger group depends on the method by which the sample was collected. Much of this chapter will be devoted to how the sample should be selected and why it should be so.

Census (qualification)
Complete census of the population.
Sample
A collection of elements of a subset of a larger group of objects.

The concept of "population" or "collection" can refer not only to people, but also to firms operating in the manufacturing industry, to retail or wholesale organizations, or even to completely inanimate objects, such as parts produced in an enterprise; this concept is defined as the entire set of elements that satisfy certain specified conditions. These conditions clearly define both the elements that belong to the target group and the elements that should be excluded from consideration.

Research to determine the demographic profile of frozen pizza consumers should begin by identifying who should and should not be classified as such. Do people who have tried this pizza at least once belong to this category? Individuals who buy at least one pizza per month? In Week? People who eat more than a certain minimum amount of pizza in a month? The researcher must be very precise in identifying the target group. It is also necessary to ensure that the sample is drawn from the target population and not from “some” population, which is the case when the sampling frame is inappropriate or incomplete. The latter is a list of elements from which the actual sample will be formed.

A researcher may prefer a sample method to a survey of the entire population for several reasons. Firstly, full examination aggregation of even a relatively small size requires very large material and time costs. Often, by the time the census is completed and the data is processed, the information is already out of date. In some cases, qualification is simply impossible. Let's say the researchers set out to check whether the actual service life of electric incandescent lamps corresponds to the calculated one, for which they need to keep them on until they fail. If we examine the entire stock of lamps in this way, reliable data will be obtained, but there will be nothing left to trade.

Finally, to the surprise of novices, a researcher may prefer the sampling method to the qualification method in order to ensure accuracy of the results. Conducting censuses requires the involvement of a large staff of staff, which increases the likelihood of systematic (non-sampling) errors. This fact is one of the reasons why the US Census Bureau uses sample surveys to check the accuracy of various types of censuses. You heard that right: sample surveys may be conducted to verify the accuracy of qualification data.

Sample Design Steps

In Fig. Figure 15.1 shows a six-step sequence that a sample researcher might follow. First of all, it is necessary to determine the target population or set of elements about which the researcher wants to know something.

For example, when studying children's preferences, researchers need to decide whether the population being studied will consist of only children, only parents, or both.

Aggregate (population)
A set of elements that satisfy certain specified conditions.
Sampling frame
List of elements from which the sample will be made; may consist of territorial units, organizations, individuals and other elements.

A certain company tested its electric “races” only on children. They delighted the children. Parents reacted differently to the new product. Moms didn't like the fact that the attraction didn't teach children to take care of cars, and dads didn't like the fact that the product was made like a toy.
The opposite situation is also possible. A certain company began producing a new food product and launched a nationwide advertising campaign in which the main role was assigned to a precocious child. The company tested the effectiveness of commercials only on mothers who were thrilled with delight. The children found this “accelerator,” and with it the advertised product itself, disgusting. The product is finished 1.

The researcher must decide who or what the relevant population will consist of: individuals, families, firms, other organizations, credit card transactions, etc. When making such decisions, it is necessary to determine the elements that should be excluded from populations. Both temporal and geographical reference of elements must be carried out, which in some cases may be subject to additional conditions or restrictions. For example, if we are talking about individuals, the population of interest may consist only of persons over 18 years of age, or only of women, or only of persons with at least a high school education.

The task of determining geographical boundaries for the target population in international marketing research can be special problem, since this increases the heterogeneity of the system under consideration. For example, the relative ratio of urban and rural areas can vary significantly from country to country. The territorial aspect has a serious impact on the composition of the population within one country. For example, in the north of Chile, a predominantly Indian population lives compactly, while in the southern regions of the country, mainly descendants of Europeans live.

Coverage (incidence)
Expressed as a percentage, the proportion of elements of a population or group that meet the conditions for inclusion in the sample.

Generally speaking, the more simply the target population is defined, the higher its coverage (incidence) and the easier and cheaper the sampling procedure. Coverage (incidence) corresponds to the percentage of elements of a population or group that meet the conditions for inclusion in the sample. Coverage directly affects time and material costs necessary for the examination. If coverage is large (that is, a large proportion of elements of the population satisfy one or more simple criteria used to identify potential respondents), the time and material costs required for data collection are minimized. Conversely, as the number of criteria that potential respondents must satisfy increases, both material and time costs increase.

In Fig. Figure 15.2 shows the proportion of the adult population involved in certain sports. The data in the figure indicate that examining people involved in motorcycling (only 3.6% of the total number of adults) is much more difficult and costly than examining people who take regular recreational walks (27.4% of the total number of adults). The main thing is for the researcher to be precise in determining which elements should be included in the population under study and which elements should be excluded from it. A clear statement of the research goal greatly facilitates the solution of this problem. The second step in the sample selection process is to determine the sample frame, which, as you already know, is the list of elements from which the sample will be drawn. Let the target population of a study be all families living in the Dallas area. At first glance, a good and easily accessible sampling frame would be the Dallas telephone directory. However, upon closer examination, it becomes obvious that the list of families contained in the directory is not entirely correct, because the numbers of some families are omitted from it (of course, it does not include families that do not have a telephone), while some families have several telephone numbers . Persons who have recently changed their place of residence and, accordingly, their telephone number are also not included in the directory.

Experienced researchers find that there is rarely an exact match between the sampling frame and the target population of interest. One of the most creative stages The job of sampling design is to determine an appropriate sampling frame when listing population elements is difficult. This may require sampling from work blocks and prefixes when, for example, random dialing is used due to shortcomings in telephone directories. However, the significant increase in work units over the past 10 years has made this task more difficult. Similar situations can arise during selective observation of territorial zones or organizations with subsequent taking of subsamples, when, say, the target population is individuals, but there is no exact current list of them.

Source: Based on data contained in SSI LITe TM: L ow Incidence T argeted S ampling" (Fairfield, Conn.: Survey Sampling, Inc., 1994).

The third stage of the sampling procedure is closely related to the determination of the sampling frame. The choice of sampling method or procedure depends largely on the sampling frame adopted by the researcher. Different types of samples require various types sampling frames. This chapter and the next will provide an overview of the main types of samples used in marketing research. When describing them, the connection between the sampling frame and the method of its formation should become obvious.

The fourth step in the sampling procedure is to determine the sample size. This problem is discussed in Chap. 17. At the fifth stage, the researcher needs to actually select the elements that will be examined. The method used for this purpose is determined by the selected sample type; When discussing sampling methods, we will also talk about the selection of its elements. Finally, the researcher needs to actually survey the selected respondents. At this stage, there is a high probability of making a number of mistakes.
These problems and some methods for resolving them are discussed in Chapter. 18.

Types of sampling (sampling) plans

All sampling control methods can be divided into two categories: observation of probability samples and observation of deterministic samples. Each member of the population can be included in a probability sample with a certain specified non-zero probability. The probability of including certain members of the population in the sample may vary, but the probability of including each element in it is known. This probability is determined by the specific mechanical procedure used to select the sample elements.

For deterministic samples, assessing the probability of including any element in the sample becomes impossible. The representativeness of such a sample cannot be guaranteed. For example, Allstate Corporation developed a system to process insurance claims data for 14 million households (its clients). The company plans to use this data to determine patterns in demand for its services—for example, the likelihood that a household that owns a Mercedes Benz will also own a vacation home (which will require insurance). Although the database is very large, the company has no means of assessing the likelihood that any particular customer will make a claim. The company therefore cannot be sure that the data on customers who make claims is representative of all of the company's customers; and to an even lesser extent - in relation to potential clients.

All deterministic samples are based on the individual position, judgment or preference of the researcher rather than on a mechanical procedure for selecting sample elements. Such preferences can sometimes provide good estimates of population characteristics, but there is no way to objectively determine whether a sample is appropriate for the task at hand. An assessment of the accuracy of the sampling results can only be made if the probabilities of selecting certain elements were known. For this reason, probability sampling is generally considered a superior method for estimating the magnitude of sampling error. Samples can also be divided into fixed-size samples and sequential samples. When working with fixed-size samples, the sample size is determined before the survey begins, and the analysis of the results is preceded by the collection of all necessary data. We will be mainly interested in samples of a fixed size, since this is the type that is usually used in marketing research.

Probability sampling
A sample in which each element of the population can be included with some known non-zero probability.
Deterministic sampling
Sampling based on certain private preferences or judgments that determine the selection of certain elements; in this case, assessing the probability of including an arbitrary population element in the sample becomes impossible.

However, it should not be forgotten that there are also sequential samples that can be used with each of the basic sampling designs discussed below.

In sequential sampling, the number of selected elements is unknown in advance; it is determined based on a series of sequential decisions. If a survey of a small sample does not lead to a reliable result, the range of elements surveyed is expanded. If the result still remains inconclusive, the sample size is increased again. At each stage, a decision is made about whether the result obtained is considered sufficiently convincing or whether to continue collecting data. Working with sequential sampling makes it possible to assess the trend of data as it is collected, which allows you to reduce the costs associated with additional observations in cases where their feasibility comes to naught.

Both probabilistic and deterministic sampling designs come in a number of types. For example, deterministic samples can be non-representative (convenient), intentional or quota; probability samples are divided into simple random, stratified or group (cluster), they, in turn, can be divided into subtypes. In Fig. Figure 15.3 shows the types of samples that will be discussed in this and the next chapter.

Fixed volume sampling (fixed sampling)
A sample whose size is determined a priori; the necessary information is determined from the selected elements.
Sequential sampling
A sample formed based on a series of sequential decisions. If, after considering a small sample, the result appears inconclusive, a larger sample is considered; if this step does not lead to a result, the sample size is increased again, etc. Thus, at each stage a decision is made as to whether the result obtained can be considered sufficiently convincing.

It should be remembered that the main types of samples can be combined to form more complex plans selective observation. If you understand their basic initial types, it will be easier for you to understand more complex combinations.

Deterministic samples

As already mentioned, when selecting elements of a deterministic sample, private assessments or decisions play a decisive role. Sometimes these estimates come from the researcher, but in other cases the selection of elements of the population is left to field workers. Since elements are not selected mechanically, determining the probability of inclusion of an arbitrary element in the sample and, accordingly, the sampling error becomes impossible. Ignorance of the error due to the chosen sampling procedure prevents researchers from assessing the accuracy of their estimates.

Non-representative (convenience) samples

Non-representative (convenience) samples are sometimes called random because the selection of sample elements is carried out in a “random” manner - those elements that are or seem to be most available during the sampling period are selected.

Our daily life is replete with examples of such selections. We talk with friends and, based on their reactions and positions, we draw conclusions regarding the political biases prevailing in society; a local radio station calls on people to express their views on a controversial issue, and the views they express are interpreted as prevailing; We encourage volunteers to cooperate and work with those who volunteer to help us. The problem with convenience samples is obvious—we cannot be sure that samples of this kind actually represent the target population. We may still doubt that the opinions of our friends accurately reflect the political views prevailing in society, but we often really want to believe that larger samples, selected in the same way, are representative. Let us show the fallacy of such an assumption with an example.
Several years ago, one of the local television stations in the city in which the author of this book lives conducted a daily public opinion poll on topics of interest to the local community. The polls, called the “Pulse of Madison,” were conducted as follows. Each evening during the six o'clock news, the station asked viewers a question regarding a specific controversial issue, to which they had to give a positive or negative answer.

In case of a positive answer, you had to call one, in case of a negative answer, you had to call another phone number. The number of votes “for” and “against” was calculated automatically. The ten o'clock news broadcast reported the results of the telephone survey. Every evening from 500 to 1000 people called the studio, wanting to express their position on one issue or another; a television commentator interpreted the poll results as mainstream public opinion.

Non-representative (convenience) sample
Sometimes called random because the selection of sample elements is carried out in a “random” manner—those elements that are or appear to be most available during the sampling period are selected.

In one of the six-hour episodes, viewers were asked the following question: “Do you think the drinking age in Madison should be lowered to 18?” The existing legal age was 21 years old. The audience responded to this question with extraordinary activity - that evening almost 4,000 people called the studio, of which 78% were in favor of lowering the age limit. It seems obvious that a sample of 4,000 people "should be representative" of a community of 180,000. Nothing like that. As you probably already guessed, a certain age group of the population was much more interested in the known outcome of the vote than others. Accordingly, it was not surprising that when discussing this issue a few weeks later, it turned out that the students acted in concert during the time allotted for the survey. They called the television in turns, each several times. Thus, neither the sample size nor the percentage of supporters of liberalization of the law was anything surprising. The sample was not representative.

Simply increasing the sample size does not make it representative. The representativeness of the sample is ensured not by size, but by the proper procedure for selecting elements. When survey participants are voluntarily identified or sample elements are selected based on their availability, the sampling plan does not guarantee representativeness of the sample. Empirical evidence suggests that samples selected for reasons of convenience are rarely representative (regardless of sample size). Telephone polls, which survey 800-900 votes, are the most common form of large but unrepresentative samples.

Purposeful sampling
Deterministic (purposeful) sampling, the elements of which are selected manually; exactly those elements are selected that, in the opinion of the researcher, meet the objectives of the survey.
Purposeful sampling, depending on the researcher's ability to identify an initial set of respondents with the desired characteristics; these respondents are then used as informants to determine the further selection of individuals.

Unfortunately, many people take the results of such surveys with confidence. One of the most typical examples of the use of non-representative samples in international marketing research is a survey of certain countries based on a sample consisting of foreigners currently living in the territory of the country that initiated the survey (for example, Scandinavians living in the USA). Although such samples may shed some light on certain aspects of the population in question, it must be remembered that these individuals usually represent an “Americanized” elite whose connection to their own country may be rather conditional. It is not recommended to use non-representative samples when conducting descriptive or causal surveys. They are permissible only in exploratory studies aimed at developing specific ideas or concepts, but even in this case, it is preferable to use deliberate samples.

Purposeful sampling

Purposeful sampling is sometimes called unfocused; their elements, which in the opinion of the researcher meet the objectives of the study, are selected manually. Procter & Gamble used this method when showing advertisements to 13- to 17-year-olds living near its headquarters in Cincinnati. The company's food and beverage division hired this group of teenagers to act as a sort of consumer sample. Working 10 hours a week in exchange for $1,000 and going to a concert, they watched television commercials, visited supermarkets with company managers to view product displays, tested new products, and discussed purchasing behavior. By selecting sample representatives through a "recruitment" process rather than at random, the company could focus on attributes it considered useful, such as a teenager's ability to express themselves clearly, at the risk that their views might not be representative of their age group.

As already stated, distinctive feature A deliberate sample is the directed selection of its elements. In some cases, sample elements are selected not because they are representative, but because they may provide information of interest to researchers. When a court relies on expert testimony, it is, in a sense, resorting to the use of deliberate sampling. A similar position may prevail when developing research projects. During the initial study of the issue, the researcher is primarily interested in determining the prospects for the study, which determines the selection of sample elements.

Snowball sampling is a type of purposive sampling used when working with special types of populations. This sample depends on the researcher's ability to identify an initial set of respondents with the desired characteristics. These respondents are then used as informants to determine the further selection of individuals.

Imagine, for example, that a company wants to evaluate the need for a certain product that would allow deaf people to communicate by telephone. Researchers can begin to develop this problem by identifying key figures in the deaf community; the latter could name other members of this group who would agree to take part in the survey. With such tactics, the sample grows like a snowball.

While the researcher is at initial stages Once the problem has been explored and the prospects and possible limitations of the planned survey are being identified, the use of purposive sampling can be very effective. But in no case should we forget about the weaknesses of a sample of this type, since it can also be used by the researcher in descriptive or causal studies, which will immediately affect the quality of their results. A classic example of such forgetfulness is the Consumer Price Index (“CPI”). As Südman points out ( Sudman): “CPI is determined only for 56 cities and metropolitan areas, the selection of which is also influenced by the political factor. In fact, these cities can only represent themselves, while the index is called price index for consumer goods for city residents receiving hourly wages wages *, And employees and appears to most people as an index reflecting the price level in any region of the United States. The choice of retail outlets itself is also made in a non-random manner, as a result of which estimating possible sampling error becomes impossible"(emphasis added) 2.

* That is, workers. — Note. lane

Quota samples

The third type of deterministic sampling is quota samples; its known representativeness is achieved by including in it the same proportion of elements with certain characteristics as in the population under study (see “Research window 15.1”). As an example, you might consider trying to create a representative sample of students living on campus. If in a certain sample consisting of 500 individuals there is not a single senior student, we will have the right to doubt its representativeness and the legitimacy of applying the results obtained from this sample to the population being surveyed. When working with a proportional sample, the researcher can ensure that the proportion of senior students in the sample corresponds to their proportion in the total number of students.

Suppose that a researcher is conducting a sample study of university students, and he is interested in ensuring that the sample reflects not only their gender, but also their distribution across courses. Let the total number of students be 10,000: 3200 are first-year students, 2600 are second-year students, 2200 are third-year students and 2000 are fourth-year students; of which 7,000 were boys and 3,000 girls. For a sample size of 1,000, the proportional sampling plan requires 320 freshmen, 260 sophomores, 220 third-years, and 200 graduates, 700 boys and 300 girls. The researcher can implement this plan by assigning each interviewer a specific quota that will determine which students they should contact.

Quota sampling A deterministic sample selected so that the proportion of elements in the sample that have certain characteristics corresponds approximately to the proportion of the same elements in the population being studied; Each field worker is given a quota that defines the characteristics of the population with which he must come into contact.

An interviewer who is to conduct 20 interviews may be instructed to ask:

            • six freshmen - five boys and one girl;
            • six sophomores - four boys and two girls;
            • four third-year students - three boys and one girl;
            • four fourth-year students - two boys and two girls.

Note that the selection of specific sample elements is determined not by the research plan, but by the choice of the interviewer, designed to comply only with those conditions that were set by the quota: interview five freshmen, one freshman, etc.

Note also that this quota accurately reflects the gender distribution of the student population, but somewhat distorts the distribution of students across courses; 70% (14 out of 20) of interviews are among boys, but only 30% (6 out of 20) are among freshmen, while they make up 32% of the total number of students. The quota allocated to each individual interviewer may not, and usually does not, reflect the distribution of control characteristics in the population—only the resulting sample should have the appropriate proportionality.

It should be remembered that proportionate samples depend on personal, subjective attitudes or judgments rather than on an objective procedure for selecting sample elements. Moreover, unlike deliberate sampling, personal judgment here belongs not to the project developer, but to the interviewer. The question arises whether proportional samples can be considered representative, even if they reproduce the inherent ratio of components of the population that have certain control characteristics. In this regard, three remarks need to be made.

First, the sample may differ significantly from the population in some other important characteristics, which can have a serious impact on the result. For example, if the study is devoted to the problem of racial prejudice existing among students, an important circumstance may turn out to be where the respondents came from: from the city or from the countryside. Since a quota for the characteristic “from an urban/rural background” was not specified, an accurate representation of this characteristic becomes unlikely. Of course, there is an alternative: define quotas for all potentially relevant characteristics. However, an increase in the number of control characteristics leads to a more complex specification. This, in turn, makes it difficult—and sometimes even impossible—to select sample elements and, in any case, leads to an increase in its cost. If, for example, belonging to a city or rural population and socioeconomic status will also be relevant to the study, the interviewer may have to look for a freshman who is urban and upper- or middle-class. You agree that finding just a male freshman is much easier.

Secondly, it is very difficult to ensure that a given sample is truly representative. Of course, it is possible to check the sample to ensure that the distribution of characteristics that are not included in the control matches their distribution in the population. However, such a check can only lead to negative conclusions. The only thing that can be identified is the divergence of distributions. If the distributions of the sample and the population for each of these characteristics repeat each other, there is a possibility that the sample differs from the population in some other, not explicitly specified, way.

And finally, thirdly. Interviewers, when left to their own devices, tend to take certain actions. They resort too often to interviewing their friends. Since they often turn out to be similar to the interviewers themselves, there is a danger of error. Evidence from England suggests that quota samples tend to:

  1. exaggeration of the role of the most accessible elements;
  2. downplaying the role of small families;
  3. exaggeration of the role of families with children;
  4. downplaying the role of workers involved in industrial production;
  5. downplaying the role of those with the highest and lowest incomes;
  6. downplaying the role of poorly educated citizens;
  7. downplaying the role of persons occupying a low social position.
Interviewers who select quotas by stopping random passers-by are likely to focus their attention on areas with a large number of potential respondents, such as shopping centers, railway stations and airports, entrances to large supermarkets and the like. This practice leads to overrepresentation of those groups of people who visit such places most often. When home visits are required, interviewers are often motivated by convenience.
For example, they may conduct surveys only during the day, which leads to an underestimation of the opinions of workers. Among other things, they do not enter dilapidated buildings and, as a rule, do not climb to the upper floors of buildings that do not have elevators.

Depending on the specifics of the problem being studied, these trends can lead to various kinds of errors, but correcting them at the stage of data analysis seems very, very difficult. On the other hand, with an objective selection of sample elements, researchers have at their disposal certain tools that make it possible to simplify the procedure for assessing the representativeness of a given sample. When analyzing the problem of representativeness of such samples, the researcher considers not so much the composition of the sample as the procedure for selecting its elements.

Research Window: Brilliant! But who will read this?

Every year, advertisers spend millions of dollars on advertisements running in countless publications, from Advertising Age to Yankee. A certain assessment of the text and image can be carried out before its publication, as they say, at home, in an advertising agency; its true verification and evaluation occurs only after the publication of the advertisement, surrounded by dozens of equally carefully prepared advertisements vying for the reader's attention.

Company Roper Starch Worldwide is engaged in assessing the readability of advertisements placed in consumer, business, trade and professional magazines and newspapers. The results of the research are brought to the attention of advertisers and agencies - of course, for an appropriate fee. Because advertisers go to great lengths every day to try to get their ads to consumers, the company Starch decided to compile a sample that would provide subscribers with timely and accurate information about the effectiveness of advertising. Every year the company Starch surveyed more than 50,000 people, looking at about 20,000 advertisements. About 500 individual publications were studied annually.

Starch used proportional sampling, with a minimum sample size of 100 readers of one gender and 100 readers of the other gender. Starch concluded that with this sample size, major variations in readability levels stabilized. Readers over 18 years of age were surveyed in person, and this included all publications except those intended for special groups of the population (for example, girls of the same age were surveyed to evaluate the publications of Seventeen magazine).

When conducting surveys, the distribution zone of a particular publication was taken into account. Let's say a Los Angeles magazine study looked at readers living in southern California. Time was studied nationally. The survey was dedicated to individual issues of the magazine and was conducted in 20-30 cities simultaneously.

Each interviewer was assigned a small quota of interviews, which served the purpose of minimizing survey bias. Questionnaires were distributed among people of different professions and ages with different incomes. Each such study provided an opportunity to present positions to a fairly wide readership. When considering a number of professional, business and industry publications, the specifics of their subscription and distribution were also taken into account. Subscription lists dedicated to publications with a fairly narrow distribution made it possible to select acceptable respondents.

In each survey, interviewers asked respondents to look through the publication and asked whether they had noticed any advertisements. If the answer was affirmative, the registrar asked a whole series of questions to assess the degree to which the advertisement was perceived.

This assessment could be threefold:

  • Pay attention: those who have already paid attention to the very fact of the appearance of such an announcement.
  • Acquainted: those who remembered any part of the advertisement, which dealt with the advertised trademark or about the advertiser.
  • Read: persons who read at least half of the advertisement.

After examining all advertisements, interviewers recorded basic classification information: gender, age, occupation, Family status, nationality, income, family size and composition, which allowed for cross-tabulation of the level of reader interest.

When used properly, company data Starch allow advertisers and agencies to identify both unsuccessful and successful types of advertising schemes that attract and hold the reader's attention. Information of this kind is extremely valuable for advertisers who are primarily interested in the effectiveness of their advertising campaign.

Source: Roper Starch Worldwide, Mamaronek, NY 10543.

Probability samples

A researcher can determine the probability of inclusion of any element of a population in a probability sample because the selection of its elements is carried out on the basis of some objective process and does not depend on the whims and preferences of the researcher or field worker. Since the procedure for selecting elements is objective, the researcher can assess the reliability of the results obtained, which was impossible in the case of deterministic samples, no matter how careful the selection of elements of the latter was.

One should not think that probabilistic samples are always more representative than deterministic ones. In fact, a deterministic sample may be more representative. The advantage of probability samples is that they allow one to estimate possible sampling error. If a researcher works with a deterministic sample, he does not have an objective method for assessing its adequacy to the purposes of the study.

Simple random sampling

Most people have encountered simple random sampling in one way or another, either as part of a statistics course at college or by reading about the results of relevant studies in newspapers or magazines. In simple random sampling, each element included in the sample has the same specified probability of being included in the sample, and any combination of elements in the original population can potentially become a sample. For example, if we wanted to draw a simple random sample of all students enrolled in a particular college, we would only need to make a list of all students, assign a number to each name on it, and use a computer to randomly select a given number of items.

Population

Population
A set of elements that satisfy certain specified conditions; also called the study (target) population.
Parameter
A specific characteristic or indicator of a general or study population.

The general, or studied, population is the population from which the selection is made. This set (population) can be described by a number of specific parameters, which are characteristics of the general population, each of which represents a certain quantitative indicator that distinguishes one population from another.

Imagine that the population under study is the entire adult population of Cincinnati. A number of parameters can be used to describe this population: average age, proportion of the population with higher education, income level, etc. Please note that all these indicators have a certain fixed value. Of course, we can calculate them by conducting a complete census of the population being studied. Usually, we rely not on qualifications, but on the sample we select and use the values ​​obtained during sample observation to determine the required parameters of the population.

Let us illustrate what has been said in Table 1. 15.1 is an example of a hypothetical population consisting of 20 people. Working with a small hypothetical population like this has a number of advantages. First, the small sample size makes it possible to easily calculate population parameters that can be used to describe it. Secondly, this scope provides insight into what might happen if a particular sampling plan is adopted. Both of these features make it easy to compare the sample results with the “true” one, and in this case known value aggregate, which is not the case in the typical situation in which the actual value of the aggregate is unknown. Comparison of the estimate with the “true” value becomes especially clear in this case.

Suppose we want to estimate, based on two randomly selected elements, the average income of individuals in the original population. Average income will be its parameter. To estimate this average value, which we denote as μ, we must divide the sum of all values ​​by their number:

Population average μ = Sum of population elements / Number of elements.

In our case, the calculations give:

Derived set

Derived set consists of all possible samples that can be selected from the general population according to a given sampling plan (sampling plan). Statistics is a characteristic, or indicator, of a sample. The value of a sample statistic is used to estimate a particular population parameter. Different samples produce different statistics or estimates of the same population parameter.

Derived set
The totality of all possible distinguishable samples that can be selected from the population according to a given sampling plan. Statistics A characteristic or indicator of a sample.

Consider the derived population of all possible samples that could be selected from our hypothetical population of 20 individuals under a sampling plan that assumes a sample size of n=2 can be obtained by random non-repetitive selection.

Let us assume for the moment that the data for each unit of the population - in our case the name and income of the individual - are recorded on mugs, after which they are dropped into a jug and mixed. The researcher removes one circle from the jug, writes down information from it and puts it aside. He does the same with the second circle removed from the jug. Then the researcher returns both mugs to the jug, mixes its contents and repeats the same sequence of actions. In table Figure 15.2 shows the possible outcomes of this procedure. For 20 circles, 190 such paired combinations are possible.

For each combination, the average income can be calculated. Let's say for a sample AB (k= 1)

k-e sample mean = Sum of sample elements / Number of sample elements =

In Fig. 15.4 shows the estimate of average income for the entire population and the magnitude of the error for each estimate for samples k = 25, 62,108,147 And 189 .

Before we begin to consider the relationship between the sample average income (statistic) and the average income of the population (a parameter that requires estimation), let's say a few words about the derived population. First, in practice we do not construct aggregates of this kind. This would require too much time and effort. The practitioner is limited to compiling just one sample of the required size. The researcher uses concept derived population and the associated concept of sampling distribution when formulating final conclusions.

How will be shown below. Secondly, it should be remembered that a derived population is defined as the totality of all possible different samples that can be selected from the population according to a given sampling plan. When any part of the sampling plan changes, the derived population also changes. Thus, if, when selecting circles, the researcher returns the first of the removed disks to the jug before removing the second, the derived set will include.

samples AA, BB, etc. If the volume of non-repetitive samples is equal to 3, and not 2, samples of type ABC will appear, and there will be 1140 of them, and not 190, as was the case in the previous case. When changing from simple random sampling to any other method of selecting sample elements, the derived population also changes.

It should also be remembered that selecting a sample of a given size from a general population is equivalent to selecting one element (1 out of 190) from a derived population. This fact allows us to draw many statistical conclusions.

Sample mean and population mean

Do we have the right to equate the sample mean to the true population mean? In any case, we assume that they are interconnected. However, we also believe that error will occur. For example, it may be assumed that the information obtained from Internet users will differ significantly from the results of a survey of the “regular” population. In other cases, we can assume a fairly close match, otherwise we would not be able to use the sample value to estimate the general value. But how big a mistake can we make in this?

Let's add up all the sample means contained in the table. 15.2, and divide the resulting amount by the number of samples, i.e., let's average the averages.
We will get the following result:

It coincides with the population mean. They say that in this case we are dealing with unbiased statistics.

A statistic is said to be unbiased if its mean over all possible samples is equal to the estimated population parameter. Please note that we are not talking about some particular meaning here. The partial estimate can be quite far from the true value - take, for example, the AB or ST samples. In some cases, the true population value may not be achievable by considering any possible sample, even if the statistics are unbiased. In our case this is not the case: a whole range of possible samples - for example AT - gives a sample mean equal to the true population mean.

It makes sense to consider the distribution of these sample estimates, and in particular the relationship between this spread of estimates and the variation in income levels in the population. The variance of the population is used as a measure of variation. To determine the variance of the population, we must calculate the deviation of each value from the mean, add the squares of all deviations and divide the resulting sum by the number of terms. Let us denote the dispersion of the population by a^. Then:

Population variance σ 2 = Sum of squared differences of each element
population and population average / Number of population elements =

Dispersion average value income level can be determined in the same way. That is, we can find it by determining the deviations of each average from their overall average, summing the squares of the deviations and dividing the resulting sum by the number of terms.

We can determine the dispersion of the average income level in another way, using the dispersion of the income levels in the population, since there is a direct relationship between these two values. To be precise, in cases where the sample represents only a small part of the population, the variance of the sample mean equals the variance of the population divided by the sample size:

where σ x 2 is the dispersion of the average sample value of the income level, σ 2 is the dispersion of the income level in the general population, n— sample size.

Now let's compare the distribution of results with the distribution of a quantitative characteristic in the general population. Figure 15.5 demonstrates that the population distribution of a quantitative trait, shown in panel A, is multipeaked (each of the 20 values ​​appears only once) and symmetrical about the true population mean of 9400.

Sample distribution
The distribution of values ​​of a specified statistic calculated for all possible distinguishable samples that can be selected from the population under a given sampling plan.

The distribution of scores shown in box B is based on the data in Table 1. 15.3, which, in turn, was compiled by assigning values ​​from table. 15.2 to one or another group depending on their size, followed by counting their number in the group. Field B is a traditional histogram, considered at the very beginning of a statistics course, which represents sampling distribution statistics. Let us note the following in passing: the concept of sampling distribution is the most important concept in statistics; it is the cornerstone of constructing statistical inferences. Based on the known sampling distribution of the statistics under study, we can draw a conclusion about the corresponding parameter of the population. If it is known only that the sample estimate varies from sample to sample, but the nature of this change is unknown, it becomes impossible to determine the sampling error associated with this estimate. Because the sampling distribution of an estimate describes its variation from sample to sample, it provides a basis for determining the validity of the sample estimate. It is for this reason that probability sampling design is so important for statistical inference.

From the known probabilities of inclusion in the sample of each element in the population, interviewers can find the sampling distribution of various statistics. Researchers rely on these distributions—whether it is the sample mean, sample proportion, sample variance, or some other statistic—when extending the result of a sample observation to the population. Note also that for samples of size 2, the distribution of sample means is single-peaked and symmetrical about the true mean.

So we have shown that:

  1. The mean of all possible sample means is equal to the general mean.
  2. The dispersion of sample means is in a certain way related to the general dispersion.
  3. The distribution of sample means is single-peaked, while the distribution of values ​​of a quantitative characteristic in the general population is multi-peaked.

Central limit theorem

A theorem stating that for simple random samples of volume n, isolated from the general population with a general mean μ and variance σ 2, for large n the distribution of the sample mean x approaches normal with a center equal to μ and variance σ 2 . The accuracy of this approximation increases with increasing n.

Central limit theorem. The single-peak distribution of estimates can be considered as a manifestation of the central limit theorem, which states that for simple random samples of volume n, isolated from the general population with a true mean μ and variance σ 2, for large n the distribution of sample means approaches normal with a center equal to the true mean and a variance equal to the ratio of the population variance to the sample size, i.e.:

This approximation becomes more and more accurate as we grow n. Remember this. Regardless of the type of population, the distribution of sample means will be normal for samples of a sufficiently large size. What should be understood by a sufficiently large volume? If the distribution of values ​​of a quantitative characteristic of the general population is normal, then the distribution of sample means for samples of size n=1. If the distribution of a variable (quantitative characteristic) in the population is symmetrical but not normal, very small samples will produce a normal distribution of sample means. If the distribution of a quantitative characteristic of the general population has a pronounced asymmetry, there is a need for larger samples. And yet, the distribution of the sample average can be accepted as normal only in cases where we are dealing with a sample of sufficient size.

In order to draw conclusions using a normal curve, it is not at all necessary to proceed from the condition of normal distribution of the values ​​of a quantitative characteristic of the general population. Rather, we rely on the central limit theorem and, depending on the population distribution, determine a sample size that would allow us to work with a normal curve. Fortunately, the normal distribution of statistics is ensured by relatively small samples - Fig. 15.6 clearly demonstrates this circumstance. Confidence interval estimates. Can the above help us in making certain conclusions about the general mean? Indeed, in practice, we select only one, and not all possible samples of a given size, and based on the data obtained we draw certain conclusions regarding the target group.

How does this happen? As is known, with a normal distribution, a certain percentage of all observations have a certain standard deviation; Let's say 95% of observations fit within ±1.96 standard deviations of the mean. The normal distribution of sample means, to which the central limit theorem can be applied, is no exception in this sense. The mean of such a sample distribution is equal to the general mean μ, and its standard deviation is called the standard error of the mean:

It turns out that:

  • 68.26% of sample means deviate from the general mean by no more than ± σ x;
  • 95.45% of sample means deviate from the general mean by no more than ±σ x;
  • 99.73% of sample means deviate from the general mean by no more than ± σ x,

i.e. a certain proportion of sample means depending on the selected value z will be contained in the interval determined by the value z. This expression can be rewritten as an inequality:

General average - z < Среднее по выборке < Генеральное среднее + z(Root Mean Square Error of the Mean)

Thus, the sample mean with a certain probability is in the interval, the boundaries of which are the sum and difference of the mean value of the distribution and a certain number of standard deviations. This inequality can be transformed into:

Sample average - z(Root Mean Square Error of the Mean)< Генеральное среднее < Среднее по выборке + z(Root Mean Square Error of the Mean)

If the ratio 15.1 is observed, for example, in 95% of cases ( z= 1.96), then in 95% of cases the ratio 15.2 is observed. In cases where the conclusion is based on a single sample mean, we use expression 15.2.

It is important to remember that expression 15.2 does not mean that the interval corresponding to a given sample must necessarily include the general mean. The interval has more to do with the selection procedure. The interval constructed around a given mean may or may not include the true population mean. Our confidence in the correctness of the conclusions made is based on the fact that 95% of all intervals constructed according to the chosen sampling plan will contain the true mean. We believe that our sample falls within this 95%.

To illustrate this important point, let us imagine for a moment that the distribution of sample means for samples of size n= 2 in our hypothetical example is normal. Table 15.4 clearly illustrates the outcome for the first 10 of the possible 190 samples that could be selected under a given design. Note that only 7 of the 10 intervals include a grand or true mean. Confidence in the correctness of the conclusion is not due to some particular assessment, but precisely procedure assessments. This procedure is such that for 100 samples for which the sample mean and confidence interval will be calculated, in 95 cases this interval will include the true general value. The accuracy of a given sample is determined by the procedure by which the sample was selected. A representative sampling design does not guarantee that all samples are representative. Statistical inference procedures are based on the representativeness of the sampling plan, which is why this procedure is so critical for probability samples.

Probability samples allow us to evaluate the accuracy of the results as the closeness of the estimates made to the true value. The greater the mean square error of the statistics, the higher the degree of scatter of estimates and the lower the accuracy of the procedure.

Some may be confused by the fact that the confidence level relates to the procedure and not to the particular sample value, but it should be remembered that the magnitude of the confidence level for estimating the general value can be adjusted by the researcher. If you don't want to take any chances and are concerned that you might encounter one of the five sample intervals chosen that does not include the population mean, you can choose a 99% confidence interval in which only one in a hundred sample intervals does not include the population mean. Further, if you can increase the sample size, you will increase the confidence level of the result, providing the desired accuracy in estimating the population value. We will talk about this in more detail in Chap. 17.

The procedure we are describing has one more component that can cause some confusion. When estimating the confidence interval, three quantities are used: x, z and σx. The sample mean x is calculated from the sample data, z is selected based on the desired confidence level. But what about the root mean square error of the average σ x? It is equal to:

and therefore, to determine it, we need to set the standard deviation of the quantitative characteristic of the general population, i.e. 5. What to do in cases where the standard deviation s unknown? This problem does not arise for two reasons. First, usually for most quantitative attributes used in marketing research, variation changes much more slowly than the level of most variables of interest to the marketer. Accordingly, if the study is repeated, we can use the previous, previously obtained value of s in the calculations. Second, once the sample has been selected and the data obtained, we can estimate the population variance by determining the sample variance. The variance of an unbiased sample is defined as:

Sample variance ŝ 2 = Sum of squared deviations from the sample mean / (Number of sample elements -1). To determine the sample variance, we first need to find the sample mean. Then the differences between each of the sample values ​​and the sample mean are found; these differences are squared, summed and divided by a number equal to the number of sample observations minus one. The sample variance not only provides an estimate of the general variance, but can also be used to estimate the root mean square error of the mean. When the general variance σ 2 is known, the root mean square error σ x is also known, since:

When the general variance is unknown, the root mean square error of the mean can only be estimated. This estimate is given ŝ x, which is equal to the standard deviation of the sample divided by the square root of the sample size, i.e. The estimate is determined in the same way as the estimate of the true value was determined, but instead of the general standard deviation, the sample standard deviation is substituted into the calculation formula. So, say, for sample AB with a sample mean of 5800:

Accordingly, ŝ = 283, and

and the 95% interval is now

which is less than the previous value.

In table 15.5 summarizes the calculation formulas for various averages and variances discussed in this chapter. Formation of a simple random sample. In our example, the selection of sample elements was carried out using a jug, which contained all the elements of the original population. This allowed us to visualize the concepts of derived population and sampling distribution. We do not recommend using such a method in practice, as this increases the likelihood of error. Mugs can differ in both size and texture, which in certain cases can lead to preference for one over the other. The selection of participants in the Vietnam campaign, carried out using a lottery, can serve as an example of this type of error.

The selection was carried out by pulling disks with dates of birth from a large drum. Television broadcast this procedure throughout the country. Unfortunately, the disks were loaded into the drum in a systematic manner: January dates came first, December dates last. Although the drum was subject to intensive spinning, December dates fell much more often than January ones. Subsequently, this procedure was revised in such a way that the likelihood of such systematic errors was significantly reduced. The preferred method of drawing a simple random sample is based on the use of a table of random numbers.

Using such a table involves the following sequence of steps. First, the elements of the population must be assigned sequential numbers from 1 to N; in our hypothetical totality the element A will be assigned number 1, element B- number 2, etc. Secondly, the number of digits in the random number table must be the same as the number N. For N= 20 two-digit numbers will be used; For N between 100 and 999 are three-digit numbers, etc. Third, the starting position must be determined randomly. We can open the corresponding table of random numbers and, closing our eyes, as they say, point our finger at it. Since the numbers in the random number table are in random order, the starting position doesn't really matter.

And finally, we can move in any arbitrarily chosen direction - up, down or across, selecting those elements whose numbers will correspond to random numbers from the table. To illustrate what has been said, consider an abbreviated table of random numbers (Table 15.6). Because the N= 20, we should only work with two-digit numbers. In this sense, table. 15.6 suits us perfectly. Let us decide in advance to move down the column, but the starting position is at the intersection of the eleventh row and the fourth column, where the number 77 is located. This number is too large and therefore must be discarded. The next two numbers will also be discarded, but the fourth value 02 will be used since 2 corresponds to the element number IN.

The next five numbers will also be discarded as too large, while the number 05 will indicate the element E. So the elements IN And E will become our two-element sample, by which we will judge the level of income of this population. An alternative strategy is also possible, in which a computer program that generates random numbers will be used as the basis for selection. Appeared in Lately publications indicate that the numbers generated by such programs are not completely random, which can manifest themselves in a certain way when constructing complex mathematical models, but they can be used for most applied marketing research. Note again that a simple random sample requires the compilation of a sequential numbered list of elements of the population.

In other words, each member of the original population must be identified. For some populations, this is not difficult to do, for example, when studying the 500 largest American corporations, a list of which is given in Fortune magazine. This list has already been compiled, so forming a simple random sample in this case will not be difficult. For other initial populations (for example, for all families living in a certain city), compiling a general list is extremely difficult, which forces researchers to resort to other sampling schemes.

Summary

Learning Objective 1
Clearly distinguish between the concepts of census (qualification) and sampling

A complete census of a population is called qualification. Sample a collection formed from selected elements.

Learning Objective 2
Know the essence and sequence of the six stages implemented by researchers to obtain a sample population

The sampling process is divided into six stages:

  1. population assignment;
  2. determining the sampling frame;
  3. choice of selection procedure;
  4. determination of sample size;
  5. selection of sample elements;
  6. examination of selected elements.

Learning Objective 3
Define the concept of "sampling frame"

The sampling frame is the list of elements from which the sample will be drawn.

Learning Objective 4
Explain the difference between probability and deterministic sampling

In a probability sample, each member of the population can be included with a certain given non-zero probability. The probabilities of including certain members of the population in the sample may differ from each other, but the probability of including each element in it is known. For deterministic samples, assessing the probability of including any element in the sample becomes impossible. The representativeness of such a sample cannot be guaranteed. All deterministic sampling is based rather on personal opinion, judgment or preference. Such preferences can sometimes provide good estimates of population characteristics, but there is no way to objectively determine whether a sample is appropriate for the task at hand.

Learning Objective 5
Distinguish between fixed-size sampling and multi-stage (sequential) sampling

When working with fixed-size samples, the sample size is determined before the survey begins and the analysis of the results is preceded by the collection of all necessary data. In sequential sampling, the number of selected elements is unknown in advance; it is determined based on a series of sequential decisions.

Learning Objective 6
Explain what purposive sampling is and describe both its strengths and weaknesses

Items in a purposive sample are hand-selected and presented to the researcher as meeting the objectives of the survey. It is assumed that the selected elements can provide a complete picture of the population being studied. While the researcher is in the early stages of exploring the problem and determining the prospects and possible limitations of the planned survey, the use of purposive sampling can be very effective. But in no case should we forget about the weaknesses of a sample of this type, since it can also be used by the researcher in descriptive or causal studies, which will immediately affect the quality of their results.

Learning Objective 7
Define the concept of quota sampling

A proportional sample is selected so that the proportion of elements in the sample that have certain characteristics corresponds approximately to the proportion of the same elements in the population being studied; To do this, each enumerator is given a quota that defines the characteristics of the population with which he must contact.

Learning Objective 8
Explain what a parameter is in a sampling procedure

Parameter - a certain characteristic or indicator of the general or studied population; a certain quantitative indicator that distinguishes one population from another.

Learning Objective 9
Explain what a derived set is

The derived population consists of all possible samples that can be selected from the population according to a given sampling plan.

Learning Objective 10
Explain why the concept of sampling distribution is an essential concept in statistics.

The concept of sampling distribution is the cornerstone of statistical inference. Based on the known sampling distribution of the statistics under study, we can draw a conclusion about the corresponding parameter of the population. If it is known only that the sample estimate varies from sample to sample, but the nature of this change is unknown, it becomes impossible to determine the sampling error associated with this estimate. Because the sampling distribution of an estimate describes its variation from sample to sample, it provides a basis for determining the validity of the sample estimate.

Sample - a set of cases (subjects, objects, events, samples), using a certain procedure, selected from the general population to participate in the study.

Sample size

Sample size is the number of cases included in the sample population. For statistical reasons, it is recommended that the number of cases be at least 30-35.

Dependent and independent samples

When comparing two (or more) samples, an important parameter is their dependence. If it is possible to establish a homomorphic pair (that is, when one case from sample X corresponds to one and only one case from sample Y and vice versa) for each case in two samples (and this basis for the relationship is important for the trait being measured in the samples), such samples are called dependent. Examples of dependent samples:

  1. pairs of twins,
  2. two measurements of any trait before and after experimental exposure,
  3. husbands and wives
  4. and so on.

If there is no such relationship between the samples, then these samples are considered independent, for example:

  1. men and women,
  2. psychologists and mathematicians.
  3. Accordingly, dependent samples always have the same size, while the size of independent samples may differ.

Comparison of samples is made using various statistical criteria:

  • Student's t-test
  • Wilcoxon T-test
  • Mann-Whitney U test
  • Sign criterion
  • and etc.

Representativeness

The sample may be considered representative or non-representative.

Example of a non-representative sample

In the United States, one of the most famous historical examples of unrepresentative sampling occurs during the 1936 presidential election. The Literary Digest, which had successfully predicted the events of several previous elections, was wrong in its predictions when it sent out ten million test ballots to its subscribers, people selected from telephone books throughout the country, and people from automobile registration lists. In 25% of returned ballots (almost 2.5 million), the votes were distributed as follows:

57% preferred Republican candidate Alf Landon

40% chose then-Democratic President Franklin Roosevelt

In the actual elections, as is known, Roosevelt won, gaining more than 60% of the votes. The Literary Digest's mistake was this: wanting to increase the representativeness of the sample - since they knew that most of their subscribers considered themselves Republicans - they expanded the sample to include people selected from telephone books and registration lists. However, they did not take into account the realities of their time and in fact recruited even more Republicans: during the Great Depression, it was mainly representatives of the middle and upper classes who could afford to own phones and cars (that is, most Republicans, not Democrats).

Types of plan for constructing groups from samples

There are several main types of group building plans:

  • A study with experimental and control groups, which are placed in different conditions.
  • Study with experimental and control groups using a pairwise selection strategy
  • A study using only one group - an experimental group.
  • A study using a mixed (factorial) design - all groups are placed in different conditions.

Group Building Strategies

The selection of groups for participation in a psychological experiment is carried out using various strategies to ensure the greatest possible respect for internal and external validity.

  • Randomization (random selection)
  • Attracting real groups

Randomization

Randomization, or random selection, is used to create simple random samples. The use of such a sample is based on the assumption that each member of the population is equally likely to be included in the sample. For example, to make a random sample of 100 students, you can put pieces of paper with the names of all university students in a hat, and then take 100 pieces of paper out of it - this will be a random selection (Goodwin J., p. 147).

Pairwise selection

Pairwise selection- a strategy for constructing sampling groups, in which groups of subjects are made up of subjects who are equivalent in terms of secondary parameters that are significant for the experiment. This strategy is effective for experiments using experimental and control groups with the best option- attracting

In statistics, there are two main research methods - continuous and selective. When conducting a sample study, it is mandatory to comply with the following requirements: representativeness of the sample population and a sufficient number of observation units. When selecting observation units, it is possible Offset errors, i.e. such events, the occurrence of which cannot be accurately predicted. These errors are objective and natural. When determining the degree of accuracy of a sampling study, the amount of error that can occur during the sampling process is estimated - Random representativeness error (M) — It is the actual difference between the average or relative values ​​obtained during a sample study and similar values ​​that would be obtained during a study on the general population.

Assessing the reliability of the research results involves determining:

1. errors of representativeness

2. confidence limits of average (or relative) values ​​in the population

3. reliability of the difference between average (or relative) values ​​(according to the t criterion)

Representativeness error calculation(mm) arithmetic mean value (M):

Where σ is the standard deviation; n—sample size (>30).

Calculation of representativeness error (mР) relative value (Р):

Where P is the corresponding relative value (calculated, for example, in%);

Q =100 - Ρ% - the reciprocal of P; n—sample size (n>30)

In clinical and experimental work, it is quite often necessary to use Small sample When the number of observations is less than or equal to 30. With a small sample to calculate errors of representativeness, both average and relative values , The number of observations decreases by one, i.e.

; .

The magnitude of the representativeness error depends on the sample size: than larger number observations, topics less error. To assess the reliability of a sample indicator, the following approach is adopted: the indicator (or average value) must be 3 times greater than its error, in which case it is considered reliable.

Knowing the magnitude of the error is not enough to be confident in the results of a sample study, since a specific error in a sample study may be significantly greater (or less) than the average representativeness error. To determine the accuracy with which a researcher wants to obtain a result, statistics uses such a concept as the probability of an error-free forecast, which is a characteristic of the reliability of the results of sample biomedical statistical studies. Typically, when conducting biomedical statistical studies, the probability of an error-free forecast is 95% or 99%. In the most critical cases, when it is necessary to draw particularly important conclusions in theoretical or practical terms, use the probability of an error-free forecast of 99.7%

A certain value corresponds to a certain degree of probability of an error-free forecast Marginal error of random sampling (Δ - delta), which is determined by the formula:

Δ=t * m, where t is a confidence coefficient, which, with a large sample and a 95% probability of an error-free forecast, is equal to 2.6; with a probability of an error-free forecast of 99% - 3.0; with a probability of an error-free forecast of 99.7% - 3.3, and with a small sample it is determined using a special table of Student’s t values.

Using the marginal sampling error (Δ), one can determine Trust boundaries, in which, with a certain probability of an error-free forecast, the actual value of the statistical quantity is contained , Characterizing the entire population (average or relative).

To determine confidence limits, the following formulas are used:

1) for average values:

Where Mgen are the confidence limits of the average value in the population;

Msample - average value , Obtained during a study on a sample population; t is a confidence coefficient, the value of which is determined by the degree of probability of an error-free forecast with which the researcher wants to obtain the result; mM is the error of representativeness of the average value.

2) for relative values:

Where Pgen are the confidence limits of the relative value in the population; Rsb is a relative value obtained when conducting a study on a sample population; t—confidence coefficient; mP is the error of representativeness of the relative value.

Confidence limits show the limits within which the sample size can fluctuate depending on random reasons.

With a small number of observations (n<30), для вычисления довери­тельных границ значение коэффициента t находят по специальной таблице Стьюдента. Значения t расположены в таблице на пересечении с избранной вероятностью безошибочного прогноза и строки, Indicating the available number of degrees of freedom (n) , Which is equal to n-1.

Statistical population- a set of units that have mass, typicality, qualitative homogeneity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is an object.

Unit of the population— each specific unit of a statistical population.

The same statistical population can be homogeneous in one characteristic and heterogeneous in another.

Qualitative uniformity- similarity of all units of the population on some basis and dissimilarity on all others.

In a statistical population, the differences between one population unit and another are often of a quantitative nature. Quantitative changes in the values ​​of a characteristic of different units of a population are called variation.

Variation of a trait- a quantitative change in a characteristic (for a quantitative characteristic) during the transition from one unit of the population to another.

Sign- this is a property, characteristic feature or other feature of units, objects and phenomena that can be observed or measured. Signs are divided into quantitative and qualitative. The diversity and variability of the value of a characteristic in individual units of a population is called variation.

Attributive (qualitative) characteristics cannot be expressed numerically (population composition by gender). Quantitative characteristics have a numerical expression (population composition by age).

Index- this is a generalizing quantitative and qualitative characteristic of any property of units or the population as a whole under specific conditions of time and place.

Scorecard is a set of indicators that comprehensively reflect the phenomenon being studied.

For example, salary is studied:
  • Sign - wages
  • Statistical population - all employees
  • The unit of the population is each employee
  • Qualitative homogeneity - accrued wages
  • Variation of a sign - a series of numbers

Population and sample from it

The basis is a set of data obtained as a result of measuring one or more characteristics. A truly observed set of objects, statistically represented by a number of observations of a random variable, is sampling, and the hypothetically existing (conjectural) - general population. The population may be finite (number of observations N = const) or infinite ( N = ∞), and a sample from a population is always the result of a limited number of observations. The number of observations forming a sample is called sample size. If the sample size is large enough ( n → ∞) the sample is considered big, otherwise it is called sampling limited volume. The sample is considered small, if when measuring a one-dimensional random variable the sample size does not exceed 30 ( n<= 30 ), and when measuring several simultaneously ( k) features in multidimensional relation space n To k does not exceed 10 (n/k< 10) . The sample forms variation series, if its members are ordinal statistics, i.e. sample values ​​of the random variable X are ordered in ascending order (ranked), the values ​​of the characteristic are called options.

Example. Almost the same randomly selected set of objects - commercial banks of one administrative district of Moscow, can be considered as a sample from the general population of all commercial banks in this district, and as a sample from the general population of all commercial banks in Moscow, as well as a sample from the commercial banks of the country and etc.

Basic methods of organizing sampling

The reliability of statistical conclusions and meaningful interpretation of the results depends on representativeness samples, i.e. completeness and adequacy of the representation of the properties of the general population, in relation to which this sample can be considered representative. The study of the statistical properties of a population can be organized in two ways: using continuous And not continuous. Continuous observation provides for examination of all units studied totality, A partial (selective) observation- only parts of it.

There are five main ways to organize sample observation:

1. simple random selection, in which objects are randomly selected from a population of objects (for example, using a table or random number generator), with each of the possible samples having equal probability. Such samples are called actually random;

2. simple selection using a regular procedure is carried out using a mechanical component (for example, date, day of the week, apartment number, letters of the alphabet, etc.) and the samples obtained in this way are called mechanical;

3. stratified selection consists in the fact that the general population of the volume is divided into subpopulations or layers (strata) of the volume so that . Strata are homogeneous objects in terms of statistical characteristics (for example, the population is divided into strata by age groups or social class; enterprises - by industry). In this case, the samples are called stratified(otherwise, stratified, typical, regionalized);

4. methods serial selection are used to form serial or nest samples. They are convenient if it is necessary to survey a “block” or a series of objects at once (for example, a batch of goods, products of a certain series, or the population in the territorial and administrative division of the country). The selection of series can be done purely randomly or mechanically. In this case, a complete inspection of a certain batch of goods, or an entire territorial unit (a residential building or block), is carried out;

5. combined(stepped) selection can combine several selection methods at once (for example, stratified and random or random and mechanical); such a sample is called combined.

Types of selection

By mind individual, group and combined selection are distinguished. At individual selection individual units of the general population are selected into the sample population, with group selection- qualitatively homogeneous groups (series) of units, and combined selection involves a combination of the first and second types.

By method selection is distinguished repeated and non-repetitive sample.

Repeatless called selection in which a unit included in the sample does not return to the original population and does not participate in further selection; while the number of units in the general population N is reduced during the selection process. At repeated selection caught in the sample, a unit after registration is returned to the general population and thus retains an equal opportunity, along with other units, to be used in a further selection procedure; while the number of units in the general population N remains unchanged (the method is rarely used in socio-economic research). However, with large N (N → ∞) formulas for repeatable selection approaches those for repeated selection and the latter are practically more often used ( N = const).

Basic characteristics of the parameters of the general and sample population

The statistical conclusions of the study are based on the distribution of the random variable, and the observed values (x 1, x 2, ..., x n) are called realizations of the random variable X(n is sample size). The distribution of a random variable in the general population is of a theoretical, ideal nature, and its sample analogue is empirical distribution. Some theoretical distributions are specified analytically, i.e. their options determine the value of the distribution function at each point in the space of possible values ​​of the random variable. For a sample, the distribution function is difficult and sometimes impossible to determine, therefore options are estimated from empirical data, and then they are substituted into an analytical expression describing the theoretical distribution. In this case, the assumption (or hypothesis) about the type of distribution can be either statistically correct or erroneous. But in any case, the empirical distribution reconstructed from the sample only roughly characterizes the true one. The most important distribution parameters are expected value and variance.

By their nature, distributions are continuous And discrete. The best known continuous distribution is normal. Sample analogues of the parameters and for it are: mean value and empirical variance. Among discrete ones in socio-economic research, the most frequently used alternative (dichotomous) distribution. The mathematical expectation parameter of this distribution expresses the relative value (or share) units of the population that have the characteristic being studied (it is indicated by the letter); the proportion of the population that does not have this characteristic is denoted by the letter q (q = 1 - p). The variance of the alternative distribution also has an empirical analogue.

Depending on the type of distribution and on the method of selecting population units, the characteristics of the distribution parameters are calculated differently. The main ones for theoretical and empirical distributions are given in table. 1.

Sample fraction k n The ratio of the number of units in the sample population to the number of units in the general population is called:

kn = n/N.

Sample fraction w is the ratio of units possessing the characteristic being studied x to sample size n:

w = n n /n.

Example. In a batch of goods containing 1000 units, with a 5% sample sample share k n in absolute value is 50 units. (n = N*0.05); if 2 defective products are found in this sample, then sample defect rate w will be 0.04 (w = 2/50 = 0.04 or 4%).

Since the sample population is different from the general population, there are sampling errors.

Table 1. Main parameters of the general and sample populations

Sampling errors

In any case (continuous and selective), errors of two types may occur: registration and representativeness. Errors registration can have random And systematic character. Random errors consist of many different uncontrollable causes, are unintentional and usually balance each other out (for example, changes in device performance due to temperature fluctuations in the room).

Systematic errors are biased because they violate the rules for selecting objects for the sample (for example, deviations in measurements when changing the settings of the measuring device).

Example. To assess the social situation of the population in the city, it is planned to survey 25% of families. If the selection of every fourth apartment is based on its number, then there is a danger of selecting all apartments of only one type (for example, one-room apartments), which will provide a systematic error and distort the results; choosing an apartment number by lot is more preferable, since the error will be random.

Representativeness errors are inherent only in sample observation, they cannot be avoided and they arise as a result of the fact that the sample population does not completely reproduce the general population. The values ​​of the indicators obtained from the sample differ from the indicators of the same values ​​in the general population (or obtained through continuous observation).

Sampling bias is the difference between the parameter value in the population and its sample value. For the average value of a quantitative characteristic it is equal to: , and for the share (alternative characteristic) - .

Sampling errors are inherent only to sample observations. The larger these errors, the more the empirical distribution differs from the theoretical one. The parameters of the empirical distribution are random variables, therefore, sampling errors are also random variables, they can take different values ​​for different samples and therefore it is customary to calculate average error.

Average sampling error is a quantity expressing the standard deviation of the sample mean from the mathematical expectation. This value, subject to the principle of random selection, depends primarily on the sample size and on the degree of variation of the characteristic: the greater and the smaller the variation of the characteristic (and therefore the value), the smaller the average sampling error. The relationship between the variances of the general and sample populations is expressed by the formula:

those. when large enough, we can assume that . The average sampling error shows possible deviations of the sample population parameter from the general population parameter. In table 2 shows expressions for calculating the average sampling error for different methods of organizing observation.

Table 2. Average error (m) of sample mean and proportion for different types of samples

Where is the average of the within-group sample variances for a continuous attribute;

Average of the within-group variances of the proportion;

— number of selected series, — total number of series;

,

where is the average of the th series;

— the overall average for the entire sample population for a continuous characteristic;

,

where is the share of the characteristic in the th series;

— the total share of the characteristic across the entire sample population.

However, the magnitude of the average error can only be judged with a certain probability P (P ≤ 1). Lyapunov A.M. proved that the distribution of sample means, and therefore their deviations from the general mean, for a sufficiently large number approximately obeys the normal distribution law, provided that the general population has a finite mean and limited variance.

Mathematically, this statement for the average is expressed as:

and for the share, expression (1) will take the form:

Where - There is marginal sampling error, which is a multiple of the average sampling error , and the multiplicity coefficient is the Student's test ("confidence coefficient"), proposed by W.S. Gosset (pseudonym "Student"); values ​​for different sample sizes are stored in a special table.

The values ​​of the function Ф(t) for some values ​​of t are equal to:

Therefore, expression (3) can be read as follows: with probability P = 0.683 (68.3%) it can be argued that the difference between the sample and general average will not exceed one value of the average error m(t=1), with probability P = 0.954 (95.4%)- that it will not exceed the value of two average errors m (t = 2) , with probability P = 0.997 (99.7%)- will not exceed three values m (t = 3) . Thus, the probability that this difference will exceed three times the average error is determined by error level and amounts to no more 0,3% .

In table 3 shows formulas for calculating the maximum sampling error.

Table 3. Marginal error (D) of the sample for the mean and proportion (p) for different types of sample observation

Generalization of sample results to the population

The ultimate goal of sample observation is to characterize the general population. With small sample sizes, empirical estimates of parameters ( and ) may deviate significantly from their true values ​​( and ). Therefore, there is a need to establish boundaries within which the true values ​​( and ) lie for the sample values ​​of the parameters ( and ).

Confidence interval of any parameter θ of the general population is the random range of values ​​of this parameter, which with a probability close to 1 ( reliability) contains the true value of this parameter.

Marginal error samples Δ allows you to determine the limiting values ​​of the characteristics of the general population and their confidence intervals, which are equal:

Bottom line confidence interval obtained by subtraction maximum error from the sample mean (share), and the upper one by adding it.

Confidence interval for the average it uses the maximum sampling error and for a given confidence level is determined by the formula:

This means that with a given probability R, which is called the confidence level and is uniquely determined by the value t, it can be argued that the true value of the average lies in the range from , and the true value of the share is in the range from

When calculating the confidence interval for three standard confidence levels P = 95%, P = 99% and P = 99.9% the value is selected by . Applications depending on the number of degrees of freedom. If the sample size is large enough, then the values ​​corresponding to these probabilities t are equal: 1,96, 2,58 And 3,29 . Thus, the marginal sampling error allows us to determine the limiting values ​​of the characteristics of the population and their confidence intervals:

The distribution of the results of sample observation to the general population in socio-economic research has its own characteristics, since it requires complete representation of all its types and groups. The basis for the possibility of such distribution is the calculation relative error:

Where Δ % - relative maximum sampling error; , .

There are two main methods for extending a sample observation to a population: direct recalculation and coefficient method.

Essence direct conversion consists of multiplying the sample mean!!\overline(x) by the size of the population.

Example. Let the average number of toddlers in the city be estimated by the sampling method and amount to one person. If there are 1000 young families in the city, then the number of required places in municipal nurseries is obtained by multiplying this average by the size of the general population N = 1000, i.e. will have 1200 seats.

Odds method It is advisable to use in the case when selective observation is carried out in order to clarify the data of continuous observation.

The following formula is used:

where all variables are the population size:

Required sample size

Table 4. Required sample size (n) for different types of sample observation organization

When planning a sample observation with a predetermined value of the permissible sampling error, it is necessary to correctly estimate the required sample size. This volume can be determined on the basis of the permissible error during sample observation based on a given probability that guarantees the permissible value of the error level (taking into account the method of organizing the observation). Formulas for determining the required sample size n can be easily obtained directly from the formulas for the maximum sampling error. So, from the expression for the marginal error:

sample size is directly determined n:

This formula shows that as the maximum sampling error decreases Δ the required sample size increases significantly, which is proportional to the variance and the square of the Student's t test.

For a specific method of organizing observation, the required sample size is calculated according to the formulas given in table. 9.4.

Practical calculation examples

Example 1. Calculation of the mean value and confidence interval for a continuous quantitative characteristic.

To assess the speed of settlement with creditors, a random sample of 10 payment documents was carried out at the bank. Their values ​​turned out to be equal (in days): 10; 3; 15; 15; 22; 7; 8; 1; 19; 20.

Necessary with probability P = 0.954 determine the marginal error Δ sample mean and confidence limits of mean calculation time.

Solution. The average value is calculated using the formula from table. 9.1 for the sample population

The variance is calculated using the formula from table. 9.1.

Mean square error of the day.

The average error is calculated using the formula:

those. the average is x ± m = 12.0 ± 2.3 days.

The reliability of the mean was

We calculate the maximum error using the formula from table. 9.3 for repeated sampling, since the population size is unknown, and for P = 0.954 level of confidence.

Thus, the average value is `x ± D = `x ± 2m = 12.0 ± 4.6, i.e. its true value lies in the range from 7.4 to 16.6 days.

Using a Student's t-table. The application allows us to conclude that for n = 10 - 1 = 9 degrees of freedom, the obtained value is reliable with a significance level of a £ 0.001, i.e. the resulting mean value is significantly different from 0.

Example 2. Estimation of probability (general share) p.

During a mechanical sampling method of surveying the social status of 1000 families, it was revealed that the proportion of low-income families was w = 0.3 (30%)(sample was 2% , i.e. n/N = 0.02). Required with confidence level p = 0.997 determine the indicator R low-income families throughout the region.

Solution. Based on the presented function values Ф(t) find for a given confidence level P = 0.997 meaning t = 3(see formula 3). Marginal error of fraction w determine by the formula from the table. 9.3 for non-repetitive sampling (mechanical sampling is always non-repetitive):

Maximum relative sampling error in % will be:

The probability (general share) of low-income families in the region will be р=w±Δw, and confidence limits p are calculated based on the double inequality:

w — Δ w ≤ p ≤ w — Δ w, i.e. the true value of p lies within:

0,3 — 0,014 < p <0,3 + 0,014, а именно от 28,6% до 31,4%.

Thus, with a probability of 0.997 it can be stated that the share of low-income families among all families in the region ranges from 28.6% to 31.4%.

Example 3. Calculation of the mean value and confidence interval for a discrete characteristic specified by an interval series.

In table 5. the distribution of applications for the production of orders according to the timing of their implementation by the enterprise has been specified.

Table 5. Distribution of observations by time of appearance

Solution. The average time for completing orders is calculated using the formula:

The average period will be:

= (3*20 + 9*80 + 24*60 + 48*20 + 72*20)/200 = 23.1 months.

We get the same answer if we use the data on p i from the penultimate column of the table. 9.5, using the formula:

Note that the middle of the interval for the last gradation is found by artificially supplementing it with the width of the interval of the previous gradation equal to 60 - 36 = 24 months.

The variance is calculated using the formula

Where x i- the middle of the interval series.

Therefore!!\sigma = \frac (20^2 + 14^2 + 1 + 25^2 + 49^2)(4), and the mean square error is .

The average error is calculated using the monthly formula, i.e. the average value is!!\overline(x) ± m = 23.1 ± 13.4.

We calculate the maximum error using the formula from table. 9.3 for repeated selection, since the population size is unknown, for a 0.954 confidence level:

So the average is:

those. its true value lies in the range from 0 to 50 months.

Example 4. To determine the speed of settlements with creditors of N = 500 corporation enterprises in a commercial bank, it is necessary to conduct a sample study using a random non-repetitive selection method. Determine the required sample size n so that with probability P = 0.954 the error of the sample mean does not exceed 3 days if trial estimates showed that the standard deviation s was 10 days.

Solution. To determine the number of required studies n, we will use the formula for non-repetitive selection from the table. 9.4:

In it, the t value is determined from a confidence level of P = 0.954. It is equal to 2. The mean square value is s = 10, the population size is N = 500, and the maximum error of the mean is Δ x = 3. Substituting these values ​​into the formula, we get:

those. It is enough to compile a sample of 41 enterprises to estimate the required parameter - the speed of settlements with creditors.