Samples in which the observation covers not big number units (n< 30), принято называть малыми выборками. Они обычно применяются в том случае, когда невозможно или нецелесообразно использовать большую выборку (исследование качества продукции, если это связано с ее разрушением, в частности на прочность, на продолжительность срока службы и т.д.).

The marginal error of a small sample is determined by the formula:

Average error of a small sample:

where is the variance of a small sample:

where is the mean value of the feature in the sample;

Number of degrees of freedom

Confidence coefficient of a small sample, which depends not only on a given confidence probability, but also on the number of sample units.

The probability that the general average is within certain limits is determined by the formula

where is the value of Student's function.

To calculate the confidence coefficient, the value of the function is determined by the formula:

Then, according to the Student's distribution table (see Appendix 4), depending on the value of the function and the number of degrees, the value is determined.

The function is also used to determine the probabilities that the actual normalized deviation will not exceed the table value.


Topic 7. Statistical study of the relationship: The concept of statistical connection. Types and forms of statistical connection. Tasks statistical study the relationship of phenomena. Features of the links of socio-economic phenomena. Basic methods of statistical study of relationships.

correlation - a relationship that does not appear in each individual case, but in the mass of cases in average values ​​in the form of a trend.

Statistical study aims to obtain a dependence model for its practical use. The solution of this problem is carried out in the following sequence.

1. Logical analysis of the essence of the phenomenon under study and cause-and-effect relationships. As a result, the performance indicator is set (y), factors of its change, characterized by indicators (x (, x 2, x 3,..., X"). Relationship of two signs (at and X) called pair correlation. The influence of several factors on the effective feature is called multiple correlation.

In the general direction of communication can be straight and reverse. With direct links with an increase in the trait x the sign also increases y, with reverse - with an increase in sign X sign at decreases.

2. Collection of primary information and checking it for homogeneity and normal distribution. To assess the homogeneity of the population, the coefficient of variation by factor characteristics is used

The set is considered homogeneous if the coefficient of variation does not exceed 33%. Checking the normality of the distribution of the studied factor signs (x ( , x 2 , x 3 ,..., X") carried out using the three sigma rule. The results of the test for normal distribution should be presented in tabular form.

When controlling the quality of goods in economic research, the experiment can be carried out on the basis of a small sample.

Under small sample is understood as a non-continuous statistical survey, in which the sample population is formed from a relatively small number of units of the general population. The volume of a small sample usually does not exceed 30 units and can reach up to 4-5 units.

In trade, a minimum sample size is used when a large sample is either not possible or not practical (for example, if the study involves damage or destruction of the samples being examined).

The value of the error of a small sample is determined by formulas different from the formulas for sample observation with a relatively large sample size (n>100). Mean error of a small sample u(mu)m.v. calculated by the formula:

um.v \u003d root (Gsquare (m.v.) . / n),

where Gsquare(m.v.) is the variance of a small sample. *this is the sigma*

According to the formula (the number is there) we have:

G0square=Gsquare *n/ (n-1).

But since with a small sample n / (n-1) is significant, the calculation of the variance of a small sample is made taking into account the so-called number of degrees of freedom. The number of degrees of freedom is understood as the number of options that can take arbitrary values ​​without changing the average value. When determining the variance Gsquare, the number of degrees of freedom is n-1:

Gsquare (m.v.) \u003d sum (xi-x (with a wavy line)) / (n-1).

The marginal error of a small sample Dm.v. (triangle sign) is determined by the formula:

In this case, the value of the confidence coefficient t depends not only on the given confidence probability, but also on the number of sample units n. For individual values ​​of t and n, the confidence probability of a small sample is determined by special Student tables, in which the distributions of standardized deviations are given:

t= (x(with a wavy line) –x(with a line)) / Gm.v.

Student's tables are given in textbooks on mathematical statistics. Here are some values ​​from these tables that characterize the probability that the marginal error of a small sample will not exceed t times the average error:

St=P[(x(with a wavy line) –x(with a line)

As the sample size increases, the Student's distribution approaches the normal distribution, and at 20 it already differs little from the normal distribution.

When conducting small sample surveys, it is important to keep in mind that the smaller the sample size, the greater the difference between the Student's distribution and normal distribution. With a minimum sample size (n=4), this difference is very significant, which indicates a decrease in the accuracy of the results of a small sample.

By means of a small sample in trade, a number of practical problems are solved, first of all, the establishment of a limit in which the general average of the trait under study lies.

Since when conducting a small sample, the value of 0.95 or 0.99 is practically taken as a confidence probability, then to determine the marginal sampling error Dm.v. The following Student's distribution readings are used.

Small-sample statistics

It is generally accepted that the beginning of S. m. or, as it is often called, “small n” statistics, was established in the first decade of the 20th century by the publication of the work of W. Gosset, in which he placed the t-distribution postulated by the “student” who later gained worldwide fame. At the time, Gosset was working as a statistician for Guinness breweries. One of his duties was to analyze successive batches of casks of freshly brewed stout. For reasons he never really explained, Gosset experimented with the idea of ​​greatly reducing the number of samples taken from a very large number of casks in the brewery's warehouses to randomly control the quality of porter. This led him to postulate the t-distribution. Since the Guinness breweries' charter forbade their employees from publishing the results of the study, Gosset published the results of his experiment comparing sampling quality control using a small-sample t-distribution and a traditional z-distribution (normal distribution) anonymously, under the pseudonym "Student" (Student - where did the name t-Student's distribution come from).

t-distribution. The t-distribution theory, like the z-distribution theory, is used to test null hypothesis that the two samples are simply random samples from the same population and therefore the calculated statistics (eg, mean and standard deviation) are unbiased estimates of the parameters of the population. However, unlike the theory of normal distribution, the theory of t-distribution for small samples does not require a priori knowledge or exact estimates mathematical expectation and variance of the general population. Moreover, although testing the difference between the means of two large samples for statistical significance requires a fundamental assumption about the normal distribution of the characteristics of the population, the t-distribution theory does not require assumptions about the parameters.

It is well known that normally distributed characteristics are described by one single curve - the Gaussian curve, which satisfies the following equation:

With a t-distribution, a whole family of curves is represented by the following formula:

This is why the equation for t includes the gamma function, which in mathematics means that as n changes, another curve will satisfy the given equation.

Degrees of freedom

In the equation for t, n denotes the number of degrees of freedom (df) associated with the population variance estimate (S2), which is the second moment of any moment generating function, such as the equation for the t-distribution. In S., the number of degrees of freedom indicates how many characteristics remained free after their partial use in a particular type of analysis. In a t-distribution, one of the deviations from the sample mean is always fixed, since the sum of all such deviations must equal zero. This affects the sum of squares when calculating the sample variance as an unbiased estimate of the parameter S2 and leads to the fact that df is equal to the number of measurements minus one for each sample. Hence, in the formulas and procedures for calculating t-statistics for testing the null hypothesis df = n - 2.

F-space division. The null hypothesis tested by the t-test is that the two samples were randomly drawn from the same population or were randomly drawn from two different populations with the same variance. What if you need to analyze more groups? The answer to this question was sought for twenty years after Gosset discovered the t-distribution. Two of the most prominent statisticians of the 20th century were directly involved in its production. One - the largest English statistician R. A. Fisher, who proposed the first theory. formulations, the development of which led to the F-distribution; his work on the theory of small samples, developing Gosset's ideas, was published in the mid-1920s (Fisher, 1925). Another is George Snedecor, one of the first American statisticians, who developed a way to compare two independent samples of any size by calculating the ratio of two estimates of the variance. He called this ratio the F-ratio, after Fischer. Research results. Snedekor led to the fact that the F-distribution began to be specified as the distribution of the ratio of two statistics c2, each with its own degrees of freedom:

From this came Fisher's classic work on analysis of variance, a statistical technique explicitly oriented towards the analysis of small samples.

The sampling distribution F (where n = df) is represented by the following equation:

As in the case of the t-distribution, the gamma function indicates that there is a family of distributions that satisfy the equation for F. In this case, however, the analysis includes two quantities of df: the number of degrees of freedom for the numerator and for the denominator of the F-ratio.

Tables for estimating t- and F-statistics. When testing the null hypothesis using C. based on the theory of large samples, usually only one reference table is required - the table of normal deviations (z), which allows you to determine the area under the normal curve between any two values ​​of z on the x-axis. However, the tables for the t- and F-distributions are of necessity presented in a set of tables, since these tables are based on multiple distributions resulting from varying the number of degrees of freedom. Although the t- and F-distributions are probability density distributions, like the normal distribution for large samples, they differ from the latter in respect of the four moments used to describe them. The t-distribution, for example, is symmetrical (notice the t2 in its equation) for all df, but becomes more and more peaked as the sample size decreases. Peaked curves (with greater than normal kurtosis) tend to be less asymptotic (i.e., closer to the x-axis at the ends of the distribution) than curves with normal kurtosis, such as the Gaussian curve. This difference leads to noticeable discrepancies between the points on the x-axis corresponding to the values ​​of t and z. With df = 5 and bilateral level a equal to 0.05, t = 2.57, while the corresponding z = 1.96. Therefore, t = 2.57 indicates statistical significance at the 5% level. However, in the case of a normal curve, z = 2.57 (more precisely 2.58) would already indicate a 1% level of statistical significance. Similar comparisons can be made with the F-distribution, since t is equal to F when the number of samples is two.

What constitutes a "small" sample?

At one time, the question was raised about how large the sample should be in order to be considered small. There is simply no definitive answer to this question. However, it is customary to consider df = 30 as a conditional boundary between a small and a large sample. The basis for this somewhat arbitrary decision is the result of comparing the t-distribution with the normal distribution. As noted above, the discrepancy between the values ​​of t and z tends to increase with decreasing and decrease with increasing df. In fact, t starts to approach z closely long before the limiting case when t = z for df = ∞. A simple visual examination of the tabular values ​​of t allows you to see that this approximation becomes quite fast, starting from df = 30 and above. The comparative values ​​of t (at df = 30) and z are, respectively: 2.04 and 1.96 for p = 0.05; 2.75 and 2.58 for p = 0.01; 3.65 and 3.29 for p = 0.001.

Other statistics for "small" samples

Although statistical tests such as t and F are specifically designed to apply to small samples, they are equally applicable to large samples. There are, however, many others. statistical methods, intended for the analysis of small samples and often used for this purpose. They mean the so-called. non-parametric or distribution-free methods. Basically, the S. appearing in these methods are intended to be applied to measurements obtained using scales that do not satisfy the definition of ratio or interval scales. Most often these are ordinal (rank) or nominal measurements. Nonparametric S. do not require assumptions about the parameters of the distribution, in particular, with respect to estimates of variance, because ordinal and nominal scales exclude the very concept of variance. For this reason, non-parametric methods are also used for measurements obtained using interval and ratio scales when small samples are analyzed and there is a possibility that the basic assumptions necessary for applying parametric methods are violated. Among such C., which can reasonably be applied to small samples, are: Fisher's exact probability test, Friedman's two-factor nonparametric (rank) analysis of variance, Kendall's rank correlation coefficient t, Kendall's concordance coefficient (W), Kruskal's H-criterion - Wallace for non-parametric (rank) one-way analysis of variance, Mann-Whitney U-test, median test, sign test, Spearman's rank correlation coefficient r and Wilcoxon's t-test.

A person can recognize his abilities only by trying to apply them. (Seneca)

Bootstrap, small samples, application in data analysis

Main idea

The bootstrap method was proposed by B. Efron as a development of the jackknife method in 1979.

Let's describe the main idea of ​​the bootstrap.

The purpose of data analysis is to obtain the most accurate selective estimates and disseminate the results to the entire population.

The technical term for numerical data drawn from a sample is sample statistics.

The main descriptive statistics are selective mean, median, standard deviation, etc.

The resulting statistics, such as sample mean, median, correlation will vary from sample to sample.

The researcher needs to know the size of these deviations depending on the population. Based on this, the margin of error is calculated.

The initial picture of all possible values ​​of a sample statistic in the form of a probability distribution is called the sample distribution.

The key is the size samples. What if the sample size is small? One reasonable approach is to random extract data from the existing sample.

The idea of ​​the bootstrap is to use the results of the sample calculations as a “dummy population” in order to determine the sample distribution of the statistic. In fact, it analyzes big the number of "phantom" samples, called bootstrap samples.

Usually several thousand samples are randomly generated, from this set we can find the bootstrap distribution of the statistics of interest to us.

So, let's say we have a sample, at the first step we randomly select one of the sample elements, return this element to the sample, randomly select the element again, and so on.

Let's repeat the described random selection procedure n times.

In the bootstrap, a random selection is made with return, selected elements of the original sample returns into the sample and then can be selected again.

Formally, at each step, we select an element of the original sample with a probability of 1/n.

In total, we have n elements of the initial sample, the probability of obtaining a sample with numbers (N 1 ... Nn ), where Ni varies from 0 to n, is described by a polynomial distribution.

Several thousand such samples are generated, which is quite achievable for modern computers.

For each sample, an estimate of the quantity of interest is constructed, then the estimates are averaged.

Since there are many samples, we can construct empirical function distribution of estimates, then calculate the quantiles, calculate the confidence interval.

It is clear that the bootstrap method is a modification of the Monte Carlo method.

If the samples are generated no return elements, then the well-known jackknife method is obtained.

Question: why do this and when is it reasonable to use the method in real data analysis?

In the bootstrap, we do not receive new information, but we use the available data wisely, based on the task at hand.

For example, bootstrap can be used to small samples, for estimates of the median, correlations, construction of confidence intervals and in other situations.

Efron's original paper considered pairwise correlation estimates for a sample size of n = 15.

B = 1000 bootstrap samples are generated (bootstrap replication ).

Based on the obtained coefficients ro 1 … ro B, a general estimate of the correlation coefficient and an estimate of the standard deviation are constructed.

The standard error of the sample correlation coefficient calculated using the normal approximation is:

where the correlation coefficient is 0.776, the initial sample size is n = 15.

The bootstrap estimate of the standard error is 0.127, see Efron, Gall Gong, 1982.

Theoretical background

Let be the target parameter of the study, for example, the average income in the selected society.

For an arbitrary sample of size, we obtain a data set. Let the corresponding sample statistic be

For most sample statistics with big value (>30) the sampling distribution is a normal curve with a center and a standard deviation , where the positive parameter depends on the population and the type of statistics

This classical result is known as the central limit theorem.

There are often significant technical difficulties in estimating the required standard deviation from the data.

For example, if median or sample correlation.

The bootstrap method circumvents these difficulties.

The idea is simple: denote by an arbitrary value representing the same statistics calculated from the bootstrap sample obtained from the original sample

What can be said about the sampling distribution if the “original” sample is fixed?

In the limit, the sampling distribution is also bell-shaped with parameters and

Thus, the bootstrap distribution approximates well the sampling distribution

Note that when we move from one sample to another, only changes in the expression, since it is calculated from

This is essentially a bootstrap version of the central limit theorem.

It was also found that if the limit sampling distribution of a statistical function does not include population unknowns, the bootstrap distribution provides a better approximation to the sampling distribution than the central limit theorem.

In particular, when the statistical function has the form where denotes the true, or sample estimate of the standard error, the marginal sample distribution is usually standard normal.

This effect is called second-order correction using bootstrapping.

Let i.e. population average, etc. sample mean; is the population standard deviation, is the sample standard deviation calculated from the original data, and is calculated from the bootstrap sample.

Then the sample distribution of the value where , will be approximated by the bootstrap distribution , where is the average over the bootstrap sample, .

Similarly, the sampling distribution will be approximated by the bootstrap distribution , where .

The first results on second order correction were published by Babu and Singh in 1981-83.

Bootstrap applications

Approximation of the standard error of a sample estimate

Assume that the parameter is known for the population

Let be an estimate based on a random sample of size i.e. is a function of Since the sample varies over the set of all possible samples, the following approach is used to estimate the standard error:

Calculate using the same formula as used for but this time based on different bootstrap size samples each. Roughly speaking, it can be accepted if only it is not very large. In this case, it can be reduced to n ln n. Then it can be determined proceeding, in fact, from the essence of the bootstrap method: the population (sample) is replaced by an empirical population (sample).

Bayesian correction using the bootstrap method

The mean of a sample distribution often depends on usually as for large i.e., the Bayesian approximation:

where is the bootstrap copy of . Then the adjusted value will be -

It is worth noting that the previous resampling method, called the jackknife method, is more popular.

Confidence intervals

Confidence intervals (CI) for a given parameter are sample-based ranges.

This range has the property that a value with a very high (preset) probability belongs to it. This is called the significance level. Of course, this probability must apply to any sample of possible ones, since each sample contributes to the determination of the confidence interval. The two most commonly used significance levels are 95% and 99%. Here we will limit ourselves to the value of 95%.

Traditionally, CI depend on the sample distribution of the quantity more precisely in the limit . There are two main kinds of confidence intervals that can be built with bootstrap.

Percentile method

This method has already been mentioned in the introduction, it is very popular due to its simplicity and naturalness. Suppose we have 1000 bootstrap copies let's denote them by Then the values ​​from the range will fall into the confidence interval. Returning to the theoretical justification of the method, it is worth noting that it requires symmetry of the sampling distribution around The reason for this is that the sampling distribution is approximated in the method using the bootstrap distribution should be approximated by a value that is opposite in sign.

Centered bootstrap percentile method

Assume that the sampling distribution is approximated by the bootstrap distribution, that is, as originally intended in bootstrapping. Let's denote the 100th percentile (in bootstrap repetitions) as Then the assumption that the value lies in the range from to will be true with a probability of 95%. The same expression can be easily converted to a similar one for the range from to This interval is called the centered confidence interval for bootstrap percentiles (at a significance level of 95%).

bootstrap-t criterion

As already noted, the bootstrap uses a function of the form where there is a sample estimate of the standard error

This gives additional precision.

As a basic example, let's take the standard t-statistic (hence the name of the method): that is special case, when (population mean), (sample mean) and - sample standard deviation. The bootstrap analog of such a function is where is calculated in the same way as and only on the bootstrap sample.

Let us denote the 100th bootstrap percentile by and assume that the value lies in the interval

Using equality one can rewrite the previous statement, i.e. lies in the interval

This gap is called the bootstrap t-confidence interval for at the 95% level.

It is used in the literature to achieve greater accuracy than the previous approach.

Real Data Example

As a first example, take the data from Hollander and Wolfe 1999, p. 63, which is the effect of light on chick hatching rate.

The standard box plot assumes no normality across population data. We performed a bootstrap analysis of the median and mean.

Separately, it is worth noting the lack of symmetry on the bootstrap t-histogram, which differs from the standard limit curve. The 95% confidence intervals for the median and mean (calculated using the bootstrap percentile method) roughly cover the range

This range represents the overall difference (increase) in chick hatch rate results depending on the backlight.

As a second example, consider data from Devore 2003, p 553, which looked at the correlation between Biochemical Oxygen Demand (BOD) and Hydrostatic Weight (HW) results of professional football players.

Two-dimensional data consists of pairs, and pairs can be freely chosen during bootstrap resampling. For example, first take then, etc.

In the figure, the box-whisker plot shows the lack of normality for the main populations. Correlation histograms calculated from 2D bootstrap data are asymmetric (shifted to the left).

For this reason, the centered bootstrap percentile method is more appropriate in this case.

As a result of the analysis, it turned out that the measurements are correlated for at least 78% of the population.

Data for example 1:

8.5 -4.6 -1.8 -0.8 1.9 3.9 4.7 7.1 7.5 8.5 14.8 16.7 17.6 19.7 20.6 21.9 23.8 24.7 24.7 25.0 40.7 46.9 48.3 52.8 54.0

Data for example 2:

2.5 4.0 4.1 6.2 7.1 7.0 8.3 9.2 9.3 12.0 12.2 12.6 14.2 14.4 15.1 15.2 16.3 17.1 17.9 17.9

8.0 6.2 9.2 6.4 8.6 12.2 7.2 12.0 14.9 12.1 15.3 14.8 14.3 16.3 17.9 19.5 17.5 14.3 18.3 16.2

The literature often suggests different schemes for bootstrapping, which could give reliable results in different statistical situations.

What was discussed above are only the most basic elements, and there are actually a lot of other circuit options. For example, which method is better to use in case of two-stage sampling or stratified sampling?

In this case, it is not difficult to invent a natural scheme. Bootstrapping in the case of data with regression models generally attracts a lot of attention. There are two main methods: in the first, the covariances and response variables are resampled together (pairwise bootstrapping), in the second, bootstrapping is performed on the residuals (residual bootstrapping).

The pair method remains correct (in terms of results at ) even if the error variances in the models are not equal. The second method is incorrect in this case. This drawback is compensated by the fact that such a scheme gives additional accuracy in the estimation of the standard error.

It is much more difficult to apply bootstrapping to time series data.

Time series analysis, however, is one of the key areas in econometrics. Two main difficulties can be distinguished here: firstly, time series data have the property of being sequentially dependent. That is, depends on , etc.

Secondly, the statistical population changes over time, that is, non-stationarity appears.

For this, methods have been developed that transfer the dependence in the source data to bootstrap samples, in particular, the block diagram.

Instead of bootstrap selection, it is immediately built block data that retains dependencies from the original sample.

In the area of ​​application of bootstrapping to sections of econometrics, quite a lot of research is currently being carried out, in general, the method is being actively developed.

Small Sample Method

The main advantage of the small sample method is the ability to estimate the dynamics of the process over time with a reduction in the time for computational procedures.

Randomly select instantaneous samples in certain periods time volume from 5 to 20 units. The sampling period is established empirically and depends on the stability of the process, determined by the analysis of a priori information.

For each instantaneous sample, the main statistical characteristics are determined. Instantaneous samples and their main statistical characteristics are presented in Appendix B.

A hypothesis about the homogeneity of the sample variance is put forward and tested using one of the possible criteria (Fisher's criterion).

Testing the hypothesis about the homogeneity of sample characteristics.

To check the significance of the difference between the arithmetic means in 2 series of measurements, the measure G is introduced. Calculations are given in Appendix B

The decision rule is formulated as follows:

where tr is the value of the quantile of the normalized distribution for a given confidence probability Р, ? = 0.095, n = 10, tr = 2.78.

When the inequality is fulfilled, the hypothesis that the difference between the sample means is not significant is confirmed.

Since the inequality is satisfied in all cases, the hypothesis that the difference between the sample means is not significant is confirmed.

To test the hypothesis about the homogeneity of sample variances, the measure F0 is introduced as the ratio of unbiased estimates of the variances of the results of 2 series of measurements. Moreover, the larger of the 2 estimates is taken as the numerator, and if Sx1>Sx2, then

The calculation results are given in Appendix B.

Then the values ​​of the confidence probability P are set and the values ​​of F(K1; K2; ?/2) are determined at K1 = n1 - 1 and K2 = n2 - 1.

At P=0.025 and K1=10-1=4 and K2=10-1=4 F(9;9;0.025/2)=4.1.

Decision rule: if F(K1; K2; ?/2)>F0, then the hypothesis of homogeneity of variances in two samples is accepted.

Since the condition F(K1; K2; ?/2) > F0 is satisfied in all cases, the hypothesis of homogeneity of variances is accepted.

Thus, the hypothesis about the homogeneity of sample variances is confirmed, which indicates the stability of the process; the hypothesis about the homogeneity of the sample means according to the method of comparison of means is confirmed, which means that the center of dispersion has not changed and the process is in a stable state.

Method of scatter and accuracy diagrams

Within a certain time, instant samples are taken, ranging from 3 to 10 products, and the statistical characteristics of each sample are determined.

The data obtained is applied to charts, on the abscissa axis of which time is plotted? or numbers of k samples, and along the y-axis - individual values ​​xk or the value of one of statistical characteristics(sample arithmetic mean, sample standard deviation). In addition, two horizontal lines Tv and Tn are drawn on the diagram, limiting the tolerance field of the product.

Instantaneous samples are given in Annex B.


Figure 1 Accuracy Chart

The diagram clearly displays the progress of the production process. It can be judged that the production process is unstable