7.3.1. Coefficients of correlation and determination. Can be quantified closeness of communication between factors and orientation(direct or reverse) by calculating:

1) if it is necessary to determine a linear relationship between two factors, - pair coefficient correlations: in 7.3.2 and 7.3.3, the operations of calculating the paired linear Bravais–Pearson correlation coefficient ( r) and Spearman's pairwise rank correlation coefficient ( r);

2) if we want to determine the relationship between two factors, but this relationship is clearly non-linear, then correlation relation ;

3) if we want to determine the relationship between one factor and some set of other factors - then (or, equivalently, "multiple correlation coefficient");

4) if we want to identify in isolation the relationship of one factor only with a specific other, which is part of a group of factors affecting the first, for which we have to consider the influence of all other factors unchanged, then private (partial) correlation coefficient .

Any correlation coefficient (r, r) cannot exceed 1 in absolute value, i.e. –1< r (r) < 1). Если получено значение 1, то это значит, что рассматриваемая зависимость не статистическая, а функциональная, если 0 - корреляции нет вообще.

The sign at the correlation coefficient determines the direction of the connection: the “+” sign (or the absence of a sign) means that the connection straight (positive), the “–” sign - that the connection reverse (negative). The sign has nothing to do with the tightness of the connection.

The correlation coefficient characterizes the statistical relationship. But often it is necessary to determine another type of dependence, namely: what is the contribution of a certain factor to the formation of another related factor. This kind of dependence, with a certain degree of conventionality, is characterized by determination coefficient (D ) determined by the formula D = r 2 ´100% (where r is the Bravais-Pearson correlation coefficient, see 7.3.2). If the measurements were taken in order scale (rank scale), then with some loss of reliability, instead of the value of r, the value of r (Spearman's correlation coefficient, see 7.3.3) can be substituted into the formula.

For example, if we obtained as a characteristic of the dependence of factor B on factor A the correlation coefficient r = 0.8 or r = –0.8, then D = 0.8 2 ´100% = 64%, that is, about 2 ½ 3. Therefore, the contribution of factor A and its changes to the formation of factor B is approximately 2 ½ 3 from the total contribution of all factors in general.

7.3.2. Bravais-Pearson correlation coefficient. The procedure for calculating the Bravais–Pearson correlation coefficient ( r ) can be used only in those cases when the connection is considered on the basis of samples having a normal frequency distribution ( normal distribution ) and obtained by measurements in scales of intervals or ratios. The calculation formula for this correlation coefficient is:



å ( x i – )( y i-)

r = .

n×sx×sy

What does the correlation coefficient show? Firstly, the sign at the correlation coefficient shows the direction of the relationship, namely: the “–” sign indicates that the relationship reverse, or negative(there is a trend: as the values ​​of one factor decrease, the corresponding values ​​of the other factor increase, and as they increase, they decrease), and the absence of a sign or the “+” sign indicates straight, or positive connections (there is a trend: with an increase in the values ​​of one factor, the values ​​of the other increase, and with a decrease, they decrease). Secondly, the absolute (sign-independent) value of the correlation coefficient indicates the tightness (strength) of the connection. It is customary to assume (rather conventionally): for values ​​of r< 0,3 корреляция very weak, often it is simply not taken into account, for 0.3 £ r< 5 корреляция weak, for 0.5 £ r< 0,7) - average, at 0.7 £ r £ 0.9) - strong and, finally, for r > 0.9 - very strong. In our case (r » 0.83), the relationship is inverse (negative) and strong.

Recall that the values ​​of the correlation coefficient can be in the range from -1 to +1. If the value of r goes beyond these limits, it indicates that in the calculations a mistake was made . If a r= 1, this means that the connection is not statistical, but functional - which practically does not happen in sports, biology, medicine. Although with a small number of measurements, a random selection of values ​​that gives a picture of a functional relationship is possible, but such a case is the less likely, the larger the volume of compared samples (n), that is, the number of pairs of compared measurements.

The calculation table (Table 7.1) is built according to the formula.

Table 7.1.

Calculation table for Bravais-Pearson calculation

x i y i (x i-) (x i – ) 2 (y i-) (y i – ) 2 (x i – )( y i-)
13,2 4,75 0,2 0,04 –0,35 0,1225 – 0,07
13,5 4,7 0,5 0,25 – 0,40 0,1600 – 0,20
12,7 5,10 – 0,3 0,09 0,00 0,0000 0,00
12,5 5,40 – 0,5 0,25 0,30 0,0900 – 0,15
13,0 5,10 0,0 0,00 0,00 0.0000 0,00
13,2 5,00 0,1 0,01 – 0,10 0,0100 – 0,02
13,1 5,00 0,1 0,01 – 0,10 0,0100 – 0,01
13,4 4,65 0,4 0,16 – 0,45 0,2025 – 0,18
12,4 5,60 – 0,6 0,36 0,50 0,2500 – 0,30
12,3 5,50 – 0,7 0,49 0,40 0,1600 – 0,28
12,7 5,20 –0,3 0,09 0,10 0,0100 – 0,03
åx i \u003d 137 \u003d 13.00 åy i =56.1 =5.1 å( x i - ) 2 \u003d \u003d 1.78 å( y i – ) 2 = = 1.015 å( x i – )( y i – )= = –1.24

Because the s x = ï ï = ï ï» 0.42, a

s y= ï ï» 0,32, r" –1,24ï (11´0.42´0.32) » –1,24ï 1,48 » –0,83 .

In other words, you need to know very firmly that the correlation coefficient can not exceed 1.0 in absolute value. This often makes it possible to avoid gross errors, or rather, to find and correct errors made in the calculations.

7.3.3. Spearman correlation coefficient. As already mentioned, it is possible to apply the Bravais-Pearson correlation coefficient (r) only in those cases when the analyzed factors are close to normal in terms of frequency distribution and the values ​​of the variant are obtained by measurements necessarily on the scale of ratios or on the scale of intervals, which happens if they are expressed physical units. In other cases, the Spearman correlation coefficient is found ( r). However, this ratio can apply also in cases where it is allowed (and desirable ! ) apply the Bravais-Pearson correlation coefficient. But it should be borne in mind that the procedure for determining the Bravais-Pearson coefficient has more power ("resolving ability"), that's why r more informative than r. Even with a large n deviation r may be of the order of ±10%.

Table 7.2 Calculation formula for the coefficient

x i y i R x R y |d R | d R 2 Spearman correlation coefficient

13,2 4,75 8,5 3,0 5,5 30,25 r= 1 – . Vos

13.5 4.70 11.0 2.0 9.0 81.00 we use our example

12.7 5.10 4.5 6.5 2.0 4.00 for calculation r, but let's build

12.5 5.40 3.0 9.0 6.0 36.00 other table (Table 7.2).

13.0 5.10 6.0 6.5 0.5 0.25 Substitute the values:

13.2 5.00 8.5 4.5 4.0 16.00 r = 1– =

13,1 5,00 7,0 4,5 2,5 6,25 =1– 2538:1320 » 1–1,9 » – 0,9.

13.4 4.65 10.0 1.0 9.0 81.00 We see: r turned out to be a bit

12.4 5.60 2.0 11.0 9.0 81.00 more than r, but this is different

12.3 5.50 1.0 10.0 9.0 81.00 not very large. After all, at

12.7 5.20 4.5 8.0 3.5 12.25 so small n values r and r

åd R 2 = 423 are very approximate, not very reliable, their actual value can fluctuate widely, so the difference r and r in 0.1 is insignificant. Usuallyrconsidered as an analoguer , but less accurate. Signs at r and r shows the direction of the connection.

7.3.4. Application and validation of correlation coefficients. Determining the degree of correlation between factors is necessary to control the development of the factor we need: for this we have to influence other factors that significantly affect it, and we need to know the measure of their effectiveness. It is necessary to know about the relationship of factors in order to develop or select ready-made tests: the information content of a test is determined by the correlation of its results with the manifestations of a trait or property of interest to us. Without knowledge of correlations, any form of selection is impossible.

It was noted above that in sports and in general pedagogical, medical, and even economic and sociological practice, it is of great interest to determine whether contribution , which the one factor contributes to the formation of another. This is due to the fact that in addition to the considered factor-causes on target(of interest to us) factor act, each giving one or another contribution to it, and others.

It is believed that the measure of the contribution of each factor-cause can be coefficient of determination D i = r 2 ´100%. So, for example, if r = 0.6, i.e. the relationship between factors A and B is average, then D = 0.6 2 ´100% = 36%. Knowing, therefore, that the contribution of factor A to the formation of factor B is approximately 1 ½ 3, it is possible, for example, to devote approximately 1 ½ 3 training times. If the correlation coefficient r \u003d 0.4, then D \u003d r 2 100% \u003d 16%, or approximately 1 ½ 6 - two s once again less, and according to this logic, only 1 ½ 6 part of training time.

The values ​​of D i for various significant factors give an approximate idea of ​​the quantitative relationship of their influences on the target factor of interest to us, for the sake of improving which we, in fact, are working on other factors (for example, a long jumper is working on increasing the speed of his sprint, so as it is the factor that makes the most significant contribution to the formation of the result in jumps).

Recall that by defining D instead of r put r, although, of course, the accuracy of the determination is lower.

Based selective(calculated from sample data) of the correlation coefficient, it is impossible to conclude that there is a connection between the considered factors in general. In order to draw such a conclusion with varying degrees of validity, use the standard correlation significance criteria. Their application assumes a linear relationship between the factors and normal distribution frequencies in each of them (meaning not a selective, but their general representation).

You can, for example, apply Student's t-tests. His race

even formula: tp= –2 , where k is the studied sample correlation coefficient, a n- the volume of compared samples. The resulting calculated value of the t-criterion (t p) is compared with the table value at the level of significance we have chosen and the number of degrees of freedom n = n - 2. To get rid of the calculation work, you can use a special table critical values ​​of sample correlation coefficients(see above), corresponding to the presence of a significant relationship between the factors (taking into account n and a).

Table 7.3.

Boundary values ​​of the reliability of the sample correlation coefficient

The number of degrees of freedom in determining the correlation coefficients is taken equal to 2 (i.e. n= 2) Indicated in the table. 7.3 values ​​have a lower bound on the confidence interval true the correlation coefficient is 0, that is, with such values ​​it cannot be argued that the correlation takes place at all. If the value of the sample correlation coefficient is higher than indicated in the table, it can be considered at the appropriate level of significance that the true correlation coefficient is not equal to zero.

But the answer to the question whether there is a real connection between the factors under consideration leaves room for another question: in what interval does true value correlation coefficient, as it can actually be, with an infinitely large n? This interval for any particular value r and n compared factors can be calculated, but it is more convenient to use a system of graphs ( nomogram), where each pair of curves constructed for some specified above them n, corresponds to the boundaries of the interval.

Rice. 7.4. Confidence limits of the sample correlation coefficient (a = 0.05). Each curve corresponds to the one above it. n.

Referring to the nomogram in Fig. 7.4, it is possible to determine the interval of values ​​of the true correlation coefficient for the calculated values ​​of the sample correlation coefficient at a = 0.05.

7.3.5. correlation relationships. If the pair correlation non-linear, it is impossible to calculate the correlation coefficient, determine correlation relationships . Mandatory requirement: features must be measured on a ratio scale or on an interval scale. You can calculate the correlation dependence of the factor X from the factor Y and correlation dependence of the factor Y from the factor X- they are different. With a small volume n considered samples representing factors, to calculate the correlation relationships, you can use the formulas:

correlation ratio h x ½ y= ;

correlation ratio h y ½ x= .

Here and are the arithmetic means of samples X and Y, and - intraclass arithmetic averages. That is, the arithmetic mean of those values ​​in the sample of factor X, with which conjugate equal values in the sample of factor Y (for example, if factor X has values ​​4, 6, and 5, with which 3 options with the same value of 9 are associated in the sample of factor Y, then = (4+6+5) ½ 3 = 5). Accordingly, - the arithmetic mean of those values ​​in the sample of the factor Y, which are associated with the same values ​​in the sample of the factor X. Let's give an example and calculate:

X: 75 77 78 76 80 79 83 82 ; Y: 42 42 43 43 43 44 44 45 .

Table 7.4

Calculation table

x i y i x y x i – x (x i – x) 2 x i - x y (x ix y) 2
–4 –1
–2
–3 –2
–1
–3
x=79 y=43 S=76 S=28

Therefore h y ½ x= » 0.63.

7.3.6. Partial and multiple correlation coefficients. To evaluate the relationship between 2 factors, by calculating the correlation coefficients, we assume by default that no other factors have any effect on this relationship. In reality, this is not the case. So, the relationship between weight and height is very significantly affected by the calorie intake, the amount of systematic physical activity, heredity, etc. When it is necessary when assessing the relationship between 2 factors take into account the significant impact other factors and at the same time how to isolate themselves from them, considering them unchanged, calculate private (otherwise - partial ) correlation coefficients.

Example: you need to evaluate paired dependencies between 3 essential factors X, Y and Z. Denote r XY (Z) private (partial) correlation coefficient between the factors X and Y (in this case, the value of the factor Z is considered unchanged), r ZX (Y) - partial correlation coefficient between the factors Z and X (with the constant value of the factor Y), r YZ (X) - partial correlation coefficient between the factors Y and Z (with the constant value of the factor X). Using the computed simple paired (according to Bravais-Pearson) correlation coefficients r xy, r XZ and r YZ, m

You can calculate private (partial) correlation coefficients using the formulas:

rXY- r XZ´ r YZ r XZ- r XY' r ZY r ZY –r ZX ´ r YZ

r XY (Z) = ; r XZ (Y) = ; r ZY (X) =

Ö(1– r 2XZ)(1– r 2 YZ) Ö(1– r 2XY)(1– r 2 ZY) Ö(1– r 2ZX)(1– r 2YX)

And partial correlation coefficients can take values ​​from -1 to +1. By squaring them, we get the corresponding quotients determination coefficients also called private measures of certainty(multiplying by 100, we express in%%). Partial correlation coefficients differ more or less from simple (full) pair coefficients, which depends on the strength of the influence of the 3rd factor on them (as if unchanged). The null hypothesis (H 0), that is, the hypothesis that there is no connection (dependence) between factors X and Y, is tested (with the total number of features k) by calculating the t-test according to the formula: t P = r XY (Z) ´ ( n–k) 1 ½ 2 ´ (1– r 2XY(Z)) –1 ½ 2 .

If a t R< t a n , the hypothesis is accepted (we assume that there is no dependence), if t P ³ t a n - the hypothesis is refuted, that is, it is believed that the dependence really takes place. t a n is taken from the table t-Student's criterion, and k- the number of factors taken into account (in our example 3), the number of degrees of freedom n= n - 3. Other partial correlation coefficients are checked similarly (into the formula instead of r XY (Z) are substituted accordingly r XZ (Y) or r ZY(X)).

Table 7.5

Initial data

Ö (1 – 0.71 2)(1 – 0.71 2) Ö (1 – 0.5)(1 – 0.5)

To assess the dependence of the factor X on the combined action of several factors (here, the factors Y and Z), calculate the values ​​of simple paired correlation coefficients and, using them, calculate multiple correlation coefficient r X (YZ) :

Ö r 2XY+ r 2XZ - 2 r XY' r XZ´ r YZ

r X (YZ) = .

Ö 1 - r 2 YZ

7.2.7. association coefficient. It is often necessary to quantify the relationship between quality signs, i.e. such signs that cannot be represented (characterized) quantitatively, which immeasurable. For example, the task is to find out whether there is a relationship between the sports specialization of those involved and such personal properties as introversion (the personality's focus on the phenomena of its own subjective world) and extraversion (the personality's focus on the world of external objects). Symbols are presented in Table. 7.6.

Table 7.6.

X (years) Y (times) Z (times) X (years) Y (times) Z (times)
Feature 1 Feature 2 introversion Extraversion
Sport games a b
Gymnastics With d

Obviously, the numbers at our disposal here can only be distribution frequencies. In this case, calculate association coefficient (other name " contingency coefficient "). Consider simplest case: the relationship between two pairs of features, while the calculated contingency coefficient is called tetrachoric (see table).

Table 7.7.

a = 20 b = 15 a + b = 35
c =15 d=5 c + d = 20
a + c = 35 b + d = 20 n = 55

We make calculations according to the formula:

ad-bc 100-225-123

The calculation of association coefficients (conjugation coefficients) with a larger number of features is associated with calculations using a similar matrix of the corresponding order.

When studying correlations try to establish whether there is any relationship between two indicators in the same sample (for example, between the height and weight of children or between the level IQ and school performance) or between two different samples (for example, when comparing pairs of twins), and if this relationship exists, whether an increase in one indicator is accompanied by an increase (positive correlation) or a decrease (negative correlation) of the other.

In other words, correlation analysis helps to establish whether it is possible to predict the possible values ​​of one indicator, knowing the value of another.

Until now, when analyzing the results of our experience in studying the effects of marijuana, we have deliberately ignored such an indicator as reaction time. Meanwhile, it would be interesting to check whether there is a relationship between the efficiency of reactions and their speed. This would allow, for example, to argue that the slower a person is, the more accurate and effective his actions will be and vice versa.

For this purpose, two different methods can be used: the parametric method for calculating the Bravais-Pearson coefficient (r) and calculating the correlation coefficient of the Spearman ranks (r s ), which applies to ordinal data, i.e. is non-parametric. However, let's first understand what a correlation coefficient is.

Correlation coefficient

The correlation coefficient is a value that can vary from -1 to 1. In the case of a complete positive correlation, this coefficient is plus 1, and with a complete negative - minus 1. On the graph, this corresponds to a straight line passing through the points of intersection of the values ​​of each pair data:

Variable

If these points do not line up in a straight line, but form a “cloud”, the absolute value of the correlation coefficient becomes less than one and approaches zero as the cloud rounds off:

If the correlation coefficient is 0, both variables are completely independent of each other.

In the humanities, a correlation is considered strong if its coefficient is greater than 0.60; if it exceeds 0.90, then the correlation is considered very strong. However, in order to be able to draw conclusions about the relationships between variables, the sample size is of great importance: the larger the sample, the more reliable the value of the obtained correlation coefficient. There are tables with critical values ​​of the Bravais-Pearson and Spearman correlation coefficients for a different number of degrees of freedom (it is equal to the number of pairs minus 2, i.e. n-2). Only if the correlation coefficients are greater than these critical values ​​can they be considered reliable. So, in order for the correlation coefficient of 0.70 to be reliable, at least 8 pairs of data should be taken into the analysis ( = P - 2 = 6) when calculating r(Table B.4) and 7 data pairs (= n - 2 = 5) when calculating r s (Table 5 in Appendix B. 5).

Bravais–Pearson coefficient

To calculate this coefficient, the following formula is used (y different authors it may look different):

where  XY is the sum of the products of the data from each pair;

n - number of pairs;

- average for variable data X;

Average for variable data Y;

S X - x;

s Y - standard deviation for distribution y.

Now we can use this coefficient to determine whether there is a relationship between the reaction time of the subjects and the effectiveness of their actions. Take, for example, the background level of the control group.

n= 15  15,8  13,4 = 3175,8;

(n 1)S x S y = 14  3,07  2,29 = 98,42;

r =

A negative value of the correlation coefficient may mean that the longer the reaction time, the lower the efficiency. However, its value is too small to be able to speak about a significant relationship between these two variables.

nXY=………

(n- 1)S X S Y = ……

What conclusion can be drawn from these results? If you think that there is a relationship between variables, then what is it - direct or reverse? Is it reliable [cf. tab. 4 (in Appendix B. 5) with critical values r]?

Spearman rank correlation coefficientr s

This coefficient is easier to calculate, but the results are less accurate than using r. This is due to the fact that when calculating the Spearman coefficient, the order of the data is used, and not their quantitative characteristics and intervals between classes.

The point is that when using the rank correlation coefficient Spearman(r s ) they only check whether the ranking of data for some sample will be the same as in a series of other data for this sample paired with the first (for example, whether students will be “ranked” equally when they pass both psychology and mathematics, or even with two different psychology professors?). If the coefficient is close to + 1, then this means that both series practically coincide, and if this coefficient is close to - 1, we can talk about a complete inverse relationship.

Coefficient r s calculated according to the formula

where d- the difference between the ranks of conjugate feature values ​​(regardless of its sign), and n- number of pairs.

Typically, this non-parametric test is used in cases where you need to draw some conclusions not so much about intervals between the data, how much about them ranks, and also when the distribution curves are too asymmetric and do not allow the use of such parametric criteria as the coefficient r(in these cases, it may be necessary to convert quantitative data to ordinal data).

Since this is the case with the distribution of efficiency and reaction time values ​​in the experimental group after exposure, you can repeat the calculations that you have already done for this group, only now not for the coefficient r, and for the indicator r s . This will allow you to see how different these two indicators are*.

* It should be remembered that

1) for the number of hits, the 1st rank corresponds to the highest and the 15th to the lowest performance, while for the reaction time, the 1st rank corresponds to the shortest time, and the 15th to the longest;

2) ex aequo data are given an average rank.

Thus, as in the case of the coefficient r, received a positive, albeit unreliable, result. Which of the two results is more plausible: r=-0.48 or r s = +0.24? Such a question can arise only if the results are reliable.

I would like to emphasize once again that the essence of these two coefficients is somewhat different. Negative coefficient r indicates that the efficiency is most often the higher, the faster the reaction time, while when calculating the coefficient r s it was necessary to check whether faster subjects always react more accurately, and slower ones less accurately.

Since in the experimental group, after exposure, a coefficient was obtained r s , equal to 0.24, such a trend is obviously not traced here. Try to make sense of the data for the control group after exposure on your own, knowing that  d 2 = 122,5:

; is it reliable?

What is your conclusion?…………………………………………………………………………………………………

…………………………………………………………………………………………………………………….

So, we have considered various parametric and non-parametric statistical methods used in psychology. Our review was very superficial, and its main task was to make the reader understand that statistics are not as scary as they seem, and require mostly common sense. We remind you that the data of "experience" with which we have dealt here are fictitious and cannot serve as a basis for any conclusions. However, such an experiment would be worth doing. Since a purely classical technique was chosen for this experiment, the same statistical analysis could be used in many different experiments. In any case, it seems to us that we have outlined some main directions that may be useful to those who do not know where to start the statistical analysis of the results.

There are three main branches of statistics: descriptive statistics, inductive statistics and correlation analysis.

Regression analysis allows you to evaluate how one variable depends on another and what is the spread of the values ​​of the dependent variable around the straight line that defines the relationship. These estimates and the corresponding confidence intervals make it possible to predict the value of the dependent variable and determine the accuracy of this prediction.

The results of regression analysis can only be presented in a fairly complex digital or graphical form. However, we are often interested not in predicting the value of one variable from the value of another, but simply in characterizing the tightness (strength) of the relationship between them, while expressed as a single number.

This characteristic is called the correlation coefficient, it is usually denoted by the letter r. The correlation coefficient can be

can take values ​​from -1 to +1. The sign of the correlation coefficient shows the direction of the connection (direct or inverse), and the absolute value shows the closeness of the connection. A coefficient equal to -1 determines the same rigid connection as equal to 1. In the absence of a connection, the correlation coefficient is zero.

On fig. 8.10 shows examples of dependencies and their corresponding values ​​of r. We will consider two correlation coefficients.

The Pearson correlation coefficient is intended to describe the linear relationship of quantitative traits; like regression
ionic analysis, it requires a normal distribution. When people just talk about "correlation coefficient" they almost always mean Pearson's correlation coefficient, and that's exactly what we will do.

Spearman's rank correlation coefficient can be used when the relationship is non-linear - and not only for quantitative, but also for ordinal features. This is a non-parametric method and does not require any particular type of distribution.

We have already spoken about quantitative, qualitative and ordinal features in Chap. 5. Quantitative signs are ordinary numerical data, such as height, weight, temperature. Values quantitative trait you can compare with each other and say which of them is greater, by how much and by how many times. For example, if one Martian weighs 15 g and the other 10, then the first one is heavier than the second and one and a half times and 5 g. how many times. In medicine, ordinal signs are quite common. For example, the results of a vaginal Pap test are evaluated on the following scale: 1) normal, 2) mild dysplasia, 3) moderate dysplasia, 4) severe dysplasia, 5) cancer in situ. Both quantitative and ordinal signs can be arranged in order - on this common property based on a large group of non-parametric criteria, which include the Spearman rank correlation coefficient. We will get acquainted with other nonparametric criteria in Chap. ten.

Pearson correlation coefficient

And yet, why can't regression analysis be used to describe the tightness of the relationship? The residual standard deviation could be used as a measure of the closeness of the relationship. However, if you swap the dependent and independent variables, then the residual standard deviation, like other indicators of the regression analysis, will be different.

Let's look at fig. 8.11. Based on a sample of 10 Martians known to us, two regression lines were constructed. In one case, the weight is the dependent variable, in the second it is the independent variable. The regression lines are markedly different



20

If you swap x and y, the regression equation will be different, but the correlation coefficient will remain the same.

hope. It turns out that the relationship of height with weight is one, and weight with height is another. The asymmetry of regression analysis is what prevents it from being directly used to characterize the strength of a relationship. The correlation coefficient, although its idea stems from regression analysis, is free from this shortcoming. We present the formula.

rY(X - X)(Y - Y)

&((- X) S(y - Y)2"

where X and Y are the average values ​​of the variables X and Y. The expression for r is "symmetrical" - swapping X and Y, we get the same value. The correlation coefficient takes values ​​from -1 to +1. The closer the relationship, the greater the absolute value of the correlation coefficient. The sign shows the direction of the connection. For r > 0, they speak of a direct correlation (as one variable increases, the other also increases), for r Let's take the example with 10 Martians, which we have already considered from the point of view of regression analysis. Let's calculate the correlation coefficient. The initial data and intermediate results of calculations are given in Table. 8.3. Sample size n = 10, average height

X = £ X/n = 369/10 = 36.9 and weight Y = £ Y/n = 103.8/10 = 10.38.

We find Shch-X)(Y-Y) = 99.9, Shch-X)2 = 224.8, £(Y - Y)2 = 51.9.

Let's substitute the obtained values ​​into the formula for the correlation coefficient:

224.8 x 51.9'"

The value of r is close to 1, which indicates a close relationship between height and weight. To get a better idea of ​​which correlation coefficient should be considered large and which should be considered insignificant, take a look at

Table 8.3. Calculation of the correlation coefficient
X Y X-X Y-Y (X-X)(Y-Y) (X-X)2 (Y-Y)2
31 7,8 -5,9 -2,6 15,3 34,8 6,8
32 8,3 -4,9 -2,1 10,3 24,0 4,4
33 7,6 -3,9 -2,8 10,9 15,2 7,8
34 9,1 -2,9 -1,3 3,8 8,4 1,7
35 9,6 -1,9 -0,8 1,5 3,6 0,6
35 9,8 -1,9 -0,6 1,1 3,6 0,4
40 11,8 3,1 1,4 4,3 9,6 2,0
41 12,1 4,1 1,7 7,0 16,8 2,9
42 14,7 5,1 4,3 22,0 26,0 18,5
46 13,0 9,1 2,6 23,7 82,8 6,8
369 103,8 0,0 0,2 99,9 224,8 51,9


those on the table. 8.4 - it shows the correlation coefficients for the examples that we analyzed earlier.

Relationship between regression and correlation

We initially used all examples of correlation coefficients (Table 8.4) to build regression lines. Indeed, there is a close relationship between the correlation coefficient and the regression analysis parameters, which we will now demonstrate. Different ways of presenting the correlation coefficient, which we will get in this case, will allow us to better understand the meaning of this indicator.

Recall that the regression equation is constructed in such a way as to minimize the sum of squared deviations from the regression line.


We denote this minimum sum of squares by S (this value is called the residual sum of squares). The sum of the squared deviations of the values ​​of the dependent variable Y from its mean Y will be denoted by S^. Then:

The value of r2 is called the coefficient of determination - it is simply the square of the correlation coefficient. The coefficient of determination shows the strength of the connection, but not its direction.

From the above formula it can be seen that if the values ​​of the dependent variable lie on the direct regression, then S = 0, and thus r = +1 or r = -1, that is, there is a linear relationship between the dependent and independent variable. Any value of the independent variable can accurately predict the value of the dependent variable. On the contrary, if the variables are not related at all, then Soci = SofSisi Then r = 0.

It can also be seen that the coefficient of determination is equal to that share of the total variance S^, which is caused or, as they say, explained by linear regression.

The residual sum of squares S is related to the residual variance s2y\x by the relation Socj = (n - 2) s^, and the total sum of squares S^ to the variance s2 by the relation S^ = (n - 1)s2 . In this case

r2 = 1 _ n _ 2 sy\x n _1 sy

This formula makes it possible to judge the dependence of the correlation coefficient on the share of the residual variance in the total variance

six/s2y The smaller this proportion, the greater (in absolute value) the correlation coefficient, and vice versa.

We have seen that the correlation coefficient reflects the tightness of the linear relationship of the variables. However, if we are talking about predicting the value of one variable from the value of another,
the correlation coefficient should not be overly relied upon. For example, the data in Fig. 8.7 corresponds to a very high correlation coefficient (r = 0.92), but the width of the confidence region shows that the prediction uncertainty is quite significant. Therefore, even with a large correlation coefficient, be sure to calculate the confidence range.


And in the end, we give the ratio of the correlation coefficient and the coefficient of the slope of the direct regression b:

where b is the slope of the regression line, sx and sY are the standard deviations of the variables.

If we do not take into account the case sx = 0, then the correlation coefficient is equal to zero if and only if b = 0. We will now use this fact to estimate the statistical significance of the correlation.

Statistical Significance of Correlation

Since b = 0 implies r = 0, the hypothesis of no correlation is equivalent to the hypothesis of zero slope of the direct regression. Therefore, to assess the statistical significance of the correlation, we can use the formula already known to us for assessing the statistical significance of the difference between b and zero:

Here the number of degrees of freedom is v = n - 2. However, if the correlation coefficient has already been calculated, it is more convenient to use the formula:

The number of degrees of freedom here is also v = n - 2.

With the outward dissimilarity of the two formulas for t, they are identical. Indeed, from what


r 2 _ 1 - n_ 2 Sy]x_

Substituting the value of sy^x into the formula for the standard error

Animal fat and breast cancer

In experiments on laboratory animals, it has been shown that a high content of animal fat in the diet increases the risk of breast cancer. Is this dependence observed in humans? K. Carroll collected data on the consumption of animal fats and mortality from breast cancer in 39 countries. The result is shown in fig. 8.12A. The correlation coefficient between the consumption of animal fats and mortality from breast cancer was found to be 0.90. Let us estimate the statistical significance of the correlation.

0,90 1 - 0,902 39 - 2

The critical value of t for the number of degrees of freedom v = 39 - 2 = 37 is 3.574, which is less than that obtained by us. Thus, at a significance level of 0.001, it can be argued that there is a correlation between animal fat intake and mortality from breast cancer.

Now let's check whether mortality is associated with the consumption of vegetable fats? The corresponding data are shown in fig. 8.12B. The correlation coefficient is 0.15. Then

1 - 0,152 39 - 2

Even at a significance level of 0.10, the calculated value of t is less than the critical value. The correlation is not statistically significant.

Correlation coefficient is a value that can vary from +1 to -1. In the case of a complete positive correlation, this coefficient is equal to plus 1 (they say that with an increase in the value of one variable, the value of another variable increases), and with a complete negative correlation - minus 1 (indicate feedback, i.e. When the values ​​of one variable increase, the values ​​of the other decrease).

Ex 1:

Dependence graph of shyness and depression. As you can see, the dots (subjects) are not located randomly, but line up around one line, and, looking at this line, we can say that the higher the shyness is expressed in a person, the more depressive, i.e. these phenomena are interconnected.

Ex 2: Graph for Shyness and Sociability. We see that as shyness increases, sociability decreases. Their correlation coefficient is -0.43. Thus, a correlation coefficient greater from 0 to 1 indicates a directly proportional relationship (the more ... the more ...), and a coefficient from -1 to 0 indicates an inversely proportional relationship (the more ... the less ...)

If the correlation coefficient is 0, both variables are completely independent of each other.

correlation- this is a relationship where the impact of individual factors appears only as a trend (on average) with the mass observation of actual data. Examples of correlation dependence can be the dependence between the size of the bank's assets and the amount of the bank's profit, the growth of labor productivity and the length of service of employees.

Two systems of classification of correlations according to their strength are used: general and particular.

The general classification of correlations: 1) strong, or close with a correlation coefficient of r> 0.70; 2) medium at 0.500.70, and not just a correlation high level significance.

The following table lists the names of the correlation coefficients for different types of scales.

Dichotomous scale (1/0) Rank (ordinal) scale
Dichotomous scale (1/0) Pearson's association coefficient, Pearson's four-cell conjugation coefficient. Biserial correlation
Rank (ordinal) scale Rank-biserial correlation. Spearman's or Kendall's rank correlation coefficient.
Interval and absolute scale Biserial correlation The values ​​of the interval scale are converted into ranks and the rank coefficient is used Pearson correlation coefficient (linear correlation coefficient)

At r=0 there is no linear correlation. In this case, the group means of the variables coincide with their general means, and the regression lines are parallel to the coordinate axes.

Equality r=0 speaks only of the absence of a linear correlation dependence (uncorrelated variables), but not in general about the absence of a correlation, and even more so, a statistical dependence.

Sometimes the conclusion that there is no correlation is more important than the presence of a strong correlation. A zero correlation of two variables may indicate that there is no influence of one variable on the other, provided that we trust the results of the measurements.

In SPSS: 11.3.2 Correlation coefficients

Until now, we have found out only the very fact of the existence of a statistical relationship between two features. Next, we will try to find out what conclusions can be drawn about the strength or weakness of this dependence, as well as about its form and direction. Criteria quantification dependencies between variables are called correlation coefficients or measures of connectivity. Two variables are positively correlated if there is a direct, unidirectional relationship between them. In a unidirectional relationship, small values ​​of one variable correspond to small values ​​of the other variable, large values ​​correspond to large ones. Two variables are negatively correlated if there is an inverse relationship between them. With a multidirectional relationship, small values ​​of one variable correspond to large values ​​of the other variable and vice versa. The values ​​of the correlation coefficients are always in the range from -1 to +1.

Spearman's coefficient is used as a correlation coefficient between variables belonging to the ordinal scale, and Pearson's correlation coefficient (moment of products) is used for variables belonging to the interval scale. In this case, it should be noted that each dichotomous variable, that is, a variable belonging to the nominal scale and having two categories, can be considered as ordinal.

First, we will check if there is a correlation between the sex and psyche variables from the studium.sav file. In doing so, we take into account that the dichotomous variable sex can be considered an ordinal variable. Do the following:

Select from the command menu Analyze (Analysis) Descriptive Statistics (Descriptive statistics) Crosstabs... (Contingency tables)

· Move the variable sex to a list of rows and the variable psyche to a list of columns.

· Click the Statistics... button. In the Crosstabs: Statistics dialog, check the Correlations box. Confirm your choice with the Continue button.

· In the Crosstabs dialog, stop displaying tables by checking the Supress tables checkbox. Click the OK button.

The Spearman and Pearson correlation coefficients will be calculated, and their significance will be tested:

/ Theory. Correlation coefficient

Correlation coefficient- two-dimensional descriptive statistics, a quantitative measure of the relationship (joint variability) of two variables.

To date, a large number of various coefficients correlations. However, the most important communication measures are Pearson, Spearman and Kendall . Them common feature is that they reflect the relationship of two features , measured on a quantitative scale - rank or metric .

Generally speaking, any empirical research focused on the study of relationships between two or more variables .

If a change in one variable by one unit always results in a change in the other variable by the same amount, the function is linear (its graph is a straight line); any other connection nonlinear . If an increase in one variable is associated with an increase in another, then connection - positive ( straight ) ; if an increase in one variable is associated with a decrease in another, then the connection - negative ( reverse ) . If the direction of change of one variable does not change with the increase (decrease) of another variable, then such a function is monotonous ; otherwise the function is called nonmonotonic .

Functional connections are idealizations. Their peculiarity lies in the fact that one value of one variable corresponds to a strictly defined value of another variable. For example, such is the relationship of two physical variables - weight and body length (linear positive). However, even in physical experiments, the empirical relationship will differ from the functional relationship due to unaccounted for or unknown reasons: fluctuations in the composition of the material, measurement errors, etc.

When studying the relationship of features, the researcher inevitably loses many possible reasons for the variability of these features. The result is that even the functional relationship between variables that exists in reality appears empirically as probabilistic (stochastic): the same value of one variable corresponds to the distribution of different values ​​of another variable (and vice versa).

The simplest example is the ratio of height and weight of people. Empirical results of the study of these two signs will, of course, show their positive relationship. But it is easy to guess that it will differ from a strict, linear, positive - ideal mathematical function, even with all the tricks of the researcher to take into account the harmony or fullness of the subjects. It is unlikely that on this basis it would occur to anyone to deny the existence of a strict functional relationship between the length and weight of the body.

So, the functional interconnection of phenomena can be empirically revealed only as a probabilistic connection of the corresponding features.

A visual representation of the nature of the probabilistic relationship is given by a scatter diagram - a graph whose axes correspond to the values ​​of two variables, and each subject is a point. As numerical characteristic probabilistic connection, correlation coefficients are used.

You can enter three gradations of correlation values ​​according to the strength of the connection:

r< 0,3 - слабая связь (менее 10% от общей доли дисперсии);

0,3 < r < 0,7 - умеренная связь (от 10 до 50% от общей доли дисперсии);

r > 0.7 - strong relationship (50% or more of the total variance).

Partial Correlation

It often happens that two variables correlate with each other only due to the fact that both of them change under the influence of some third variable. That is, in fact, there is no connection between the corresponding properties of these two variables, but it manifests itself in statistical relationship, or correlations, under the influence common cause third variable).

Thus, if the correlation between two variables decreases, with a fixed third random variable, then this means that their interdependence arises in part through the influence of this third variable. If the partial correlation is zero or very small, then we can conclude that their interdependence is entirely due to their own influence and is in no way related to the third variable.

Also, if the partial correlation is greater than the initial correlation between two variables, then it can be concluded that other variables have weakened the relationship, or "hidden" the correlation.

In addition, it must be remembered that correlation is not causation . Based on this, we have no right to speak categorically about the presence causation: some variable completely different from those considered in the analysis may be the source of this correlation. In both ordinary and partial correlations, the assumption of causality must always have its own non-statistical grounds.

Pearson correlation coefficient

r- Pearson used to study the relationship of two metric variables , measured on the same sample . There are many situations in which it is appropriate to use it. Does intelligence affect undergraduate performance? Is the salary of an employee related to his goodwill towards colleagues? Does the mood of a student affect the success of solving a complex arithmetic problem? To answer such questions, the researcher must measure two indicators of interest to each member of the sample.

The value of the correlation coefficient is not affected by the units in which the features are presented. Therefore, any linear transformations features (multiplication by a constant, addition of a constant) do not change the value of the correlation coefficient. An exception is the multiplication of one of the signs by a negative constant: the correlation coefficient changes its sign to the opposite.

Pearson correlation is a measure of linear relationship between two variables . It allows you to determine , how proportional is the variability of the two variables . If the variables are proportional to each other, then graphically the relationship between them can be represented as a straight line with a positive (direct proportion) or negative (inverse proportion) slope.

In practice, the relationship between two variables, if any, is probabilistic and graphically looks like an ellipsoidal scatter cloud. This ellipsoid, however, can be represented (approximated) as a straight line, or a regression line. regression line is a least-squares straight line: the sum of the squared distances (calculated along the y-axis) from each point of the scatterplot to the straight line is the minimum.

Of particular importance for assessing the accuracy of the prediction is the variance of estimates of the dependent variable. In essence, the variance of estimates of the dependent variable Y is that part of its total variance that is due to the influence of the independent variable X. In other words, the ratio of the variance of estimates of the dependent variable to its true variance is equal to the square of the correlation coefficient.

The square of the correlation coefficient of the dependent and independent variables represents the proportion of the variance of the dependent variable due to the influence of the independent variable, and is called determination coefficient . The coefficient of determination, therefore, shows the extent to which the variability of one variable is due (determined) by the influence of another variable.

The determination coefficient has an important advantage over the correlation coefficient. Correlation is not linear function relationship between two variables. Therefore, the arithmetic mean of the correlation coefficients for several samples does not coincide with the correlation calculated immediately for all subjects from these samples (i.e., the correlation coefficient is not additive). On the contrary, the coefficient of determination reflects the relationship linearly and, therefore, is additive: it can be averaged over several samples.

Additional information about the strength of the connection is given by the value of the correlation coefficient squared - the coefficient of determination: this is the part of the variance of one variable that can be explained by the influence of another variable. In contrast to the correlation coefficient, the coefficient of determination increases linearly with an increase in the strength of the connection.

Spearman and τ-Kendall correlation coefficients (rank correlations). If both variables between which the relationship is being studied are presented on an ordinal scale, or one of them is on an ordinal scale and the other is on a metric scale, then rank correlation coefficients are applied: Spearman or τ - Kendella . And that one , and the other coefficient requires prior ranking of both variables for its application .

Spearman's rank correlation coefficient - this is a non-parametric method , which is used for the purpose of statistical study of the relationship between phenomena . In this case, the actual degree of parallelism between the two quantitative series of the studied features is determined and the tightness of the established relationship is estimated using a quantitatively expressed coefficient.

If the members of a group were ranked first by the x variable and then by the y variable, then the correlation between the x and y variables can be obtained by simply calculating the Pearson coefficient for the two rank series. Assuming there are no links in the ranks (i.e. no repeated ranks) for either variable, the formula for Pearson can be greatly simplified computationally and transformed into the formula known as Spearman .

The power of the Spearman rank correlation coefficient is somewhat inferior to the power of the parametric correlation coefficient.

It is advisable to use the rank correlation coefficient in the presence of a small number of observations . This method can be used for more than just quantified data , but also in cases , when the recorded values ​​are determined by descriptive features of varying intensity .

Spearman's rank correlation coefficient with a large number of identical ranks for one or both of the compared variables gives coarsened values. Ideally, both correlated series should be two sequences of mismatched values

An alternative to the Spearman correlation for ranks is the correlation τ-kendall . The correlation proposed by M. Kendall is based on the idea that the direction of the connection can be judged by comparing the subjects in pairs: if a pair of subjects has a change in x that coincides in direction with a change in y, then this indicates a positive relationship, if does not match - something about a negative relationship.

Correlation coefficients have been specifically designed to numerically determine the strength and direction of a relationship between two properties measured on numerical scales.(metric or rank).

As already mentioned, Correlation values ​​+1 (strict direct or directly proportional relationship) and -1 (strict inverse or inversely proportional relationship) correspond to the maximum strength of the relationship, correlation equal to zero corresponds to the absence of the relationship.

Additional information about the strength of the connection is provided by the value of the coefficient of determination: it is the part of the variance of one variable that can be explained by the influence of another variable.

Topic 12 Correlation analysis

Functional dependency and correlation. Even Hippocrates in the VI century. BC e. drew attention to the existence of a connection between the physique and temperament of people, between the structure of the body and predisposition to certain diseases. Certain types of such a connection have also been identified in the animal and flora. So, there is a relationship between physique and productivity in farm animals; the relationship between seed quality and crop yields is known, etc. As for such dependences in ecology, there are dependences between the content of heavy metals in soil and snow cover on their concentration in atmospheric air etc. Therefore, it is natural to strive to use this regularity in the interests of man, to give it a more or less precise quantitative expression.

As you know, to describe the relationships between variables, we use mathematical concept functions f, which assigns to each specific value of the independent variable x certain value of the dependent variable y, i.e. . This kind of unambiguous relationship between variables x and y called functional. However, such relationships are not always found in natural objects. Therefore, the relationship between biological and also ecological characteristics is not functional, but statistical in nature, when in the mass of homogeneous individuals a certain value of one attribute considered as an argument corresponds not to the same numerical value, but to a whole gamut of numerical values ​​distributed in a variational series values ​​of another feature considered as a dependent variable or function. This kind of relationship between variables is called correlation or correlation..

Functional relationships are easy to detect and measure on single and group objects, but this cannot be done with correlations, which can only be studied on group objects using methods mathematical statistics. The correlation relationship between features can be linear and non-linear, positive and negative. The task of correlation analysis is reduced to establishing the direction and form of a relationship between varying features, measuring its tightness, and, finally, to verifying the reliability of sample correlation indicators.

Dependence between variables X and Y can be expressed analytically (using formulas and equations) and graphically (as the locus of points in a rectangular coordinate system). The correlation graph is built according to the equation of the function or , which is called regression. Here and are the arithmetic means found under the condition that X or Y will take on some values x or y. These averages are called conditional.

11.1. Parametric indicators of communication

Correlation coefficient. Conjugacy between variables x and y can be established by comparing the numerical values ​​of one of them with the corresponding values ​​of the other. If an increase in one variable increases another, this indicates positive connection between these values, and vice versa, when an increase in one variable is accompanied by a decrease in the value of another, this indicates negative connection.

To characterize the relationship, its direction and the degree of conjugation of variables, the following indicators are used:

    linear dependence - correlation coefficient;

    non-linear - correlation ratio.

The following formula is used to determine the empirical correlation coefficient:

. (1)

Here s x and s y are standard deviations.

The correlation coefficient can be calculated without resorting to the calculation of standard deviations, which simplifies the computational work, using the following similar formula:

. (2)

The correlation coefficient is a dimensionless number ranging from –1 to +1. With independent variation of signs, when the connection between them is completely absent, . The stronger the contingency between features, the higher the value of the correlation coefficient. Consequently, at this indicator characterizes not only the presence, but also the degree of conjugation between the signs. With a positive or direct relationship, when large values ​​of one attribute correspond to large values ​​of the other, the correlation coefficient has a positive sign and ranges from 0 to +1, with a negative or feedback relationship, when large values ​​of one attribute correspond to smaller values ​​of the other, the correlation coefficient is accompanied by a negative sign and ranges from 0 to –1.

The correlation coefficient has found wide application in practice, but it is not a universal indicator of correlations, since it is able to characterize only linear relationships, i.e. expressed by a linear regression equation (see topic 12). If there is no linear dependence between varying signs, other indicators of the connection, discussed below, are used.

Calculation of the correlation coefficient. This calculation is done in different ways and in different ways depending on the number of observations (sample size). Let us consider separately the specifics of calculating the correlation coefficient in the presence of small samples and large samples.

Small samples. In the presence of small samples, the correlation coefficient is calculated directly from the values ​​of conjugate features, without preliminary grouping of sample data into variation series. For this, the above formulas (1) and (2) are used. More convenient, especially in the presence of multi-digit and fractional numbers, which express the deviations of the variant X i and y i from averages and , the following working formulas serve:

where ;

;

Here x i and y i– paired variants of conjugate features x and y; and are arithmetic means; - the difference between paired variants of conjugate features x and y; ntotal number paired observations, or sample size.

The empirical correlation coefficient, like any other sample indicator, serves as an estimate of its general parameter ρ and how the random value is accompanied by an error:

The ratio of the sample correlation coefficient to its error serves as a criterion for testing the null hypothesis - the assumption that in population this parameter is equal to zero, i.e. . The null hypothesis is rejected at the accepted level of significance. α , if

Values critical points t st for different levels of significance α and numbers of degrees of freedom are given in Table 1 of the Appendix.

It has been found that when processing small samples (especially when n< 30 ) calculation of the correlation coefficient by formulas (1) - (3) gives somewhat underestimated estimates of the general parameter ρ , i.e. the following amendment needs to be made:

Fisher z-transform. Correct Application correlation coefficient assumes a normal distribution of a two-dimensional set of conjugate values ​​of random variables x and y. It is known from mathematical statistics that if there is a significant correlation between variables, i.e. when R xy > 0,5 sample distribution of the correlation coefficient for more small samples taken from a normally distributed population deviate significantly from the normal curve.

Considering this circumstance, R. Fisher found a more accurate way to estimate the general parameter by the value of the sample correlation coefficient. This method is to replace R xy the transformed value of z, which is related to the empirical correlation coefficient, as follows:

The distribution of the z value is almost unchanged in shape, since it does not depend much on the sample size and on the value of the correlation coefficient in the general population, and approaches a normal distribution.

The criterion for the reliability of the indicator z is the following ratio:

The null hypothesis is rejected at the accepted level of significance α and the number of degrees of freedom. Critical point values t st are given in Table 1 of the Applications.

Application z-transforms allows more confidence in assessing the statistical significance of the sample correlation coefficient, as well as the difference between the empirical coefficients when necessary.

The minimum sample size for an accurate estimate of the correlation coefficient. It is possible to calculate the sample size for a given value of the correlation coefficient, which would be sufficient to disprove the null hypothesis (if the correlation between features Y and X really exists). For this, the following formula is used:

where n is the desired sample size; t is the value specified according to the accepted level of significance (better for α = 1%); z is the converted empirical correlation coefficient.

Large samples. In the presence of numerous initial data, they have to be grouped into variational series and, having built a correlation lattice, the difference in its cells (cells) is the total frequencies of conjugate series. The correlation lattice is formed by the intersection of rows and columns, the number of which is equal to the number of groups or classes of correlated series. Classes are located in the top row and in the first (left) column of the correlation table, and the common frequencies, denoted by the symbol f xy, – in the cells of the correlation grid, which is the main part of the correlation table.

Classes placed in the top row of the table are usually arranged from left to right in ascending order, and in the first column of the table - from top to bottom in descending order. With such an arrangement of classes of variational series, their common frequencies (in the presence of a positive relationship between the signs Y and X) will be distributed over the grid cells in the form of an ellipse diagonally from the lower left corner to the upper right corner of the grid or (if there is a negative relationship between the features) in the direction from the upper left corner to the lower right corner of the grid. If the frequencies f xy are distributed over the cells of the correlation grid more or less evenly, without forming an ellipse, this will indicate the absence of a correlation between the signs.

Frequency allocation f xy by the cells of the correlation lattice gives only general idea about the presence or absence of a relationship between features. Judge tightness or less accurately only by meaning and sign correlation coefficient. When calculating the correlation coefficient from a preliminary grouping of sample data into interval variation series, one should not take too wide class intervals. Rough grouping has a much stronger effect on the value of the correlation coefficient than is the case when calculating averages and indicators of variation.

Recall that the value of the class interval is determined by the formula

where x max , x min- the maximum and minimum variants of the population; To is the number of classes into which the feature variation should be divided. Experience has shown that in the field of correlation analysis, the value To can be put in dependence on the sample size approximately as follows (Table 1).

Table 1

Sample size

K value

50 ≥ n > 30

100 ≥ n > 50

200 ≥ n > 100

300 ≥ n > 200

Like other statistical characteristics calculated with a preliminary grouping of the initial data into variation series, the correlation coefficient is determined in different ways, giving completely identical results.

Way of works. The correlation coefficient can be calculated using the basic formulas (1) or (2), correcting them for the repeatability of the variant in the dimer population. At the same time, simplifying the symbolism, the deviations of the variants from their averages will be denoted by a, i.e. and . Then formula (2), taking into account the frequency of deviations, will take the following expression:

The reliability of this indicator is assessed using the Student's test, which represents the ratio of the sample correlation coefficient to its error, determined by the formula

Hence, and if this value exceeds standard value Student's test t st for the degree of freedom and significance level α (see Table 2 of the Appendix), then null hypothesis reject.

Method of conditional averages. When calculating the deviation correlation coefficient, the variant (“classes”) can be found not only from the arithmetic means and , but also from the conditional means A x and A y . With this method, the numerator of formula (2) is amended and the formula takes the following form:

where f xy are the frequencies of the classes of one and the other distribution series; and , i.e. deviations of classes from conditional averages, related to the size of class intervals λ ; n is the total number of paired observations, or sample size; and are conditional moments of the first order, where f x– series frequencies X, a f y– series frequencies Y; s x and s y are the standard deviations of the series X and Y, calculated by the formula .

The method of conditional averages has an advantage over the method of products, since it allows you to avoid operations with fractional numbers and give the same (positive) sign to deviations a x and a y, which simplifies the technique of computational work, especially in the presence of multi-digit numbers.

Estimating the difference between correlation coefficients. When comparing the correlation coefficients of two independent samples, the null hypothesis is reduced to the assumption that in the general population the difference between these indicators is zero. In other words, one should proceed from the assumption that the difference observed between the compared empirical correlation coefficients arose by chance.

To test the null hypothesis, the Student's t-test is used, i.e. difference ratio between empirical correlation coefficients R 1 and R 2 to its statistical error, determined by the formula:

where s R1 and s R2 are the errors of the compared correlation coefficients.

The null hypothesis is refuted provided that for the accepted level of significance α and the number of degrees of freedom.

It is known that a more accurate assessment of the reliability of the correlation coefficient is obtained by translating R xy in number z. The assessment of the difference between the sample correlation coefficients is no exception. R 1 and R 2 , especially in those cases when the latter are calculated on samples of a relatively small size ( n< 100 ) and in their absolute value significantly exceed 0.50.

The difference is estimated using the Student's t-test, which is built in relation to this difference to its error, calculated by the formula

The null hypothesis is rejected if for and the accepted significance level α.

correlation relation. To measure non-linear relationships between variables x and y use an indicator called correlation relation, which describes the relationship bidirectionally. The construction of a correlation relation involves a comparison of two types of variation: the variability of individual observations in relation to partial means and the variation of the partial means themselves compared to the overall mean. The smaller the part of the first component in relation to the second, the greater the closeness of the connection will be. In the limit, when no variation of the individual values ​​of the attribute near the partial averages will be observed, the tightness of the connection will be extremely large. Similarly, in the absence of variability in partial means, the tightness of the relationship will be minimal. Since this ratio of variation can be considered for each of the two features, two indicators of the tightness of the relationship are obtained - h yx and h xy. The correlation ratio is a relative value and can take values ​​from 0 to 1. In this case, the coefficients of the correlation ratio are usually not equal to each other, i.e. . Equality between these indicators is feasible only with a strictly linear relationship between the features. The correlation ratio is a universal indicator: it allows you to characterize any form of correlation - both linear and non-linear.

Correlation coefficients h yx and h xy determined by the methods discussed above, i.e. method of products and method of conditional averages.

Way of works. Correlation coefficients h yx and h xy determined by the following formulas:

where and are the group variances,

and and are the common variances.

Here, and are the common arithmetic means, and and are the group arithmetic means; f yi– series frequencies Y, a f xi– series frequencies X; k– number of classes; n is the number of variable features.

The working formulas for calculating the coefficients of the correlation ratio are as follows:

Method of conditional averages. Determining the coefficients of the correlation relation according to the formulas (15), the deviations of the class variant x i and y i can be taken not only from the arithmetic means and , but also from the conditional means А x and A y . In such cases, the group and total deviates are calculated using the formulas and , and also, and , where and .

In expanded form, formulas (15) look as follows:

;

. (17)

In these formulas, and are deviations of classes from conditional averages, reduced by the value of class intervals; values a y and a x are expressed in natural numbers: 0, 1, 2, 3, 4, .... The rest of the symbols are explained above.

Comparing the method of products with the method of conditional averages, one cannot fail to notice the advantage of the first method, especially in those cases when one has to deal with multi-digit numbers. Like other sample indicators, the correlation ratio is an estimate of its general parameter and, as a random value, is accompanied by an error determined by the formula

The reliability of the estimate of the correlation relationship can be checked by Student's t-test. H 0 -hypothesis proceeds from the assumption that the general parameter is equal to zero, i.e. the following condition must be met:

for the number of degrees of freedom and significance level α.

Determination coefficient. To interpret the values ​​taken by the indicators of the closeness of the correlation, use determination coefficients, which show what proportion of the variation of one feature depends on the variation of another feature. In the presence of a linear relationship, the coefficient of determination is the square of the correlation coefficient R2 xy , and in the case of a non-linear relationship between the features y and x is the square of the correlation ratio h2 yx . The coefficients of determination give grounds to build the following approximate scale, which makes it possible to judge the closeness of the relationship between the signs: when the relationship is considered average; indicates a weak connection, and only when it is possible to judge a strong connection, when about 50% of the variation of the trait Y depends on the trait variation X.

Communication Form Evaluation. With a strictly linear relationship between variables y and x equality is achieved. In such cases, the coefficients of the correlation ratio coincide with the value of the correlation coefficient. In this case, the coefficients of determination will also coincide in their value, i.e. . Therefore, by the difference between these values, one can judge the form of the correlation dependence between the variables y and x:

Obviously, with a linear relationship between the variables y and x the exponent γ will be equal to zero; if the relationship between variables y and x nonlinear, γ > 0.

The indicator γ is an estimate of the general parameter and, as a random value, needs to be verified. In this case, we proceed from the assumption that the relationship between the quantities y and x linear (null hypothesis). Fisher's F-criterion allows you to test this hypothesis:

where a- the number of groups, or classes of the variation series; N is the sample size. The null hypothesis is rejected if for (find horizontally in Table 2 of the Appendix), (find in the first column of the same table) and the accepted significance level α.

Determining the Significance of a Correlation

Classifications of correlation coefficients

Correlation coefficients are characterized by strength and significance.

Classification of correlation coefficients by strength.

Classification of correlation coefficients by significance.

These 2 classifications should not be confused, as they define different characteristics. A strong correlation may turn out to be random and, therefore, unreliable. This is especially true for small sample sizes. And in a large sample, even a weak correlation can be highly significant.

After calculating the correlation coefficient, it is necessary to put forward statistical hypotheses:

H 0: The correlation index is not significantly different from zero (it is random).

H 1: the correlation indicator is significantly different from zero (it is non-random).

Hypothesis testing is carried out by comparing the obtained empirical coefficients with tabulated critical values. If the empirical value reaches the critical value or exceeds it, then the null hypothesis is rejected: r emp ≥ r cr Ho, Þ H 1 . In such cases, it is concluded that a significant difference has been found.

If the empirical value does not exceed the critical value, then the null hypothesis is not rejected: r emp< r кр Þ Н 0 . В таких случаях делают вывод, что достоверность различий не установлена.

/ Statistics / Correlation

Calculation of the matrix of pair coefficients

correlations

To calculate the matrix of paired correlation coefficients, call the menu Correlation matrices module Basicsdata statistics.

Rice. 1 Main statistics module panel

We will consider the main stages of correlation analysis in the STATISTICA system using the example data (see Fig. 2). The initial data are the results of observations of the activities of 23 enterprises in one of the industries.

Fig.2 Initial data

The columns of the table contain the following indicators:

RENTABEL - profitability,%;

SHARE SLAVES - specific gravity workers in the PPP, units;

FUNDOOTD - return on assets, units;

CAPITAL FUND - the average annual value of fixed production assets, million rubles;

NEPRRASH - non-production expenses, thousand rubles. It is required to investigate the dependence of profitability on other

other indicators.

Suppose that the considered signs in the general population are subject to normal law distributions, and observational data is a sample from the population.

Let us calculate the pairwise correlation coefficients between all variables. After selecting a row Correlation matrices a dialog box will appear on the screen. Pearson correlations. The name is due to the fact that for the first time this coefficient was Pearson, Edgeworth and Weldon.

Let's choose variables for analysis. There are two buttons in the dialog box for this: Square matrix(one list) and Rectangular matrix(two lists).


Rice. 3 Correlation Analysis Dialog Box

The first button is designed to calculate the usual matrix. symmetric form with paired correlation coefficients of all combinations of variables. If all indicators are used in the analysis, then in the variable selection dialog box, you can click the button Choose all. (If the variables are not consecutive, they can be selected with a mouse click with simultaneously pressed key ctrl)


If you press the button Details dialog box, long names will be displayed for each variable. By clicking this button again (it will take on the name Briefly), we get short names.

Button Information opens a window for the selected variable, where you can view its characteristics: long name, display format, sorted list of values, descriptive statistics (number of values, mean, standard deviation).

After selecting variables, press OK or the button Korrelation dialog box Correlations Pearson. The calculated correlation matrix will appear on the screen.

Significant correlation coefficients are highlighted in red on the screen.

In our example, the profitability indicator turned out to be most related to the indicators capital productivity(direct connection) and production costs(feedback suggesting that V decreases as X increases). But how closely are the signs related? A close relationship is considered when the modulo coefficient values ​​are greater than 0.7 and weak - less than 0.3. Thus, in the further construction of the regression equation, one should limit oneself to the indicators “Product return” and “Non-production costs” as the most informative.

However, in our example, there is a phenomenon multicolour, when there is a relationship between the independent variables themselves (pair correlation coefficient modulo greater than 0.8).

The option rectangular matrix (two lists of variables) opens a dialog box for selecting two lists of variables. Place as shown


As a result, we obtain a rectangular correlation matrix containing only correlation coefficients with the dependent variable.


If the option is set Corr. Matrix (highlight significant), then after pressing the button Correlation a matrix will be built with coefficients highlighted at the level of significance R.


If the option is selected Detailed results table, then by pressing the button Correlations, we get a table that contains not only correlation coefficients, but also averages, standard deviations, regression equation coefficients, a free term in the regression equation and other statistics


When variables have small relative variation (standard deviation to mean less than 0.0000000000001), a higher estimate is required. It can be set by checking the Calculations with high precision checkbox in the Pearson Correlations dialog box.

The mode of operation with missing data is determined by the option Line by line deletion of PD. If selected, STATISTICS will ignore all observations that have gaps. Otherwise, they are removed in pairs.

The Show long variable names checkbox will result in a table with long variable names.

Graphical representation of correlation dependencies

The Pearson Correlation dialog box contains a series of buttons for obtaining graphic image correlation dependencies.

The 2M scatter plot option builds a sequence of scatter plots for each selected variable. The window for their selection is identical to Figure 6. On the left, you should indicate the dependent variables, on the right, the independent - RENTABLE. By clicking OK, we will get a graph that will show the overtaken regression line and the confidence limits of forecasting.

The linear correlation coefficient gives the most objective estimate of the tightness of the connection, if the location of the points in the coordinate system resembles a straight line or an elongated ellipse, but if the points are located in the form of a curve, then the correlation coefficient gives an underestimate.

On the basis of the graph, we can once again confirm the relationship between profitability and return on assets, as the observational data are arranged in the form of an inclined ellipse. It must be said that the connection is considered the closer, the closer the points are to the main axis of the ellipse.

In our example, a change in the rate of return on assets per unit will lead to a change in profitability by 5.7376%.

Let's look at the impact of non-production costs on the value of profitability. To do this, we will construct a similar graph

The analyzed data is less like an ellipse in shape, and the correlation coefficient is somewhat lower. The found value of the regression coefficient shows that with an increase in non-production costs by 1 thousand rubles, the profitability decreases by 0.7017%.

It should be noted that the construction of multiple regression (discussed in subsequent chapters), when the equation contains both features at the same time, leads to other values ​​of the regression coefficients, which is explained by the interaction of the explanatory variables with each other.

When using the Named button, the points on the scatterplot will acquire their corresponding numbers or names if predefined.

The next option with plot indication Matrix plots a matrix of scatterplots for the selected variables.

Each graphical element of this matrix contains correlation fields formed by the corresponding variables with

the regression line drawn on them.

When analyzing the matrix of scatterplots, attention should be paid to those graphs whose regression lines have a significant slope to the X axis, which suggests the existence of an interdependence between the corresponding signs.

The 3D scattering option builds a 3D correlation field for the selected variables. If the Named button is used, the points on the scatterplot will be labeled with the numbers or names of the corresponding observations, if they have them.

The Surface graphical option plots a 3M scatterplot for the selected triple of variables along with a fitted second order surface.

Option Category. scatterplots, in turn, build a cascade of correlation fields for the selected indicators.

After pressing the corresponding button, the program will ask the user to create two sets of them from those previously selected using the Variables button. Then a new one will appear on the screen.

a query window for specifying a grouping variable based on which all available cases will be classified.

The result is the construction of correlation fields in the context of groups of observations for each pair of variables assigned to different lists

3.4. Calculation of partial and multiple coefficientscorrelation elements

To calculate private and multiple coefficients cor. relation call the module Multiple regression using the module selector button. The following dialog box will appear on the screen:

Pushing a button Variables, select variables for analysis: on the left dependent - profitability, and on the right are independent - capital productivity and non-manufacturing expenses. The remaining variables will not participate in further analysis - based on the correlation analysis, they are recognized as non-informative for the regression model.

In field Input file as input data, the usual initial data is offered, which is a table with variables and observations, or a correlation matrix. The correlation matrix can be pre-created in the Multiple Regression module itself or calculated using the Quick Basic Statistics option.

When working with the source data file, you can set the mode of working with gaps:

    Line by line deletion. When this option is selected, only cases that do not have missing values ​​in all of the selected variables are used in the analysis.

    Replacing the average. Missing values ​​in each variable are replaced by the average calculated from the available complete observations.

    Pairwise removal of missing data. If this option is selected, then when calculating pairwise correlations, observations that have missing values ​​in the corresponding pairs of variables are removed.

In field Regression type the user can choose standard or fixed non-linear regression. By default, the standard multiple regression analysis is selected, which calculates the standard correlation matrix of all selected variables.

Mode Fixed non-linear regression allows you to perform various transformations of independent variables. Option Conduct an analysis by default, it uses the settings corresponding to the definition of a standard regression line that includes an intercept. If this option is deselected, clicking the launch pad's OK button will open the Model Definition dialog box, in which you can select both the type of regression analysis (for example, stepwise, ridge, etc.) and other options.

By checking the checkbox of the option line Show descriptive descriptive, corr. matrices and clicking OK, we get a dialog box with the statistical characteristics of the data.

In it, you can view detailed descriptive statistics (including the number of observations on which the correlation coefficient was calculated for each pair of variables). Click OK to continue the analysis and open the Model Definers dialog box.

If the analyzed indicators have an extremely small relative variance, calculated as the total variance divided by the mean, then you should check the box next to the option High Precision Calculations to obtain more accurate values ​​of the elements of the correlation matrix.

By setting all the necessary parameters in the dialog box Multiple regression, press OK and get the results of the required calculations.

According to our example, the multiple correlation coefficient turned out to be 0.61357990 and, accordingly, the determination coefficient - 0.37648029. Thus, only 37.6% of the dispersion of the "profitability" indicator is explained by the change in the indicators of "capital productivity" and "non-production costs". Such a low value indicates an insufficient number of factors introduced into the model. Let's try to change the number of independent variables by adding the variable "Fixed assets" to the list (the introduction of the indicator "share of workers in PPP" into the model leads to multicolleniality, which is unacceptable). The coefficient of determination increased slightly, but not enough to significantly improve the results - its value was about 41%. Obviously, our dacha requires additional research to identify factors that affect profitability.

The significance of the multiple correlation coefficient is calculated according to the Fisher F-criterion table. The hypothesis of its significance is rejected if the deviation probability value exceeds a given level (most often, a = 0.1, 0.05; 0.01 0.001 is taken). In our example p=0.008882< 0.05, что свидетельствует о значимости коэффициента.

The results table contains the following columns:

    Beta coefficient (in)- standardized regression coefficient for the corresponding variable;

    Partial Correlation- partial correlation coefficients between the corresponding variable and the dependent one, while fixing the influence of the rest included in the model.

The partial correlation coefficient between profitability and capital productivity in our example is 0.459899. This means that after entering into the model the indicator of non-productive race-ev, the impact of capital productivity on profitability is somewhat reduced - from 0.49 (the value of the pair correlation coefficient) to 0.46. A similar coefficient for the indicator of non-produced expenses also decreased - from 0.46 (the value of the pair correlation coefficient) to 0.42 (the value is taken by the modulo), characterizes the change in the relationship with the dependent variable after entering the capital productivity indicator into the model.

    A semi-partial correlation is the correlation between the unadjusted dependent variable and the corresponding non-dependent variable, taking into account the influence of the others included in the model.

    Tolerance (defined as 1 minus the square of the multiple correlation between the relevant variable and all independent variables in the regression equation).

    The coefficient of determination is the square of the multiple correlation coefficient between the corresponding independent variable and all other variables included in the regression equation.

    1-values ​​- the calculated value of Student's t-test for testing the hypothesis about the significance of the partial correlation coefficient with the specified (in parentheses) number of degrees of freedom.

    p-level! - the probability of rejecting the hypothesis about the significance of the partial correlation coefficient.

In our case, the obtained value of p for the first coefficient (0.031277) is less than the selected one =0.05. The value of the second coefficient slightly exceeds it (0.050676), which indicates its insignificance at this level. But it is significant, for example, when =0.1 (in ten cases out of a hundred, the hypothesis will still be wrong).

Where x y , x , y are the mean values ​​of the samples; σ(x), σ(y) - standard deviations.
Besides, Pearson's linear pair correlation coefficient can be determined through the regression coefficient b: , where σ(x)=S(x), σ(y)=S(y) are standard deviations, b is the coefficient in front of x in the regression equation y=a+bx .

Other formula options:
or

K xy - correlation moment (covariance coefficient)

To find the linear Pearson correlation coefficient, it is necessary to find the sample means x and y , and their standard deviations σ x = S(x), σ y = S(y):

The linear correlation coefficient indicates the presence of a connection and takes values ​​from -1 to +1 (see the Chaddock scale). For example, when analyzing the tightness of a linear correlation between two variables, a pair linear correlation coefficient equal to –1 was obtained. This means that there is an exact inverse linear relationship between the variables.

You can calculate the value of the correlation coefficient using the given sample means, or directly.

Xy#x #y #σ x #σ y " data-id="a;b;c;d;e" data-formul="(a-b*c)/(d*e)" data-r="r xy "> Calculate your value

The geometric meaning of the correlation coefficient: r xy shows how much the slope of the two regression lines: y(x) and x(y) differs, how much the results of minimizing the deviations in x and in y differ. The greater the angle between the lines, the greater the r xy .
The sign of the correlation coefficient coincides with the sign of the regression coefficient and determines the slope of the regression line, i.e. the general direction of dependence (increase or decrease). The absolute value of the correlation coefficient is determined by the degree of proximity of the points to the regression line.

Properties of the correlation coefficient

  1. |r xy | ≤ 1;
  2. if X and Y are independent, then r xy =0, the opposite is not always true;
  3. if |r xy |=1, then Y=aX+b, |r xy (X,aX+b)|=1, where a and b are constant and ≠ 0;
  4. |r xy (X,Y)|=|r xy (a 1 X+b 1 , a 2 X+b 2)|, where a 1 , a 2 , b 1 , b 2 are constants.

Therefore, for link direction checks a hypothesis test is selected using the Pearson correlation coefficient with a further test for reliability using t-test(see example below).

Typical tasks (see also non-linear regression)

Typical tasks
The dependence of labor productivity y on the level of mechanization of work x (%) is studied according to the data of 14 industrial enterprises. Statistical data are given in the table.
Required:
1) Find estimates for the parameters of linear regression y on x. Build a scatterplot and plot the regression line on the scatterplot.
2) At the significance level α=0.05, test the hypothesis of the agreement between the linear regression and the observational results.
3) With reliability γ=0.95 find confidence intervals for linear regression parameters.

The following are also used with this calculator:
Multiple regression equation

Example. Based on the data given in Appendix 1 and corresponding to your option (Table 2), you need:

  1. Calculate the coefficient of linear pair correlation and construct the equation of linear pair regression of one feature from another. One of the signs corresponding to your option will play the role of factorial (x), the other - effective (y). Establish cause-and-effect relationships between signs on the basis of economic analysis. Explain the meaning of the parameters of the equation.
  2. Determine the theoretical coefficient of determination and the residual (unexplained by the regression equation) variance. Make a conclusion.
  3. Assess the statistical significance of the regression equation as a whole at the 5 percent level using Fisher's F-test. Make a conclusion.
  4. Perform a forecast of the expected value of the attribute-result y with the predicted value of the attribute-factor x, which is 105% of the average level x. Assess the accuracy of the forecast by calculating the forecast error and its confidence interval with a probability of 0.95.
Solution. The equation is y = ax + b
Averages



Dispersion


standard deviation



The relationship between the trait Y factor X is strong and direct (determined by the Chaddock scale).
Regression Equation

Regression coefficient: k = a = 4.01
Determination coefficient
R 2 = 0.99 2 = 0.97, i.e. in 97% of cases, changes in x lead to a change in y. In other words, the accuracy of the selection of the regression equation is high. Residual dispersion: 3%.
xyx2y2x yy(x)(y i -y ) 2(y-y(x)) 2(x-x p) 2
1 107 1 11449 107 103.19 333.06 14.5 30.25
2 109 4 11881 218 107.2 264.06 3.23 20.25
3 110 9 12100 330 111.21 232.56 1.47 12.25
4 113 16 12769 452 115.22 150.06 4.95 6.25
5 120 25 14400 600 119.23 27.56 0.59 2.25
6 122 36 14884 732 123.24 10.56 1.55 0.25
7 123 49 15129 861 127.26 5.06 18.11 0.25
8 128 64 16384 1024 131.27 7.56 10.67 2.25
9 136 81 18496 1224 135.28 115.56 0.52 6.25
10 140 100 19600 1400 139.29 217.56 0.51 12.25
11 145 121 21025 1595 143.3 390.06 2.9 20.25
12 150 144 22500 1800 147.31 612.56 7.25 30.25
78 1503 650 190617 10343 1503 2366.25 66.23 143

Note: y(x) values ​​are found from the resulting regression equation:
y(1) = 4.01*1 + 99.18 = 103.19
y(2) = 4.01*2 + 99.18 = 107.2
... ... ...

Significance of the correlation coefficient

We put forward hypotheses:
H 0: r xy = 0, there is no linear relationship between variables;
H 1: r xy ≠ 0, there is a linear relationship between the variables;
In order to test the null hypothesis at the significance level α that the general correlation coefficient of the normal two-dimensional random variable with a competing hypothesis H 1 ≠ 0, it is necessary to calculate the observed value of the criterion (the value of the random error):

According to the Student's table, we find t tab (n-m-1; α / 2) = (10; 0.025) = 2.228
Since Tobs > t tab, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant.
Interval estimate for correlation coefficient (confidence interval)


r - Δr ≤ r ≤ r + Δr
Δ r = ±t table m r = ±2.228 0.0529 = 0.118
0.986 - 0.118 ≤ r ≤ 0.986 + 0.118
Confidence interval for correlation coefficient: 0.868 ≤ r ≤ 1

Analysis of the accuracy of determining estimates of regression coefficients





Sa =0.2152

Confidence intervals for the dependent variable

Let us calculate the boundaries of the interval in which 95% of the possible values ​​of Y will be concentrated for unlimited large numbers observations and X = 7
(122.4;132.11)
Testing hypotheses about coefficients linear equation regression

1) t-statistic




The statistical significance of the regression coefficient is confirmed
Confidence interval for coefficients of the regression equation
Let us determine the confidence intervals of the regression coefficients, which, with 95% reliability, will be as follows:
(a - t a S a ; a + t a S a)
(3.6205;4.4005)
(b - t b S b ; b + t b S b)
(96.3117;102.0519)