Correlation coefficients

Until now, we have found out only the very fact of the existence of a statistical relationship between two features. Next, we will try to find out what conclusions can be drawn about the strength or weakness of this dependence, as well as about its form and direction. Criteria for quantifying the relationship between variables are called correlation coefficients or measures of connectivity. Two variables are positively correlated if there is a direct, unidirectional relationship between them. In a unidirectional relationship, small values ​​of one variable correspond to small values ​​of the other variable, large values ​​correspond to large ones. Two variables are negatively correlated if there is an inverse relationship between them. With a multidirectional relationship, small values ​​of one variable correspond to large values ​​of the other variable and vice versa. The values ​​of the correlation coefficients are always in the range from -1 to +1.

As a correlation coefficient between variables belonging to ordinal scale applied Spearman's coefficient, and for variables belonging to interval scale - Pearson correlation coefficient(moment of works). It should be noted that each dichotomous variable, that is, a variable belonging to the nominal scale and having two categories, can be considered as ordinal.

First, we will check if there is a correlation between the sex and psyche variables from the studium.sav file. In this case, the dichotomous variable sex can be considered as ordinal. Do the following:

    Select from the command menu Analyze (Analysis) Descriptive Statistics (Descriptive statistics) Crosstabs... (Tables of contingency)

    Move variable sex to a list of strings, and a variable psyche- to the list of columns.

    Click the button Statistics... (Statistics). In the Crosstabs: Statistics dialog, check the Correlations box. Confirm your choice with the Continue button.

    In dialogue crosstabs Stop displaying tables by checking the Supress tables checkbox. Click the OK button.

The Spearman and Pearson correlation coefficients will be calculated, and their significance will be tested:

Symmetric Measures

value Asympt. Std. Error(a) (Asymptotic standard error) Approx. T (b) (Approx. T) Approx. Sig. (Approximate significance)
Interval by Interval (Interval - interval) Pearson's R
(R Pearson)
,441 ,081 5,006 .000 (s)
Ordinal by Ordinal (Ordinal - Ordinal) Spearman Correlation (Spearman Correlation) ,439 ,083 4,987 .000 (s)
N of Valid Cases 106

Since there are no interval-scaled variables here, we will consider the Spearman correlation coefficient. It is 0.439 and is the most significant (p<0,001).

The following table is used to verbally describe the values ​​of the correlation coefficient:

Based on the above table, the following conclusions can be drawn: There is a weak correlation between the variables sex and psyche (conclusion about the strength of dependence), the variables correlate positively (conclusion about the direction of dependence).

In the psyche variable, smaller values ​​correspond to a negative mental state, and larger values ​​correspond to a positive one. In the sex variable, in turn, the value "1" corresponds to the female gender, and "2" - male.

Therefore, the unidirectional relationship can be interpreted as follows: female students assess their mental state more negatively than male colleagues or, most likely, are more likely to agree to such an assessment when conducting a survey. When building such interpretations, one should take into account that the correlation between two traits is not necessarily the same as their functional or causal relationship, see Section 15.3 for more on this.

Now let's check the correlation between alter and semester variables. Let's apply the method described above. We will get the following coefficients:

Symmetric Measures

Asympt. Std. error(a)

Interval by Interval

Ordinal by Ordinal

Spearman Correlation

N of Valid Cases

a. Not assuming the null hypothesis (Null hypothesis is not accepted).

e. Using the asymptotic standard error assuming the null hypothesis.

With. Based on normal approximation.

Since alter and semester are metric variables, we will consider the Pearson coefficient (moment of products). It is 0.807. There is a strong correlation between alter and semester variables. The variables are positively correlated. Consequently, older students study in senior courses, which, in fact, is not an unexpected conclusion.

Let's check the variables sozial (assessment of social position) and psyche for correlation. We will get the following coefficients:

Symmetric Measures

Asympt. Std. error(a)

Interval by Interval

Ordinal by Ordinal

Spearman Correlation

N of Valid Cases

a. Not assuming the null hypothesis (Null hypothesis is not accepted).

b. Using the asymptotic standard error assuming the null hypothesis.

With. Based on normal approximation.

In this case, we will consider the Spearman correlation coefficient; it is -0.703. There is a moderate to strong correlation between sozial and psyche (cutoff 0.7). Variables are negatively correlated, that is, the greater the value of the first variable, the lower the value of the second and vice versa. Since small values ​​of the variable sozial characterize a positive state (1 = very good, 2 = good), and large values ​​of psyche characterize a negative state (1 = extremely unstable, 2 = unstable), therefore, psychological difficulties are largely due to social problems.

The correlation coefficient is the degree of association between two variables. Its calculation gives an idea of ​​whether there is a relationship between two data sets. Unlike regression, correlation does not allow predicting values. However, the calculation of the coefficient is an important step in the preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and GDP growth is high. This gives us an idea that in order to ensure prosperity, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not so obvious conclusion at first glance!

Correlation and causation

Perhaps there is not a single area of ​​statistics that would be so firmly established in our lives. The correlation coefficient is used in all areas of public knowledge. Its main danger lies in the fact that often its high values ​​​​are speculated in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a causal relationship between the quantities.

Correlation coefficient: Pearson and Spearman formula

There are several main indicators that characterize the relationship between two variables. Historically, the first is Pearson's linear correlation coefficient. It is passed at school. It was developed by K. Pearson and J. Yule based on the work of Fr. Galton. This coefficient allows you to see the relationship between rational numbers that change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional relationship. If the coefficient is zero, then there is no relationship between the variables. Equal to a positive number - there is a directly proportional relationship between the studied quantities. Spearman's rank correlation coefficient makes it possible to simplify calculations by constructing a hierarchy of variable values.

Relationships between variables

Correlation helps answer two questions. First, whether the relationship between variables is positive or negative. Secondly, how strong is the addiction. Correlation analysis is a powerful tool with which to obtain this important information. It is easy to see that household incomes and expenses rise and fall proportionately. Such a relationship is considered positive. On the contrary, when the price of a product rises, the demand for it falls. Such a relationship is called negative. The values ​​of the correlation coefficient are between -1 and 1. Zero means that there is no relationship between the studied values. The closer the indicator to the extreme values, the stronger the relationship (negative or positive). The absence of dependence is evidenced by a coefficient from -0.1 to 0.1. It must be understood that such a value only indicates the absence of a linear relationship.

Application features

The use of both indicators is subject to certain assumptions. First, the presence of a strong relationship does not determine the fact that one value determines the other. There may well be a third quantity that defines each of them. Secondly, a high Pearson correlation coefficient does not indicate a causal relationship between the studied variables. Thirdly, it shows an exclusively linear relationship. Correlation can be used to evaluate meaningful quantitative data (eg barometric pressure, air temperature) rather than categories such as gender or favorite color.

Multiple correlation coefficient

Pearson and Spearman investigated the relationship between two variables. But what to do if there are three or even more of them. This is where the multiple correlation coefficient comes in. For example, the gross national product is affected not only by foreign direct investment, but also by the monetary and fiscal policies of the state, as well as the level of exports. The growth rate and the volume of GDP are the result of the interaction of a number of factors. However, it should be understood that the multiple correlation model is based on a number of simplifications and assumptions. First, multicollinearity between quantities is excluded. Second, the relationship between the dependent variable and the variables that affect it is assumed to be linear.

Areas of use of correlation and regression analysis

This method of finding the relationship between quantities is widely used in statistics. It is most often resorted to in three main cases:

  1. For testing causal relationships between the values ​​of two variables. As a result, the researcher hopes to find a linear relationship and derive a formula that describes these relationships between quantities. Their units of measurement may be different.
  2. To check for a relationship between values. In this case, no one determines which variable is dependent. It may turn out that the value of both quantities determines some other factor.
  3. To derive an equation. In this case, you can simply substitute numbers into it and find out the values ​​of the unknown variable.

A man in search of a causal relationship

Consciousness is arranged in such a way that we definitely need to explain the events that occur around. A person is always looking for a connection between the picture of the world in which he lives and the information he receives. Often the brain creates order out of chaos. He can easily see a causal relationship where there is none. Scientists have to specifically learn to overcome this trend. The ability to evaluate relationships between data is objectively essential in an academic career.

Media bias

Consider how the presence of a correlation can be misinterpreted. A group of ill-behaved British students were asked if their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parents' smoking and their children's delinquency. The professor who conducted this study even suggested putting a warning about this on cigarette packs. However, there are a number of problems with this conclusion. First, the correlation does not indicate which of the quantities is independent. Therefore, it is quite possible to assume that the pernicious habit of parents is caused by the disobedience of children. Secondly, it is impossible to say with certainty that both problems did not arise due to some third factor. For example, low-income families. It should be noted the emotional aspect of the initial conclusions of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his study in this way.

conclusions

Misinterpreting correlation as a causal relationship between two variables can lead to embarrassing research errors. The problem is that it lies at the very core of human consciousness. Many marketing tricks are based on this feature. Understanding the difference between causation and correlation allows you to rationally analyze information both in everyday life and in your professional career.

Correlation coefficient is a value that can vary from +1 to -1. In the case of a complete positive correlation, this coefficient is equal to plus 1 (they say that with an increase in the value of one variable, the value of another variable increases), and with a complete negative correlation - minus 1 (indicate feedback, i.e. When the values ​​of one variable increase, the values ​​of the other decrease).

Ex 1:

Dependence graph of shyness and depression. As you can see, the dots (subjects) are not located randomly, but line up around one line, and, looking at this line, we can say that the higher the shyness is expressed in a person, the more depressive, i.e. these phenomena are interconnected.

Ex 2: Graph for Shyness and Sociability. We see that as shyness increases, sociability decreases. Their correlation coefficient is -0.43. Thus, a correlation coefficient greater from 0 to 1 indicates a directly proportional relationship (the more ... the more ...), and a coefficient from -1 to 0 indicates an inversely proportional relationship (the more ... the less ...)

If the correlation coefficient is 0, both variables are completely independent of each other.

correlation- this is a relationship where the impact of individual factors appears only as a trend (on average) with the mass observation of actual data. Examples of correlation dependence can be the dependence between the size of the bank's assets and the amount of the bank's profit, the growth of labor productivity and the length of service of employees.

Two systems of classification of correlations according to their strength are used: general and particular.

The general classification of correlations: 1) strong, or close with a correlation coefficient of r> 0.70; 2) medium at 0.500.70, and not just a correlation of a high level of significance.

The following table lists the names of the correlation coefficients for different types of scales.

Dichotomous scale (1/0) Rank (ordinal) scale
Dichotomous scale (1/0) Pearson's association coefficient, Pearson's four-cell conjugation coefficient. Biserial correlation
Rank (ordinal) scale Rank-biserial correlation. Spearman's or Kendall's rank correlation coefficient.
Interval and absolute scale Biserial correlation The values ​​of the interval scale are converted into ranks and the rank coefficient is used Pearson correlation coefficient (linear correlation coefficient)

At r=0 there is no linear correlation. In this case, the group means of the variables coincide with their general means, and the regression lines are parallel to the coordinate axes.

Equality r=0 speaks only of the absence of a linear correlation dependence (uncorrelated variables), but not in general about the absence of a correlation, and even more so, a statistical dependence.

Sometimes the conclusion that there is no correlation is more important than the presence of a strong correlation. A zero correlation of two variables may indicate that there is no influence of one variable on the other, provided that we trust the results of the measurements.

In SPSS: 11.3.2 Correlation coefficients

Until now, we have found out only the very fact of the existence of a statistical relationship between two features. Next, we will try to find out what conclusions can be drawn about the strength or weakness of this dependence, as well as about its form and direction. Criteria for quantifying the relationship between variables are called correlation coefficients or measures of connectivity. Two variables are positively correlated if there is a direct, unidirectional relationship between them. In a unidirectional relationship, small values ​​of one variable correspond to small values ​​of the other variable, large values ​​correspond to large ones. Two variables are negatively correlated if there is an inverse relationship between them. With a multidirectional relationship, small values ​​of one variable correspond to large values ​​of the other variable and vice versa. The values ​​of the correlation coefficients are always in the range from -1 to +1.

Spearman's coefficient is used as a correlation coefficient between variables belonging to the ordinal scale, and Pearson's correlation coefficient (moment of products) is used for variables belonging to the interval scale. In this case, it should be noted that each dichotomous variable, that is, a variable belonging to the nominal scale and having two categories, can be considered as ordinal.

First, we will check if there is a correlation between the sex and psyche variables from the studium.sav file. In doing so, we take into account that the dichotomous variable sex can be considered an ordinal variable. Do the following:

Select from the command menu Analyze (Analysis) Descriptive Statistics (Descriptive statistics) Crosstabs... (Contingency tables)

· Move the variable sex to a list of rows and the variable psyche to a list of columns.

· Click the Statistics... button. In the Crosstabs: Statistics dialog, check the Correlations box. Confirm your choice with the Continue button.

· In the Crosstabs dialog, stop displaying tables by checking the Supress tables checkbox. Click the OK button.

The Spearman and Pearson correlation coefficients will be calculated, and their significance will be tested:

/ SPSS 10

Task number 10 Correlation analysis

The concept of correlation

Correlation or correlation coefficient is a statistical indicator probabilistic relationships between two variables measured on quantitative scales. In contrast to the functional connection, in which each value of one variable corresponds to strictly defined the value of another variable, probabilistic connection characterized by the fact that each value of one variable corresponds to set of values Another variable, An example of a probabilistic relationship is the relationship between height and weight of people. It is clear that people of different weights can have the same height and vice versa.

Correlation is a value between -1 and + 1 and is denoted by the letter r. Moreover, if the value is closer to 1, then this means the presence of a strong connection, and if it is closer to 0, then a weak one. Correlation value less than 0.2 is considered as weak correlation, more than 0.5 - high. If the correlation coefficient is negative, this means that there is an inverse relationship: the higher the value of one variable, the lower the value of the other.

Depending on the accepted values ​​of the coefficient r, different types of correlation can be distinguished:

Strong positive correlation is determined by the value r=1. The term "strict" means that the value of one variable is uniquely determined by the values ​​of another variable, and the term " positive" - that as the value of one variable increases, the value of the other variable also increases.

Strict correlation is a mathematical abstraction and almost never occurs in real research.

positive correlation corresponds to the values ​​0

Lack of correlation is determined by the value r=0. A correlation coefficient of zero indicates that the values ​​of the variables are not related to each other in any way.

Lack of correlation H o : 0 r xy =0 formulated as a reflection null hypotheses in correlation analysis.

negative correlation: -1

Strong negative correlation determined by the value r= -1. It, like a strict positive correlation, is an abstraction and does not find expression in practical research.

Table 1

Types of correlation and their definitions

The method of calculating the correlation coefficient depends on the type of scale on which the values ​​of the variable are measured.

Correlation coefficient rPearson is the main one and can be used for variables with nominal and partially ordered interval scales, the distribution of values ​​over which corresponds to normal (correlation of product moments). The Pearson correlation coefficient gives fairly accurate results in cases of abnormal distributions as well.

For distributions that are not normal, it is preferable to use the Spearman and Kendall rank correlation coefficients. They are ranked because the program pre-ranks the correlated variables.

The SPSS program calculates the r-Spearman correlation as follows: first, the variables are converted to ranks, and then the Pearson-formula is applied to the ranks.

The correlation proposed by M. Kendall is based on the idea that the direction of the connection can be judged by comparing the subjects in pairs. If for a pair of subjects the change in X coincides in direction with the change in Y coincides, then this indicates a positive relationship. If it does not match, then about a negative relationship. This coefficient is used mainly by psychologists working with small samples. Since sociologists work with large data arrays, it is difficult to sort through pairs, identify the difference in relative frequencies and inversions of all pairs of subjects in the sample. The most common is the coefficient. Pearson.

Since the correlation coefficient rPearson is the main one and can be used (with some error depending on the type of scale and the level of abnormality in the distribution) for all variables measured on quantitative scales, we will consider examples of its use and compare the results obtained with the results of measurements using other correlation coefficients.

The formula for calculating the coefficient r- Pearson:

r xy = ∑ (Xi-Xav)∙(Yi-Yav) / (N-1)∙σ x ∙σ y ∙

Where: Xi, Yi- Values ​​of two variables;

Xav, Yav - average values ​​of two variables;

σ x , σ y are standard deviations,

N is the number of observations.

Pair correlations

For example, we would like to find out how the answers between different types of traditional values ​​correlate in students' ideas about the ideal place of work (variables: a9.1, a9.3, a9.5, a9.7), and then about the ratio of liberal values ​​(a9 .2, a9.4, a9.6, a9.8). These variables are measured on 5-term ordered scales.

We use the procedure: "Analysis",  "Correlations",  "Paired". By default, the coefficient Pearson is set in the dialog box. We use the coefficient Pearson

The tested variables are transferred to the selection window: a9.1, a9.3, a9.5, a9.7

By pressing OK, we get the calculation:

Correlations

a9.1.t. How important is it to have enough time for family and personal life?

Pearson correlation

Value(2-sided)

a9.3.t. How important is it to not be afraid of losing your job?

Pearson correlation

Value(2-sided)

a9.5.t. How important is it to have such a boss who will consult with you when making this or that decision?

Pearson correlation

Value(2-sided)

a9.7.t. How important is it to work in a well-coordinated team, to feel like a part of it?

Pearson correlation

Value(2-sided)

** Correlation is significant at the 0.01 level (2-sided).

Table of quantitative values ​​of the constructed correlation matrix

Partial correlations:

First, let's build a pairwise correlation between these two variables:

Correlations

c8. Feel close to those who live near you, neighbors

Pearson correlation

Value(2-sided)

c12. Feel close to their family

Pearson correlation

Value(2-sided)

**. The correlation is significant at the 0.01 level (2-sided).

Then we use the procedure for constructing a partial correlation: "Analysis",  "Correlations",  "Partial".

Suppose that the value “It is important to independently determine and change the order of your work” in relation to the indicated variables will be the decisive factor, under the influence of which the previously identified relationship will disappear or turn out to be of little significance.

Correlations

Excluded variables

c8. Feel close to those who live near you, neighbors

c12. Feel close to their family

c16. Feel close to people who have the same wealth as you

c8. Feel close to those who live near you, neighbors

Correlation

Significance (2-sided)

c12. Feel close to their family

Correlation

Significance (2-sided)

As can be seen from the table, under the influence of the control variable, the relationship decreased slightly: from 0.120 to 0.102. it remains sufficiently high and allows one to refute the null hypothesis.

Correlation coefficient

The most accurate way to determine the tightness and nature of the correlation is to find the correlation coefficient. The correlation coefficient is a number determined by the formula:


where r xy is the correlation coefficient;

x i -values ​​of the first feature;

i -values ​​of the second feature;

Arithmetic mean of the values ​​of the first feature

Arithmetic mean of the values ​​of the second feature

To use formula (32), we construct a table that will provide the necessary sequence in the preparation of numbers to find the numerator and denominator of the correlation coefficient.

As can be seen from formula (32), the sequence of actions is as follows: we find the arithmetic means of both signs x and y, we find the difference between the values ​​​​of the sign and its average (х i - ) and y i - ), then we find their product (х i - ) ( y i - ) – the sum of the latter gives the numerator of the correlation coefficient. To find its denominator, one should square the differences (x i -) and (y i -), find their sums and extract the square root from their product.

So for example 31, finding the correlation coefficient in accordance with formula (32) can be represented as follows (Table 50).

The resulting number of the correlation coefficient makes it possible to establish the presence, closeness and nature of the relationship.

1. If the correlation coefficient is zero, there is no relationship between the features.

2. If the correlation coefficient is equal to one, the relationship between the features is so great that it turns into a functional one.

3. The absolute value of the correlation coefficient does not go beyond the interval from zero to one:

This makes it possible to focus on the tightness of the connection: the closer the coefficient is to zero, the weaker the connection, and the closer to unity, the closer the connection.

4. The sign of the correlation coefficient "plus" means direct correlation, the sign "minus" means the opposite.

Table 50

x i i (х i - ) (y i - ) (x i - )(y i - ) (х i - )2 (y i - )2
14,00 12,10 -1,70 -2,30 +3,91 2,89 5,29
14,20 13,80 -1,50 -0,60 +0,90 2,25 0,36
14,90 14,20 -0,80 -0,20 +0,16 0,64 0,04
15,40 13,00 -0,30 -1,40 +0,42 0,09 1,96
16,00 14,60 +0,30 +0,20 +0,06 0,09 0,04
17,20 15,90 +1,50 +2,25 2,25
18,10 17,40 +2,40 +2,00 +4,80 5,76 4,00
109,80 101,00 12,50 13,97 13,94


Thus, the correlation coefficient calculated in Example 31 is r xy = +0.9. allows us to draw the following conclusions: there is a correlation between the magnitude of the muscle strength of the right and left hands in the studied schoolchildren (the coefficient r xy \u003d + 0.9 is non-zero), the relationship is very close (the coefficient r xy \u003d + 0.9 is close to unity), the correlation is direct (the coefficient r xy = +0.9 is positive), i.e. with an increase in the muscle strength of one of the hands, the strength of the other hand increases.

When calculating the correlation coefficient and using its properties, it should be taken into account that the conclusions give correct results when the features are normally distributed and when the relationship between a large number of values ​​of both features is considered.

In the considered example 31, only 7 values ​​of both features were analyzed, which, of course, is not enough for such studies. We remind here again that the examples, in this book in general and in this chapter in particular, are in the nature of illustrating methods, and not a detailed presentation of any scientific experiments. As a result, a small number of feature values ​​are considered, measurements are rounded - all this is done in order not to obscure the idea of ​​the method with cumbersome calculations.

Particular attention should be paid to the essence of the relationship under consideration. The correlation coefficient cannot lead to the correct results of the study if the analysis of the relationship between the features is carried out formally. Let's go back to example 31. Both considered signs were the values ​​of the muscle strength of the right and left hands. Let's imagine that by feature x i in example 31 (14.0; 14.2; 14.9... ...18.1) we mean the length of randomly caught fish in centimeters, and by feature y i (12.1 ; 13.8; 14.2 ... ... 17.4) - the weight of instruments in the laboratory in kilograms. Formally, using the apparatus of calculations to find the correlation coefficient and in this case also obtaining r xy =+0>9, we should have concluded that there is a close relationship of a direct nature between the length of the fish and the weight of the instruments. The absurdity of such a conclusion is obvious.

To avoid a formal approach to using the correlation coefficient, one should use any other method - mathematical, logical, experimental, theoretical - to identify the possibility of a correlation between signs, that is, to detect the organic unity of signs. Only then can one begin to use correlation analysis and establish the magnitude and nature of the relationship.

In mathematical statistics, there is also the concept multiple correlation- Relationships between three or more features. In these cases, a multiple correlation coefficient is used, consisting of the pairwise correlation coefficients described above.

For example, the correlation coefficient of three signs - x і , y і , z і - is:

where R xyz -multiple correlation coefficient expressing how feature x i depends on features y i and z i ;

r xy -correlation coefficient between features x i and y i ;

r xz - correlation coefficient between features Xi and Zi;

r yz - correlation coefficient between features y i , z i

Correlation analysis is:

Correlation analysis

Correlation- statistical relationship of two or more random variables (or variables that can be considered as such with some acceptable degree of accuracy). At the same time, changes in one or more of these quantities lead to a systematic change in the other or other quantities. The correlation coefficient serves as a mathematical measure of the correlation of two random variables.

Correlation can be positive and negative (it is also possible that there is no statistical relationship - for example, for independent random variables). negative correlation - correlation, in which an increase in one variable is associated with a decrease in another variable, while the correlation coefficient is negative. positive correlation - a correlation in which an increase in one variable is associated with an increase in another variable, while the correlation coefficient is positive.

autocorrelation - statistical relationship between random variables from one row, but taken with a shift, for example, for a random process - with a shift in time.

The method of processing statistical data, which consists in studying the coefficients (correlations) between variables, is called correlation analysis.

Correlation coefficient

Correlation coefficient or pair correlation coefficient in probability theory and statistics, this is an indicator of the nature of the change in two random variables. The correlation coefficient is denoted by the Latin letter R and can take values ​​between -1 and +1. If the modulo value is closer to 1, then this means the presence of a strong connection (with a correlation coefficient equal to one, they speak of a functional connection), and if closer to 0, then a weak one.

Pearson correlation coefficient

For metric quantities, the Pearson correlation coefficient is used, the exact formula of which was introduced by Francis Galton:

Let X,Y- two random variables defined on the same probability space. Then their correlation coefficient is given by the formula:

,

where cov is the covariance and D is the variance, or equivalently,

,

where the symbol denotes the mathematical expectation.

To graphically represent such a relationship, you can use a rectangular coordinate system with axes that correspond to both variables. Each pair of values ​​is marked with a specific symbol. Such a plot is called a "scatterplot".

The method of calculating the correlation coefficient depends on the type of scale to which the variables refer. So, to measure variables with interval and quantitative scales, it is necessary to use the Pearson correlation coefficient (correlation of product moments). If at least one of the two variables has an ordinal scale, or is not normally distributed, Spearman's rank correlation or Kendal's τ (tau) must be used. In the case when one of the two variables is dichotomous, a point two-series correlation is used, and if both variables are dichotomous, a four-field correlation is used. The calculation of the correlation coefficient between two non-dichotomous variables makes sense only if the relationship between them is linear (unidirectional).

Kendell correlation coefficient

Used to measure mutual disorder.

Spearman's correlation coefficient

Properties of the correlation coefficient

  • Cauchy - Bunyakovsky inequality:
if we take the covariance as the scalar product of two random variables, then the norm of the random variable will be equal to , and the consequence of the Cauchy-Bunyakovsky inequality will be: . , where . Moreover, in this case the signs and k match: .

Correlation analysis

Correlation analysis- method of processing statistical data, which consists in studying the coefficients ( correlations) between variables. In this case, the correlation coefficients between one pair or multiple pairs of features are compared to establish statistical relationships between them.

Target correlation analysis - provide some information about one variable with the help of another variable. In cases where it is possible to achieve the goal, we say that the variables correlate. In its most general form, the adoption of the hypothesis of the presence of a correlation means that a change in the value of variable A will occur simultaneously with a proportional change in the value of B: if both variables increase, then correlation is positive if one variable increases and the other decreases, correlation is negative.

The correlation reflects only the linear dependence of the quantities, but does not reflect their functional connectivity. For example, if we calculate the correlation coefficient between the values A = sin(x) and B = cos(x), then it will be close to zero, i.e., there is no dependence between the quantities. Meanwhile, the quantities A and B are obviously related functionally according to the law sin 2(x) + cos 2(x) = 1.

Limitations of correlation analysis



Plots of distributions of pairs (x,y) with corresponding x and y correlation coefficients for each of them. Note that the correlation coefficient reflects a linear relationship (top row), but does not describe a relationship curve (middle row), and is not at all suitable for describing complex, non-linear relationships (bottom row).
  1. Application is possible if there are a sufficient number of cases to study: for a particular type of correlation coefficient, it ranges from 25 to 100 pairs of observations.
  2. The second limitation follows from the hypothesis of correlation analysis, which includes linear dependence of variables. In many cases, when it is reliably known that the dependence exists, the correlation analysis may not give results simply because the dependence is non-linear (expressed, for example, as a parabola).
  3. By itself, the fact of correlation does not give grounds to assert which of the variables precedes or causes changes, or that the variables are generally causally related to each other, for example, due to the action of a third factor.

Application area

This method of processing statistical data is very popular in economics and social sciences (in particular, in psychology and sociology), although the scope of application of correlation coefficients is extensive: quality control of industrial products, metallurgy, agricultural chemistry, hydrobiology, biometrics, and others.

The popularity of the method is due to two points: the correlation coefficients are relatively easy to calculate, their application does not require special mathematical training. Combined with the ease of interpretation, the ease of application of the coefficient has led to its widespread use in the field of statistical data analysis.

spurious correlation

The often tempting simplicity of a correlation study encourages the researcher to draw false intuitive conclusions about the presence of a causal relationship between pairs of traits, while the correlation coefficients establish only statistical relationships.

In the modern quantitative methodology of the social sciences, in fact, there has been a abandonment of attempts to establish causal relationships between observed variables by empirical methods. Therefore, when researchers in the social sciences talk about establishing relationships between the variables they study, either a general theoretical assumption or a statistical dependence is implied.

see also

  • Autocorrelation function
  • Cross-correlation function
  • covariance
  • Determination coefficient
  • Regression analysis

Wikimedia Foundation. 2010.

Where x y , x , y are the mean values ​​of the samples; σ(x), σ(y) - standard deviations.
Besides, Pearson's linear pair correlation coefficient can be determined through the regression coefficient b: , where σ(x)=S(x), σ(y)=S(y) are standard deviations, b is the coefficient in front of x in the regression equation y=a+bx .

Other formula options:
or

K xy - correlation moment (covariance coefficient)

To find the linear Pearson correlation coefficient, it is necessary to find the sample means x and y , and their standard deviations σ x = S(x), σ y = S(y):

The linear correlation coefficient indicates the presence of a connection and takes values ​​from -1 to +1 (see the Chaddock scale). For example, when analyzing the tightness of a linear correlation between two variables, a pair linear correlation coefficient equal to –1 was obtained. This means that there is an exact inverse linear relationship between the variables.

You can calculate the value of the correlation coefficient using the given sample means, or directly.

Xy#x #y #σ x #σ y " data-id="a;b;c;d;e" data-formul="(a-b*c)/(d*e)" data-r="r xy "> Calculate your value

The geometric meaning of the correlation coefficient: r xy shows how much the slope of the two regression lines: y(x) and x(y) differs, how much the results of minimizing the deviations in x and in y differ. The greater the angle between the lines, the greater the r xy .
The sign of the correlation coefficient coincides with the sign of the regression coefficient and determines the slope of the regression line, i.e. the general direction of dependence (increase or decrease). The absolute value of the correlation coefficient is determined by the degree of proximity of the points to the regression line.

Properties of the correlation coefficient

  1. |r xy | ≤ 1;
  2. if X and Y are independent, then r xy =0, the opposite is not always true;
  3. if |r xy |=1, then Y=aX+b, |r xy (X,aX+b)|=1, where a and b are constant and ≠ 0;
  4. |r xy (X,Y)|=|r xy (a 1 X+b 1 , a 2 X+b 2)|, where a 1 , a 2 , b 1 , b 2 are constants.

Therefore, for link direction checks a hypothesis test is selected using the Pearson correlation coefficient with a further test for reliability using t-test(see example below).

Typical tasks (see also non-linear regression)

Typical tasks
The dependence of labor productivity y on the level of mechanization of work x (%) is studied according to the data of 14 industrial enterprises. Statistical data are given in the table.
Required:
1) Find estimates for the parameters of linear regression y on x. Build a scatterplot and plot the regression line on the scatterplot.
2) At the significance level α=0.05, test the hypothesis of the agreement between the linear regression and the observational results.
3) With reliability γ=0.95 find confidence intervals for linear regression parameters.

The following are also used with this calculator:
Multiple regression equation

Example. Based on the data given in Appendix 1 and corresponding to your option (Table 2), you need:

  1. Calculate the coefficient of linear pair correlation and construct the equation of linear pair regression of one feature from another. One of the signs corresponding to your option will play the role of factorial (x), the other - effective (y). Establish cause-and-effect relationships between signs on the basis of economic analysis. Explain the meaning of the parameters of the equation.
  2. Determine the theoretical coefficient of determination and the residual (unexplained by the regression equation) variance. Make a conclusion.
  3. Estimate statistical significance regression equations as a whole at the five percent level using Fisher's F-test. Make a conclusion.
  4. Perform a forecast of the expected value of the attribute-result y with the predicted value of the attribute-factor x, which is 105% of the average level x. Assess the accuracy of the forecast by calculating the forecast error and its confidence interval with a probability of 0.95.
Solution. The equation is y = ax + b
Averages



Dispersion


standard deviation



The relationship between the trait Y factor X is strong and direct (determined by the Chaddock scale).
Regression Equation

Regression coefficient: k = a = 4.01
Determination coefficient
R 2 = 0.99 2 = 0.97, i.e. in 97% of cases, changes in x lead to a change in y. In other words, the accuracy of the selection of the regression equation is high. Residual dispersion: 3%.
xyx2y2x yy(x)(y i -y ) 2(y-y(x)) 2(x-x p) 2
1 107 1 11449 107 103.19 333.06 14.5 30.25
2 109 4 11881 218 107.2 264.06 3.23 20.25
3 110 9 12100 330 111.21 232.56 1.47 12.25
4 113 16 12769 452 115.22 150.06 4.95 6.25
5 120 25 14400 600 119.23 27.56 0.59 2.25
6 122 36 14884 732 123.24 10.56 1.55 0.25
7 123 49 15129 861 127.26 5.06 18.11 0.25
8 128 64 16384 1024 131.27 7.56 10.67 2.25
9 136 81 18496 1224 135.28 115.56 0.52 6.25
10 140 100 19600 1400 139.29 217.56 0.51 12.25
11 145 121 21025 1595 143.3 390.06 2.9 20.25
12 150 144 22500 1800 147.31 612.56 7.25 30.25
78 1503 650 190617 10343 1503 2366.25 66.23 143

Note: y(x) values ​​are found from the resulting regression equation:
y(1) = 4.01*1 + 99.18 = 103.19
y(2) = 4.01*2 + 99.18 = 107.2
... ... ...

Significance of the correlation coefficient

We put forward hypotheses:
H 0: r xy = 0, there is no linear relationship between variables;
H 1: r xy ≠ 0, there is a linear relationship between the variables;
In order to test the null hypothesis at the significance level α that the general correlation coefficient of a normal two-dimensional random variable is equal to zero with a competing hypothesis H 1 ≠ 0, it is necessary to calculate the observed value of the criterion (the value of the random error):

According to the Student's table, we find t tab (n-m-1; α / 2) = (10; 0.025) = 2.228
Since Tobs > t tab, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant.
Interval estimate for correlation coefficient (confidence interval)


r - Δr ≤ r ≤ r + Δr
Δ r = ±t table m r = ±2.228 0.0529 = 0.118
0.986 - 0.118 ≤ r ≤ 0.986 + 0.118
Confidence interval for correlation coefficient: 0.868 ≤ r ≤ 1

Analysis of the accuracy of determining estimates of regression coefficients





Sa =0.2152

Confidence intervals for the dependent variable

Let us calculate the boundaries of the interval in which 95% of the possible values ​​of Y will be concentrated for unlimited large numbers observations and X = 7
(122.4;132.11)
Testing hypotheses about coefficients linear equation regression

1) t-statistic




The statistical significance of the regression coefficient is confirmed
Confidence interval for coefficients of the regression equation
Let us determine the confidence intervals of the regression coefficients, which, with 95% reliability, will be as follows:
(a - t a S a ; a + t a S a)
(3.6205;4.4005)
(b - t b S b ; b + t b S b)
(96.3117;102.0519)

The purpose of correlation analysis is to identify an estimate of the strength of the connection between random variables (features) that characterizes some real process.
Problems of correlation analysis:
a) Measurement of the degree of connection (tightness, strength, severity, intensity) of two or more phenomena.
b) The selection of factors that have the most significant impact on the resulting attribute, based on measuring the degree of connectivity between phenomena. Significant factors in this aspect are used further in the regression analysis.
c) Detection of unknown causal relationships.

The forms of manifestation of interrelations are very diverse. As their most common types, functional (complete) and correlation (incomplete) connection.
correlation manifests itself on average, for mass observations, when the given values ​​of the dependent variable correspond to a certain number of probabilistic values ​​of the independent variable. The connection is called correlation, if each value of the factor attribute corresponds to a well-defined non-random value effective feature.
Correlation field serves as a visual representation of the correlation table. It is a graph where X values ​​are plotted on the abscissa axis, Y values ​​are plotted along the ordinate axis, and combinations of X and Y are shown by dots. The presence of a connection can be judged by the location of the dots.
Tightness indicators make it possible to characterize the dependence of the variation of the resulting trait on the variation of the trait-factor.
A better indicator of the degree of tightness correlation is linear correlation coefficient. When calculating this indicator, not only the deviations of the individual values ​​of the attribute from the average are taken into account, but also the magnitude of these deviations.

The key issues of this topic are the equations of the regression relationship between the resulting feature and the explanatory variable, the least squares method for estimating the parameters of the regression model, analyzing the quality of the resulting regression equation, building confidence intervals for predicting the values ​​of the resulting feature using the regression equation.

Example 2


System of normal equations.
a n + b∑x = ∑y
a∑x + b∑x 2 = ∑y x
For our data, the system of equations has the form
30a + 5763 b = 21460
5763 a + 1200261 b = 3800360
From the first equation we express a and substitute into the second equation:
We get b = -3.46, a = 1379.33
Regression equation:
y = -3.46 x + 1379.33

2. Calculation of the parameters of the regression equation.
Sample means.



Sample variances:


standard deviation


1.1. Correlation coefficient
covariance.

We calculate the indicator of closeness of communication. Such an indicator is a selective linear correlation coefficient, which is calculated by the formula:

The linear correlation coefficient takes values ​​from –1 to +1.
Relationships between features can be weak or strong (close). Their criteria are evaluated on the Chaddock scale:
0.1 < r xy < 0.3: слабая;
0.3 < r xy < 0.5: умеренная;
0.5 < r xy < 0.7: заметная;
0.7 < r xy < 0.9: высокая;
0.9 < r xy < 1: весьма высокая;
In our example, the relationship between feature Y and factor X is high and inverse.
In addition, the coefficient of linear pair correlation can be determined in terms of the regression coefficient b:

1.2. Regression Equation(evaluation of the regression equation).

The linear regression equation is y = -3.46 x + 1379.33

The coefficient b = -3.46 shows the average change in the effective indicator (in units of y) with an increase or decrease in the value of the factor x per unit of its measurement. AT this example with an increase of 1 unit, y decreases by an average of -3.46.
The coefficient a = 1379.33 formally shows the predicted level of y, but only if x=0 is close to the sample values.
But if x=0 is far from the sample x values, then a literal interpretation can lead to incorrect results, and even if the regression line accurately describes the values ​​of the observed sample, there is no guarantee that this will also be the case when extrapolating to the left or to the right.
By substituting the corresponding values ​​of x into the regression equation, it is possible to determine the aligned (predicted) values ​​of the effective indicator y(x) for each observation.
The relationship between y and x determines the sign of the regression coefficient b (if > 0 - direct relationship, otherwise - inverse). In our example, the relationship is reverse.
1.3. elasticity coefficient.
It is undesirable to use regression coefficients (in example b) for a direct assessment of the influence of factors on the effective attribute in the event that there is a difference in the units of measurement of the effective indicator y and the factor attribute x.
For these purposes, elasticity coefficients and beta coefficients are calculated.
The average coefficient of elasticity E shows how many percent the result will change on average in the aggregate at from its average value when changing the factor x 1% of its average value.
The coefficient of elasticity is found by the formula:


The elasticity coefficient is less than 1. Therefore, if X changes by 1%, Y will change by less than 1%. In other words, the influence of X on Y is not significant.
Beta coefficient shows by what part of the value of its standard deviation the value of the effective attribute will change on average when the factor attribute changes by the value of its standard deviation with the value of the remaining independent variables fixed at a constant level:

Those. an increase in x by the value of the standard deviation S x will lead to a decrease in the average value of Y by 0.74 standard deviation S y .
1.4. Approximation error.
Let us evaluate the quality of the regression equation using the absolute approximation error. The average approximation error is the average deviation of the calculated values ​​from the actual ones:


Since the error is less than 15%, this equation can be used as a regression.
Dispersion analysis.
The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y i - y cp) 2 = ∑(y(x) - y cp) 2 + ∑(y - y(x)) 2
where
∑(y i - y cp) 2 - total sum of squared deviations;
∑(y(x) - y cp) 2 - sum of squared deviations due to regression (“explained” or “factorial”);
∑(y - y(x)) 2 - residual sum of squared deviations.
Theoretical correlation ratio for a linear relationship is equal to the correlation coefficient r xy .
For any form of dependence, the tightness of the connection is determined using multiple correlation coefficient:

This coefficient is universal, as it reflects the tightness of the connection and the accuracy of the model, and can also be used for any form of connection between variables. When constructing a one-factor correlation model, the multiple correlation coefficient equal to the coefficient pair correlation r xy .
1.6. Determination coefficient.
The square of the (multiple) correlation coefficient is called the coefficient of determination, which shows the proportion of the variation of the resultant attribute explained by the variation of the factor attribute.
Most often, giving an interpretation of the coefficient of determination, it is expressed as a percentage.
R 2 \u003d -0.74 2 \u003d 0.5413
those. in 54.13% of cases, changes in x lead to a change in y. In other words, the accuracy of the selection of the regression equation is average. The remaining 45.87% of the change in Y is due to factors not taken into account in the model.

Bibliography

  1. Econometrics: Textbook / Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 34..89.
  2. Magnus Ya.R., Katyshev P.K., Peresetsky A.A. Econometrics. Initial course. Tutorial. - 2nd ed., Rev. – M.: Delo, 1998, p. 17..42.
  3. Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 5..48.