For the statistic nerds here, this may be a dumb question, but it confuses me:

Assume you have a set of data points and fit a straight line y = a + b x through them the usual way. Then you can determine the standard errors of the parameters a and b as well, i.e. s(a) and s(b).

Using these fit results, you can compute a forecast y* = a + b x*. According to Gauss' error propagation, the standard error of this forecast should be

OTOH, there is a 'standard error of the estimate' s(y^) = sy SQRT((n - 1)(1 - r²)/(n - 2)) with sy being the standard deviation of the y-values. This formula is said to describe the scattering of y of the data points around the fit line. Hence s(y^) is independent of x.

What is the correct formula for the standard error of the forecast y* ?
If s(y^) then why are you not allowed to apply Gauss here? Is it since a and b are not independent variables?
And I wonder why the error of the forecast should be constant over the entire range of x. I think it should increase when x* exceeds the range covered by the data points.

I looked through some statistics books but found noone really covering this topic (or I did miss it).

the formula for the variance of a forecasting value f is:
S(f)^2 = S(e)^2 * (1+(1/N) + (x0-mean(x))^2/sum(xi-mean(x))^2)
Whereas:
N = number of observations
mean(x) = arithmetic mean of x
x0 = value of x for which you want a forecast
Se^2 = sum(yi-(a+b*xi))^2/(N-2)
a is the constant, b the slope of the regression function.

The standard deviation is the root of that.

The reason why your formula is not correct is because the regression parameters a and b are correlated.

Attached you will find a small example.

Reference: Hill / Griffiths / Lim: Principles of Econometrics, Wiley 2008

your formula delivers the standard error of all y-values of the regression, one result for all y-values together (see the example in Excel).
The formula I posted delivers the standard error for one single value which may be in the data set of the regression analysis but may as well be an additional value not included in the regression analysis for a forecast. This is how I understood Walter's question.
It is clear that a formula for a forecast has to take into account explicitely the x0-value for which the forecast is done. The farer away the x0-value is from the mean of the x-values that are the basis for the regression analysis the greater the standard error gets.

Danke Raimund, merci Pierre, for confirming that Gauss can't be used here. With respect to the formula for S(f), I tend to Raimund's since it depends also on x in a plausible way.

But I think S(e) shall be replaced by sy therein, doesn't it? Please compare https://de.wikipedia.org/wiki/Lineare_E ... Vorhersage. Though I observed more than once that different authors state different (statistical) formulas since they write and condense terms differently, making it difficult to find out whether those are actually identical or not.

s(e) should not be replaced by s(y). In the wikipedia article sigma is defined only implicitly. Under the headline "t-test" you see it is assumed that the error term is normally distributed with mean 0 and variance sigma^2. From this it is clear that the sigma used later is the standard deviation of the error term which I called s(e).

As well in my reference "Principles of Econometrics" on page 102 it is clear that it is the variance of the errors that is used in the formula I cited.

And this makes sense: If you use s(y) instead of s(e) the quality of the forecast would not depend on the quality of the regression.

BTW: I agree 100% to your comment on statistical formulas in different textbooks

the formula for the variance of a forecasting value f is:
S(f)^2 = S(e)^2 * (1+(1/N) + (x0-mean(x))^2/sum(xi-mean(x))^2)
Whereas:
N = number of observations
mean(x) = arithmetic mean of x
x0 = value of x for which you want a forecast
Se^2 = sum(yi-(a+b*xi))^2/(N-2)
a is the constant, b the slope of the regression function.

Yes in principle, but this is the standard error of the sample values. I think if the forecasting value is on the regression line then S(f)^2 = t^(-1)(n-2,0.683) * S(e)^2 * ((1/N) + (x0 - mean(x))^2/sum(xi - mean(x))^2)
shall be taken (standard error for estimating the mean response).
You can write this formula easier using Sx and S(b) then.

The formula I gave is not the standard value of the sample values but for a value x0 that may be a sample value or may not be.
I do not know what t is and 2,0.683 Is a typo by sure but even if I read it as 2 I cannot see why this multplication makes sense here.
Sorry, but for the time being I cannot help further.

And BTW; I still believe that the formula I gave you is correct.

This t^(-1) is the inverse of the t distribution for n-2 degrees of freedom and a probability of 0.683 (it returns some 0.5 ± 5%). I found this looking at some old books and papers explaining confidence limits for the estimated forecast (Schätzwert). The probability 0.683 corresponds to the standard error. Your formula matches (though also with a t^(-1) multiplied) with the confidence limits for the population. The only difference is the term "1+" under the square root.

Some authors ditch the t^(-1) for unknown reasons. The errors become 2 times greater then, so you can say that's a conservative assumption (and you don't have to remember n anymore).

Statistics can be a little messy, especially when authors refrain from deriving their results...

Found the errors I made: For calculating t^(-1) for the standard error of a forecast on the regression line, the probability must be set to one minus half the error probability, i.e. 1 - (1 - 0.683)/2 = 0.8415. Then t^(-1) approx. equals 1, so ditching this factor in the formula