For each, start by defining the residual sum of squares (RSS), \(RSS=\sum (y_i - \hat{y}_i)^2\) and then differentiate and solve.
With \(RSS=\sum (y_i - \alpha)^2\), then \(d RSS/d\alpha = -2 \sum (y_i - \alpha)\). Assuming that the minimum is at a turning point then \(d RSS/d\alpha = 0\) and hence \(\hat\alpha = \bar y\). This means that the fitted model is simply a horizontal line going through the mean. Note that \(d^2 RSS/d\alpha^2=2n\), confirming a minimum.
With \(RSS=\sum (y_i - \alpha-\beta x_i)^2\), then \[ \frac{d RSS}{d\alpha} = -2 \sum (y_i - \alpha-\beta x_i) \] and \[ \frac{d RSS}{d\beta} = -2 \sum x_i(y_i - \alpha-\beta x_i). \] Setting these to zero and solving the pair of simultaneous equations gives the usual regression estimates \[ \hat\alpha = \bar y -\hat\beta \bar x, \qquad % \hat\beta = \frac{\sum x_i^2-n \bar x^2}{\sum x_iy_i -n \bar x \bar y}. \] Further, the Hessian determinant is \(H= 2n\cdot 2 \sum x_i^2 - 2\sum x_i \cdot 2 \sum x_i = 4 n s_x^2\) which is positive and there are positive diagonals in the Hessian matrix, which together confirm a minimum.
With \(RSS=\sum (y_i - \alpha-\beta x_i-\gamma x_i^2)^2\), then \[ \frac{d RSS}{d\alpha} = -2 \sum (y_i - \alpha-\beta x_i-\gamma x_i^2), \] \[ \frac{d RSS}{d\beta} = -2 \sum x_i(y_i - \alpha-\beta x_i-\gamma x_i^2) \] and \[ \frac{d RSS}{d\gamma} = -2 \sum x_i^2(y_i - \alpha-\beta x_i-\gamma x_i^2) \] Setting these to zero and writing in matrix form gives \[ \begin{bmatrix} n & \sum x_i & \sum x_i^2 \\ \sum x_i & \sum x_i^2 & \sum x_i^3 \\ \sum x_i^2 & \sum x_i^3 & \sum x_i^4 \end{bmatrix} \begin{bmatrix} \hat\alpha \\ \hat\beta \\ \hat\gamma \end{bmatrix} = \begin{bmatrix} \sum y_i \\ \sum x_i y_i \\ \sum x_i^2 y_i \end{bmatrix}. \] This can be solved to give explicit estimates, and the Hessian should be check to confirm a minimum.
Clearly, least-squares can be used, with \(RSS=\sum (y_i - \alpha e^{\beta x_i})^2\) followed by the mechanics of differentiation etc.
An alternative approach is to first take logs, of the data and model (if the \(y\) values are all positive), and consider the minimisation of \(RSS=\sum (\log y_i - \log \alpha -\beta x_i)^2\). Notice that this is of the form of a simple linear regression with intercept \(\log \alpha\) and slope \(\beta\) making fitting straightforward.
Note that these two approaches will not, unless here is no error, give the same parameter estimates.
Either of these approaches could be implemented numerically.
With many similar problems, imaging breaking the problem down into steps. If you imagine that you know the location of the change-point then separate linear functions could be fitted to all data points below the change-point and above the change-point. This process could be repeated for all possible change-point locations. Note that all potential change-points between data points will give the same answer. If performed with the given data the following graph shows the fitted model with estimated change-point at 0.56.
The constant values between data points in (a) makes the method very straightforward – no calculations are needed. This approach would seem suitable is the response is discrete as then only corresponding discrete values can be predicted. In contrast it does not seem sensible for continuous measurements as real values predictions is more appropriate. Similarly, the linear interpolation in (c) is simple to use with straightforward calculation. Most suitable for continuous response variables but could be used for discrete in a similar way to a sample mean is used to give an average value. Likely to be reliable if the data values are close together as a linear approximation should be good. The cubic interpolation in (c) gives a similar answer to (b) for this example but gives smooth changes near to the data points. In some cases this difference could be important. Finally, the smoothing model in (d) shows a curve which passes close to the data points. This gives smooth changes around the data points giving a good description of the variation in the data.
Overall, (a) and (b) are much simpler to calculate – and to explain – whereas (c) and (d) involve substantial theory and calculation – requiring computer software. All, except (d), pass through the data points exactly. This property is most appropriate if we believe the recorded response values are very accurate – even do not have any error. If there is measurement error in the data, then why should a curve follow that random component. Instead, it is better to accept that points away from the general trend are likely to be associated with larger errors.