We observe \( y_k \) for each item in the sample \(s \). The parameter of interest is the population total \( Y = \sum_{k \in \wp} y_k \). We define the ratio estimator of \( Y \) to be \[\begin{aligned} \widehat{Y} &= \frac{ \sum_{h} N_h \: \bar{y}_h } { \sum_{h} N_h \: \bar{x}_h } \: X & \\ & \\ &= \frac{ \sum_{k \in s} \pi_k^{-1} \: y_k } { \sum_{k \in s} \pi_k^{-1} \: x_k } \: X \end{aligned} \] Here \( \bar{y}_h \) and \( \bar{x}_h \) are the sample mean of \( y \) and \( x \) in stratum \( h \), and \( X = \sum_{ k \in \wp} x_k \) is the known population total of \(x \).
The sample design is said to be strongly stratified if the population coefficient of variation of \( x \) is small in each stratum. The stratified ratio estimator is essentially unbiased if the sample design is strongly stratified, and its variance is asymptotically equal to \[\begin{aligned} V &= \sum_{h} { N_h^2 \left( 1- \frac{n_h}{N_h} \right) \frac{S^2_h}{n_h} } & \\ & \\ \text{where} & \\ & \\ S^2_h &= \frac{ \sum_{k \in \wp_h } \left( e_k - \bar{e}_h \right)^2 } { N_h - 1 } & \\ & \\ \text{with} & \\ & \\ B &= \frac{ \sum_{k \in \wp} {y_k} } { \sum_{k \in \wp} {x_k} } & \\ e_k &= y_k - B \: x_k & \\ & \\ \text{and} & \\ & \\ \bar{e}_h &= \frac{ \sum_{k \in \wp_h} {e_k} } { N_h }. \end{aligned} \]
Denoting the expected value of \( y_k \) as \( \mu_k \) and the standard deviation of \( y_k \) as \( \sigma_k \), we assume the heteroscedasticity equation: \[ \sigma_k = \sigma_0 \: x_k^\gamma, \: \: \sigma_0 > 0. \]
We define the error ratio to be \[ er = \frac{ \sum_{k \in \wp} {\sigma_k} } { \sum_{k \in \wp} {\mu_k} } \] The final parameters of the model are \( \beta \), \(\gamma \), and \( er \).
More generally, a sample design is efficient under the assumed model if and only if for any subset of the population the expected sample size within the subset is proportional to the sum of the \( \sigma_k \) within the subset.
The expected relative precision of \( \widehat{Y} \) is defined to be \[ rp = \frac { z \: \sqrt{ \sum_{k \in \wp} ( \pi_k^{-1} -1 ) \sigma_k^2 } } {\sum_{k \in \wp} \mu_k } \] If \( N \) is sufficiently large and if the sample design is efficient, then the expected relative precision is equal to \[ rp = \frac { z \: er } { \sqrt { n } } \] where \( er \) is the error ratio of the assumed model. We often use this equation to guide the preliminary choice of the sample size before we have collected detailed information about the distribution of \( x \) in the population.
Once we do have the detailed information, we usually use model-based stratification to construct a stratified sample design. We choose the number of strata \( H \) to be sufficiently large for strong stratification, but often as small as five or ten. Then the strata are constructed by sorting the population by increasing \( \sigma_k \) and choosing stratum cut points that equalize the sum of \( \sigma_k \) in each stratum. We use the expected relative precision to help choose the sample size, and then we divide the chosen sample size equally among the \( H \) strata.
If the number of strata is sufficiently large, the inclusion probabilities will be immaterially different than the optimal inclusion probabilities and the stratified sample design will be nearly efficient.
So, conditional on \( \gamma \), the error ratio \( er \) is estimated as \[ \hat{er} = \frac { \sqrt{ \left( \sum_{k \in s} { \pi_k^{-1} \: \hat e_k^2/ x_k^\gamma } \right) \: \left( \sum_{k \in s} { \pi_k^{-1} \: x_k^\gamma } \right) } } { \sum_{k \in s} { \pi_k^{-1} \: y_k } } \] The estimate of \(\gamma \) is the value that minimizes the right hand side of the preceding equation.
It is impossible to develop an optimal sample design without assumptions about the population. We have found the ratio model to be an effective and parsimonious guide to developing a sample design. Its three parameters reflect the three most relevant aspects of the sample design -- \( \beta \) represents the ratio to be estimated, \( er \) measures the the overall population variability affecting the statistical precision, and \( \gamma \) characterizes the heteroscedasticity in the population and guides efficient stratification.
Of course the results are best when the model is an accurate characterization of the population. But even if the model is inaccurate, the ratio estimator is asymptotically unbiased and conventional design-based methods can be used to assess its statistical precision.
When the model is reasonably accurate, the error ratio is a reliable guide to the sample size needed for a desired relative precision. In our applications, we have been able to assess the expected error ratio from the circumstances of each application, sometimes by analyzing prior data but often just from general considerations.
As an integral part of the analysis of sample data, we always estimate the error ratio to aid in planning future studies. The simulations suggest that the estimator of the error ratio is somewhat less reliable than the ratio estimator. Therefore, in planning a future study, we try to synthesize an estimate of the error ratio from all relevant experience and past results.
In most of our applications, we have found \( \gamma \) to be in the range 0.5 to 1. We also find that the required sample size and expected statistical precision generally vary little throughout this range, and that 0.8 usually work well.
Our preferred equation for the achieved relative precision is based on the assumption that the mean of the residuals is small in each stratum. We check this assumption by examining the scatter plot between \( y \) and \(x \).
Our three estimators, and our preferred equation for the achieved relative precision of the ratio estimator, are all calculated using \( \pi^{-1} \) weights applied to each sample item. We use these same equations within any domain identifiable in the sample. We find this very advantageous.
The model is used primarily to plan the sample design. The analysis is assisted by the model, but is not dependent on the accuracy of the model. As indicated by the use of \( \pi^{-1} \) weights, our estimators are essentially design-based. Their validity rests on the assumption that the sample has been randomly selected following a known sample design with little or no non-response or missing data. We also assume that \( y \) is observed with little or no measurement error, and especially, no material bias.