# Statistical Methodology

This note gives a concise description of the statistical methodology used in this web site. If you have a strong background in mathematical statistics and are knowledgeable about finite population sampling, then this may be a good place to acquaint yourself with our methods and see how they fit with other tools and techniques of survey sampling. If this note seems to be above your head, please don't worry -- everything will be explained much less technically in the main web site.

## The Stratified Ratio Estimator

Assume a finite population $$\wp$$ of $$N$$ items labeled $$1 ,..., N$$. For each item $$k$$ in the population, $$x_k$$ is known and $$x_k > 0$$. A sample $$s$$ of fixed size $$n$$ is selected from the population using a stratified sample design. We let $$\wp_h$$ and $$s_h$$ denote the items in stratum $$h$$ in the population and sample respectively. The number of items in the population and sample of each stratum $$h$$ are $$N_h$$ and $$n_h$$ respectively. The inclusion probability is $$\pi_k = n_h / N_h$$ for each item $$k$$ in stratum $$h$$.

We observe $$y_k$$ for each item in the sample $$s$$. The parameter of interest is the population total $$Y = \sum_{k \in \wp} y_k$$. We define the ratio estimator of $$Y$$ to be \begin{aligned} \widehat{Y} &= \frac{ \sum_{h} N_h \: \bar{y}_h } { \sum_{h} N_h \: \bar{x}_h } \: X & \\ & \\ &= \frac{ \sum_{k \in s} \pi_k^{-1} \: y_k } { \sum_{k \in s} \pi_k^{-1} \: x_k } \: X \end{aligned} Here $$\bar{y}_h$$ and $$\bar{x}_h$$ are the sample mean of $$y$$ and $$x$$ in stratum $$h$$, and $$X = \sum_{ k \in \wp} x_k$$ is the known population total of $$x$$.

The sample design is said to be strongly stratified if the population coefficient of variation of $$x$$ is small in each stratum. The stratified ratio estimator is essentially unbiased if the sample design is strongly stratified, and its variance is asymptotically equal to \begin{aligned} V &= \sum_{h} { N_h^2 \left( 1- \frac{n_h}{N_h} \right) \frac{S^2_h}{n_h} } & \\ & \\ \text{where} & \\ & \\ S^2_h &= \frac{ \sum_{k \in \wp_h } \left( e_k - \bar{e}_h \right)^2 } { N_h - 1 } & \\ & \\ \text{with} & \\ & \\ B &= \frac{ \sum_{k \in \wp} {y_k} } { \sum_{k \in \wp} {x_k} } & \\ e_k &= y_k - B \: x_k & \\ & \\ \text{and} & \\ & \\ \bar{e}_h &= \frac{ \sum_{k \in \wp_h} {e_k} } { N_h }. \end{aligned}

## The Ratio Model

In order to help to choose the sample size, to develop an efficiently stratified sample design and to guide the analysis of the sample data, we assume a heteroscedastic, zero-intercept regression model for the relationship between $$y$$ and $$x$$: $y_k = \beta \: x_k + \epsilon_k$ We assume that $$\epsilon_1,...,\epsilon_N$$ are mutually independent and that $$E(\epsilon_k) = 0$$ for all $$k$$.

Denoting the expected value of $$y_k$$ as $$\mu_k$$ and the standard deviation of $$y_k$$ as $$\sigma_k$$, we assume the heteroscedasticity equation: $\sigma_k = \sigma_0 \: x_k^\gamma, \: \: \sigma_0 > 0.$

We define the error ratio to be $er = \frac{ \sum_{k \in \wp} {\sigma_k} } { \sum_{k \in \wp} {\mu_k} }$ The final parameters of the model are $$\beta$$, $$\gamma$$, and $$er$$.

## The Anticipated Variance and Achieved Relative Precision

Motivated by the model, we use the approximation $S^2_h = \frac{ \sum_{k \in \wp_h } e_k^2 } { N_h }.$ Then \begin{aligned} V &= \sum_{h} { N_h^2 \left( 1- \frac{n_h}{N_h} \right) \frac{S^2_h}{n_h} } & \\ & \\ &= \sum_{k \in \wp} { \left( \pi_k^{-1} - 1 \right) e_k^2 } \end{aligned} The anticipated variance is the expected value under the model of the right hand side of the preceding equation: $AV = \sum_{k \in \wp} { \left( \pi_k^{-1} - 1 \right) \sigma_k^2 }$ Given the sample data, we calculate the achieved relative precision as $\frac { z\: \sqrt{ \sum_{k \in s} { \pi_k^{-1} ( \pi_k^{-1} - 1 ) \:\hat e_k^2 } } } { \sum_{k \in s} { \pi_k^{-1} \: y_k } }$ where $$z$$ is the standard normal coefficient associated with the specified level of confidence and \begin{aligned} \hat e_k &= y_k - b \: x_k & \\ & \\ \text{with} & \\ & \\ b &= \frac{ \sum_{k \in s} \pi_k^{-1} \: y_k } { \sum_{k \in s} \pi_k^{-1} \: x_k } \end{aligned}

## Efficient Sample Designs and Model-Based Stratification

For any given sample size $$n = \sum_{k \in \wp} { \pi_k}$$ and sufficiently large $$N$$, the anticipated variance $$AV$$ is minimized by choosing a sample design with inclusion probabilities $\pi_k = \frac { n \: \sigma_k} {\sum_{k \in \wp} { \sigma_k } }$ Such a sample design is said to be efficient and the inclusion probabilities are said to be optimal under the assumed model.

More generally, a sample design is efficient under the assumed model if and only if for any subset of the population the expected sample size within the subset is proportional to the sum of the $$\sigma_k$$ within the subset.

The expected relative precision of $$\widehat{Y}$$ is defined to be $rp = \frac { z \: \sqrt{ \sum_{k \in \wp} ( \pi_k^{-1} -1 ) \sigma_k^2 } } {\sum_{k \in \wp} \mu_k }$ If $$N$$ is sufficiently large and if the sample design is efficient, then the expected relative precision is equal to $rp = \frac { z \: er } { \sqrt { n } }$ where $$er$$ is the error ratio of the assumed model. We often use this equation to guide the preliminary choice of the sample size before we have collected detailed information about the distribution of $$x$$ in the population.

Once we do have the detailed information, we usually use model-based stratification to construct a stratified sample design. We choose the number of strata $$H$$ to be sufficiently large for strong stratification, but often as small as five or ten. Then the strata are constructed by sorting the population by increasing $$\sigma_k$$ and choosing stratum cut points that equalize the sum of $$\sigma_k$$ in each stratum. We use the expected relative precision to help choose the sample size, and then we divide the chosen sample size equally among the $$H$$ strata.

If the number of strata is sufficiently large, the inclusion probabilities will be immaterially different than the optimal inclusion probabilities and the stratified sample design will be nearly efficient.

## Estimating the Error Ratio

For large $$N$$, the expected relative precision is approximately $rp = \frac { z \: \sqrt{ \sum_{k \in \wp} \pi_k^{-1} \sigma_k^2 } } {\sum_{k \in \wp} \mu_k }$ Assuming an efficient sample design under the assumed model with a given $$\gamma$$, we can write the optimal inclusion probability as \begin{aligned} \pi_k &= \frac { n \; \sigma_k} {\sum_{k \in \wp} \sigma_k } & \\ & \\ &= \frac { n \; x_k^\gamma } { \sum_{k \in \wp} x_k^\gamma } \end{aligned} We can also write $rp = \frac { z \: er } { \sqrt { n } }$ Setting these two expressions for $$rp$$ to be equal and solving for $$er$$ we get \begin{aligned} {er} = \frac { \sqrt{ \left( \sum_{k \in \wp} { e_k^2/ x_k^\gamma } \right) \: \left( \sum_{k \in \wp} { x_k^\gamma } \right) } } { \sum_{k \in \wp} { y_k } } \end{aligned}

So, conditional on $$\gamma$$, the error ratio $$er$$ is estimated as $\hat{er} = \frac { \sqrt{ \left( \sum_{k \in s} { \pi_k^{-1} \: \hat e_k^2/ x_k^\gamma } \right) \: \left( \sum_{k \in s} { \pi_k^{-1} \: x_k^\gamma } \right) } } { \sum_{k \in s} { \pi_k^{-1} \: y_k } }$ The estimate of $$\gamma$$ is the value that minimizes the right hand side of the preceding equation.

## Remarks

The simulation provided in the final step of the web site can be used to assess the effectiveness of these methods. We have also had more than thirty years of experience with a large number of applications drawn from electric and gas utility work. A number of lessons have emerged.

It is impossible to develop an optimal sample design without assumptions about the population. We have found the ratio model to be an effective and parsimonious guide to developing a sample design. Its three parameters reflect the three most relevant aspects of the sample design -- $$\beta$$ represents the ratio to be estimated, $$er$$ measures the the overall population variability affecting the statistical precision, and $$\gamma$$ characterizes the heteroscedasticity in the population and guides efficient stratification.

Of course the results are best when the model is an accurate characterization of the population. But even if the model is inaccurate, the ratio estimator is asymptotically unbiased and conventional design-based methods can be used to assess its statistical precision.

When the model is reasonably accurate, the error ratio is a reliable guide to the sample size needed for a desired relative precision. In our applications, we have been able to assess the expected error ratio from the circumstances of each application, sometimes by analyzing prior data but often just from general considerations.

As an integral part of the analysis of sample data, we always estimate the error ratio to aid in planning future studies. The simulations suggest that the estimator of the error ratio is somewhat less reliable than the ratio estimator. Therefore, in planning a future study, we try to synthesize an estimate of the error ratio from all relevant experience and past results.

In most of our applications, we have found $$\gamma$$ to be in the range 0.5 to 1. We also find that the required sample size and expected statistical precision generally vary little throughout this range, and that 0.8 usually work well.

Our preferred equation for the achieved relative precision is based on the assumption that the mean of the residuals is small in each stratum. We check this assumption by examining the scatter plot between $$y$$ and $$x$$.

Our three estimators, and our preferred equation for the achieved relative precision of the ratio estimator, are all calculated using $$\pi^{-1}$$ weights applied to each sample item. We use these same equations within any domain identifiable in the sample. We find this very advantageous.

The model is used primarily to plan the sample design. The analysis is assisted by the model, but is not dependent on the accuracy of the model. As indicated by the use of $$\pi^{-1}$$ weights, our estimators are essentially design-based. Their validity rests on the assumption that the sample has been randomly selected following a known sample design with little or no non-response or missing data. We also assume that $$y$$ is observed with little or no measurement error, and especially, no material bias.