Sampling

Sampling is the selection of a subset of data from within a statistical population to estimate characteristics of the whole population.

Sampling Factors

  • Application - the use to which the data will be put

  • Availability - the availability of data sources and the data itself

  • Bias - aka selection bias, in which proper randomization is not achieved

  • Cost - the cost and resources needed to collecting, storing and maintaining data

  • Representation - the need to include or exclude specific groups of data

Methods of Sampling

Which method of sampling is chosen depends on sampling factors such as those shown above. Methods of sampling include:

Sample Size Determination

Sample size determination is an important factor is the sampling process. Approaches to sample size determination includes one or a combination of methods such as:

Experimentation

During the Modeling Process, various factors such as Loss, Bias, Variance, and Accuracy can be monitored to determine the best sample size.

Larger is Better

Some Machine Learning Models, such as Artificial Neural Networks, perform better with large sample sizes. This is in part due to the number of network graph nodes that need to be trained during the model training process. However, sample size is just one of the modeling hyperparameters, so increasing samples size can be combined with modeling experimentation to achieve optimal results.

Statistical Power

The Statistical Power of a binary hypothesis test is the probability that the test rejects the null hypothesis when a specific alternative hypothesis is true.

Resampling

Resampling is the process of:

  • changing/exchanging data samples

  • identifying the impact of these changes on model and prediction characteristics

  • continuing until optimal results are achieved

Resampling can include methods such as:

  • Bootstrap - uses random sampling with data replacement

  • Jackknife -  estimators of parameters are found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations

  • Label Exchange - classes associated with data samples are exchanged