Sampling
Sampling is the selection of a subset of data from within a statistical population to estimate characteristics of the whole population.
Sampling Factors
- Application - the use to which the data will be put 
- Availability - the availability of data sources and the data itself 
- Bias - aka selection bias, in which proper randomization is not achieved 
- Cost - the cost and resources needed to collecting, storing and maintaining data 
- Representation - the need to include or exclude specific groups of data 
Methods of Sampling
Which method of sampling is chosen depends on sampling factors such as those shown above. Methods of sampling include:
- Cluster Sampling - samples are selected from data organized into clusters such as by geography 
- Convenience Sampling - samples are selected from data close at hand 
- Quota Sampling - samples are selected by quota from data organized into multiple groups 
- Simple Random Sampling - samples are chosen by chance 
- Snowball Sampling - samples are selected from an initial group and then groups identified by the initial group 
- Stratified Sampling - samples are selected from data organized into multiple groups 
- Systematic Sampling - aka interval sampling, samples are selected at regular intervals from an ordered list 
Sample Size Determination
Sample size determination is an important factor is the sampling process. Approaches to sample size determination includes one or a combination of methods such as:
Experimentation
During the Modeling Process, various factors such as Loss, Bias, Variance, and Accuracy can be monitored to determine the best sample size.
Larger is Better
Some Machine Learning Models, such as Artificial Neural Networks, perform better with large sample sizes. This is in part due to the number of network graph nodes that need to be trained during the model training process. However, sample size is just one of the modeling hyperparameters, so increasing samples size can be combined with modeling experimentation to achieve optimal results.
Statistical Power
The Statistical Power of a binary hypothesis test is the probability that the test rejects the null hypothesis when a specific alternative hypothesis is true.
Resampling
Resampling is the process of:
- changing/exchanging data samples 
- identifying the impact of these changes on model and prediction characteristics 
- continuing until optimal results are achieved 
Resampling can include methods such as:
- Bootstrap - uses random sampling with data replacement 
- Jackknife - estimators of parameters are found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations 
- Label Exchange - classes associated with data samples are exchanged 

