The entire set of items or individuals of interest in a study. Denoted By N.
A subset selected from the larger population; Denoted by n.
A numerical value that describes a characteristic of the entire population. It is the opposite of statistic.
A numerical value that describes a characteristic of a sample and used to estimate a population parameter. It is the opposite of a parameter.
A sample in which every member of the population has an equal chance of being selected.
A sample that accurately mirrors the characteristics of the larger population.
A characteristic or attribute that can take on different values or categories. E.g. height, occupation, age etc.
The classification of data based on its nature.There are two types of data - categorical and numerical.
Data that represents categories or labels without inherent numerical value.
Data that represents quantifiable amounts or values. Can be further classified into discrete and continuous.
Numerical data that can only take on specific, distinct values. Opposite of continuous.
Numerical data that is 'infinite' and impossible to count. Opposite of discrete.
A way to classify data. There are two levels of measurement - qualitative and quantitative.
A subgroup of levels of measurement. There are two types of qualitative data - nominal and ordinal.
A subgroup of levels of measurement. There are two types of quantitative data - ratio and interval.
Nominal level of measurement refers to variables that describe different categories or names. These categories cannot be put in any specific order.
Ordinal level of measurement refers to variables that describe different categories, and they can be ordered.
Ratio level of measurement represents a number that has a unique and unambiguous zero point, no matter if a whole number or a fraction. For example, the temperature in Kelvin is a ratio variable.
An interval variable represents a number or an interval. There isn't a unique and unambiguous zero point. For example, degrees in Celsius and Fahrenheit are interval variables.
The number of times a particular value or category occurs in a dataset.
Measures the relative number of occurrences of a variable. Usually, expressed in percentages.
The sum of the relative frequencies of all members in a dataset up to a certain point. The cumulative frequency of all members is 100% or 1.
A type of bar chart where frequencies are shown in descending order. There is an additional line on the chart, showing the cumulative frequency.
A type of bar chart that represents numerical data. It is divided into intervals (or bins) that are not overlapping and span from the first observation to the last. The intervals (bins) are adjacent - where one stops, the other starts.
A table in a matrix format that displays the frequency distribution of the variables.
A plot that represents numerical data. Graphically, each observation looks like a point on the scatter plot.
A characteristic or attribute that can take on different values or categories. E.g. height, occupation, age etc.
The middle number in a data set sorted in ascending or descending order.
The value that occurs most frequently in the dataset. A dataset can have one mode (unimodal), more than one mode (multimodal), or no mode at all.
A measure which indicates whether the observations in a dataset are concentrated on one side.
Sample Formula
A formula that is calculated on a sample. The value obtained is a statistic.
A formula that is calculated on a population. The value obtained is a parameter.
Measures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation.
Measures the dispersion of the dataset around its mean. It is measured in units squared. Denoted σ2 for a population and s2 for a sample.
Measures the dispersion of the dataset around its mean. It is measured in original units. Denoted σ for a population and s for a sample.
Measures the dispersion of the dataset around its mean. The coefficient of variation is unitless. Therefore, it is useful when comparing the dispersion across different datasets that have different units of measurement.
Univariate measure refers to the summary of a dataset that includes multiple categories of variables.
A statistical measure that quantifies the degree to which two random variables in a dataset change together. Usually, because of its scale of measurement, covariance is not directly interpretable.
A measure of of the strength and direction of a linear relationship relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted ρxy for a population and rxy for a sample.
A statistical measure that describes the extent to which two variables change together. There are several ways to compute it, the most common being the linear correlation coefficient.
A function that shows the possible values for a variable and the probability of their occurrence.
A continuous, symmetric probability distribution that is completely described by its mean and its variance. Also known as the Gaussian distribution or bell curve.
The original name of the normal distribution. Named after the famous mathematician Gauss, who was the first to explore it through his work on the Gaussian function.
A normal distribution with a mean of 0, and a standard deviation of 1
The cumulative frequency of a data value in a frequency distribution.
A variable which has been standardized using the z-score formula - by first subtracting the mean and then dividing by the standard deviation.
The sampling distribution will approximate a normal distribution as the sample size increases. In general, a sample of at least 30 is often considered sufficient for the theorem to hold.
The probability distribution of a given statistic (like the mean or variance) based on all possible samples of a fixed size from a population.
The standard deviation of the sampling distribution, which reflects the variability of sample means. It accounts for the sample size, with larger samples generally having smaller standard errors.
The particular value that was estimated through an estimator.
The difference between an estimator's expected value and the true population parameter.
Refers to an estimator's variability. An efficient estimator has minimal variability compared to others.
A function or a rule, according to which we make estimations that will result in a single number.
The specific numerical value obtained from a point estimator.
A function or a rule, according to which we make estimations that will result in an interval.
The categorization of data into discrete groups based on their attributes.
A confidence interval is the range within which you expect the population parameter to be. You have a certain probability of it being correct, equal to the significance level.
A singular metric that captures the entire variance of a dataset.
The probability that the population parameter lies within a given confidence interval. Denoted 1 - α.
A threshold value from a statistical table (z, t, F, etc.) associated with a chosen significance level.
A table showing values of the Z-statistic for various probabilities under the standard normal distribution.
A statistic that is generally associated with the Student's T distribution, in the same way the z-statistic is associated with the normal distribution.
A table showing t-statistic values for given probabilities and degrees of freedom.
The number of values in a statistical calculation that are free to vary without violating the data's constraints.
The range within which the true population parameter is likely to lie, given a specific confidence level. Often expressed as a percentage of the estimate itself.
A testable proposition or assumption about a population parameter.
A test that is conducted in order to verify if a hypothesis is true or false.
A default hypothesis for testing. Whenever we are conducting a test, we are trying to reject the null hypothesis.
The hypothesis that contradicts the null hypothesis. It represents the researcher's claim.
The statistical evidence shows that the hypothesis is likely to be true.
The statistical evidence shows that the hypothesis is likely to be false.
A test that examines if a parameter is greater than or less than a specified value. In a one-tailed test, the alternative hypothesis focuses on a specific difference (higher than, lower than, or equal to).
A test that examines if a value is different (or equal) from a specified value. A two-tailed test considers the possibility of a difference in either direction from the null hypothesis.
The probability of rejecting the null hypothesis when it's true. Denoted α. You choose the significance level. All else equal, the lower the level, the better the test.
The part of the distribution, for which we would reject the null hypothesis.
Rejecting a null hypothesis that is true. The probability of committing it is α, the significance level.
Accepting a null hypothesis that is false. The probability of committing it is β.
The probability of correctly rejecting a false null hypothesis. (the researcher's goal). Denoted by 1- β.
A value indicating how many standard deviations an element is from the mean.
The smallest significance level at which the null hypothesis can be rejected based on the observed data.