If you’re like me, you never really got much exposure to statistics in school. It wasn’t until I started my degree in analytics that I really began to delve deep into what statistics really is and why it matters so much in our lives. Given the importance of statistics, I’d like to put together a bit of a primer on the subject. To start though, here are 5 basic concepts that will help you begin to understand the subject.
1. Population vs. Sample
When you’re pondering a question like, “How many cups of coffee did the average coffee-loving American adult drink in a 2018?” you’re examining a population. In short, a population in statistics is just the collection of every possible subject that you’re investigating. This population could be people, like every adult in the United States that drinks coffee in 2018. It could also be a collection of objects, if, let’s say, I wanted to examine how long I can expect the tires on my car to last. In this case, the population would be every single tire of the same brand and model that I have on my car.
Now, going around and asking every single American how much coffee they drink or tracking the life of hundreds of thousands or millions of tires is just impossible. I don’t enough time, energy, or resources to do this, even if I had a team of researchers working with me. This is where we encounter the concept of a sample. A sample is just what it sounds. It’s a subset, or a selection, of the population that you can use to help examine the population as a whole and begin to draw conclusions. In order to do this though, a sample needs to be truly representative of the population. Using our coffee example, if I were to only collect data from coffee drinkers in Seattle, New York, and Los Angeles; or if I only ended up with data from people in their 20’s and 30’s, I can’t really use that to gather any sort of insight as to what the average American drinks, because I didn’t include data from people who might be 40 or older or live in, say, the rural Midwest.
2. Parameters and Statistics
Now that we have an idea of the difference between a population and sample, let’s consider what we’re examining. This measurement, such as the number of cups of coffee or miles that tires drive before they need to be replaced, that we really care about. Of course, statistics has to be confusing, and use two different terms for what seem at first to be the same thing. A parameter is the measure that we’re considering in relation to the overall population as a whole. A statistic is that same measure, but only in relation to a sample. This might seem like a meaningless distinction, but here’s why.
A population parameter is generally static, unchanging. The number of cups of coffee that Americans drank in 2018 will not change. An important caveat here is that while a parameter will be static at any given point in time, it is not guaranteed to be so moving forward. This is why it’s important to ensure that our questions in statistics are time-bounded. In the coffee case, we’re looking at coffee consumption only within 2018.
A sample statistic, however, varies from sample to sample. Because we’re only looking at a subset of the population, the measurements we’re getting are an incomplete picture. Ideally, a sample is still representative enough to be a “good enough” picture that we can fill in the blanks with a little guesswork, but the fact remains that the sample is, nonetheless, missing data. It’s this incompleteness that accounts for the margin of error when we use a sample to make an inference about a population as a whole.
3. Mean, Median, and Mode
When we think about a concept like the average coffee consumption, what are we really talking about? In general, when we discuss “average”, we mean we want to find a reasonable middle ground. There are people who drink coffee only a couple of times a week or month. Others may drink several cups a day. How do we know where that middle ground is? This is where we encounter what are called “measures of central tendency”. In other words, these define where the middle of our data lies.
The first measure of central tendency is the mean. The mean is what most people think of when they talk about an “average”. The mean, simply put, is the sum of all the values in your dataset, divided by the number of values in the dataset. In effect, this distributes the entire dataset equally among the individual values. If you imagine your data as the landscape in a garden, the mean is what you’d have if you took all the dirt from the garden, and leveled out all the little hills and valleys until you have a nice, smooth, flat surface. The mean is probably the most commonly used measure of central tendency used in statistics. It does have one major flaw though: it can be affected, or skewed, by data points that are very high or very low in relation to the rest of the data.
The median is the next most commonly used measure of central tendency. You’ll often see it reported out in things like economic figures such as “median household income”. The median is the value that falls in the very middle of your data, when it’s all ordered and sorted by value. For data sets with an even number of observations (which will have two observations at the midpoint) the median is the mean of the middle two values. It is, quite literally, the middle ground in your data. Because the median is not impacted by high or low values, it is generally considered to be more robust than the mean. On the other hand, because it’s not measuring the “weight” of high and low data, the median tells us little about where the bulk of the data actually lies.
How do we determine the most usual value for a non-numeric variable though? What if we want to know what is a “usual” eye color. You can’t add up or order colors. This is where the last measure of central tendency comes in handy. The mode is the value that appears the most in your dataset. If you ask look at the eye color of people entering a room and get 3 people with brown eyes, 1 with blue eyes, and 1 with green eyes, It’s reasonable to state that (at least among these individuals) the “usual” eye color would be brown.
4. Probability
When we discuss probability, we’re talking about the odds, likelihood, or chance that a given condition will be met. By definition, probability is the ratio of how many times the condition is actually met and the number of times the condition could have been met. If we flip a coin, there are two outcomes: heads or tails. You have one chance in two possible outcomes that it’ll land on heads. In other words, you have a 50% chance of getting a heads when flipping a coin. Probability is especially useful when we start analyzing samples of data to determine how reasonable an outcome was, and if that data supports or rejects our predictions.
5. Standard Deviation
Last, but not least, let’s discuss standard deviation. Standard deviation is a measure of how spread out your data is in relation to the mean. This is commonly used to examine whether a given data point is significantly high or low. In general, a data point that falls two or more standard deviations away from the mean in either direction is considered an outlier. For example, if we have a sample that gives the mean number of cups of coffee that someone drinks in a year as 300, with a standard deviation of 120, then someone who only drinks 20 cups in a year drinks significantly less than the average person in this sample. We’ll discuss how the standard deviation is calculated in a later post.
Leave a comment