What you need to know
Sampling is the process of looking at a small selection of people/objects from a wider population in order to learn new information about that population. For example, suppose you want to know about reading habits in the UK. To do this, you might ask 500 people from the UK how many books they finished last year – this is a sample. In reality, it would be better to ask every single member of the population – then your information is more reliable – but this is often too time-consuming/too expensive to do. The benefit of sampling is that it is less time-consuming and expensive, but this also means it might be less reliable, and possibly biased – for example, if you only asked people who were leaving a library, then all your results might be much higher than otherwise. The aim is to make a sample as representative of the whole population as possible (i.e., reduce the amount of bias).
Before we go further however, we should know the following 4 definitions.
- Discrete data (or discrete variables) can only take certain values, e.g. shoe sizes increase in increments of a half: 7, 7.5, 8, 8.5, and so on. You can’t buy size 7.65 shoes.
- Continuous data (or continuous variables) can take any value, e.g. the height of a person can be any number of centimetres: 182cm, 173.4cm, 167.892cm, and so on.
- Primary data is data that you collect first-hand, e.g. you go out into the street and ask people how many books they finished last year.
- Secondary data is data that was collected in the past by other people, e.g. you look at a published survey of 500 people on the internet which asked them about the numbers of books they finished in the previous year.
Note: the first two definitions are about the data itself, whereas the last two are about how it is collected. So, it is possible (almost guaranteed) that your data will be one of the first two types and one of the second two types. For example, if you measured the heights of 50 different people, then your data would be both continuous and primary.
Primary and secondary data both come with their own advantages and disadvantages.
Primary data is trustworthy, i.e. you know how/when/where it was collected since you collected it yourself. However, for this same reason, it can be an expensive and time-consuming decision to use primary data. Secondary data is cheaper and easier since all the data has been collected for you. However, for this same reason, we might not know how/where/when the data was collected, so we can’t be sure how representative the sample is.
One common way to collect primary data is through questionnaires, i.e. asking the same question(s) to a bunch of people. This can be done face-to-face, or it can be done indirectly e.g. over the internet. Typically, they will involve tick-boxes, but could also involve worded answers. The main thing is to know what makes a questionnaire good or bad. This is easiest to see with a pretty terrible example.
Example: Philippa has designed a questionnaire (seen below) to learn about TV-watching habits. She intends to put this questionnaire to 30 of her classmates. Find 2 problems with this questionnaire and explain why it might not be biased.
There are actually more than just two problems with this, so we’ll go through them all.
- There is no timeframe – people are asked how many hours of TV they watch, but is this per day, per week, per month? It’s impossible to know without specifying a timeframe.
- There is no option for someone who watches no TV at all, nor is there an option for someone who watches more than 20 hours of TV. You must include all possible options.
- There is a crossover between the options – if I watch 5 hours of TV, then should I tick the first box or the second? A questionnaire must be clear and easily understandable.
- It isn’t clear whether or the hours spent watching TV includes time spent watching online streaming services such as Netflix – some people might include this time and others might not. As a result, Philippa won’t really know what the answers to her question mean.
Now, why might this survey be biased? Since she is only asking her classmates, everyone will be the same age as her, and as teenagers tend to watch more TV than adults, her results might not be higher than the value for the whole population overall, i.e. not representative.
Considering all the issues we’ve highlighted, here’s a better questionnaire she could use. Also, to make it more representative she could try asking people outside her classmates, e.g. standing outside a local shopping centre and asking passers by of varying ages would be more representative.
Additionally, there are a couple of other important things to consider when writing questionnaires.
- Leading questions – questions that include the author’s opinion, e.g. “how amazing was the last Star Wars film?” – this could pressure the person into answering positively, even if they might not agree. They should be avoided, always.
- Personal questions – questions such as “do you have a criminal record?” might make the person feel uncomfortable and result in a dishonest answer. Knowing this, these types of questions should be avoided wherever possible.
1) Tilly wants to know people’s opinions on the outcome of a recent election. She plans to use the following question.
Write down two criticisms of her question and construct a better question.
The first criticism is that Tilly’s question is a leading question – she leads people into agreeing with her opinion that the new government will be a disaster.
The second criticism is that there are not enough options – somebody might have no opinion on the matter, or they could be neutral about it.
Here is an example of an improved question:
2) Saru wishes to learn about people’s spending habits on food. He stands at the end of his road and asks 100 people the following question.
Write down two criticisms of his questions and explain why his survey might be biased.
The first criticism is that there are crossovers between the options – if I spend £30 on food each week I won’t know whether to tick the first or the second box.
The second criticism is that there are not enough options – there is no suitable box to tick for someone who spends £150 on food every week.
Because he is only asking people at the end of his street, Saru will probably get answers which are lower than if he asked this question at the end of a different street where the houses were bigger/more expensive, therefore his survey will probably be biased.