Proposal 3: Differential Item Functioning Analysis

Current position

A well-established statistical procedure that is often used to identify individual questions that may be biased against particular groups of test-takers is the use of Differential Item Functioning (DIF) analysis.

In a DIF analysis the performance on each question of all the members of one group of test-takers is compared with the performance of the members of another group. For example, in a gender-based DIF analysis the results for girls and for boys might be compared for each question in a test. This can help test developers to identify particular items on which members of one of these groups perform in a way that does not match their overall performance on the test. So, for instance, a gender-based DIF analysis might reveal that girls who perform well on the test overall tend to perform less well on a specific question or part of a question. In this case the question should be reviewed to check that it does not have some hidden barriers to accessibility for girls. For example, it might be found that a question set in the context of cricket or motor racing was not as readily accessible to girls as to boys, and so it would be biased against girls and should be amended or rejected. This is a simple and straightforward example which helps to explain the principles behind the use of DIF analysis.

It is important to note that a DIF analysis compares the performance of test-takers in the focus group on each individual question in the test in comparison with their overall performance in that particular test. It might show that question 4, say, is biased for or against the members of a particular group in comparison to the rest of the other questions. But a DIF analysis cannot show whether the whole test is biased against the members of the group. So, for example, if all the members of a particular group tend to do worse than other pupils on every question in the test then the mean score of the group will be lower than that of the mainstream pupils, but the DIF analysis itself will not suggest that there are any issues. For this reason DIF analysis is not useful in identifying problems with a test as a whole, only with specific questions within the test. A DIF analysis cannot say 'This test is biased against pupils in this group'. It can only say 'Questions x, y and z are biased against these pupils'.

National Assessment development agencies have extensive experience of carrying out some types of DIF analyses during the development phase of their projects. The outcomes of these are used to identify any questions which show bias against particular groups, such as boys or girls, and to guide amendments to these questions. If the sample size allowed, similar DIF analyses could be carried out to identify any questions that might be biased against pupils from different ethnic groups or with different social backgrounds.

However, for the key stage tests a question that was set in a context that was so obviously likely to be biased as cricket or motor racing (or fashion or ballet) would almost certainly be rejected by the test developers at an early stage, so it would not even reach the first pre-test. Only questions that do not carry any obvious cause of bias are selected and trialled, so when the statistical results show that a question is biased against girls (or against boys) there may be no apparent reason for this, with other very similar questions showing no such bias. For example,

'the following question was trialled for a mathematics test:

Put one number in each gap to make the sentences true.

Example

Multiplying by 2 and then by 6 is the same as multiplying by 12 .

a) Multiplying by 3 and then by 2 is the same as multiplying by _____.

b) Multiplying by 4 and then by 6 is the same as multiplying by _____.

The wording, presentation and layout of the two parts of the question are identical. None the less, girls did significantly better than boys (at the one per cent level) in part a), but not in part b). This could have been a random statistical effect, but it still gave the test developers some cause for concern.'

(Clausen-May, 2001, pp31-33)

So the statistical results of a DIF analysis showing which items are biased may, or may not, be useful for the selection of questions for a test.

Furthermore, effective DIF analyses rely upon there being enough test-takers in the category which is the focus of the analysis to allow for statistically robust conclusions to be drawn. Zieky (1993) argues that there must be at least a hundred people in the smaller group, and at least five hundred altogether for DIF analyses to be used at the development phase – and more for evaluation purposes (Zieky, 1993, pg 346).

For a gender DIF analysis the sample size is not likely to be a problem, as there are approximately equal numbers of boys and girls taking any key stage test. Similarly, there may be enough pupils who have English as an additional language (EAL) to offer a viable sub-sample of the total. But where only a relatively small number of pupils belong to a group there may not be enough to provide an adequate sample on which to base any meaningful statistical conclusions. So while the sub-sample of all pupils with EAL may be large enough, there may be too few pupils from each specific ethnic group to allow test developers to carry out a differential item analysis.

For example, a test development agency might want to use a DIF analysis to compare the performance of pupils from different ethnic backgrounds in order to identify particular items on which, say, Caribbean boys or Chinese girls tend to do particularly badly – or particularly well – in relation to their overall performance on the test. However, there may not be enough pupils from these groups to provide an adequate sub-sample for the analysis. Similarly pupils with specific forms of special educational or assessment need may not be adequately represented to allow for a valid DIF analysis for their particular condition. Furthermore, as has been noted above, if the pupils in the focus group tend to do less well or better than mainstream pupils overall in the test then the DIF analysis will not show this.

It should also be noted that even the pupils with one particular condition or disability, such as hearing impaired pupils or dyslexic pupils, may not form a homogeneous group. There may be so much variation between individuals that classing them together is not useful as it ignores their very significant differences.

Key stage test developers routinely carry out at least gender and English as an Additional Language (EAL) DIF analyses as part of the process of developing draft tests. None the less, to Ofqual’s knowledge there has been little published work on this use of DIF analyses.

Ofqual is now considering the possibility of using DIF analyses after the tests have been taken, using the results of the live tests to confirm that none of the questions were biased against a particular group of pupils, or, if they were, to identify these in order to guide the development of questions for later tests. The proposed study would look at the functioning of test items in Key Stage 2 tests from 2008-2010 with respect to a range of pupil background factors, including gender, ethnicity, eligibility for free school meals (FSM), special educational needs (SEN), English as an Additional Language (EAL), Income Deprivation Affecting Children Index (IDACI), and school type. The purpose of this study would be to establish for quality control purposes whether any of the questions were, in fact, biased against pupils in any particular group.

Proposal

To allow DIF analyses to be used in a similar way for evaluation purposes in the future, item-level data (the number of marks awarded for each part of each question to each pupil) would need to be collected, along with relevant pupil background data, for at least a sample of the total cohort of test-takers. However, to make the collection of DIF data worthwhile we must identify the purposes to which any evidence that a particular item may be biased would be put. It would not be possible to amend or remove the item at this stage as the test would already have been taken by the cohort of pupils. Two possible purposes might be:

  • to guide future item development by providing test developers with robust data relating to previously developed items
  • to establish a bank of biased items for further research.

Your views are now sought on whether the use of DIF analyses using live data from the released tests, rather than pre-test data, as part of an evaluation procedure would be worthwhile, and if so, which groups should be identified for data collection and analysis and what use should be made of the outcomes.

Responding to Proposal 3

Please complete the response form online. The form is available to print at Annex 1 of the PDF version of this consultation.

RSS feed of comments 3 Responses to “Proposal 3: Differential Item Functioning Analysis”

  1. Steve says:

    On face value a great idea. But isn’t having questions that cover a range of more accessible/less accessible to a group all part of candidates being prepared for a general assessment? Don’t we want girls to tackle question on cricket and boys questions on patchwork?

    Like or Dislike: Thumb up 0 Thumb down 0

  2. Tandi Clausen-May says:

    Yes, Steve is quite right – it is of course more complicated than this. The example of the cricket etc was just intended to explain what a DIFF analysis can do – and, most important, what it cannot do. It CANNOT show whether a whole test is biased against a particular group: only whether a specific question is.
    But if we are assessing, say, mathematics, then if we want to set the question in the context of cricket then we must make sure that the test-taker is given all the informatin s/he needs about cricket within the question. The question should not assess the pupils’ knowledge of the game – only their knowledge of mathematics, and their ability to apply that knowledge appropriately. We might also want to set other questions in different contexts so that overall the test is balanced, and not particularly boy-friendly or girl-friendly.
    But there are certainly times when we might want to include a question which does show differential performance that is gender-related. For example, some studies have found that boys tend to do better than girls on questions that rely heavily on spatial ability. Thus a DIFF analysis might indicate that such a question is ‘biased’ – but we would not want to exclude it from the assessment if it assessed an aspect of the curriculum. To do so would distort the assessment, and ultimately would be likely to distort the taught curiculum.
    So yes, we might well want to keep the cricket question – but perhaps with a picture of some people, including girls, playing cricket, and with all the relevant cricket-related background knowledge expalained carefully. And perhaps we would have another question set in a context that was likely to be more familiar to girls.
    Thanks for your comment, Steve. Keep them coming!

    Like or Dislike: Thumb up 0 Thumb down 0

  3. Michael says:

    If the general assessment were on hobbies, than yes we would want candidates to be prepared to answer a question on cricket. Otherwise probably not.

    The proposal likely comes at this from the point of view that a maths score should not be affected by the knowledge a candidate has of a hobby. How much the need to understand cricket affects the score is, in general, what DIF is attempting to measure.

    If it turns out that knowledge of cricket doesn’t seem to have an affect on candidates score, then by all means keep cricket questions.

    If, on the other hand, it turns out that knowledge of cricket is what is being measured and not maths, then it makes sense to get rid of the cricket questions.

    I think the key is right now we don’t know if a question about cricket is valid or not. So the proposal is to use DIF to first find out if the questions are measuring maths knowledge or cricket knowledge.

    Like or Dislike: Thumb up 0 Thumb down 0

Credits