how are psychometric tests designed?

 

It is a simple enough task write a 'test' that looks like a test - to the untrained eye it may look plausible enough. However, the quality of a test is determined by it's psychometric and scaling properties, and not what the test items look like. Our practice tests have been developed according to the guidelines laid down by Kline (1995) and other respected experts in the test development field.

Is It Easy to Write a Psychometric Test?

If you have ever seen an aptitude test you may well have thought ‘This looks simple enough, I could write one of these !’ On the first point you would be correct, they do indeed look simple to produce, but the illusion of simplicity ends there. 

Would you be surprised to discover that it can take two or three years to develop a single basic level test, and can involve literally thousands of test subjects ? This is one reason why test publishers are so very jealously protective of their tests and charge so much  money for clients to use them !

 

Why is There So Much Involved in Developing a Test ? 

A properly designed and constructed test must have certain technical properties. Of course there are tests around which have been written which do not reach these standards, but the major test publishers have spent the last few years working with the various professional associations, the main one being the British Psychological Society to produce standards to which tests development will ideally adhere. 

Having said that, some of these standards do not apply to certain types of test. For instance, there would be little point in trying to write appropriately  ‘difficult’ items for a personality test since we have seen that in such a case we are more interested in the test taker’s performance under normal, rather than difficult conditions.  

It is also true that in the case of some highly job specific types of work samples e.g. a test pilot flying an aircraft to see how well he or she can recover from a spin or stall, it is clearly not feasible to amass a database of how several thousand other people have performed on the same test.  

 

1.      Degree of Difficulty. 

If all of the questions in a test were so easy that everyone could answer them correctly, or so difficult that no-one could answer them correctly then that test would tell us nothing about the differences between people on the particular ability or aptitude the test was measuring. We do need to avoid this since in selection or recruitment particularly it is those very differences in which we are interested. 

As a result, a large number of test items are written initially, often five or ten times as many as will eventually be used. This is because most individual test items will fail this crucial test – an equal number of people must get a test question right and wrong. Several hundred people, need to complete each test question before we can determine if this is case.  

2.      Degree of Accuracy. 

Without getting too technical, most test developers agree that tests and test use is prone to error. This error can come from the test itself e.g. poorly written or easily misunderstood items, or from the test administration process e.g. instructions not being adhered to properly, or time limits not being followed. 

In the case of a test, this error affects how closely the test will measure the characteristic it is designed to measure. This is known as Reliability. The more reliable the test then the more stable and accurate it is.  

Think of reliability like a ruler. If the ruler is made from wood, then one would not expect the measurements of length it provided to vary much. If we measured something such as a person’s height one day, and then measured it the next day with the same ruler we would expect a high degree of correlation. 

If on the other hand the ruler was made of rubber, then we would see a large variation from one day to the next. The ruler might  measure the same height (which we know has not had time to change) and produce two very different values. 

The exact same principle applies to tests, they to have to be stable and accurate before they can be used. The easiest way to establish whether a test possesses such Reliability is to administer it to a group of people, and then a few weeks later administer it to them again. If the test is stable and reliable we would expect to see a high positive correlation between the scores obtained on the two different occasions (the accepted criterion is 0.7 or higher for the mathematically inclined of you). 

This notion of Reliability again boils down the quality of the test questions.  

3.      Degree of Relevance to the Job 

The fundamental question here is this ‘Does the test measure what it claims to measure ?’. This may seem like a strange thing to ask, surely, a test which contains numerical problems which a person is required to solve is measuring numerical ability. 

This is not necessarily the case. Many tests rely on a good grasp of verbal comprehension, even numerical  or abstract reasoning tests if the instructions are complex, in order to be able to do them successfully. 

In these cases,  although the test claims to measure numerical ability, and the employer may well interpret the test scores in that light, they may well be overlooking the fact that the test is to some unknown extent a measure of verbal ability. This is quite a challenge to be overcome during the test development process.  

This concept of whether a test measure what it claims to measure is known as the ‘Validity’ of a test, and it is most often established by examining statistically the degree of correlation between our own test, and another established test of the same characteristic or ability. In the case of Validity the accepted degree of correlation is 0.3 or higher.  

4.      Establishing a Benchmark Against Which to Interpret test Scores. 

If I was to tell you that you had scored 25 on some test what would you think ? Would you think that was a good score, a poor score, average ?  

In truth, a standalone figure like that means nothing – because you have no idea whether you were awarded 1, 2 or 3 points or whatever for a correct answer. 

If I then said your score was 25 out of 100 what would you think ? You might think that was not such a good score. But why ? - If the test was particularly difficult, your score might be amongst  the best.  

If I then said to you  that your score was 25 out of 100, and that this was an average score, then you might now know what I meant – but not quite. There is still one piece of information missing. Namely, you now know your score is average, but average compared to whom ? 16 year old school leavers ? graduates ? You still do not have a definite idea of what you score means. 

If I finally told you that your test score was 25 out of 100, and that this was average compared to the tests scores of graduates obtained on the same test you would know what you score finally meant.  

This is what happens in ability testing, a test score is interpreted in relation to some comparison group. This is a fundamental concept in testing. 

Very often a test manual can provide information on the scores obtained by several comparison groups e.g. school leavers, A’ level students, diploma students, degree level students, the general population, junior middle and senior manager, skilled, semi-skilled and unskilled workers. Each group has to have information based on several hundred (or even thousand) subjects. The aim is to produce a set of  these ‘norm groups’ to enable the employer to make comparisons of a candidate’s test score with the performance of a known group of people who are as similar to the final job holder as possible.  

Norm groups take a long time to produce, and much of the work is done by test publishers prior to releasing the test, although it is a never ending task and norm groups are constantly being updated.  

It is important to remember that if you are tested together with 100 other people, your test scores might be viewed in relation to theirs, but the comparison to determine whether it is average, above average, or below average will most often take place using the norms provided by the test publisher. Very occasionally, if enough candidates have been tested using the same test (perhaps several hundred or a thousand), then a company may produce its own ‘local’ norms. 

5.      Fairness and Discrimination in Test Use

 As well as a formal legal requirement that a test should not unfairly discriminate against particular ethnic or gender groups, there are ethical and practical reasons why employers should use tests that are fair. 

We know that males and females, and different ethnic groups, all have the same overall level of intellectual ability. This means that, if a test systemically suggested that men scored lower than women on an ability or aptitude test, and then the test results were used to select candidates for a job, a disproportionate number of women would be selected.  

This would be fine, if what the test scores suggested about men i.e. that they had less intellectual ability than women, was true, but it is not. We would find that the higher levels of work success predicted by the test in the case of women would not be found. 

It may sound a strange thing to say, but the purpose of test is to discriminate, but only between people who have differing levels of ability on the characteristic in question, and not on the basis of irrelevant characteristics such as gender. 

Consequently, ability and aptitude tests need to be carefully scrutinised before use to make sure that they do not discriminate between people on anything other than the actually ability or aptitude in question. Again, this takes a long time and can involve thousands of research subjects. 

In personality tests this is much less of an issue because we know that differences in personality test results may well reflect differences between males and females. For instance, females tend to be reported as being more sensitive to other peoples’ feelings and more socially oriented than men.  Remember that personality is not the same thing as ability, so the issue is much less contentious.

 

Delicious Delicious LinkedIn