Bad Apples and Bad Barrels: Bias and Corruption in Polygraphy

Bias in polygraphy, my research suggests, is a bad apples problem affecting relatively few outcomes but undermining fairness in sometimes egregious ways including life-and-death contexts. At the same time, its propensity to increase the very corruption it’s meant to decrease is a bad barrel problem undermining overall efficacy in contexts that can be just as practically significant. Government non-transparency hinders progress on both issues.

These bias and corruption problems are related but distinct. The bias issue is one of bad appleles. A minority of prejudiced polygraphers probably have huge effects on a minority of police departments, companies, and federal agencies that use their services, by disproportionately flunking certain groups, e.g., blacks, homosexuals, rape victims, whistleblowers—an authoritarians’ banquet of outsiders, anecdotally targeted by polygraphers as recorded by a broad range of documents and sources compiled, shared, and enriched with others’ work in my 2012 collaboration with Marisa Taylor at McClatchy, 2014 Ph.D. dissertation research, and 2018 collaboration with Mark Harris at Wired.

“Bad apples” is more than an expression. Operation Bad Apple was a polygrapher fraud-busting operation that veteran CIA polygrapher John F. Sullivan spoke about in a 2008 interview after writing about it in his book Gatekeeper: Memoirs of a CIA Polygraph Examiner. John called abusive polygraphers who were manufacturing fraudulent confessions or charts, bad apples. The evidence on bias in polygraphy suggests he was probably right to think of them that way.

By contrast, the corruption issue is about “bad barrels”—not egregious cases of bias, fraud, or abuse, but a deleterious overall effect. In the aggregate, polygraph programs seem to hurt police departments that intend to use them in order to decrease corruption, by unexpectedly increasing corruption instead—much like the infamous D.A.R.E. program increased the very juvenile drug use it was expected to decrease. (As a sidenote: No one knew about that backfiring effect until field experimental data revealed it. This type of efficacy data generation remains surprisingly rare in public policy. This is known as the evidence-policy gap. It’s counter-intuitive if you’re not familiar with the literature in this area, but it’s not unusual.)

Secrecy provides cover for both bias and corruption. Programs that seem to be unaccountable under equal opportunity law on a case by case basis, like federal polygraph programs, will be more vulnerable to biased “bad apples” because non-transparency keeps possible aggregate disparities from being publicly analyzed. And anti-corruption programs that backfire, increasing the corruption they seek to decrease, will similarly be better able to protect their own interests in persisting, when their efficacy is not being publicly or independently evaluated.

Non-transparency, as I recently noted, limits research on polygraph bias and efficacy. But federal agencies are not the only possible source of relevant data. Experimental data can be collected in any population, although its generalizability is then an open question. Survey data can be collected from relevant populations, such as state-licensed polygraphers. And national-level survey data collected by the Bureau of Justice Statistics can be analyzed using statistical tools as if it were experimental data, allowing causal inferences to be drawn about the effects of polygraph programs on police departments. Triangulating all these sources of data sheds novel light on questions of bias and corruption in polygraphy.

Survey and experimental data on bias

Before formulating and beginning to test hypotheses with experiments using the scientific method, I conducted interviews—recently released and curated in an essay on AltGov2—that documented bias, fraud, and abuse in polygraphy. My subsequent hypotheses centered mainly around whether such bias was systematic, and how it worked under real-life conditions such as polygraphers viewing background investigations that could introduce confirmation as well as racial bias. Observational data from the federal polygraph institute itself had in 1990 shown significant racial bias against innocent blacks. But follow-up experimental data—vulnerable to design criticisms such as expectancy effects and artificiality—showed no such bias. Overall, the qualitative and quantitative evidence on racial bias in polygraphy remained suggestive enough to warrant triangulating, or combining it with more forms of data from more sources, to see what they would all suggest when considered together.

A series of four Internet survey experiments showed racial bias (bias against blacks and Hispanics) does not systematically affect polygraph chart interpretation. Confirmation bias (bias against people with negative background investigations) does. And it’s possible to “hack,” or neutralize, that confirmation bias. The hack works by tricking people interpreting polygraph charts into thinking they’re running polygraphs in “suspicious mode,” so they delegate their confirmation bias to the computer.

These Internet survey experiments employed naive interpreters on an online platform run by Amazon called Mechanical Turk (aka MTurk). On MTurk, workers sign up to complete online tasks for pay. MTurk has been criticized, but remains widely used because it seems to yield good data in large samples, cheaply and quickly.

However, it is an open question whether these results generalize to the population of interest—professional polygraphers or to the field more broadly. Ideally, the study sample would have been polygraphers. But their limited accessibility precluded that possibility.

Online survey experimental subjects’ demographics differed from those of polygraphers as a group in potentially important ways. My survey of Virginia state-licensed polygrapher demographics, political attitudes, and self-reported bias showed polygraphers (like American police management) tend to skew white, male, older, conservative, Republican, and Fox News-watching compared with the general population. Per Berinsky et al, MTurk samples skew in the opposite direction in some ways that can impact bias effects, such as gender (female), party identification (Democrat), and especially age (younger).

Polygraphers also tend to be overwhelmingly current and former law enforcement—a distinct group in terms of social and political attitudes and behaviors. For instance, this group skews more authoritarian than the national norm according to national survey data from American National Election Studies analyzed in tandem with polygrapher survey data. This might matter, because right-wing authoritarianism is associated with racial bias.

Indeed, the small-scale pilot study data that helped secure NSF funding for this larger-scale dissertation research on bias in polygraphs used criminal justice students and professionals in real life as much as possible, in order to get a sample more similar to polygraphers than the general population. That preliminary data did indicate racial and confirmation bias, and a possible compounding interaction between them. There wasn’t enough data in that small sample, however, to assess an interaction, or speak to statistical significance at all: thus, the need for a larger sample size. But due to accessibility, that larger sample got further away from polygraphers demographically.

Polygraphers as a group might also be distinct in ways that specifically matter for the bias effects of interest. About 20% of Virginia state-licensed polygrapher survey respondents reported thinking that some groups (e.g., blacks or homosexuals) tend to fail polygraphs more than others. This suggests that a substantial minority of people interpreting polygraph charts for a living in the field hold biases against some groups that could affect the outcomes of the “tests” they administer. If they were simply honestly reporting observed disparities, then those disparities should have been consistent across polygraphers; but the disparities were inconsistent.

The available evidence altogether suggests that polygraphs themselves are not biased, and most people interpreting them may exhibit confirmation, but not racial, bias. Smaller or less systematic effects will be harder to measure with statistical significance. Those effects would be particularly hard to measure if they came from so-called “bad apples” who might self-select into positions of power in the field and abuse them in a relatively small proportion of cases that could yet have socially and politically significant consequences. For instance, a polygraph-related, erroneous credibility attribution to the CIA source Curveball contributed to the illegal 2003 U.S. invasion of Iraq. Polygraphs have similarly contributed to innocent (and subseuqently exonerated) men being sentenced to death in America, and to (probably) innocent men being sent to Abu Ghraib prison in Iraq. Moreover Thomas Schelling’s segregation game shows it only takes a small bias to endanger equality for society as a whole. But relatively small bias effects will nonetheless be difficult to measure with statistical significance in this context.

Qualitative data from interviews and documents, as well as survey data from Virginia state-licensed polygraphers, suggest possible racial, religious, sexuality, and other forms of bias in polygraphy. By contrast, psychophysiology lab studies and online survey experimental data suggest that racial bias in polygraphs is not a systematic, statistically significant effect resulting from stereotype threat or authoritarian selection on the subject side, or from interpreter bias on the polygrapher side. Triangulating field data with these other sources is required to assess the fuller picture of how polygraph programs work in the field. Do they institutionalize systematic racial bias? And do they work in law enforcement agencies to address the corruption they are meant to lessen?

Field data on bias and corruption

The Bureau of Justice Statistics collects national survey data from thousands of state, county, and local law enforcement agencies (LEMAS). This data is sufficient to run quasi-experimental analyses, analyzing the observational data “as if” it were experimental data, using coarsened exact matching (CEM) and difference in differences (DID) analysis (for a more detailed explanation, please see dissertation, Chapter 4). CEM is a matching procedure that here allows comparison of local and state law enforcement agencies that are highly similar, except for their implementation of particular selection tools like polygraphs. This comparison reduces model dependence, average treatment effect estimation error, and internal validity threats from measurement error.

These quasi-experimental analyses do not show evidence of systematic racial bias effects of polygraph programs on law enforcement agencies. However, it seems likely, based on data recently obtained by Mark Harris for Wired in combination with my own research, that there are prejudiced polygraphers who, for example, disproportionately flunk blacks vying for jobs at particular departments. Other sources of data generate cause for concern about racial and other bias in polygraphy in the field. Agencies that observe large racial disparities in hiring and demographics should be wary of polygraphs as one possible avenue for bias in both administrative and employment contexts. That said, there is insufficient scientific evidence to reject the null hypothesis of no systematic racial bias in the aggregate in police polygraph programs in the field. The generalizability of this finding to other contexts, like federal agencies, remains an open question.

These analyses also show that polygraph tests and only polygraph tests (among eighteen police selection tools) decrease sustained citizen complains of excessive officer use of force, and this effect is statistically significant. At first glance that might sound great; it might sound like polygraphs decrease police brutality. That’s not what the data suggest, though. Polygraphing police hires may select against people who are worse at lying or more honest, selecting instead for police officers who are better at lying and getting away with it when they have done something wrong. The reason is that there are tools that cause a decrease in total complaints, and polygraphs are not one of them. So polygraphs appear to select specifically on recruit characteristics that change not brutality or complaint rates themselves, but rather only complaint outcomes.

This is a distinct effect. Polygraphs are the only police selection tool, of eighteen such tools on which LEMAS collects data. That is ironic and important, because the U.S. has long exported polygraphs as part of its sponsored anti-corruption programs—as in Plan Colombia, the Mérida Initiative in Mexico, and others in the Bahamas, Bolivia, Guatemala, Honduras, and Iraq (dissertation, Chapter 3). It would seem to be horrifying and surprising if American taxpayer-funded anti-corruption programs worldwide actually increased corruption.

Yet this effect has face plausibility. Most American polygraph schools and polygraphers rely primarily on the so-called control question test format. This interrogation protocol intends to cause subjects to lie to polygraphers without knowing that they’re supposed to be lying. Those presumed lies to so-called control questions generate physiological responses that are then compared with physiological responses to so-called relevant questions, to say whether someone is lying. People who are better at lying in the first place are likely to admit less wrongdoing during the dialogue that is arguably the point of the whole exercise. People who are relatively honest are going to both admit more derogatory information, and not lie (as easily, or at all) in response to control questions—skewing the test against them in two ways. So the design of the control question test polygraphs most common in American policing may discriminate against honest people.

That could explain the apparent corruption effect. That is, thus discriminating against honest police recruits through polygraph screenings may cause more hiring of dishonest officers. That would explain why these officers later get the same rate of brutality complaints as their non-polygraphed counterparts. But then they’re better able to get out of these complaints rather than having them sustained—by lying. In this way, police polygraph screenings may systematically contribute to the problem of American police perjury, aka “testilying.”

Generalizability is an open question. But if this result generalizes to other contexts, then it likely contributes to a number of other, serious security problems. For example, federal agencies with polygraph programs could be selecting employees who will be better able to avoid consequences for wrongdoing by lying under questioning.

What next?

The Government Accountability Office (GAO) is accountable instead to the legislative branch—to Congress. It is uniquely independent in its audit capabilities vis-a-vis the military, Department of Defense (DOD), and intelligence community. The National Academy of Sciences and other researchers have been unable to acquire bias and efficacy data directly from federal agencies or indirectly through the courts. But GAO could acquire this data by auditing or investigating federal polygraph programs.

Individual Congresspersons have the power to trigger GAO audits and investigations, making this process accessible. However, DOD has increasingly circumscribed GAO’s powers, and that seems to be an escalating trend. Nonetheless, liberal democratic political institutions must check and balance the power of these programs in the public interests of due process and national security alike. Bad apples should not be able to institutionalize bias. And anti-corruption programs should not increase corruption.