On Being "Antiscience", XKCD Jellybeans, and the Mechanics of Scientific Bias Accumulation

Jun 13, 2021

One bit of advice you hear about blogging is to pick a focus - to choose one thing you are particularly knowledgeable about or good at and mostly write about that, so people know what your blog is about and what they can expect from it. I would like to do this, but attentive readers have probably already picked up on the fact I’m a generalist at heart - I’m not especially knowledgeable about or good at any one thing, even if I like to think I’m OK at a bunch of them. The practical implication of this is that my articles tend to be about the best idea I had that week, prompted perhaps by something I read or an argument (real or imagined) that I had.

The topic-of-the-week often ends up being science related, usually focusing on instances of misrepresented or inaccurate data. The end result of this is articles that often point out instances where you can’t trust the processed end-form of the science, whether that be on the acceptable usage of Q-tips, the safety of overnight pizza or whether or not it’s OK to kill a bunch of people to avoid having to reconsider your favorite habits.

That particular form of contrarianism comes with some pushback, the most interesting of which (to me) being the assertion that I’m antiscience. If you don’t know the term, there’s an official-ish definition summarized by Wikipedia thusly:

Antiscience is a set of attitudes that involve a rejection of science and the scientific method. People holding antiscientific views do not accept science as an objective method that can generate universal knowledge.

But like most official definitions of things, the common usage is a bit different. A anti-feminist doesn’t mean “someone who wants women to do well” when he says “Feminist”. Nor does an antiracist mean anything like the “historic” meaning of racism when he mentions it, and though the entries for the word have been obediently lengthened it’s rare to see it used by a speaker in such a way that their intended meaning is entirely covered by the dictionary definition.

So it is here - antiscience in this usage is meant to be taken in a way that gestures broadly at the political right - that this is antiscience of the sort that denies global warming, believes the earth is flat or is generically anti-vax. The general feel of the usage is broad - they are saying I’m a dude who rejects science whole-cloth. There’s no idea of a rationale besides “oh, I just don’t trust science at all - I’m too busy polishing my truck and maintaining my gun collection” in play - it’s generally a dismissal, not a counterargument.

I don’t think I’m antiscience in this way; I don’t really think anybody explicitly is, even if some are in effect only paying attention to science when it suits them. But I am distrustful of science, and I’m distrustful in a way that’s going to keep triggering this particular accusation; they may not be right in the scope of the accusation, but there’s some there there. I approach every study I read assuming potential fishiness and needing an awful lot of scientific rigor to shake me out of that pattern. Where I might look antiscientific at times, an awful lot of people look excessively pro-science to me; I think pleading my case is worthwhile here.

The first movement towards the bottom of things is, as you might expect, a comic strip.

XKCD is a webcomic by Randall Munroe that is focused, mainly, on science-nerdiness. I don’t always agree with the author (see here for him arguing that any non-government suppression of speech is justified, acceptable, and always deserved by the victim) but he’s good enough at his job that I nonetheless find myself regularly enjoying his work. The author is prolific to the extent that there’s a “the Simpsons did it!” aura around his work, resulting in the “relevant XKCD” meme; there’s seemingly a strip for every possible topic a person might discuss.

That pattern holds here, with this gem about jellybeans and statistical significance:

If that’s too long, here’s my very probably even longer summary of it. The strip deals with statistical significance as defined by the classic .05 p-value. The idea here is that although a particular finding where p = .05 would only be expected to occur by coincidence 1 out of 20 times the study was run, running enough similar studies will often produce a surprising result by accident, whether or not an honest-to-god effect actually exists.

I think this point is valid, but I think it stops too soon (although I can’t claim that Munroe didn’t address it elsewhere - as of the writing of this article, XKCD had ~2500 entries, a bit too much for me to claim I’ve combed completely through). I think there are implications here a bit wider than he got into, but before getting into them I want to make clear what the non-conclusion sections of this article won’t address:

I’m not going to be talking about p-hacking, the art of cheating your data to produce significant-looking results where by rights you should have thrown it out and started over.
I’m not going to talk about bias in terms of what journals will and won’t print, at least as it relates to political bias.
I’m not going to discuss out-and-out falsification of data or fraud.

I’m excluding these things for a reason - I want people to listen to me. It’s no good if I go “those people in my outgroup are liars - of course they’d say that”, because for a lot of you that outgroup is actually your ingroup; if I go after them with ad hominem attacks, you are going to tune right out and rightfully so. This argument will be politics-agnostic at least up until the conclusion and summary - no promises after that.

The first big thing that strikes me as a shortcoming of the jellybean comic is it considers exactly one (very silly) hypothetical study. But I want you to imagine what would happen in a world where all the studies on a particular subject were pointed in a particular direction. Imagine, for instance, that all dieticians who studied fish were for whatever reason pretty sure that fish-based diets were harmful, and as a body made studies overwhelmingly aimed at having titles like “Does fish give you cancer?” or “Erectile Disfunction and Smelt: A Possible Link”. What would the body of science related to eating fish look like in a few years?

If the jellybean theory above is correct, it would look pretty bad for fish-eaters. The first reason why is as previously discussed; 1 out of 20 of those gill-centric studies is going to hit on something. Dirty science isn’t necessary here; if you were willing to do ~30 “Does RC cheat at coin tosses?” studies, you’d be likely to find an instance where my coin landed on heads five times in a row, and be able to publish a valid study indicating I was a cheater at a p value of ~.03. You don’t need bad faith - just a field where the questions asked go in a single direction consistently enough that the false positives reinforce each other.

If you are currently screaming that this effect should be incredibly easy to control for in any number of ways, you are absolutely correct - it should be. The first control that should-but-doesn’t fix this problem is null results - a study that attempts to find some particular disease-causing property of an ichthoid diet but doesn’t actually provides some evidence to the contrary by implication in the same way that quickly looking for socks in your sock-drawer and not finding them should increase your confidence that you need to do laundry. But if you trust studies like this (Don’t! Check to see if they are any good first!), null results just don’t get published that often:

Strong results are 40 percentage points more likely to be published than are null results and 60 percentage points more likely to be written up. We provide direct evidence of publication bias and identify the stage of research production at which publication bias occurs: Authors do not write up and submit null findings.

This table from the same study is also instructive:

If this pattern holds true you’d expect to see about 20% of null, “fish is fine, at least as far as we could tell” studies get published while 60% of the “fish is a silent killer” studies would make it to the business end of a printing press.

To be clear, this isn’t enough by itself; it’s still pretty bad, though. If fish was completely non-carcinogenic, and we did 100 studies trying to prove it was anyway, we’d expect to see about 5 coincidental fish-kills-people results (with about 3 of them getting published), but only as many as 20 fish-is-fine null results get published. The literature supporting the un-true conclusion would be 15% as strong as the truth, with no less apparent scientific validity - nobody cheated to get to this point, it’s just how the system works. And this is an optimistic take, as the authors note in the full study you absolutely shouldn’t use Scihub to read:

While TESS studies are clearly not a random sample of the research conducted in the social sciences, it is unlikely that publication bias is less severe than what is reported. Because TESS proposals undergo rigorous peer review, the studies in the sample all exceed a substantial quality threshold. While TESS studies are clearly not a random sample of the research conducted in the social sciences, it is unlikely that publication bias is less severe than what is reported here. The baseline probability of publishing experimental findings based on representative samples is likely higher than that of observational studies using “off-the-shelf” datasets or experiments conducted on convenience samples where there is lower “sunk cost” involved in obtaining the data. Because the TESS data were collected at considerable expense—in terms of time to obtain the grant—authors should, if anything, be more motivated to attempt to publish null results.

The quick translation of that is that these results took a lot of time and money to get, null or not; you’d expect people to be more likely to write them up as a result. But a lot of studies aren’t like that - they are quick and relatively cheap analyses of existing data. In those cases, you’d expect this effect to be more powerful - we don’t really know how much more powerful, but it’s almost certainly so.

Another control that should stop this from happening is ideological diversity. If we counter those dirty fish-haters with a similar amount of people asking “are fish a panacea for all illness” questions with their studies, it would wash out this effect completely. But in some fields that’s just not how it is:

But the percentages varied. Regarding economic affairs, approximately nineteen per cent called themselves moderates, and eighteen per cent, conservative. On foreign policy, just over twenty-one per cent were moderate, and ten per cent, conservative. It was only on the social-issues scale that the numbers reflected Haidt’s fears: more than ninety per cent reported themselves to be liberal, and just under four per cent, conservative.

The article is trying to soften the conclusion here, but in the only way that matters to us (what are their political alignments as it applies to their field), we are looking at a >2200% imbalance, before we get to the part where the few remaining right-leaning folks are pretty scared to stick their necks out:

As the degree of conservatism rose, so, too, did the hostility that people experienced. Conservatives really were significantly more afraid to speak out. Meanwhile, the liberals thought that tolerance was high for everyone. The more liberal they were, the less they thought discrimination of any sort would take place.

Or that a not insignificant portion of their colleagues willingly self-report they’d attempt to suppress them and their work if they did:

As a final step, the team asked each person a series of questions to see how willing she would personally be to do something that could be considered discrimination against a conservative. Here, an interesting disconnect emerged between self-perception—does my field discriminate?—and theoretical responses about behaviors. Over all, close to nineteen per cent reported that they would have a bias against a conservative-leaning paper; twenty-four per cent, against a conservative-leaning grant application; fourteen per cent, against inviting a conservative to a symposium; and thirty-seven and a half per cent, against choosing a conservative as a future colleague. They persisted in saying that no discrimination existed, yet their theoretical behaviors belied that idealized reality.

The list of controls that don’t work well go on and on - if peer review was worth a shit, we wouldn’t have a replication crisis.

Citations could work as a control - if these accidental bad-result stories weren’t used, that diminishes the negative effect of them existing. But when we check that out by looking to see if studies that fail to replicate are cited less, we find they are cited ~150 times more often (Don’t believe the study right away! Go read it!):

As shown in Fig. 2, papers that replicate are cited 153 times less, on average, than papers that do not (N = 80, Poisson regression, residual df = 76, Z = −3.47, and P = 0.001). The point estimate for the difference in citations is largest for papers published in Nature and Science, compared with studies published in economics and psychology journals. Yet, the relationship between replicability and citations is not significantly different across the three replication projects. When we include several individual characteristics of the studies replicated [based on (15)], such as the number of authors, the rate of male authors, and the characteristics of the experiment (location, language, and online implementation), as well as the field in which the paper was published (16), the relationship between replicability and citations is qualitatively unchanged (the same occurs if we control for the highest seniority level among the authors).

Media reporting on science is geared towards sensation, not substance; it’s driven by abstracts and press releases, not the studies themselves; by and large they care absolutely nothing for reporting null results. You aren’t going to control for this with a group of partisan English majors writing on a deadline and maximizing for hits - it’s not in their best interest and it’s just not their job in the first place.

Government funding could support pro-fish studies at the same rate as anti-fish and provide balance, but in the real world that isn’t always how it works: see here for a list of studies firmly slanted towards the “can this hurt somebody” range, but conspicuously light on the “does this have any benefits compared to alternatives” front. This isn’t weird; government has its own aims and works towards them with truth as a secondary afterthought. Fauci has been made into a partisan battle touchpoint, but that’s the wrong framing. The government is a big machine, not a concerned philosopher; of course it views science first and foremost as a tool for governing - what else would we expect? Fauci reflects that; he didn’t invent it.

Nothing here stops the mechanic of wrong results accidentally looking right fairly often and null results only having a limited effect on counterbalancing that - the lower bound on “studies that look fine but are completely wrong” is something like 15% in a best case scenario almost any way you slice it.

The internet has changed and improved a lot things, but one of my favorites is what it did for arguments between laymen. I’m old enough to remember the before-times when getting confirmation that your friend was full of it was a lot more costly - once I saw my dad make a 15 minute phone call to a reference librarian to learn the name of Will Rogers’ horse. As a result most arguments between “normal people” relied on who had the better memory and rhetoric.

The availability of information on the internet spurred a new norm - we went from Mitchell and Webb cheese arguments to an environment where it was reasonable to expect and demand some sort of sourcing. By and large this was a great development that really did improve the level of discourse, even if the internet undercut the quality of our conversations in other important ways. But what’s less great is the bit where some amount of people looked at that improvement and couldn’t imagine anything greater, taking the provision of a link to a source as the end-all be all of argument winning. Which would be great if every study in bonafide science-ese could be relied on to be accurate; where that fails, the provide-a-source-win-an-argument model limits us.

The body of this article was necessarily the closest I could get to a best-case scenario, and it still ended up with a probable model where 3 out of every 20 published studies would be expected to be wrong by default. But that’s the best case scenario. I avoided bringing up actual cheating there, but I sort of have to here, because it’s relevant: that best-case scenario isn’t actually the scenario we have.

Nobody really disagrees that the publish-or-perish environment is a thing - people’s entire careers often hang on whether or not their research produces a significant, interesting result. In that environment it would be outright weird to not see a significant amount of cheating ranging from mild p-hacking to out-and-out fraud. It’s difficult to estimate the prevalence of this because nobody admits to it and scientists are pretty soft on their in-group: see here for an example of a guy talking about three known and common methods of cheating that are intentionally performed as if people are just kinda doing them on accident - as if all they need is a quick bit of kindly education and they will stop doing the known-bad career-promoting practices nobody is likely to notice or willing to call them on.

You might disagree with my math (and frankly, as always, you shouldn’t take the word of some internet jackass at face value). But when we check to see how bad it’s gotten on a macro level we find we have a replication crisis; when multiple fields are worse than a coin toss in terms of determining the truth, it’s clear something has gone terribly wrong.

And, of course, that’s before we get into the kind of science that’s known-bad from the start but promoted anyway to promote desirable-to-some political or social goals. The Helena Miracle study was known-bad science almost from the start (and has been debunked more thoroughly since) but that hasn’t stopped it from being used to push policy for decades. Everyone in that field knows (or should know and doesn’t) that it’s trash, but it doesn’t matter - it’s often the case that if a finding supports a desired outcome, it’s desirable science, no questions asked.

None of this is a reason to reject science outright. One reason why is that there’s clearly a lot of good science out there, and even a study that is wrong in its conclusions can yield at least some good information. Another is although science is wrong, it’s still about the best we have. With that said, if you believe even a single one of the flaws I’ve outlined, then we are well served by approaching any particular conclusion with a reasonable amount of skepticism - trusting but verifying before we believe outright.

So what do we do with this? It depends on what you want. If the goal is to have better and better discussions and to come closer and closer to an accurate understanding of the world, you raise standards; you read papers skeptically knowing there’s a reasonable chance they are accidentally or intentionally inaccurate and put whatever limited influence you have behind asking for better. This is particularly important in the case of science being used to push policy - if you want real solutions, your best first step is to demand they be built on a foundation of reality.

If you just want to win arguments and be on the right team, you do something different. There’s a common accusation that gets thrown around that some groups are “just like a religion” - that people concerned by the implications of climate science or similar are part of a cult, for instance. People who call the accusers here out on that are correct to do so; it’s a sloppy comparison. But there are still several important ways it can be right, and unquestioning belief is one - this is true whether that belief is for its own sake or motivated by a need for social status or power.

I hope you understand that all this, as negative as it might sound, is not meant to dismiss science as a tool for truth. At a bare minimum, the internet’s effect on argument norms of sourcing is a great change. I also argue that as bad as policy sometimes is, it’s better than it would have been before our collective access to information was boosted by the web. But doing better doesn’t mean we have to stop - it would be a mistake to let pride in how far we’ve come hold us back from going even further.

(A special thanks once again to The Resident Contrarian Proofreading Corps; I couldn’t do it without either you or dozens and dozens of errors.)

12 Comments

Feeling Sentient

Jun 13, 2021Liked by Resident Contrarian

"I approach every study I read assuming potential fishiness and needing an awful lot of scientific rigor to shake me out of that pattern"

That sounds a lot like "scientist" if you ask me.

Expand full comment

1 reply by Resident Contrarian

Temagami

I used to work for a quantitative hedge fund. We hired really really smart scientists and statisticians and gave them all the resources they could ever want. A high level description of their job was basically to develop hypotheses and test them. Incentives were aligned: Once their results were verified in the real world they got paid lots of money, and if not, they got nothing. It would be difficult to come up with a more ideal setting for correct research and statistics to take place. Whenever I read articles like this I think yup we solved those problems.

And bad science still happened. I can't overstate how insanely hard it is to do correctly. Humans are fallible. So are all the processes we design. We can fix every single issue anyone has ever thought up and we'll still be far from the ideal. A healthy understanding of the scientific method incorporates this.

But hey, Humans keep trying and it's been working pretty well, on average, over long periods of time.

10 more comments...