Variability and Reproducibility in fMRI Research

While conducting experiments, researchers are faced with an abundance of decisions. Theories, methodologies, standards and innovation – the choice of what is “correct” is not always easy.  Research methodology and pipelines used to analyse data are becoming more and more complex – especially in neuroimaging research. Every step of the way, decisions have to be made that are not always obvious. Researchers nowadays are bombarded with new methodologies, ways to analyse data, various implementations of the same pipelines, and a myriad of possibilities on how to analyse their data to answer their research question. On the one hand, this “Researcher’s degrees of freedom” opens doors to endless possibilities. But at the same time – how does one choose when there are so many possibilities? And what does this mean for the replicability and validity of our results?

Let’s go back a bit and start from the beginning – what is a research pipeline exactly? In every neuroimaging research project, the data has to be preprocessed before being analysed – to make sure that one can make reasonable sense of the data. Both preprocessing and analysis are made up of different steps such as taking care of the noise in the data and making sure that the data is in the right format. After preprocessing, data is analyzed using statistical modeling. During this process, voxel activity is thresholded to determine whether the recorded activity in a certain brain region is significant. Each of these steps can be performed in multiple ways and order. Values for thresholds can vary and different types and degrees of smoothing kernels can be applied (Lindquist, 2020). As of now, there are no strict rules for such ‘algorithms’ of data handling.

Thus, there can be a lot of variation in:

  • The steps that are taken
  • The implementation and approach to these steps
  • The order in which the steps are conducted


Botvinik-Nezer et al. (2020) investigated exactly this variation – and whether it  affects the significance of results and how we interpret them. 70 teams of researchers were given the same original functional MRI dataset and a research question with nine pre-defined hypotheses. The teams were tasked to conduct the entire analysis in the way they usually would and report a detailed description of their methods and results. Strikingly, none of the research teams presented the same preprocessing-analysis pipeline and the reported significant results for each hypothesis varied across teams. On average, 20% of the teams reported a result that was different from the majority of teams across the nine hypotheses. This raises the question: why are the results so seemingly different?

To investigate this, the authors used a logistic regression model to identify the variables driving this diversity. Amongst others, the degree of spatial smoothness was associated with a higher chance of a significant outcome. In other words, through multiple steps in the analysis, i.e., motion-correction, the raw neuroimaging data is spatially smoothed. The smoother the data, the more likely it is to get a significant result. Even more surprising, the choice of software package used for analysis was connected to the significance of results. FSL was associated with a higher likelihood of significant results over SPM.

This variability explains the different findings of the teams, but does it systematically influence fMRI results? To assess the consistency of results, Botvinik-Nezer et al. conducted an image based meta-analysis of the statistical maps that were provided by the teams. The results suggest that, although the teams reported different outcomes of their analyses, the activated clusters converged across teams and reached a significant consensus. 

Does that mean there is no reason to worry? Not quite. While the reported results are consistent when combined between the teams, this study showed that for every of the nine hypotheses given, the teams used at least four different pipelines to obtain significant results. This reduces the reproducibility of these studies. It is common in fMRI studies to explore different pipelines during analysis but report only the best fitting one (Simmons, Nelson, & Simonsohn, 2011). This practice can lead to more false positive errors and makes it harder for other researchers to replicate findings exactly (Lindquist, 2020).

The authors propose various solutions to this problem. For one, unthresholded statistical maps should be shared on platforms such as NeuroVault (Gorgolewski et al., 2015). This is vital since it allows image based meta-analyses as employed in this study. The drawback of reporting only thresholded maps is that ultimately, information is lost by discarding the deactivations of voxels below the threshold, similarly to discarding null-results (Salimi-Khorshidi et al., 2009). Secondly, in order to properly reproduce an experiment, the analysis code should be publicly shared to allow other labs to re-run the analysis or to validate the code. Furthermore, the authors suggest that datasets should be analyzed by more than one research team, establishing the need for automated analysis tools like FitLins (Markiewicz et al., 2019). Lastly, with the use of preregistration, researchers are forced to specify their hypothesis and analysis plan before conducting the experiment.

It is clear that there is no optimal pipeline to analyze all types of data; but if researchers raise more awareness for the reproducibility of studies and encourage open and transparent practices, such variance in pipelines will be less of a problem for future scientific endeavours.


Botvinik-Nezer, R., Holzmeister, F., Camerer, C. F., Dreber, A., Huber, J., Johannesson, M., … & Rieck, J. R. (2020). Variability in the analysis of a single neuroimaging dataset by many teams. Nature, 582(7810), 84-88.

Gorgolewski, K. J., Varoquaux, G., Rivera, G., Schwarz, Y., Ghosh, S. S., Maumet, C., … & Margulies, D. S. (2015). NeuroVault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain. Frontiers in neuroinformatics, 9(8).

Lindquist, M. (2020). Neuroimaging results altered by varying analysis pipelines. Nature, 582(7810), 36-37. doi: 10.1038/d41586-020-01282-z

Markiewicz, C., De La Vega, A., Yarkoni, T., Poldrack, R. & Gorgolewski, K. FitLins: reproducible model estimation for fMRI. Poster W621 in 25th Annual Meeting of the Organization for Human Brain Mapping (OHBM, 2019).

Salimi-Khorshidi, G., Smith, S. M., Keltner, J. R., Wager, T. D., & Nichols, T. E. (2009). Meta-analysis of neuroimaging data: a comparison of image-based and coordinate-based pooling of studies. Neuroimage, 45(3), 810-823.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.

Leave a Reply

Your email address will not be published. Required fields are marked *