Faster SPM99/SPM2/FSL

Introduction

The SPM99 and more recent SPM2 suite of software is used by many neuroimagers world-wide for analysing MRI images. This takes a lot of computer time, and any approach to speeding it up is welcome. Tom Womack and I found that recompiling parts of the SPM package using a more advanced compiler than the one used for the standard distribution (Intel C++ version 7.1, rather than GCC) produces quite significant speed-ups. The tables below show the speeds we found on different computers and with differently compiled MEX files (measured with Matthew Brett's mextest.m file). Note that the many computers are much slower computing Not-a-Number values than real values (i.e. compare speed of 'linear resample' to the 'NaN linear resample'). Unfortunately, the statistics stage of SPM processing uses NaN values heavily: I think this explains why Ahtlon and Athlon64 computers are so much faster than other systems for computing the 'statistics' stage of SPM (as they do not have a NaN penalty). The tables indicate the time required to complete different tasks, with lower numbers indicating faster performance. Systems that show a NaN penalty are highlighted in red: these systems will be slow during the statisitics portion of SPM processing. Sun and SGI data from Otto Muzik and Shane McKie.

System MEX files SimpleRead Linear Resample Sinc Resample NaN linear Resample Smooth
Celeron .8Ghz OriginalSPM99 1.26 3.59 48.22 11.68 17.02
  icc7.1 p2 1.26 1.30 23.15 9.16 15.45
Athlon 2200XP 1.8Ghz OriginalSPM99 0.32 1.14 15.67 1.09 14.98
  icc7.1 p2 0.31 0.39 6.28 0.38 11.37
Athlon 2800XP 2.1GHz OriginalSPM2 0.17 0.95 6.16 0.95 9.55
Athlon64 3400+ 2.2GHz SPM2std 0.11 0.70 7.86 0.70 5.95
  SPM2p3 0.11 0.66 8.58 0.65 5.95
  SPM2p4 0.11 0.69 7.45 0.63 5.97
Pentium4 2.4GHz OriginalSPM99 0.23 1.16 11.45 20.66 16.71
  Matthew Brett gcc3.2 0.20 0.99 8.59 0.99 15.31
  icc7.1 p2 0.21 0.31 5.55 19.30 19.20
  icc7.1 p4 0.20 0.32 7.59 0.28 18.29
Xeon 2x3GHz SPM2p3 0.19 0.99 8.84 15.73 13.3
  SPM2p4 0.20 0.66 9.98 0.62 13.4
Sun Ultra II .25Ghz OriginalSPM99 6.20 7.20 134.17 27.12 49.00
Sun Ultra III CPU .75Ghz OriginalSPM99 0.68 2.88 33.27 123.42 11.72
Sun Ultra III+ CPU 1.0GHz OriginalSPM99 0.30 1.31 23.94 95.91 4.44
SGI R14000 .6GHz OriginalSPM2 0.35 0.62 15.46 117.03 10.94
Macintosh G4: 2x1.4GHz OriginalSPM99 0.46 0.89 18.68 0.88 7.48

SPM optimizations

As Matthew Brett explains on his web page, there are a couple of other optimizations that you can do to make SPM faster.

SPM2

SPM2 allows you to write scripts to process data fairly automatically. I created a script to process a simple block design dataset (130 volumes). This dataset gives you an idea of the amount of time required to process a dataset. Note that for event-related designs the 'statistics' portion of the analysis would take considerably longer. Since the statistics are often recomputed for different comparisons, I suggest that the performance on the statistics portion of this benchmark should be emphasized. All times are in seconds: smaller numbers are faster. Note that 'preprocess total' is the sum of realignment, unwarping, normalization and smoothing. The excellent performance of the Athlon64 for statistics is probably due to the fact that it does not suffer a NaN penalty. If you wish, you can download this data set and SPM2 batch scripts here (18Mb). Note that the same 2.4GHz Pentium 4 was tested both with Windows and Linux: with Linux showing much better performance in SPM, I found similar results with the FSL3.1 FEEDS benchmark running on WindowsXP/Cygwin1.5.9 versus Linux.

System Realign Unwarp Normalize Smooth Preprocess Total Statistics  
0.8GHz Celeron 530 2374 953 131 4005 364 WinXP; Matlab5.3 192Mb RAM
Dual 1.0GHz Pentium3 233 1016 343 63 1661 271 WinXP; SGI330 1.5Gb RAM
1.3GHz PentiumM 'Centrino' 177 525 183 34 922 143 WinXP; Dell D500 Latitude laptop
Dual 1.4GHz Opteron 240 118 478 255 24 875 66 LinuxRH9, 1Gb RAM UC Irvine Brain Imaging Center
1.83GHz AthlonXP 2500 141 558 170 85 958 106 WinXP, 512Mb RAM, HP Laptop
Dual 2.0GHz G5 Macintosh 156 440 161 28 789 104 Mac10.3
2.2GHz Athlon64 3400+ 76 314 97 15 509 69 WinXP, 1Gb RAM
  61 293 124 15 494 41 SUSE9.1 64bit, 1Gb RAM
2.4GHz Pentium4 149 426 141 24 743 145 WinXP, 512Mb RAM, FSB: 800Mhz
  185 513 163 23 887 144 WinXP, 512Mb RAM, FSB: 800Mhz MATLAB 7.0
  112 294 116 21 547 67 SUSE9.0, 512Mb RAM, FSB: 800Mhz
  71 320 121 22 536 65 SUSE9.1, 512Mb RAM, FSB: 800Mhz
3.0GHz Pentium4 88 258 105 18 472 57 Mandrake10, 2Gb RAM, UC Berkeley
Dual 3.0GHz Xeon 125 339 103 21 591 121 WinXP, 3Gb RAM, Optimized Mex/ATLAS
without optimised files: statistics required 346 seconds
Dual 3.0GHz Xeon 65 267 96 21 451 60 RH9, 3Gb RAM, Optimized Mex/ATLAS, Campinas

Matlab bench functions

I think the real world tests above are probably a better benchmark of SPM performance than the built-in Matlab Benchmark test. However, here are a few Matlab 'bench' values that should give a rough idea about performance. Lower values mean faster performance, except for the 'score', where a higher value means faster overall performance. Note that Matlab is single-threaded, so there is little benefit for dual processors. These values are from Matlab 6.5, and it is possible that the G5 and Athlon64 may show improved performance if Mathworks releases new versions of Matlab optimized for these systems. Good Pentium4 performance depends on compiling code specifically for its quirks. Note that Linux systems perform slower on the 2D/3D scores than the same system running Windows, a finding not reflected in the SPM2 benchmark.

System LU FFT ODE Sparse 2-D 3-D Score Notes
Dual 1.0GHz Pentium3 1.61 1.98 0.86 1.30 2.27 0.70 11.5 WinXP SGI330 1.5Gb RAM
1.3GHz PentiumM 'Centrino' 0.82 0.99 0.37 0.73 0.78 0.41 24.4 WinXP Dell D500 Latitude laptop
1.8GHz Athlon XP2200 0.81 1.48 0.50 0.86 1.31 0.52 17.3 WinXP
Dual 2.0GHz G5 Macintosh 0.32 1.10 0.44 0.52 1.10 0.90 22.8 Mac10.3
2.1GHz AthlonXP 2800+ 0.46 0.85 0.31 0.55 0.47 0.16 35.8 WinXP [is 3D score correct?]
2.2GHz Athlon64 3400+ 0.28 0.59 0.20 0.44 0.66 0.67 35.2 WinXP
  0.38 0.60 0.34 0.61 0.43 0.76 32.0 SUSE9.1 64bit, 1Gb RAM
2.4GHz Pentium4 0.31 1.02 0.47 0.57 0.79 0.68 26.0 WinXP FSB: 800Mhz, optimized ATLAS
  0.31 0.82 0.58 0.72 0.81 1.31 22.0 SUSE9 FSB: 800Mhz, optimized ATLAS
Dual 3.0GHz Xeon 0.38 1.02 0.36 0.47 0.56 0.30 32.4 WinXP optimized ATLAS, ATI Radeon 9800pro

FSL benchmark

The FMRIB in Oxford provides a group of neuroimaging tools known collectively as FSL. These tools also come with a benchmark named FEEDS that allow you to test that the programs are installed correctly as well as giving you an idea for the performance of your system. Since FSL is available is source code, we can recompile these tools to take advantage of architecture specific features (like SSE or the extra registers provided when the Athlon64 is in 64-bit mode). The times below reflect total time (not shorter user time) in seconds to complete the FEEDS 3.1 benchmark, lower values mean faster performance.

System Total time (sec)  
2.2GHz Athlon64 3400+ 1852 WinXP/Cygwin: distribution from FSL website
  1620 SUSE9.1 64bit OS, 32-bit: distribution from FSL website
  1305 SUSE9.1 64bit OS, 32-bit: -march=k8 -mcpu=k8 -mfpmath=sse -O3 -fexpensive-optimizations
  1137 SUSE9.1 64bit OS, 64-bit: -march=k8 -mcpu=k8 -mfpmath=sse -O3 -fexpensive-optimizations -m64
2.4GHz Pentium4 2572 WinXP/Cygwin: distribution from FSL website
  2155 SUSE9.0: distribution from FSL website
  1784 SUSE9.1: -march=pentium4 -mcpu=pentium4 -mfpmath=sse -O3 -fexpensive-optimizations

*Note: I was unable to get 'slicer' and 'overlay' to compile in 64-bit mode, further 'melodic' was much slower as a 64-bit executable: 32-bit executables were used for these stages of processing.

 

Tom Womack and Chris Rorden,7 June 2004