Microsoft Excel tries to be helpful, leads to errors in a fifth of published scientific research

Microsoft Office has long been prone to over-zealous auto-correction errors. From auto-numbering issues (if you’ve ever had to type a list using decimals such as 1.1.1, 1.1.2, etc., you know what I mean) to automatically selecting more text than you actually selected, undoing Microsoft’s “helpfulness” is usually more difficult and frustrating than it is worth.

Recently, researchers published a study in Genome Biology in which they found that fully 19.6% of peer-reviewed research in the field of genomics contain errors that were solely caused by Excel. Here is an excerpt from their report:

The problem of Excel software (Microsoft Corp., Redmond, WA, USA) inadvertently converting gene symbols to dates and floating-point numbers was originally described in 2004 [1]. For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to ‘2-Sep’ and ‘1-Mar’, respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession ‘2310009E13’ to ‘2.31E+13’). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. ‘SEPT2’ converted to ‘2006/09/02’). This suggests that gene name errors continue to be a problem in supplementary files accompanying articles. Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused. Our aim here is to raise awareness of the problem.

The real issue that is causing the problem: researchers are copying and pasting data into Excel. Without formatting the cells to receive values such as “SEPT2” or “MARCH1” as text, Excel assigns the formatting itself to the cell.