Almost half of the world’s population still cooks on biomass cookstoves of poor efficiency and primitive design, such as three stone fires (TSF). Emissions from biomass cookstoves contribute to adverse health effects and climate change. A number of “improved cookstoves” with higher energy efficiency and lower emissions have been designed and promoted across the world. During the design development, and for selection of a stove for dissemination, the stove performance and emissions are commonly evaluated, communicated and compared using the arithmetic average of replicate tests made using a standardized laboratory-based test, commonly the water boiling test (WBT). However, published literature shows different WBT results reported from different laboratories for the same stove technology. Also, there is no agreement in the literature on how many replicate tests should be performed to ensure “significance” in the reported average performance. This matter has not received attention in the rapidly growing literature on stoves, and yet is crucial for estimating and communicating the performance of a stove, and for comparing the performance between stoves. We present results of statistical 2 analyses using data from a number of replicate tests of performance and emission of the Berkeley-Darfur Stove (BDS) and the TSF under well-controlled laboratory conditions. We observed moderate variability in the test results for the TSF and BDS when measuring several characteristics. Here we focus on two as illustrative: time-to-boil and PM2.5 (particulate matter less than or equal to 2.5 micrometers in diameter) emissions. We demonstrate that interpretation of the results comparing these stoves could be misleading if only a small number of replicates had been conducted. We then describe a practical approach, useful to both stove testers and designers, to assess the number of replicates needed to obtain useful data. Caution should be exercised in attaching high credibility to results based on only a few replicates of cookstove performance and emissions. Stove designers, testers, program implementers and decision makers should all benefit from improved awareness of the importance of adequate number of replicates required to produce practically useful test data.