Our current understanding of terrestrial carbon processes is represented in various models used to integrate and scale measurements of CO2 exchange from remote sensing and other spatiotemporal data. Yet assessments are rarely conducted to determine how well models simulate carbon processes across vegetation types and environmental conditions. Using standardized data from the North American Carbon Program we compare observed and simulated monthly CO2 exchange from 44 eddy covariance flux towers in North America and 22 terrestrial biosphere models. The analysis period spans ∼220 site-years, 10 biomes, and includes two large-scale drought events, providing a natural experiment to evaluate model skill as a function of drought and seasonality. We evaluate models' ability to simulate the seasonal cycle of CO2 exchange using multiple model skill metrics and analyze links between model characteristics, site history, and model skill. Overall model performance was poor; the difference between observations and simulations was ∼10 times observational uncertainty, with forested ecosystems better predicted than nonforested. Model-data agreement was highest in summer and in temperate evergreen forests. In contrast, model performance declined in spring and fall, especially in ecosystems with large deciduous components, and in dry periods during the growing season. Models used across multiple biomes and sites, the mean model ensemble, and a model using assimilated parameter values showed high consistency with observations. Models with the highest skill across all biomes all used prescribed canopy phenology, calculated NEE as the difference between GPP and ecosystem respiration, and did not use a daily time step.