A number of recent psychophysical studies have argued that human behavioral processing of sensory inputs is best captured by probabilistic computations. Due to conflicting cues, real scenes are ambiguous and support multiple hypotheses of scene interpretation, which require handling uncertainty. The effects of this inherent perceptual uncertainty have been well-characterized on immediate perceptual decisions, but the effects on learning (beyond non-specific slowing down) have not been studied. Although it is known that statistically optimal learning requires combining evidence from all alternative hypotheses weighted by their respective certainties, it is still an open question whether humans learn this way. In this study, we tested whether human observers can learn about and make inferences in situations where multiple interpretations compete for each stimulus. We used an unsupervised visual learning paradigm, in which ecologically relevant but conflicting cues gave rise to alternative hypotheses as to how unknown complex multi-shape visual scenes should be segmented. The strength of conflicting segmentation cues, “high-level” statistically learned chunks and “low-level” grouping features of the input based on connectedness, were systematically manipulated in a series of experiments, and human performance was compared to Bayesian model averaging. We found that humans weighted and combined alternative hypotheses of scene description according to their reliability, demonstrating an optimal treatment of uncertainty in learning. These results capture not only the way adults learn to segment new visual scenes, but also the qualitative shift in learning performance from 8-month-old infants to adults. Our results suggest that perceptual learning models based on point estimates, which instead of model averaging evaluate a single hypothesis with the “best explanatory power” only, are not sufficient for characterizing human visual learning of complex sensory inputs.