Traditionally, statistical learning (SL) studies in the auditory domain have been linked to language processing and, therefore, to sequential predictability or “temporal” structure learning. In contrast, research on visual SL has focused more on discovering general spatio-temporal patterns, which requires spatial structure learning. We asked whether this dichotomy is justified or if auditory SL should also be considered within the more general framework of domain-independent discovery of spatio-temporal patterns. In three auditory experiments, we used the co-occurrence statistics of the classical visual spatial SL paradigm without and with spatial information included. From co-occurring but spatially not separated auditory patterns of “scenes” with up to four different sounds presented concurrently, human adults learned the same statistics as in visual SL tasks of underlying pair-based chunks. When, with the help of a two-dimensional loudspeaker grid, the auditory stimuli were presented in a spatial layout that tightly followed the structure of earlier visual studies, the number of sounds in the scene adults could parse into chunks went up by 50%. In addition, depending on the difficulty of the task, adults learned the different statistics to different degrees. These results support the idea of treating auditory and visual statistical learning in a joint framework.