While research on visual statistical learning (VSL) is divided into two distinct lines investigating the learning of temporal and spatial regularities separately, such a distinction does not hold in real-world environments, where the two types of regularities are perpetually intertwined as spatial patterns unfold over time. We investigated the interplay between spatial and temporal regularities in a new VSL paradigm, in which spatially defined chunks were continuously moving in and out of the observer’s view. First, participants passively observed a stream of stimuli in a task-free setup. Scenes composed of novel shape-pairs (oriented horizontally, vertically, or diagonally) were presented through a 3×3 grid aperture without between-pair segmentation cues. Periodically, the whole scene within the aperture moved a grid to a direction so that some shapes moved out and others moved in the aperture, thus showing particular pairs only partially sometimes. Subsequently, participants completed a 2AFC familiarity-task judging between real and foil pairs. In Experiment 1 (n=20), participants showed the same level of correct responses in this new setup as in classical spatial VSL experiments (M=61.11%, SE=3.19, p=0.003, BF=16.31). In Experiment 2a (n=73) and 2b (n=75), we introduced different levels of spatial noise by biasing the ratio between specific movement directions. More horizontal movement led to significantly more partial presentations (i.e. more noise) of horizontal than vertical pairs, and vice versa. Despite strong differences in the spatial conditional probabilities within the different types of pairs due to these manipulations, learning of pairs was not selectively hindered, observers performed equally well with all pair types. Evidently, observers can rely on the high temporal coherence of the evolving scenes to recover and represent the spatial structure regardless of spatial noise, and their learning is not a direct consequence of exposure frequency.