Research on visual statistical learning (VSL) is classically divided into two independent lines: temporal and spatial VSL. However, for observers in real-world environments, spatial patterns unfold over time leading to a fundamental intertwining of both types of regularities. Using a new spatio-temporal VSL setup, we investigated the nature of this interdependence by moving spatially defined structures in and out of participants’ view over time. First, participants passively observed a large grid-like plane cluttered with novel shapes through an aperture of 3×3-grid moving over the plane for several minutes. Unbeknownst to the participants, the shape arrangement was composed of fixed pairs of shapes oriented horizontally, vertically, or diagonally and placed on the grid without any between-pair segmentation cues. The aperture moved by one grid cell into one direction periodically, so that some shapes moved out of the visible field while others moved in. This led to partial presentations of pairs at times as they entered or left the aperture, creating noise on the perceived spatial structure. After the passive exposure, participants’ acquired sensitivity to real vs. foil pairs was measured in a 2AFC-familiarity task. In Experiment 1 (n=20), participants showed the same level of learning in this setup as in classical static spatial VSL experiments (M=61.11%, SE=3.19, p=0.003, BF=16.31), demonstrating that learning of spatial structures is possible in dynamic contexts. In Experiment 2a (n=70), we manipulated the spatial noise of different types of pairs by enforcing more horizontal than vertical movement directions and thereby, showing more partial presentations of horizontal pairs than vertical pairs. All experiments were performed with both horizontal and vertical movement bias across different groups of participants leading to symmetrical results. In the condition with more horizontal moves, the bias led to an uneven decrease of spatial conditional probabilities within pairs from 1.0 to 0.67 (diagonal), 0.75 (horizontal), and 0.92 (vertical). In the control Experiment 2b (n=76), we used the same stimulus displays as the ones in Experiment 2a, but presented them statically and in scrambled order, thereby removing the temporal coherence of the stream while keeping the same level of spatial noise. Participants in the temporally coherent version (2a) learned significantly better overall than participants in the temporally scrambled version (2b) (F(1,144)=5.969, p=0.016) and in both experiments, they learned the diagonal better than the horizontal pairs (F(2,288)=3.601, p=0.029). On the one hand, our results confirmed the hypothesis that observers can rely on temporal coherence of the evolving scenes to recover the spatial structure of noisy input. Surprisingly, we also found that the success of learning the pair structure was not linked directly to spatial noise, as the diagonal pairs with higher noise were learned better than the horizontal pairs lower noise. Overall, these results indicate that the common assumption in spatial VSL that learning is a direct consequence of spatial statistics alone is an oversimplification.