We investigated visuo-auditory statistical learning by using four visual shape and four auditory sound pairs and creating strong and weak cross-modal quadruples through manipulating how reliably a visual and an auditory pair occurred together across a large number of audio-visual scenes. In Exp 1, only the weak and strong quads were used, while in Exp 2 additional individual shapes and sounds were mixed in to the same cross-modal structures. After passive exposure to such scenes, participants were tested in three familiarity tests: (T1) visual or auditory pairs against pairs of randomly combined elements unimodally, (T2) strong cross-modal quads against weak ones, and (T3) visual or auditory pairs from the strong and weak quads against each other, unimodally. Without noise (Exp 1), participants learned all structures, but performed at chance in T3. In Exp 2, while T1 auditory was at chance, in the auditory T3, participants preferred strong pairs, showing a strong cross-modal boost.