Statistical learning (SL) within modalities is an area of intensive research, but much less attention has been focused on how SL works across different modalities apart from demonstrating that learning can benefit from information provided in more than one modalities. We investigated visuo-auditory SL using the standard arrangement of SL paradigms. Four visual and four auditory pairs were created from 8-8 abstract shapes and distinctive sounds, respectively. Visual pairs consisted of two shapes always appearing together in a fixed relation, audio pairs were defined by two sounds always being heard at the same time. Next, strong and weak cross-modal quadruples were defined as one visual pair always occurring together with a particular auditory pair (strong) or appearing with one of two possible auditory pairs (weak). Using additional individual shapes and sounds, a large number of cross-modal six-element scenes were created with one visual pair, a single shape, one sound pair and a single sound. Adult participants were exposed to a succession such cross-modal scenes without any explicit task instruction during familiarization, and then tested in three familiarity tests: (1) visual or auditory pairs against pairs of randomly combined elements unimodally, (2) strong cross-modal quads against weak ones, and (3) visual or auditory pairs from the strong and weak quads against each other, again unimodally. We arranged relative difficulties so that in Test 1, the visual pairs were highly favored against random pairs, while choosing the auditory pairs against random sound pairs was at chance. Surprisingly, this setup caused participants choosing the weak quads significantly more often as familiar constructs in Test 2, and preferring equally strongly both the visual and auditory strong pairs over the corresponding weak pairs in Test 3. We interpreted this complex interaction through probabilistic explaining away effects occurring within the participants’ emerging internal model.