Contrastive Learning Relies More on Spatial Inductive Bias Than Supervised Learning: An Empirical Study

Yuanyi Zhong^1†

Haoran Tang^2†

Junkun Chen^1†

Yuxiong Wang¹

¹University of Illinois at Urbana-Champaign

²University of Pennsylvania

^†Equal Contribution

[Paper]

[GitHub]

[Bibtex]

Abstract

Though self-supervised contrastive learning (CL) has shown its potential to achieve state-of-the-art accuracy without any supervision, its behavior still remains under investigated by academia. Different from most previous work that understands CL from learning objectives, we focus on an unexplored yet natural aspect: the spatial inductive bias which seems to be implicitly exploited via data augmentations in CL. We design an experiment to study the reliance of CL on such spatial inductive bias, by destroying the global or local spatial structures of image with global or local patch shuffling, and comparing the performance drop between experiments on original and corrupted dataset to quantify the reliance of certain inductive bias. We also use the uniformity of feature space to further research on how CL-pre-trained model behave with the corrupted dataset. Our results and analysis show that CL has a much higher reliance on spatial inductive bias than SL, regardless of specific CL algorithm or backbones, opening a new direction for studying the behavior of CL.

Presentation

[Slides]

Destroying Spatial Inductive Bias via Data Corruptions

We corrupt the spatial inductie bias from data by destroying its structural information in two ways:

Global shuffling: Divide image into N×N patches of patch size P, shuffle all patches
Local shuffling: Divide image into N×N patches of patch size P, shuffle pixels within each patch independently

Pretraining and Downstream tasks with corruptions lead to consistent observations, which hold for different datasets, architectures, etc.

CL relies more on spatial inductive bias than SL; CL relies more on global spatial inductive bias than local
Well pretrained CLs with spatial bias are more robust to patch shuffling than SL on Downstream tasks

Empirical Analysis from Feature Space

Grad-CAM maps from learned representations

Class-wise feature uniformity throughout training (black represents overal feature space)

t-SNE visualizations

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.