Self-supervised visual representation learning (SSL) attempts to extract significant features from unlabeled datasets, alleviating the necessity for labor-intensive and time-consuming manual labeling processes. However, existing contrastive learning-based methods typically suffer from the underutilization of datasets, consume significant computational resources, and employ longer training epochs or large batch sizes. In this study, we propose a novel method aimed at optimizing self-supervised learning that integrates the advantages of sparse-dense sampling and collaborative optimization, thereby significantly improving the performance of downstream tasks. Specifically, sparse-dense sampling primarily focuses on high-level semantic features, while leveraging the spatial structure relationship provided by the unlabeled dataset to ensure the incorporation of low-level texture features to improve data utilization. Besides, collaborative optimization, including contrastive and location tasks, further enhances the model's ability to perceive features of different dimensions, thereby improving its utilization of features in the embedding space. Furthermore, the combination of sparse-dense sampling and collaborative optimization strategies can reduce computational consumption while improving performance. Extensive experiments demonstrate that the proposed method effectively reduces the computational requirements while delivering favorable results. The codes and model weights will be available at https://github.com/AI-TYQ/S4.
Support the authors with ResearchCoin