Transformer models have recently approached or even surpassed the performance of ConvNets on computer vision tasks like classification and segmentation with large scale supervised pre-training. In this work, we bridge the gap between ConvNets and Transformers for Earth observation by self-supervised pre-training on large-scale unlabeled remote sensing data. The resulting representations can be utilized for both land cover classification and segmentation tasks, where they significantly outperform the fully supervised baselines and require only a fraction of the labeled training data.
Presenters: Linus Scheibenrei and Joëlle Hanna, University of St. Gallen (Switzerland)