Fusing different sensors with different data modalities is a common technique to improve land cover classification performance in remote sensing. However, all modalities are rarely available for all test data, and this missing data problem poses severe challenges for multi-modal learning. Inspired by recent successes in deep learning, we propose as a remedy a convolutional neural network architecture for urban remote sensing image segmentation trained on data modalities which are not all available at test time. We train our architecture with a cost function particularly suited for imbalanced classes, as this is a frequent problem in remote sensing. We demonstrate the method using a benchmark dataset containing RGB and DSM images. Assuming that the DSM images are missing during testing, our method outperforms both a CNN trained on RGB images as well as an ensemble of two CNNs trained on the RGB images, by exploiting the training time information of the missing modality.