publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2023
- ECCV 2024CLAP: Isolating Content from Style through Contrastive Learning with Augmented PromptsYichao Cai, Yuhang Liu, Zhen Zhang, and 1 more author2023
Contrastive vision-language models, such as CLIP, have garnered considerable attention for various dowmsteam tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often blend content and style information, which somewhat limits their generalization capabilities under distribution shifts. To address this limitation, we adopt a causal generative perspective for multimodal data and propose contrastive learning with data augmentation to disentangle content features from the original representations. To achieve this, we begins with exploring image augmentation techniques and develop a method to seamlessly integrate them into pre-trained CLIP-like models to extract pure content features. Taking a step further, recognizing the inherent semantic richness and logical structure of text data, we explore the use of text augmentation to isolate latent content from style features. This enables CLIP-like model’s encoders to concentrate on latent content information, refining the learned representations by pre-trained CLIP-like models. Our extensive experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks, alongside enhanced robustness to various perturbations. These results underscore the effectiveness of our proposed methods in refining vision-language representations and advancing the state-of-the-art in multimodal learning.
- ICUS 2023Robust Real-Time Curb Detection for Autonomous Sanitation VehiclesYichao Cai, Kejun Ou, Dachuan Li, and 3 more authorsIn 2023 IEEE International Conference on Unmanned Systems (ICUS), 2023
Curb detection is a key enabling functionality for the precise curb-following sanitation control of autonomous sanitation vehicles. The robust and efficient curb detection in complex environments is still a challenging issue. In this paper, we propose a novel semantic segmentation-based framework for curb detection using monocular bird’s eye-view images. We employ a lightweight segmentation network based on HRNet to extract the drivable area. A zero-shot post-processing approach is proposed to extract a candidate point set from the segmented image for robust curve fitting. In addition, we propose a modified RANSAC fitting approach that accounts for outlier points to achieve dynamic order curve fitting and curb representation. Experimental results in complex sanitation scenarios demonstrate the efficiency, accuracy, and robustness of the proposed approach.
2018
- SensorsRobust drivable road region detection for fixed-route autonomous vehicles using map-fusion imagesYichao Cai, Dachuan Li, Xiao Zhou, and 1 more authorSensors, 2018
Environment perception is one of the major issues in autonomous driving systems. In particular, effective and robust drivable road region detection still remains a challenge to be addressed for autonomous vehicles in multi-lane roads, intersections and unstructured road environments. In this paper, a computer vision and neural networks-based drivable road region detection approach is proposed for fixed-route autonomous vehicles (e.g., shuttles, buses and other vehicles operating on fixed routes), using a vehicle-mounted camera, route map and real-time vehicle location. The key idea of the proposed approach is to fuse an image with its corresponding local route map to obtain the map-fusion image (MFI) where the information of the image and route map act as complementary to each other. The information of the image can be utilized in road regions with rich features, while local route map acts as critical heuristics that enable robust drivable road region detection in areas without clear lane marking or borders. A neural network model constructed upon the Convolutional Neural Networks (CNNs), namely FCN-VGG16, is utilized to extract the drivable road region from the fused MFI. The proposed approach is validated using real-world driving scenario videos captured by an industrial camera mounted on a testing vehicle. Experiments demonstrate that the proposed approach outperforms the conventional approach which uses non-fused images in terms of detection accuracy and robustness, and it achieves desirable robustness against undesirable illumination conditions and pavement appearance, as well as projection and map-fusion errors.