Visual-Semantic Transformer for Scene Text Recognition


Visual-Semantic Transformer for Scene Text Recognition


Liang Diao (Ping An property&casualty insurance company of China.LTD.), xin tang (Ping An property&casualty insurance company of China.LTD.), Jun Wang (Ping An Technology (Shenzhen) Co. Ltd.), RUI FANG (Ping An Property & Casualty Insurance Company of China), Guotong Xie (Ping An Technology (Shenzhen) Co. Ltd.), Weifu Chen (Guangzhou Maritime University)*
The 33rd British Machine Vision Conference

Abstract

Semantic information plays an important role in scene text recognition (STR) as well as visual information. Although state-of-the-art models have achieved great improvement in STR, they usually rely on extra external language models to refine the semantic features through context information, and the separate utilization of semantic and visual information leads to biased results, which limits the performance of those models. In this paper, we propose a novel model called Visual-Semantic Transformer (VST) for text recognition. VST consists of several key modules, including a ConvNet, a visual module, two visual-semantic modules, a visual-semantic feature interaction module and a semantic module. VST is a conceptually much simpler model. Different from existing STR models, VST can efficiently extract semantic features without using external language models and it also allows visual features and semantic features to interact with each other parallel so that global information from two domains can be fully exploited and more powerful representations can be learned. The working mechanism of VST is highly similar to our cognitive system, where the visual information is first captured by our sensory organ, and is simultaneously transformed to semantic information by our brain. Extensive experiments on seven public benchmarks including regular/ irregular text recognition datasets verify the effectiveness of VST, it outperformed other 14 popular models on four out of seven benchmark datasets and yielded competitive performance on the other three datasets.

Video



Citation

@inproceedings{Diao_2022_BMVC,
author    = {Liang Diao and xin tang and Jun Wang and RUI FANG and Guotong Xie and Weifu Chen},
title     = {Visual-Semantic Transformer for Scene Text Recognition},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/0772.pdf}
}


Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection