Label embedding: A frugal baseline for text recognition

Albert Gordo; Florent Perronnin; Jose Antonio Rodriguez Serrano

doi:10.1007/s11263-014-0793-6

Label embedding: A frugal baseline for text recognition

Albert Gordo, Florent Perronnin, Jose Antonio Rodriguez Serrano

Research output: Not indexed journal article › Article

79 Citations (Scopus)

Abstract

The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

Original language	Spanish
Pages	193-207
Specialist publication	International Journal of Computer Vision
DOIs	https://doi.org/10.1007/s11263-014-0793-6
Publication status	Published - 1 Jul 2015

Access to Document

10.1007/s11263-014-0793-6

Cite this

@misc{881d19670bc94beaa43e2ff05ae8da0b,

title = "Label embedding: A frugal baseline for text recognition",

abstract = "The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.",

author = "Albert Gordo and Florent Perronnin and {Rodriguez Serrano}, {Jose Antonio}",

year = "2015",

month = jul,

day = "1",

doi = "10.1007/s11263-014-0793-6",

language = "Espa{\~n}ol",

pages = "193--207",

journal = "International Journal of Computer Vision",

issn = "0920-5691",

publisher = "Springer Science and Business Media LLC",

}

TY - GEN

T1 - Label embedding: A frugal baseline for text recognition

AU - Gordo, Albert

AU - Perronnin, Florent

AU - Rodriguez Serrano, Jose Antonio

PY - 2015/7/1

Y1 - 2015/7/1

N2 - The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

AB - The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

U2 - 10.1007/s11263-014-0793-6

DO - 10.1007/s11263-014-0793-6

M3 - Artículo

SN - 0920-5691

SP - 193

EP - 207

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

ER -