This exploratory paper quests for a stochastic and context sensitive
grammar of images. The grammar should achieve the following four
objectives and thus serves as a unified framework of representation,
learning, and recognition for a large number of object categories. (i) The
grammar represents both the hierarchical decompositions from scenes,
to objects, parts, primitives and pixels by terminal and non-terminal
nodes and the contexts for spatial and functional relations by horizontal
links between the nodes. It formulates each object category as the
set of all possible valid configurations produced by the grammar. (ii)
The grammar is embodied in a simple And–Or graph representation
where each Or-node points to alternative sub-configurations and an
And-node is decomposed into a number of components. This representation
supports recursive top-down/bottom-up procedures for image
parsing under the Bayesian framework and make it convenient to scale
up in complexity. Given an input image, the image parsing task constructs
a most probable parse graph on-the-fly as the output interpretation
and this parse graph is a subgraph of the And–Or graph after
* Song-Chun Zhu is also affiliated with the Lotus Hill Research Institute, China.
making choice on the Or-nodes. (iii) A probabilistic model is defined
on this And–Or graph representation to account for the natural occurrence
frequency of objects and parts as well as their relations. This
model is learned from a relatively small training set per category and
then sampled to synthesize a large number of configurations to cover
novel object instances in the test set. This generalization capability
is mostly missing in discriminative machine learning methods and can
largely improve recognition performance in experiments. (iv) To fill the
well-known semantic gap between symbols and raw signals, the grammar
includes a series of visual dictionaries and organizes them through
graph composition. At the bottom-level the dictionary is a set of image
primitives each having a number of anchor points with open bonds to
link with other primitives. These primitives can be combined to form
larger and larger graph structures for parts and objects. The ambiguities
in inferring local primitives shall be resolved through top-down
computation using larger structures. Finally these primitives forms a
primal sketch representation which will generate the input image with
every pixels explained. The proposal grammar integrates three prominent
representations in the literature: stochastic grammars for composition,
Markov (or graphical) models for contexts, and sparse coding
with primitives (wavelets). It also combines the structure-based and
appearance based methods in the vision literature. Finally the paper
presents three case studies to illustrate the proposed grammar.
1