XML Vectorization

Byron Choi

Visiting Student at Department of Computer Science, HKUST

The received wisdom on storing tables in a relational database is to store each tuple contiguously in secondary storage. A simple alternative is to store the columns contiguously, so that a table is represented as a set of arrays, or vectors all of the same length. It has been shown that such representation performs well on queries requires few columns and some main memory algorithms. This paper reviews a shredding scheme used in XMill, an XML compressor, which provides a natural notion of columns in XML. We consider such shredding as a storage model -- vectorization -- by presenting an indexing scheme and physical algebra associated with a detailed I/O cost model. The notion of columns faciliates us to develop an efficient join algorithm for vectorized XML which based on two hash-based join algorithms. We study the detailed I/O cost model and provide experimental result. This work is done at HKUST.