Vertical partitioning algorithms used in physical design of databases

Introduction

Vertical partitioning is a crucial step in physical database design, mainly used in disk-based row-store database systems. A number of vertical partitioning algorithms have been proposed over the last three decades for a variety of niche scenarios. In principle, the underlying problem remains the same: decompose a table into one or more vertical partitions, i.e. find a complete and (disjunct) partitioning of the set of attributes of a table such that it is optimal with respect to a cost function.

We implemented the following 6 + 1 vertical partitioning algorithms in Java:

AutoPart [S. Papadomanolakis and A. Ailamaki. AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning. In SSDBM, pages 383–392, 2004.]
HillClimb [R. A. Hankins and J. M. Patel. Data Morphing: An Adaptive, Cache-Conscious Storage Technique. In VLDB, pages 417–428, 2003.]
HYRISE [M. Grund, J. Krüger, H. Plattner, A. Zeier, P. Cudre-Mauroux, and S. Madden. HYRISE: A Main Memory Hybrid Storage Engine. PVLDB, 4(2):105–116, 2010.]
Navathe [S.Navathe, S.Ceri, G.Wiederhold, and J.Dou. Vertical Partitioning Algorithms for Database Design. ACM TODS, 9(4):680–710, 1984.]
O2P [A.Jindal and J.Dittrich. Relax and let the database do the partitioning online. In BIRTE, pages 65–80, 2011.]
Trojan Layouts [A.Jindal, J.-A. Quianeé-Ruiz, and J.Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In ACM - SOCC, pages 21:1–21:14, 2011.]
Brute Force

Read about our findings for legacy row-store database systems in our VLDB'13 paper: Jindal, A., Palatinus, E., Pavlov, V., & Dittrich, J. (2013). A comparison of knives for bread slicing. Proceedings of the VLDB Endowment, 6(6), 361-372.

A short walk-through of vertical partitioning the TPC-H PartSupp table

Let's take a look at the TPC-H PartSupp table's workload, and it's vertical partitioning. The table shows the following informations: which query references which attribute, marked by an X.

In our implementation each Workload is first reduced by kicking out queries which do not touch the given table. Furthermore, the workload is converted to a more efficient representation using primitive types only (called SimplifiedWorkload in the code), using GNU Trove collections. As a result of this step performed on the TPC-H PartSupp table we get only 5 queries (with IDs 0 to 4, shown in the second column in the above Figure).

Now let's run the AUTOPART algorithm for the PartSupp table. Here's the output of the run:

partsupp
1 #AUTOPART
0.0038134 #seconds computation time
4 #partition count -- partitions:
1 0
2
3
4
5 #query count -- best solutions:
2 0
2 0
2 1 0
0
1 0

We get two mappings as a result. Partitions is the result of a VP algo: the list of partitions with the attributes in them. Thus the first line (1 0) means Partition 0 contains attributes 1 and 0 (marked green in the figure).

The second mapping, best solutions is a bit more tricky. If we have a vertically partitioned table, and a query comes in, we have to decide which vertical partitions we should scan to answer the query. This itself is an NP-hard problem when partial attribute replication is allowed, like in the AUTOPART algorithm (Note: these algos are called AbstractPartitionsAlgorithm in the code, in contrast to AbstractPartitioningAlgorithms that create a disjunct set of partitions, i.e. without overlaps). The best solutions tells us for each query which partitions to scan, in order to answer the query WITH THE LOWEST COST, according to the cost model used (Note: this is only non-trivial for overlapping (non-disjunct) partitionings). Thus the first line (2 0) means query 0 (which is TPC-H Q2) needs to scan partitions 2 and 0 (the red and green columns in the figure).

Getting started

Build metis (under lib/metis) on your machine
Create a new instance of experiments.AlgorthmRunner using its default constructor setting up to run all VP algorithms, the whole TPC-H benchmark using scale factor 10, and using an HDD cost model.
Call the experiments.AlgorthmRunner.runTPC_H_Tables() method to get the vertical partitionings of all TPC-H tables, for each VP algorithm.
Print the resulting partitionings calling the experiments.AlgorithmResults.exportResults() method, passing in the results attribute of the experiments.AlgorthmRunner instance.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
img		img
lib		lib
src		src
HYRISE_data_file		HYRISE_data_file
HYRISE_data_file.part.3		HYRISE_data_file.part.3
LICENSE		LICENSE
README.md		README.md
database-vertical-partitioning.iml		database-vertical-partitioning.iml
query10		query10
query1000		query1000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

lib

lib

src

src

HYRISE_data_file

HYRISE_data_file

HYRISE_data_file.part.3

HYRISE_data_file.part.3

LICENSE

LICENSE

README.md

README.md

database-vertical-partitioning.iml

database-vertical-partitioning.iml

query10

query10

query1000

query1000

Repository files navigation

Vertical partitioning algorithms used in physical design of databases

Table of Contents

Introduction

A short walk-through of vertical partitioning the TPC-H PartSupp table

Getting started

About

Releases

Packages

Languages

License

jnwang/database-vertical-partitioning

Folders and files

Latest commit

History

Repository files navigation

Vertical partitioning algorithms used in physical design of databases

Table of Contents

Introduction

A short walk-through of vertical partitioning the TPC-H PartSupp table

Getting started

About

Resources

License

Stars

Watchers

Forks

Languages