Elastic Parallel Framework
Chronos consists of three major components, including column decomposition, temporal dependency extraction and parallel tuple generation. The first column decomposition component aims to partition the original schema into column sets, such that each column set is independently processed by the other two components in Chronos. The second temporal dependency extraction component is responsible to identify latent factors behind the tuples and analyze the temporal dependency between the latent factors across neighboring timestamps. The third component is deployed on a distributed platform, used to support elastic tuple generation. Given the column combinations and data distributions from the previous two components, each distributed node parses the received information and generate the tuples to the specified target.
Given a schema R and the example database D, the major goal of the column decomposition component is to partition the columns into groups. Such decomposition benefits the other two components in Chronos framework, by controlling the size of the column set, in order to improve the efficiency of dependency extraction and tuple generation. On the other side, some information on the column correlations is lost, as the dependency between columns is highly complex and the decomposition may not preserve all the correlations under the given constraint on computation capacity.
Given the result sub-schemas output from the schema decomposition component, Chronos aims to analyze the distribution of the tuples, as well as the temporal dependency on each sub-schema. The temporal dependency analysis technique use Latent Dirichlet Allocation (LDA) model. The intuition behind the methodology is to construct a parameterized generative model, which is capable of simulating the latent factors and temporal correlation behind the data domain. Given the sample tuples in D, Chronos tries to find the optimal parameters to maximize the likelihood of the observations in the generative model. The optimized parameter are later used to generate the distributions of simulated database in the elastic generator component of Chronos.
Elastic Parallel Generation
The parallel generator follows a master-slave architecture. The master node is responsible to create the generation plan for all the slave nodes. It runs the schema decomposition component and the temporal dependency extraction component on the sample database D. For each timestamp j in the newly generated benchmark database, the master node only builds a distribution P for each sub-schema R based on the temporal LDA model, outputting a vector Pj = (Pj , ... , Pj[S(R)]) to indicate the probabilities of the tuples falling in the cells of R’s data domain on snapshot j. The master node then writes the probability into the file of the timestamp j on the distributed synchronization subsystem, e.g. Zookeeper.
various number of topics, the shape of the ordered
statistics is almost the same, and close to the
statistics of the original sample database, where the
x-axis is the sequential ids od the ordered domain
values in terms of the frequency in each corresponding
generated database. The orange lines indicate the
regression results on first halves, while the blue
lines represent the regressions on second halves. The
lines are almost identical, the simulation results
almost have the same power law characteristics as the
original data has.
Aoying Zhou, professor in East China Normal University, in charge of this project.
Minqi Zhou, associate professor in East China Normal University, in charge of this project.
Ling Gu is responsible for designing the demo cases of CLAIMS.
- –1.Ling Gu, Minqi Zhou, Qiangqiang Kang, and Aoying Zhou, A Scalable Framework for Universal Data Generation in Parallel,TPCTC 2014.
- –2.Ling Gu, Minqi Zhou, Zhenjie Zhang, Ming-Chien Shan, Aoying Zhou, Marianne Winslett, Chronos: An Elastic Parallel Framework for Stream Benchmark Generation and Simulation, ICDE 2015.