Highlights

Elastic Parallel Framework

Chronos consists of three major components, including column decomposition, temporal dependency extraction and parallel tuple generation. The first column decomposition component aims to partition the original schema into column sets, such that each column set is independently processed by the other two components in Chronos. The second temporal dependency extraction component is responsible to identify latent factors behind the tuples and analyze the temporal dependency between the latent factors across neighboring timestamps. The third component is deployed on a distributed platform, used to support elastic tuple generation. Given the column combinations and data distributions from the previous two components, each distributed node parses the received information and generate the tuples to the specified target.

Schema Decomposition

Given a schema R and the example database D, the major goal of the column decomposition component is to partition the columns into groups. Such decomposition benefits the other two components in Chronos framework, by controlling the size of the column set, in order to improve the efficiency of dependency extraction and tuple generation. On the other side, some information on the column correlations is lost, as the dependency between columns is highly complex and the decomposition may not preserve all the correlations under the given constraint on computation capacity.

Temporal Dependency

Given the result sub-schemas output from the schema decomposition component, Chronos aims to analyze the distribution of the tuples, as well as the temporal dependency on each sub-schema. The temporal dependency analysis technique use Latent Dirichlet Allocation (LDA) model. The intuition behind the methodology is to construct a parameterized generative model, which is capable of simulating the latent factors and temporal correlation behind the data domain. Given the sample tuples in D, Chronos tries to find the optimal parameters to maximize the likelihood of the observations in the generative model. The optimized parameter are later used to generate the distributions of simulated database in the elastic generator component of Chronos.

Elastic Parallel Generation

The parallel generator follows a master-slave architecture. The master node is responsible to create the generation plan for all the slave nodes. It runs the schema decomposition component and the temporal dependency extraction component on the sample database D. For each timestamp j in the newly generated benchmark database, the master node only builds a distribution P for each sub-schema R based on the temporal LDA model, outputting a vector    Pj = (Pj[1] , ... , Pj[S(R)])  P_{j}=(P_{j}[1],...,P_{j}[S(R)])to indicate the probabilities of the tuples falling in the cells of R’s data domain on snapshot j. The master node then writes the probability into the file of the timestamp j on the distributed synchronization subsystem, e.g. Zookeeper.

Performance

Under various number of topics, the shape of the ordered statistics is almost the same, and close to the statistics of the original sample database, where the x-axis is the sequential ids od the ordered domain values in terms of the frequency in each corresponding generated database. The orange lines indicate the regression results on first halves, while the blue lines represent the regressions on second halves. The lines are almost identical, the simulation results almost have the same power law characteristics as the original data has.

Team

Aoying Zhou, professor in East China Normal University,  in charge of this project.

Minqi Zhou, associate professor in East China Normal University, in charge of this project.

Ling Gu is responsible for designing the demo cases of CLAIMS.

Publications

        • 1.Ling Gu, Minqi Zhou, Qiangqiang Kang, and Aoying Zhou, A Scalable Framework for Universal Data Generation in Parallel,TPCTC 2014.
        • 2.Ling Gu, Minqi Zhou, Zhenjie Zhang, Ming-Chien Shan, Aoying Zhou, Marianne Winslett, Chronos: An Elastic Parallel Framework for Stream Benchmark Generation and Simulation, ICDE 2015.

Download

Instruction for Chronos's code: readme
Code of Chronos: code
Dataset: Telecom data

Contact

Contact us (see Team for emails) if you are interested in joining us!

East China Normal University
Institute for Data Science and Engineering
3663 N.Zhongshan Rd.
200062 Shanghai
China
© East China Normal University - DaSE 2014.