AI REGIO-Synthetic Data Generation Engine (SynData)
A flexible and extensible synthetic data generation engine based on mainstream statistics distributions and on timeseries generative AI techniques.
- Generation of: (a) fully synthetic data from scratch, (b) partially synthetic data to complement/augment any existing datasets based on:
- statistics distributions, e.g. for discrete data: Poisson, Binomial (Bernulli for trials = 1), Negative Binomial, Uniform Integer; for continuous data: Normal, Gamma (Exponential for shape = 1), Beta, Weibull, Uniform), that are relevant for certain columns; and
- machine learning techniques, e.g. Time-series Generative adversarial network (GAN)), to detect and analyse patterns and “inject” outliers in the synthetic dataset.
- User-friendly user interface allowing for: Configuring the synthetic data generation; Downloading the created synthetic dataset as csv file while previewing online a sample to confirm that it is according to the user’s expectations; Identifying any errors/problems that prevented the synthetic data generation through informative messages.
The AI REGIO Synthetic Data Generation Engine (SynData) asset allows a human to generate appropriate timeseries (IoT) data on-demand according to their needs through a combined data-driven and process-driven approach. Since the importance of synthetic data comes with its power of generating features to meet specific needs or conditions which otherwise would not be available in real-world data (e.g. for edge/sporadic cases not yet encountered, for overcoming confidentiality or privacy concerns), SynData aims at addressing the lack of representative IoT data to train artificial intelligence and machine learning (AI/ML) models.
Through SynData, the user is able to define the exact structure he/she wants to create in the synthetic dataset (if it has not been derived from an existing dataset). For each column, the user configures the set of rules that should be applied depending on the selected data type, the available options (from the supported synthetic data generation techniques) and the number of rows that are to be generated in the dataset.