Fujitsu Laboratories Ltd. today announced the development of technology to automate data preparation tasks, including the integration of the data format, which is necessary when the acquired data is generated from multiple sources serving different purposes.
The integration and transformation of data that comes in a variety of sizes and formats can take a significant amount of time, from a few weeks to several months. Transforming data from sources such as companies and social media also requires sufficient understanding of the data contents, and for that reason, it has been difficult to put many valuable data sources into use.
Fujitsu Laboratories has now developed a technology to automatically handle the data formatting and integration process when linking multiple data sources, by defining examples of the desired data results.
The company has applied this technology to a marketing analysis dataset based on data from about 8,000 past POS transactions, and confirmed that it was able to shorten the preparation time for data analysis. The data preparation tasks, which had previously taken five days, were completed in about half a day. With this development, Fujitsu Laboratories is promoting data interoperability and exchange among companies, contributing to the creation of new businesses based on new insights generated by accelerated data analysis.
In recent years, it has become increasingly important to create businesses or develop new products that integrate and make use of a variety of data, such as marketing analyses combining POS data and social media data, or drug discovery research analyzing electronic medical records collected from regional hospitals. Fujitsu Laboratories is advancing this research and under the title of “Data Bazaar,” is systematizing a variety of necessary data processing technologies from the perspective of data exchange and usage (Figure 1). The “Data Bazaar” is made up of a comprehensive set of fundamental data processing technologies, in order to efficiently transform data into connectable information by formatting, integrating and analyzing scattered data. This will lead to opportunities in creating new businesses by delivering the value extracted from the data to users in a safe and timely manner. With the new technology, data can be automatically formatted and integrated to connecting form as one of the constituent elements of the “Data Bazaar.”
For example, this technology can be applied to set sales strategies and develop new products, as it links the POS data of one’s own company to such data as weather information for a joint analysis, thereby generating insights that would be difficult to gain otherwise.
Previously in data preparation, it was necessary for highly skilled individuals to comprehend and carry out transformation as well as integration processes in order to achieve the desired result. During the integration process, people also need to solve problems such as insufficient datasets and issues with the transformation program. Because data preparation involves the repetitive process of understanding, reformatting, integrating and validating the data, it makes up about 80% of the entire process.
To increase the efficiency of data formatting and integration, there have been efforts at developing technology to automatically transform data based on examples of the desired transformation results. For this method to work, however, it is necessary to conduct an exhaustive search for a combination that can produce the desired data integration results. At the same time, datasets need to be integrated to supplement for gaps in the original datasets, and a variety of transformations have to be tested, including such tasks as notation and format unification, as well as unit conversion. As this formatting processing becomes more complicated, the number of necessary transformations and gaps in the data also increases, meaning that the number of combinations that must be searched grows rapidly, making it difficult to finish processing in a realistic timeframe.
About the Newly Developed Technology
Fujitsu Laboratories has developed a technology that can automatically conduct data formatting and integration, and by streamlining searches of these processing combinations, even with increasing transformation and missing datasets, thereby enables high-speed processing.
Features of this technology are as follows.
1. Improved search efficiency that predicts processing based on conversion history
This technology calculates the intermediate results by applying a variety of transformations for data columns in the database, including notation and format unification, unit conversion, and the integration of supplemental data. Subsequently, it calculates the degree of similarity between the desired data and the intermediate results. Accordingly, based on the intermediate results with the highest similarity, further transformations calculate the next intermediate results, and the degree of similarity is again calculated, efficiently approaching the desired data by repeating this process.
By maintaining a sequence of previous transformations and their results, the new technology eliminates unnecessary transformations by predicting those that will generate data similar to the desired data. (Figure 2).
As a consequence, the search time was reduced to a few percent of the previous time, as that had required extensive and repetitive searches for the transformations reaching the desired result.
2. Efficient searches for missing data based on high-speed similarity filtering
When any data required to reach the desired result is missing, people can manually find appropriate datasets in efficient ways based on background knowledge. But when the search is automated, the processing time can increase dramatically as thorough searches of supplementary datasets, for example, those provided as libraries, become necessary.
Fujitsu Laboratories has now developed a rapid similarity filtering technology to quickly find the missing data by first calculating the characteristics of the distribution of data in the columns as metadata, for each line of the supplementary data stored as a library. It then calculates the degree of similarity with the characteristics for the intermediate data to reach the desired result. (Figure 3).
Fujitsu Laboratories applied this technology to a marketing analysis dataset based on data from about 8,000 past POS transactions, and confirmed the ability to shorten preparation time for data analysis. The data preparation tasks, which had previously taken five days, were completed in about half a day.
This technology will promote data interoperability and exchange between companies, contributing to the creation of new businesses based on new insights generated by accelerating data analysis.
Fujitsu Laboratories will continue field trials of this technology while continuing to improve its functionality, including the expansion of the types of data transformations and the support for open data as supplementary data. The company aims to commercialize this development in fiscal 2018 as a constituent component of the “Data Bazaar” technology.