Data scientists turn raw data into actionable insights that help drive business values or, sometimes, even disrupt industries or create entirely new ones. In the IT industry, being a data scientist is one of the most sought after position. Harvard Business Review, in 2012, even called the profession the "sexiest job of the 21st century."

Surprisingly, however, data scientists spend the majority of their time on low-level tasks such as collecting, cleaning and organising data. According to Forbes, up to 70% of typical data science projects is spent on such tasks.

20170703_IBM_01 Figure 1: A typical Data Science project workflow.

Figure 1 shows the typical workflow of a Data Science project. The initial step is the formulation of the business problem to be solved. The next two steps involve acquiring, cleaning and curating relevant data. Feature engineering transforms raw data into numerical or categorical values (so-called features) that can be used as inputs for machine learning models. The machine learning models themselves are selected and fine-tuned in the last step.

The feature engineering step is particularly time-consuming and tedious. It can take days or even weeks, even in short-term projects. Often the raw data are stored across various tables in a relational database and need to be combined in various ways. Given a rich set of features, there exists a variety of methods to select the optimal subset of features and optimise the machine learning models accordingly. Hence, if we could automate the feature engineering process (to a large extent, at least), this would dramatically speed up the creation of machine learning models on new data sets and in new application domains.

A team of IBM researchers in Ireland have completed the first phase of a project that aids in automating the feature engineering step at the push of a button. Called the “One Button Machine” project, it computes aggregate features that can be used as input for machine learning models.

20170703_IBM_02 Figure 2: The team behind “One Button Machine”: (L-R): Francesco Vigliaturo, Thanh Lam Hoang, Ambrish Rawat, Francesco Fusco, Valentina Zantedeschi, Maria-Irina Nicolae, Minh Tran, Vincent Lonij and Mathieu Sinn.

The team has successfully applied the One Button Machine in various data science competitions where it outperformed most human teams and ranked among the top 16–24% of participants. In a client project with a social service provider from the U.S., it helped improve the accuracy of a complex classification task (involving a database with more than 20 tables) from 57% to 64%. One Button Machine produced the results within a few hours of effort whereas if the features had to be manually engineered, it would have taken days or even weeks to get to the same levels of accuracy.

20170703_IBM_03 Figure 3: Thanh Lam Hoang, author of the research paper One Button Machine: A framework for automating feature engineering in relational databases.

One Button Machine works by traversing the graph defined by the entities (tables) and relations (primary/foreign keys) of a relational database. The aggregation functions can be specified by the user, or chosen generically for certain data types. To deal with the combinatorial explosion of related entities, the One Button Machine deploys heuristics and sub-sampling strategies. Scalability to big databases is achieved by dynamic caching of intermediate results and a parallelisable implementation in Apache Spark, a distributed computing framework for analysing massive amounts of data.

The team is working on improving feature detection for unstructured data as well as integration with algorithms optimal for feature selection and optimisation of the machine learning models.

IBM's vision for the future is to build cognitive agents that serve as autonomous assistants for data scientists. Such agents will take over the most tedious and time-consuming tasks. A key capability of those agents will be to understand and reason about the application domain, and be able to automatically detect, diagnose and resolve inconsistencies in the data. If the agents encounter inconsistencies that cannot be resolved, they will ask for specific feedback from data scientists and update their domain knowledge accordingly. This will finally give data scientists more time to think about actual business challenges, develop creative solutions, and communicate actionable insights to stakeholders at the right time.