Assess quality of machine learning software systems

Like any software, a machine learning program can be faulty, leading to non-qualified outcomes with potentially catastrophic consequences. To avoid this, machine learning development teams should be equipped with efficient fault detection techniques. Currently, researchers at Polytechnique Montréal are working to help developers who are not machine learning experts find design issues and faults in their programs.

We use machine learning software systems (MLSSs) daily: recommendation systems, speech recognition, face detection, personal banking, and autonomous driving. These systems make intelligent decisions automatically based on learned patterns, associations, and knowledge from data. Design issues and faults directly affect the quality of MLSSs. Since false or poor decisions can lead other systems to malfunction, MLSS quality assurance is necessary. In safety-critical systems, non-qualified decisions may cause, for example, significant financial losses or even threat to human life. A quality profile can be defined to include a set of quality-related measurements along with a set of quality assurance indicators, which determine when quality issues must be raised and then addressed during MLSS lifecycle.

Currently, the focus of my research is on detecting faults and poorly designed models in deep learning (DL) programs, and assuring their quality. DL is an active field of machine learning that is being successfully applied to many complex real-world problems. It has been used widely in MLSSs. The core of such systems is deep neural networks. Once built, a desirable deep neural network should be trained by executing a learning algorithm on data.

Our research team has examined common structural errors (faults) and design inefficiencies occurring in DL programs. Faults may lead to incorrect functionality like an unexpected output/behaviour, or crash where the program stops running with or without displaying an error message. Design inefficiencies result in poor performance by showing symptoms like low prediction accuracy. We have identified 8 design issues and 14 faults for a popular DL architecture, namely deep feedforward neural networks, which are widely employed in industrial applications. They are called feedforward because the information flows in a forward manner from the input layer, through the hidden layers and to the output layer which represents a class probability or a predicted real value. The faults and issues were identified through a review of the existing literature on DL design and a manual inspection of problematic DL programs.

Design issues are specified by describing their context, consequences, and recommended refactorings. To provide empirical evidence on the relevance and perceived impact of the identified design issues, we conducted a survey with 81 DL developers. In general, the developers perceived the proposed design issues as reflective of design or implementation problems, with agreement levels varying between 47% and 68%. We believe that having a list of known bad design practices for DL models can help developers avoid pitfalls during the development of their programs, resulting in better software quality.

Using the list of identified faults and design issues, we create NeuraLint, a model-based verification approach for DL programs. To do so, we have proposed a meta-model for DL programs that includes their base skeleton and fundamental properties. This meta-model, as a model of models, captures the programs’ essential properties independent of available DL libraries. Considering the proposed meta-model, a verification rule is specified for each fault or design issue to detect its occurrence. Then, we propose a checking process to verify models of DL programs that conform to the meta-model. We have employed graph transformations to implement NeuraLint: representing DL models by graphs and checking those graphs with transformations. In this way, the model of each DL program is extracted from its code and then checked to detect potential problems. We have evaluated NeuraLint by finding various types of faults and design issues in several real-world DL programs extracted from GitHub repositories and StackOverflow posts. The results show that NeuraLint effectively detects faults and design issues in both synthesized and real-world examples with a recall of 70.5% and a precision of 100%.

There are several directions for future work. Although the proposed meta-model is designed for feedforward architectures, it could be extended to support other neural network architectures such as recurrent neural networks. Researchers could also expand our set of verification rules to cover more fault or issue types in DL programs. Assuring the quality of other ML techniques like reinforcement learning is another potential application. Moreover, DL programs could be run experimentally to assess their performance (and functionality) leading to detection of potential problems. Such evaluation, which is called dynamic analysis, would enrich quality assessment of MLSSs. Finally, the quality profile of MLSSs could be extended beyond faults and design issues to detect performance degradation (model staleness) over time and quality issues raised from interacting with other parts of a system.

This article was produced by Amin Nikanjam, Research Associate (Polytechnique Montréal), with the guidance of Marie-Paule Primeau, science communication advisor, as part of our “My research project in 800 words” initiative.

Back to search projects

Artificial intelligence to...

Assess quality of machine learning software systems