Microsoft is claiming that they developed a system that is correctly distinguishing between non-security and security software bugs 99% of the time. Moreover, it can accurately identify high-priority, critical security bugs on average 97% of the time. It is planning to open-source the methodology on GitHub, in the coming months. It is along with example models and other resources.
Moreover, the system trained on a data set of 13 million woks items and bugs from 47,000 developers at Microsoft stored across GitHub and AzureDevOps repositories. It can be helpful for supporting human experts. It is estimated that fixing a bug takes 30 times longer than writing a line of code and developers create 70 bugs per 1,000 lines of code. In the United States, $113 billion is in spending annually to identify and fix product defects.
Microsoft Technique
Microsoft is saying that security experts approved the training data while architecting the model. Moreover, statistical sampling is for providing those experts a manageable amount of data to review. Afterward, the data then encoded into representations called feature vectors. Moreover, researchers of Microsoft set about to design the system using a two-step process. First is the model learning to classify non-security and security bugs. Thus, then it learned to apply severity labels – low impact, critical and important – to the security bugs.
To make its bug predictions, Microsoft’s model leverages two techniques. The first is a TF-IDF (a term frequency-inverse document frequency algorithm), an information retrieval approach that is assigning importance to a word based on the number of items it is appearing in checks and documents. Those checks and documents are about how relevant the word is throughout a titles collection. Microsoft is saying that bug titles are generally very short. It is containing around 10 words. The second technique is a logistic regression model. Thus, it is using a logistic function for modeling the probability of a certain class or event existing.