Experiment : Supervised learning model to classify a Github Issue as enhancement or bug based purely on issue title

Quick Summary: I Mined more that 1,00,000 Issue data from Github open source repositories. Mined data included { issue title – string}, {issue description – paragraph } and {labels – discrete strings}. Most of them were enhancement or bug. So started with a simple classifier which classifies an issue as Enhancement or Bug based on issue title. Could achieve up to 85.8% accuracy on mined data. The pipeline was built using scipy kit. And here’s things I have learnt in the process (In most naive terms, abstracting information on what parameters I used in each algorithm)

  • Training result (accuracy) seemed to go up with training data size (no of rows) 😀
  • But for few ML algorithms, training time also go up with training data. Few algorithms seemed to take time proportional to training data while predicting. For example – Gaussian Naive Bayes, SVM. While it was pretty much constant in case of tree based algorithms like – decision tree, random forest, adaboost with decision tree as weak learner.
  • Here’s the best accuracy I could achive so far with diff algorithms (w/o mentioning the parameters or training time or data size).
    Algorithm Accuracy (%)
    SVM 80.08
    AdaBoost 74.82
    Naive Bayes 68.84
    Random Forest 85.8
    Decision Tree 77.52

    Which made me an obvious fan of Random Forest Ensemble considering both speed and accuracy.

  • Feature selection seemed to improve the accuracy of random forest classifier by okish (not hude) margin but it has a huge impact on Training / Testing time. Best results were observed 11 percentile feature selection was applied. Without any feature selection in pipeline it was 85.74% accuracy for same amount of data and parameters (n_estimators = 15, criterio=entropy). However the training time reduced to 129s with 9%ile feature selection contrary to when 12%ile was applied when it took 1077s-> Nearly 8 times.
  • Accuracy is not the only metric to consider – metrics like precision, recall & f1_score are important too. At it’s best got following data. Note that there were two labels so metrics like precisio, recall & fbeta will have two values
    Random Forest accuracy = 85.34% precision = [0.87, 0.82]
    recall = [0.92, 0.73] fbeta_score = [0.89, 0.77]
  • PCA didn’t seem to improve the accuracy of the classifier in inital experiments by around 5%. As can be assumed it took a toll in case of other metrics like precision, fbeta score & recall. Also it increased the time for the process.
  • Also, as per initial experiments stemming and stopword (english) removal didn’t seem to bring much improvements. It seemed to bring down the metrics in some cases.
  • Stemming and removing english stopwords didn’t have a great impact on the metrics either. Changes were marginal, infact in negative direction.

TODOS / Things to test

  • Classifier features based on POS tagging of the issue title text.
  • Write a detailed writeup on observations with each algorithms, part of the pipeline.
  • Write multi label classifier for the same problem statement.
  • Use Issue description as well for the process.

References

  1. Experiment Code – https://github.com/mebjas/ml-experiments/blob/master/classifiers/GithubLabels_pipeline/classifier.py
  2. Dataset with two labels – https://github.com/mebjas/ml-experiments/blob/master/classifiers/GithubLabels_pipeline/dataset_complete.json
  3. Mining issue data from Github – https://github.com/mebjas/gils

Leave a Reply

Your email address will not be published. Required fields are marked *