K-Nearest Neighbors

Explored by Vallie Tracy

Before doing any feature selection, I performed a pd.get_dummies and kept all inputs except 'Score', 'Class', and 'Case No'. K=21 appeared to be where the graph leveled off, so using 21 neighbors, I got an accuracy score of 0.925. Because K-Nearest Neighbors doesn't have its own feature selections method, I used ExtraTreesClassifier to determine the heighest weighted inputs. Making up the top 11 features were all 10 questions, plus Age. Age ranked tenth, and question 'A8' was ranked at 11, only behind by a weight of two ten-thousandth.

After performing a gridsearch to tune the hyperparameters, the testing score did bump up, from .925 to ~.947. I then proceeded to alter the feature selection five times. In addition to the questions, I focused on Age, Family History of ASD, Jaundice, and Sex. Though ethnicity inputs of 'white' and 'middle eastern' ranked 12th and 13th, I chose to not focus on ehtnicity because two ethnicies accounted for the overwhelming majority.

Through the five feature selection iterations, one-by-one, I removed the aforementioned features I had chosen to focus on, so I eventually was left with only the 10 AQ-10 questions as inputs. The highest testing score came in the last iteration, with only the questions as inputs.

Some interesting notes:

  • Even though Age ranked within the top 10 inputs, there was 0 variance in the testing score after gridsearch, both before and after Age was the last input removed outside of the questions. But the variance was ~1.1% for the testing score before hyerparameter tuning, meaning the model improved, but ~.986 appears to be a threshold. If we look at the correlation heat map, our findings do correlate with the mapping of Age to Class, in that Age is a weak classifier.
  • Hyperparameter tuning: algorithm jumped around a bit between auto, brute, and ball_tree. The computer chose neighbors of 19-21 for a majority of the feature selection tests. Could we have expected to see any correlation between the number of neighbors and how the machine naturally clustered participants in K-means?
  • Precision: The precision of the '1' label (1 = classification of yes for ASD) didn't improve until the last feature selection tweak. It jumped from 0.91 to .97 in the final feature selection, where I removed all except for the questions. I don't think it has to do with the smaller sample size since it isn't across-the-board. Unsure where this is stemming from, might be something to explore.

Click below to view the code.