We thought it would be interesting to use data we found on Autism Diagnoses from Fadi Thabtah, who had used these data sets for his own machine learning projects. He is currently a professor of Computer Science & Data Analytics at New Zealand's Manukau Institute of Technology. The data sets and his subsequent publications can be found here. We focused on the Adults dataset.
Professor Thabtah's objectives for his machine learning model were to explore a ML technique called Covering, which is an understudied classification approach. He and his colleagues set out to create a new Autism Detection method based in Covering.
We set out to dive deeper into machine learning, with this dataset as our vehicle. We decided to explore K-Nearest Neighbors, Logistic Regression, and Random Forest.
A special thank you to Fadi Thabtah for lending us his time and thoughts. And thank you to the University of California Irvine: Center for Machine Learning and Intelligent Systems for providing access to their repo.
The dataset is composed of user information compiled thru an app, created soley for research purposes. The app is based on something called the AQ-10, which is a condensed version of the clincally accepted diagnostic tool, the AQ test. The AQ-10 consists of the ten most statistically effective questions. (NOTE: The AQ-10 is not new with this dataset; it is not part of the machine learning).
To view or take the Autism Spectrum Test that built this dataset, visit the ASDTests App Webpage.
| Participant Breakdown | |
|---|---|
| Total Participants | 1117 |
| Min Age | 17 |
| Max Age | 80 |
| Avg Age (median) | 28 |
| Men | 595 |
| Women | 522 |
| Percent with Autism | 32.05% |
| Without Autism | 67.95% |
While our machine learning models focused on the 10 questions from the AQ-10 Questionnaire as inputs, we also analyzed the makeup of the dataset. In addition to individual and overall scoring for each participant, plus subsequent classification, the dataset also contains additional information on each user. This includes: Age, Sex, Ethnicity, Jaundice, Family History of ASD, Residence, Having Used the App Before, Screening Type, and Language.
Of particular interest to our group were Family History of ASD, Sex, Age, Ethnicity, and User Type. We initially hypothesized that family history would be an important factor in the machine learning, but it's actually a weak correlation according to our models.
As we did feature selection, we noticed that both white and middle eastern ethnicities were weighted highly, even higher than sex or family history. Of note, participants of white and middle eastern ethnicities make up an overwhelming majority of the dataset. We did look at the ethnic breakdown of all users, and also compared the rates of autism within the whole group to rates within each ethnicity. It could be beneficial to analyze a more evenly distributed ethnic classification. Because Autism is a global medical diagnosis, further ethnic analysis could help to hone ML detection globally. Or upon further analysis, we might find that ethnicity is of very small significance.
Lastly, we explored how user type might impact the results of the machine learning models. User types included: parent, self, relative, healthcare professional, friend, others, and teacher. We created a series of boxplots to analyze the score distribution (ranging from 0 to 10) based on the user, and broken apart by classification. We found that there appears to be a stable median 50% across both groups. This could indicate that the test is viable in terms of giving steady results, not largely swayed based on who's taking it.
FEATURE SELECTION: While we did explore all components of the dataset, we ultimately focused on the 10 questions from the AQ-10 questionnaire for our feature selection. Please follow this link to explore our correlation heat map, which includes all questions, Age, and our target variable, Class.