Autism Spectrum Disorder (ASD)

Adult ASD Detection Using Machine Learning

Objectives

We thought it would be interesting to use data we found on Autism Diagnoses from Fadi Thabtah, who had used these data sets for his own machine learning projects. He is currently a professor of Computer Science & Data Analytics at New Zealand's Manukau Institute of Technology. The data sets and his subsequent publications can be found here. We focused on the Adults dataset.

Professor Thabtah's objectives for his machine learning model were to explore a ML technique called Covering, which is an understudied classification approach. He and his colleagues set out to create a new Autism Detection method based in Covering.

We set out to dive deeper into machine learning, with this dataset as our vehicle. We decided to explore K-Nearest Neighbors, Logistic Regression, and Random Forest.

A special thank you to Fadi Thabtah for lending us his time and thoughts. And thank you to the University of California Irvine: Center for Machine Learning and Intelligent Systems for providing access to their repo.

Overview of the Autism Questionnaire and Resulting Dataset

Origins of the Dataset

The dataset is composed of user information compiled thru an app, created soley for research purposes. The app is based on something called the AQ-10, which is a condensed version of the clincally accepted diagnostic tool, the AQ test. The AQ-10 consists of the ten most statistically effective questions. (NOTE: The AQ-10 is not new with this dataset; it is not part of the machine learning).

To view or take the Autism Spectrum Test that built this dataset, visit the ASDTests App Webpage.

The Data's Structure

There are two classification labels, 'yes' and 'no' (yes = an ASD diagnosis of yes)

Users of the ASDTests app could answer each question with one of three options: "Definitely", "Slightly Agree", or "Slightly Disagree". Then, based on how the user answered, their responses were put into binary format of 0's and 1's.

While the ASDTests app asks users their age range, this dataset has been cleaned, so the column 'Screening Type' is irrelevant.

The 'Score' column ranges from 0 to 10, so depending on the ML method, that column needs to be scaled.

A score of 7 or greater means a higher statistical probability that a user has Autism.

Score ('Class' is our label column) is highly tied to questions A1 thru A10. Something to keep in mind.

Overall it's a very clean dataset with only one null row.

It's interesting to note all of the data outside of the AQ-10 questions. We're unsure of the thought process behind including these questions in the ASDTests app. Interesting because the AQ-10 questions were the best feature selection group for us, but it appears that the app creator has worked substantially with autism. What is his thought process and is there more to glean here?

	Participant Breakdown
Total Participants	1117
Min Age	17
Max Age	80
Avg Age (median)	28
Men	595
Women	522
Percent with Autism	32.05%
Without Autism	67.95%

Analysis of the Dataset

While our machine learning models focused on the 10 questions from the AQ-10 Questionnaire as inputs, we also analyzed the makeup of the dataset. In addition to individual and overall scoring for each participant, plus subsequent classification, the dataset also contains additional information on each user. This includes: Age, Sex, Ethnicity, Jaundice, Family History of ASD, Residence, Having Used the App Before, Screening Type, and Language.

Of particular interest to our group were Family History of ASD, Sex, Age, Ethnicity, and User Type. We initially hypothesized that family history would be an important factor in the machine learning, but it's actually a weak correlation according to our models.

As we did feature selection, we noticed that both white and middle eastern ethnicities were weighted highly, even higher than sex or family history. Of note, participants of white and middle eastern ethnicities make up an overwhelming majority of the dataset. We did look at the ethnic breakdown of all users, and also compared the rates of autism within the whole group to rates within each ethnicity. It could be beneficial to analyze a more evenly distributed ethnic classification. Because Autism is a global medical diagnosis, further ethnic analysis could help to hone ML detection globally. Or upon further analysis, we might find that ethnicity is of very small significance.

Lastly, we explored how user type might impact the results of the machine learning models. User types included: parent, self, relative, healthcare professional, friend, others, and teacher. We created a series of boxplots to analyze the score distribution (ranging from 0 to 10) based on the user, and broken apart by classification. We found that there appears to be a stable median 50% across both groups. This could indicate that the test is viable in terms of giving steady results, not largely swayed based on who's taking it.

Please visit our Google Colab notebooks to explore the code and view our analysis:

Ethnicity Analysis
Exploring Sex, Family History, and Age

FEATURE SELECTION: While we did explore all components of the dataset, we ultimately focused on the 10 questions from the AQ-10 questionnaire for our feature selection. Please follow this link to explore our correlation heat map, which includes all questions, Age, and our target variable, Class.

Autism Spectrum Disorder (ASD)

Adult ASD Detection Using Machine Learning

Objectives

Overview of the Autism Questionnaire and Resulting Dataset

Origins of the Dataset

The Data's Structure

Analysis of the Dataset

Please visit our Google Colab notebooks to explore the code and view our analysis:

MACHINE LEARNING MODELS

K-Nearest Neighbors

Logistic Regression

Random Forest

K-means