Built logistic regression and decision tree models in Python to predict Heart Disease, Kidney Disease, and Skin Cancer using patient health data. Achieved up to 75% accuracy and AUC scores as high as 0.84, demonstrating strong model performance. Identified challenges due to class imbalance, with models exhibiting high recall (e.g., 78%) but low precision for positive cases. Evaluated performance using F1 score and AUC, and validated model generalizability by confirming no overfitting through consistent training and testing accuracy
Completed an internship focused on developing a player performance dashboard using Python and Excel, implementing a full ETL process to extract, clean, and organize performance data. Extracted, combined, and processed over 500 Excel and CSV files—each representing individual training sessions or matches—collected via wearable Playermaker devices. These devices tracked key metrics such as sprint speed, acceleration, and foot usage. Cleaned and transformed/filtered the dataset in Python to generate 27 individual CSV files, each corresponding to a specific player’s performance data. Designed a user-friendly Excel dashboard to visualize this information and collaborated with coaching staff to align the tool with their decision-making needs.
Developed a logistic regression model in Python to predict loan default using LendingClub data. To address class imbalance, implemented various resampling techniques including SMOTE, ADASYN, RandomOverSampler, and RandomUnderSampler. Performed data preprocessing and model evaluation to compare the effectiveness of these strategies, achieving up to 70% recall, 0.70 AUC, and a 0.35 F1-score on unseen test data, demonstrating the utility of advanced sampling strategies in improving predictive accuracy for rare events.
Worked with real-world web traffic data from grammy.com and recordingacademy.com to evaluate the impact of The Recording Academy's decision to split its main website into two distinct platforms. Using Python and Pandas, I conducted an in-depth analysis to uncover differences in user behavior, traffic trends, and engagement across both sites. Key insights were visualized using Plotly Express library to support data-driven conclusions and identify how the split affected audience interaction
Used SQL to analyze collision data from the California Statewide Integrated Traffic Records System (SWITRS), focusing on accidents involving alcohol impairment or driver inattention (e.g., texting or phone use). The goal was to identify patterns in the occurrence of these accidents, including when they are most likely to happen. Findings were visualized using Tableau to effectively communicate insights and trends.
Developed a single-variable regression model in SAS to predict flight arrival delays at Houston’s IAH airport, using data from the month of October. The project involved analyzing five independent variables related to delays, such as weather, carrier issues, and late aircraft. Various plots and statistical summaries were generated to explore relationships between variables. The analysis revealed that carrier delay had the strongest impact on overall arrival delay times, providing key insights into operational inefficiencies
Utilized Python to explore a comprehensive dataset of Olympic medalists from 1896 to 2016. Conducted data cleaning, analysis, and visualization to answer key questions such as: Who are the youngest and oldest medalists of all time? Are there physical differences between Summer and Winter Olympic medalists? Which country has won the most medals overall? This project demonstrates the use of Python for historical data exploration and insights into global athletic performance trends.
Utilized SAS to develop predictive models for the number of wins by the Toronto Blue Jays. Built and evaluated multiple linear regression models to determine the best combinations of variables. The most effective two-variable model included hits and batting average, while the top three-variable model incorporated hits, batting average, and RBIs as predictors of team wins
After acquiring the GoBike program from Ford, Lyft aimed to boost memberships by understanding how users interact with the service. Using SQL, I analyzed user behavior by joining and querying multiple datasets to compare patterns between former Ford GoBike users and current Lyft users. This analysis helped uncover key differences in usage trends to inform business strategy.
Hi, I’m Ayaan Omair, a Master’s student in Data Science at Texas A&M University, with a strong foundation in statistics and analytics. I recently earned my Bachelor’s degree in Mathematics (Statistics) from Arizona State University, graduating summa cum laude with a 3.90 GPA.
During my undergraduate studies, I built a solid skill set in data analysis, working with tools such as Python, SQL, and R. I’ve applied these tools across a range of personal and professional projects, developing expertise in data cleaning, visualization, data preprocessing, data wrangling, and exploratory data analysis (EDA).
I also have hands-on experience with a variety of machine learning techniques, including k-means clustering, Gaussian mixture models, decision trees, logistic regression, as well as linear, lasso, and ridge regression. When I’m not working with data, you’ll probably find me watching sports. I’m a huge fan of the Green Bay Packers, Phoenix Suns, and Manchester United.