Research
My research interests include but are not limited to:
- Data Management: Computational Reproducibility, Data Engineering, Data Management Systems.
- Machine Learning & Natural Language Processing: Applied Machine Learning, Large Language Models.
- Software Engineering: Program Analysis, Applications of Artificial Intelligence in Software Engineering.
Projects
- Evaluating Computational Reproducibility of Jupyter Notebooks Using Machine Learning and Natural Language Processing. [Details]
Several methods were developed to compare cell outputs between original and rerun Jupyter Notebooks and evaluate their reproducibility. Natural Language Processing was used to extract textual features from the markdown texts and Machine Learning models were built to classify the notebooks automatically based on their reproducibility.
- Reproducibility of Visualizations in Jupyter Notebooks: Algorithms vs Programmers. [Details]
Changes in visualizations upon rerunning Jupyter Notebooks were tracked and categorized. A set of survey questions was proposed to understand the differences between programmers’ perceptions and algorithms for evaluating reproducibility of image outputs.
- ‘Reproducibility’ in Computer Science Conferences. [Details]
Texts from 55 Computer Science conference websites mentioning reproducibility and 250 research papers from 55 such conferences as well as 25 conferences not mentioning reproducibility were used to understand how computer science conferences assess reproducibility of the submitted works and if there are any shared characteristics among the papers accepted in reproducibility-aware conferences.
- Weather Station using Raspberry Pi and Environmental Sensors. [Details]
In this Software Engineering project, an IOT-based Web Interface to report temperature, humidity, and air pressure using Raspberry Pi was developed. To make the application elegant, several good software engineering practices were implemented to ensure reliability, scalability, and robustness.
- Clinical Trials Categorization using Machine Learning and Natural Language Processing based on their Media Attention. [Details]
Machine Learning and Natural Language Processing techniques were applied to automatically categorize clinical trials. 15 classification models including general purpose classifiers such as Decision Tree (DT), Support Vector Machine (SVM), Logistic Regression (LR), Naive Bayes (NB) as well as ensemble methods such as Random Forest (RF), Bagging, Adaboost, and XGBoost were used to classify clinical trials based on their Altmetric Attention Scores (AAS). Separately numerical and text-based features were used to train and test the models. For the numerical-based models, the Decision Tree classifier was shown to have the best performance of the general classifiers used. The use of ensemble methods increased the performance of the models as expected. The Random Forest and extreme gradient boosting models had the best performance overall with the XGBClassifier tuned with randomized search having the largest ROC/AUC values. Similarly, XGBClassifier outperformed the other classifiers applied on the text features. The highest test accuracy was found 94.5% for the numerical features and 80.1% for the text features.
- Customer Segmentation using Centroid Based and Density Based Clustering Algorithms. [Details]
This project illustrated the idea of applying density-based clustering algorithms such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) for customer segmentation besides using centroid-based algorithms such as K-means. Clusters were formed among 440 customers based on their spending habits to identify the high-value customers.
- Fake News Detection within a Static Dataset using Supervised Machine Learning Algorithms. [Details]
As part of this academic research project, Machine Learning & Natural Language Processing (ML-NLP) based models were built that were able to detect fake news within a static dataset. 3 Machine Learning Classifiers - Support Vector Machine (SVM), Logistic Regression (LR), and Stochastic Gradient Descent (SGD) were used and among them the highest accuracy was observed as 91.1% for Logistic Regression having a precision of 87.1% and a recall of 95.6%.