Research and Projects

March 2024

[p7] Bill Split

An application to split groceries unevenly among my roommates > turned into a public web app for everyone to use!

Skillset

Python
Flask
Docker
Google Cloud Run
GitHub Actions
HTML, CSS, JavaScript

Objectives

Given participant, item and shares of participants, return the shares due for a bill.
Use Flask for a Python based frontend and backend.
Use Docker for easy deployment to the web.
Deploy on the web using Google Cloud Run, integrating with GitHub for CI/CD approach.

Try it out!

Aug 2023 - Dec 2023

[p6] Comprehensive Evaluation of ChatGPT Reliability Through Multilingual Inquiries

Conducted a thorough evaluation of ChatGPT's reliability by examining its responses to inquiries in multiple languages, focusing on the presence of jailbreak vulnerabilities which could potentially have adverse effects on users, including facilitating criminal activities. The study emphasized the importance of identifying and mitigating these vulnerabilities to enhance ChatGPT's security and reliability.

This was my first project as a Graduate Research Assistant under Prof. Yiming Tang. Worked with fellow student and friend Poorna Chander and got a chance to apply the knowledge gained in my graduate program.

Skillset

Data annotation
MS Excel
OpenAI API
Python

Objectives

Assess ChatGPT's reliability across multiple languages to understand its global usability and potential risks.
Identify the presence of jailbreak vulnerabilities within ChatGPT's responses.
Propose measures to enhance ChatGPT's security and mitigate the identified vulnerabilities.

Results

Discovery of specific instances where ChatGPT could potentially facilitate jailbreak scenarios, highlighting security flaws.
Evidence of varied reliability across different languages, with some languages showing higher susceptibility to vulnerabilities.
Recommendations for improving ChatGPT's security measures, including the implementation of more robust filtering and monitoring algorithms to prevent exploitation of vulnerabilities.

View the paper here

Aug 2023 - Dec 2023

[p5] Analyzing Developer-ChatGPT Conversations for Software Refactoring: An Exploratory Study

Conducted a comprehensive analysis on the impact of a large language model (LLM - ChatGPT) on software development through data from GitHub and Hacker News, focusing on its role in code refactoring and enhancing developer interactions.

This paper was co-authored with my classmates at RIT: Omkar Chavan, Divya Hinge, Olivia Wang under the guidance of Prof. Dr. Mohamed Wiem Mkaouer. The paper was accepted in Mining Software Repositories 2024 (MSR 2024). Look out for the link here to the paper as soon as it gets published.

Skillset

Python
API requests to GitHub (data retrieval based on commit SHAs)
Statistical analysis
SQLite
JSON

Objectives

Analyze the interaction dynamics between developers and ChatGPT, focusing on conversation content and engagement patterns.
Examine how ChatGPT assists in code refactoring, identifying effective strategies and outcomes.
Quantify the average number of prompts needed to achieve resolution in ChatGPT interactions.

Results

Developer-ChatGPT conversations cover software engineering topics including documentation (9.5%), issue resolution (22.1%), new feature development (44.6%), configuration, testing, code refactoring (12.2%), and others (9.9%).
For code refactoring, developers either give specific instructions or allow ChatGPT to suggest improvements, with 54 out of 447 conversations focusing on refactoring.
On average, the number of prompts needed to reach a conclusion varies by topic, from 3 for commits, around 4 for discussions, hacker news, and issues, to 5 for pull request-related conversations.

Oct 2023 - Dec 2023

[p4] Multi-Turn Dialogue Analysis: Conversations Between Developers & ChatGPT

Explored the dynamics of Developer-ChatGPT conversations, analyzing interactions to uncover insights into user behavior and conversation patterns. Utilizing a dataset from GitHub and Hacker News, employed exploratory data analysis and the YAKE algorithm for keyword extraction, aiming to highlight contextual patterns and the impact of ChatGPT in software development.

This project was part of the Natural Language Processing course at RIT and implored me to apply various NLP techniques, to settle into a simple and effective approach.

Skillset

Python
Statistical Analysis
NLTK
YAKE
Gensim
spaCy
JSON

Objectives

Investigate the dynamics of Developer-ChatGPT conversations.
Identify common themes and keywords in conversations using the YAKE algorithm.
Analyze sentiment distribution in developer interactions with ChatGPT.

Results

Effective identification of 'working sets' and action-oriented keywords (e.g., 'make', 'create') through YAKE algorithm.
Discovery of patterns indicating positive and negative feedback between developers and ChatGPT.
Sentiment analysis revealed a predominance of positive sentiments in both prompts and responses, with a substantial amount of prompts classified as neutral.

Find the report here

Feb 2023 - May 2023

[p3] Recommending Extract Method Refactoring Opportunities via Multi-view Representation of Code Property Graph

Researched and compared using multiple embedding techniques from code property graphs to build machine learning models used to recommend extract method refactoring opportunities in code to find the optimal embedding combinations for open source Java projects.

This project was replication of work done in a paper published on Anonymous GitHub. Found multiple shortcomings in the methodology provided in original research and worked to improve on these along with my classmates Manohar Reddy Uppula and Meghana Kalluri as part of our Software Engineering for Data Science course at RIT.

Skillset

Java
Python
Scikit
Matplotlib
Seaborn

Objectives

Develop REMS to automate Extract Method refactoring recommendations.
Surpass limitations of existing heuristic and data-driven approaches.
Study the methodology used to develop REMS and patch any shortocomings.

Results

Established the optimal flow-view (CodeBERT) and tree-view (GraRep) representations (embeddings) of code for extract method refactoring.
Identified and rectified data leakage from original methodology used.

Find the report here

This is how I did it

Apr 2023 - May 2023

[p2] Multi-Class Image Classification

Developed a deep CNN model using PyTorch optimized with dropout and batch normalization for classification of the CIFAR-10 data set with an accuracy of 75% with only 82,130 parameters.

This project was the culmination of the Neural Networks course at RIT. Here, we (finally) used PyTorch instead of Java to develop a CNN which was effective enough to achieve a competent perfomance on the CIFAR-10 dataset while also not being resource heavy.

Skillset

Python
Java
PyTorch
Convolutional Neural Networks (CNNs)

Objectives

Develop a CNN to achieve effective performance when tested on held out data from the CIFAR-10 dataset.

Results

Achieved an accuracy of 0.75 on said dataset.

Here's the CNN

Nov 2022 - Dec 2022

[p1] Fake Job Posting Prediction

Developed a cascading machine learning model employing neural networks and decision trees utilizing textual and numerical fields for binary classification of a highly imbalanced data set to achieve an F1-score of 0.72.

This project was an exercise to deal with imbalanced data (5% of the records were "fake job listings"). It was my first tryst with imbalanced data which continued through future projects [4] and also my first project using neural networks which sparked an interest to take up the neural networks course next semester.

Skillset

Python
Scikit
Matplotlib
Seaborn

Objectives

Develop a model using continuous and categorical data to predict whether a record is labelled as "true" or "false".
Tackle the imbalanced data problem.

Results

Developed a model with an F1-score of 0.72.
Used a cascading model which utilized neural networks and decision trees to account for data imbalance.

Look at my first data science project

Page updated

Google Sites

Report abuse