Kazi Sajeed Mehrab

(I am currently in the process of transferring to a new website, and not updating this one!)

I have completed my bachelors in CSE from the Department of CSE at Bangladesh University of Engineering and Technology. I performed my undergraduate thesis under the supervision of Dr. Rifat Shahriyar, Professor of the Department of CSE, BUET. I have also worked as an undergraduate research assistant at the BUET CSE NLP Group. During my Bachelors, I have worked in several research projects in Machine Learning, Natural Language Processing, and Source Code Tasks (ML for Programs). I have also worked in academic projects in various different fields of Computer Science, like ML, NLP, Computer Security, Networking and Database Management Systems.

Currently, I am researching in ML, NLP, Data Mining and Source Code Tasks. Besides my research, I am a full-time Lecturer at the Department of CSE, United International University, Dhaka. I am also a volunteer of the IEEE Computer Society, where I am serving as an International Ambassador.


Education

Bachelor of Science in Computer Science and Engineering

Bangladesh University of Engineering and Technology

February 2016 - Febuary 2021

International A levels

Pearson Edexcel UK

2013 - Febuary 2015

International GCSE

Pearson Edexcel UK

February 2011 - Febuary 2013

Publications

CoDesc: A Large Code-Description Parallel Dataset

Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed Mehrab, Md. Mahim Anjum Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya Iqbal, Rifat Shahriyar

Findings of the Association for Computational Linguistics, ACL 2021

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at this url

The full paper can be found in the ACL anthology here: CoDesc

2021

Text2App: A Framework for Creating Android Apps from Text Descriptions

Masum Hasan, Kazi Sajeed Mehrab, Wasi Uddin Ahmad, Rifat Shahriyar

Non Archival Accepted in NLP4Prog Workshop, co-located with ACL-IJCNLP 2021.

We present Text2App -- a framework that allows users to create functional Android applications from natural language specifications. The conventional method of source code generation tries to generate source code directly, which is impractical for creating complex software. We overcome this limitation by transforming natural language into an abstract intermediate formal language representing an application with a substantially smaller number of tokens. The intermediate formal representation is then compiled into target source codes. This abstraction of programming details allows seq2seq networks to learn complex application structures with less overhead. In order to train sequence models, we introduce a data synthesis method grounded in a human survey. We demonstrate that Text2App generalizes well to unseen combination of app components and it is capable of handling noisy natural language instructions. We explore the possibility of creating applications from highly abstract instructions by coupling our system with GPT-3 -- a large pretrained language model. We perform an extensive human evaluation and identify the capabilities and limitations of our system. The source code, a ready-to-run demo notebook, and a demo video are publicly available at this url

The full paper can be found here: Text2App

2021

Research

CoDesc: A Large Code-Description Parallel Dataset

Supervisor: Dr. Rifat Shahriyar (BUET), Dr. Anindya Iqbal (BUET), Dr. Wasi Uddin Ahmad (UCLA, AWS AI)

Status: Published in ACL Findings 2021

We collected the largest Code-Description parallel dataset using various scattered works as the dataset sources. We manually observe the collected raw data and discover repetitive noise patterns that can affect ML training. We clean the noise patterns and demonstrate the proficiency of CoDesc in two tasks. We obtain a new SOTA in Source Code Summarization and achieve a great improvement in Source Code Search

The full paper can be found in the ACL anthology here: CoDesc

2020 - 2021

Text2App: A Framework for Creating Android Apps from Text Descriptions

Supervisor: Dr. Rifat Shahriyar (BUET), Dr. Wasi Uddin Ahmad (UCLA, AWS AI)

Status: Non Archival Accepted in NLP4Prog Workshop.

We design a pipeline to produce simple Android Apps from Natural Language Queries. We design our own formal language, a compiler for the language, and a data synthesizer. We make use of seq2seq networks, MIT App Inventor, and our own modules to build this framework.

Website of the work, which includes a demonstration as well: Text2App

2020 - 2021

BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

Supervisor: Dr. Rifat Shahriyar (BUET)

Status: Paper is available in preprint in arxiv

We devise a neural network method to search for source codes using natural language queries. We convert Source Codes and NLs to corresponding embeddings using respective pretrained models. Different neural networks are then trained to map NL embeddings into code embeddings.

The arxiv preprint of the paper can be found here: Bert2Code

2020 - 2021

Analyzing effectiveness of medical treatments from Electronic Health Records

Supervisors: Dr. Shubhra Kanti Karmaker (Auburn University, Alabama), Dr. Anindya Iqbal (BUET)

Status: Ongoing work; started in 2021.

We aim to use NLP techniques to retrieve useful information from free-text doctor's notes. We have extracted doctor's texts before and after prescribing a particular treatment. We are trying to assess the effectiveness of particular treatments and finding out relevant causalities

This research project aims to use NLP techniques to retrieve useful information from free-text doctor's notes. Unstructured medical notes usually contain information that provides a holistic and comprehensive view of patients' medical conditions including diagnosis history and treatment plans. A series of such notes can reveal detailed information about the treatment progress, i.e., how effective the prescribed treatment has been for the patient and the adverse effects they might have faced during this period. We plan to focus on the differences and similarities between the medical notes in a time series to draw meaningful conclusions on the overall effectiveness of a specific regimen for an illness.

2021 - Present

Work Experience

United International University, Dhaka

Lecturer, Department of CSE
July 2021 - Present

Eastern University, Dhaka

Lecturer, Department of CSE
February 2021 - July 2021

BUET CSE NLP Group, Bangladesh University of Engineering and Technology

Research Assistant
February 2020 - February 2021

Teaching

Summer 2021

United International University
  • Discrete Mathematics
  • Introduction to Computing Systems (including C Programming basics)
  • Operating Systems Lab
  • Electrical Circuits

Summer 2021

Eastern University, until Mid Terms
  • Database Management Systems
  • Information Security
  • Microprocessors, Microcontrollers and Assembly Language Programming
  • Microprocessors Lab
  • Compiler Lab

Spring 2021

Eastern University
  • Software Engineering
  • Microprocessors, Microcontrollers and Assembly Language Programming
  • Database Lab

Awards and Honors

Richard E. Merwin Scholarship of the IEEE Computer Society

The Richard E. Merwin Scholarship is an international student scholarship offered by the IEEE Computer Society. The scholarship is awarded based on academic achievements, ECAs and leadership roles. I was one of the 18 people selected for this award in the Fall 2020 cycle.
Fall 2020

University Merit Scholarships

Achieved merit stipends from BUET in 5 out of the 7 graded academic terms
2016-2021

Innovation Fund from the ICT Division of the Government of Bangladesh

My academic thesis team was awarded this research fund by the ICT division of Bangladesh for the Text2App and CoDesc works mentioned earlier
February 2020

Pearson Edexcel High Achievers' Award

Received 5 awards for achieving World's Highest Scores and good results in IGCSEs and A levels
2013 and 2015

Competition Achievements

  • First Runner Up of Banglalink Social Development Goals Hackathon
  • Runner Up of Covid-19 Idea Contest by IEEE CS BDC

Projects

Bangla Parts Of Speech (POS) Tagger

Tools: Python, HuggingFace Transformers, BERT

Experimented and developed a Bangla POS tagger using neural networks -- RNN, biLSTM, RoBERTa. I analyzed the dataset that is most commonly used for Bangla POS tagging, both manually and using Python automation, and discovered recurring noise patterns. The dataset was then cleaned and preprocessed, before being used for training.

Project Repository:

2020

Decision Tree and AdaBoost for Classification

Tools: Python, Numpy, Pandas, Scikit Learn

Implemented a decision tree classifier and used it in the AdaBoost Algorithm for classification. The performance of the classifier was tested on several classification-task-datasets from Kaggle

The code and documents of the project can be found here

2020

Text Classification on Stack Exchange Texts

Tools: Python, Numpy, Pandas, NLTK, BeautifulSoup

Preprocessed a dataset consisting of documents and texts collected from Stack Exchange, and prepared the dataset for a text classfication task. The objective of the project was to predict the topic of a given text. The classification was done using the k nearest neighbor (knn) algorithm and the Naive Bayes Algorithm. Several similarity measures were set up for these algorithms -- Euclidean Distance, Hamming Distance, TF-IDF. The input texts were represented using simple binary vectors, bag of words vectors and TF-IDF vectors. The classification accuracy and results of the different representations were then reported.

The code and report of the project can be found here

2020

Dimensionality Reduction and Clustering of Data

Tools: Python, Numpy, Pandas, MatPlotLib

In this project, a dataset consisting of high dimensional data was at first reduced to two dimensions using the Pricipal Component Analysis Algorithm. An Expectation-Maximization Algorithm was then run to divide the data into clusters. The results were visually represented using matplotlib

The data, code and results of the project can be found here

2020

Online Service Provider Website

Tools: Python, Django, MySQL

In this project, we built a website that allows general users to order various services (like plumbing, cleaning etc.) online. Service providers would have to register and put up their services and prices into our system, while general users wpuld be able to search for various services, communicate and negotiate with the service providers, and confirm the delivery of a service. The whole project was designed from scratch, and a software development lifecycle was followed. All necessary requirements analysis was done, including identifying use cases and designing a use case diagram, a data flow diagram and a class diagram. The system was designed to follow the Model-View-Controller pattern.

The requirements analysis and system design documents and diagrams, codes, report and user manual of the project can be found here

2019

Implementation of TCP Reset Attack on an SSH Server

Tools: Python, Scapy, SSH, VirtualBox VM, WireShark

In this project, I simulated the TCP Reset Attack on an SSH Server. An SSH Server was set up in a VirtualBox machine, while another virtual machine was set up to be the client. A third virtual machine was set up to be the attacker, who would send forged TCP packets to reset and disrupt the connection between the client and the SSH Server.

The detailed steps of the project can be found here

2019

Restaurants' Information Database System

Tools: Oracle PL/SQL, Oracle SQL Developer, Java, JavaFX

We developed a system that stores information of all restaurants in a city. Restuarant owners can register and place their information into the system, while general users would be able to lookup the information, including address, menu, offers of different restaurants. The focus of the project was particularly towards building an efficient database management system. In the project, we ensured that all tables are normalized, follow an Entity-Relationship Model, have appropriate integrity constraints, and make use of various PL/SQL concepts like triggers, database transactions, and functions.

The code of the project can be found here

2019

Restaurants' Information Database System

Tools: Oracle PL/SQL, Oracle SQL Developer, Java, JavaFX

We developed a system that stores information of all restaurants in a city. Restuarant owners can register and place their information into the system, while general users would be able to lookup the information, including address, menu, offers of different restaurants. The focus of the project was particularly towards building an efficient database management system. In the project, we ensured that all tables are normalized, follow an Entity-Relationship Model, have appropriate integrity constraints, and make use of various PL/SQL concepts like triggers, database transactions, and functions.

The code of the project can be found here

2019

Air Hockey Game

Tools: Java, JavaFX

Developed the game of Air Hockey using Object Oriented Programming concepts in Java. The GUI was designed using JavaFX.

The code of the project can be found here

2016

Gesture Piano

Tools: C Programming, ATMega32 (Microcontroller), IR sensors, Piezo Buzzer, LED Matrix

In this Microcontroller Project, we developed a piano that can be played with finger genstures. IR sensors were used to detect the finger proximity and movemen and a Piezo Buzzer was used to produce the tones. The system was programmed using a C code, which was fed into an ATMega32 microcontroller.

The code of the project can be found here

The video presentation of the project can be found here

2018

Technical Skills

Programming Languages
  • Python
  • Java
  • C and C++
  • SQL and PL/SQL
  • Matlab
  • Octave
  • Assembly (Intel 8086)

Frameworks
  • PyTorch
  • Keras
  • HuggingFace Transformers
  • OpenNMT
  • Django

Tools//Software
  • Git and GitHub
  • PyCharm
  • NetBeans
  • CodeBlocks
  • Oracle SQL Developer
  • emu8086
  • Jupyter Notebook

Libraries
  • pandas
  • NumPy
  • Scikit Learn
  • BeautifulSoup
  • MatplotLib

Scripting
  • Latex
  • Linux Shell Script

Familar With
  • AWS EC2
  • Antlr 4
  • Selenium
  • HTML, CSS

Volunteering and Leadership

ACL Student Volunteer

59th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP 2021)

I was selected to be a student volunteer of the ACL 2021 Conference. My responsibilities were mainly in the help desk, where I had to attend to questions of the participants.

2021

International Ambassador

IEEE Computer Society

My responsibilities include collaborating with other IEEE sections and chapters to organize various events.

2021 - Present

Founding Chair

IEEE Computer Society BUET Student Branch Chapter

I petitioned, formed and led the BUET chapter of the IEEE Computer Society

March 2019 -- February 2021

Coordinator and Content Writer

BUET Systems Analysis, Design and Development Community (BSADD)

Responsible for coordinating various events, mainly live webibars for my undergraduate university's largest students and alumni community - BSADD. I was also in charge of writing social media contents.

March 2020 -- Febrary 2021

Managing Editor and Publications Lead

Responsible for writing, editing and managing technical articles for IoT for Bangladesh

2018 -- 2020

Magazine Subeditor and Program Coordinator

IEEE BUET Student Branch

Responsible for various written content and coordination of events.

2018 -- 2020

Contact

ksmehrab at gmail dot com

sajeed at cse dot uiu dot ac dot bd