More Projects | Alvitay

This Page Consists of all the additional projects I did accross different topics. Click on the topic you desire to see

1. Data Analysis and Visualization

Socio-economic Factors for Geographic Clustering

Developed a clustering model to group countries based on socio-economic indicators such as child mortality, health expenditure, and income.
Preprocessed the data using StandardScaler to normalize features and ensure consistent scaling across models
Employed various clustering techniques, including K-Means, K-Medoids, Gaussian Mixture, DBSCAN, and Agglomerative Hierarchical Clustering
This analysis provided deep insights into global socio-economic disparities, helping to inform policy decisions and development strategies.

Unsupervised Learning : Fantasy Sports Clustering Analysis

Developed an advanced clustering model to segment fantasy sports players based on performance metrics like goals, assists, and total points.
Utilized techniques such as K-Means, K-Medoids, DBSCAN, Gaussian Mixture Models, and Agglomerative Hierarchical Clustering to uncover distinct player groups.
Applied PCA for dimensionality reduction and StandardScaler to normalize the data, ensuring model precision.
Each cluster was thoroughly profiled to deliver insights into player potential and pricing strategies for the next season.

Data Analysis and Visualization

2. Machine Learning

BigMart Sales Prediction

For the BigMart Sales Prediction project, I developed a machine learning model to predict the sales of products across different stores using various features such as product characteristics and store attributes.
The project involved thorough data cleaning, feature engineering, and handling missing values.
I utilized tools like Linear Regression for the predictive model, and MinMaxScaler to normalize the data.
Through statistical analysis and visualization, I identified key factors influencing sales. The insights gained were aimed at providing actionable recommendations for inventory management and sales strategy optimization.

HR Employee Attrition Prediction

Developed an Employee Attrition Prediction model to identify key factors driving employee turnover and predict the likelihood of attrition.
Applied machine learning techniques such as Logistic Regression, K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA), with GridSearchCV for hyperparameter tuning to enhance model performance.
Utilized SHAP (SHapley Additive exPlanations) to interpret the contribution of each feature to the predictions, providing deep insights into the drivers of employee attrition.
StandardScaler was used for feature scaling, ensuring consistent data handling.

SuperKart Sales Prediction

For the SuperKart Sales Prediction project, I built a Linear Regression model to forecast sales based on product and store attributes.Key tools like statsmodels and scikit-learn were used for model building and diagnostics, ensuring a robust, well-performing predictive model.
Performed residual analysis to ensure an unbiased model (mean of residuals ≈ 0), checked for homoscedasticity using the Goldfeld-Quandt test, assessed the linearity of variables through regression plots, and confirmed the normality of error terms via Q-Q plots, ensuring a robust linear regression model for predicting sales.

Machine Learning

3. Practical Data Science

Forecasting Consumer Price Index

For the Consumer Price Index Forecasting project, I applied time series modeling techniques to predict the Consumer Price Index (CPI) for the next five years. Utilizing ARIMA and SARIMA models, I captured the seasonal trends and cyclic behaviors of CPI data.
Preprocessing involved transforming the date into the appropriate format and splitting the data into training and testing sets.
To ensure accuracy, I performed model evaluation using metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

Classification Project - Hotel Booking Cancellation Prediction

For the Hotel Booking Cancellation Prediction project, I developed a machine learning model to predict whether a customer’s hotel booking will be canceled or not. To minimize last-minute cancellations, reduce revenue loss, and optimize booking strategies.
Leveraging classification models such as Decision Trees, Random Forests, and ensemble methods, I tuned the hyperparameters using GridSearchCV to optimize model performance.
In this project, I conducted extensive feature engineering, applied scaling techniques to normalize data, and evaluated model performance through residual analysis and feature importance.

Crude Oil Production Forecasting

I developed a comprehensive time series model for the Crude Oil Production Forecasting project to predict future oil production using historical data from 1992 to 2018.
Time series decomposition, visualizations, and trend analysis were also applied to provide clear insights into the production patterns.
The project involved advanced techniques like ARIMA, Auto ARIMA, AR, MA, and ARMA to capture trends, seasonal variations, and cyclic behaviors.
Leveraging tools such as Statsmodels and pmdarima, I conducted thorough model evaluations and tuned parameters to enhance accuracy.

Practical Data Science

4. Deep Learning

Deep Learning

Audio MNIST Digit Recognition

For the Audio MNIST Digit Recognition project, I developed an advanced neural network model using TensorFlow to classify spoken digits from audio recordings.
The project leveraged the Audio MNIST dataset, converting the audio files into MFCC spectrograms using Librosa for preprocessing and feature extraction. These spectrograms were then treated as image data for classification.
I utilized TensorFlow to build an Artificial Neural Network (ANN) for digit recognition and employed scikit-learn for performance evaluation, ensuring model accuracy and robustness.
This project highlights how MFCC, audio processing techniques, TensorFlow, and deep learning combine to tackle speech recognition and audio classification in a practical, real-world way.

Predicting Chances of Admission

For the project on Predicting Chances of Admission into UCLA, I developed a neural network-based classification model using a Sequential model from TensorFlow.
The model was optimized using different optimizers, including Adam and SGD, to enhance its predictive performance.
Key features such as GRE scores, TOEFL scores, undergraduate GPA, and research experience were used for prediction, with MinMaxScaler applied for feature scaling. By converting continuous probabilities of admission into categorical outcomes, the model provided clear and actionable insights into a student's likelihood of being admitted.

Data Scientist Employee Attrition

The Employee Attrition Prediction project, provides a well-rounded approach to predicting employee attrition, combining powerful machine learning models with advanced optimization techniques.
I applied thorough Univariate and Bivariate Analysis to understand the relationships between key features. I handled missing values through imputation and applied Label Encoding on categorical columns to prepare the data for modeling.
I built a Sequential model using TensorFlow and optimized it through multiple techniques, including RandomSearchCV, GridSearchCV, and Keras Tuner.
Additionally, I applied SMOTE (Synthetic Minority Oversampling Technique) combined with Keras Tuner to address class imbalance and fine-tune the model's hyperparameters.

BERT - Article Categorization Using Transformer Models

For the BERT-based Article Categorization project, I developed a predictive model to automate article categorization, enhancing the speed and personalization of content delivery by leveraging advanced machine learning techniques.
The project began with preprocessing, including univariate analysis to understand the distribution of categories, label encoding to convert categorical labels into numerical form, and tokenizing the text data using BertTokenizer.
These tokenized sequences were then fed into the TFBertForSequenceClassification model for category prediction. The Adam optimizer was employed to fine-tune the model, ensuring efficient learning.
TensorFlow and Hugging Face’s transformers library were used for implementation, while scikit-learn was utilized for performance evaluation through metrics like accuracy, precision, and recall.

COVID-19 Detection Using Chest X-Rays with CNN and Pre-trained VGG16 Model

For the COVID-19 Chest X-Ray Classification project, I developed a Convolutional Neural Network (CNN) model using TensorFlow to classify chest X-ray images into three categories: COVID-19, Viral Pneumonia, and Normal.
The project involved preprocessing a dataset of X-ray images, converting them into NumPy arrays, with steps including resizing, normalizing pixel values, and using data augmentation to enhance model generalization.
I built the CNN using both TensorFlow's Sequential API and a pre-trained VGG16 model for improved feature extraction. The Adam optimizer was employed to fine-tune the model, ensuring efficient learning.
Performance evaluation was conducted using confusion matrices and metrics like precision, recall, and accuracy to assess the model’s robustness.
This project demonstrates how deep learning and transfer learning can be applied in medical imaging, can help accurately diagnose COVID-19 using X-ray images.

Automating Food Image Classification Using CNNs for Efficient Labeling

For the Food Image Classification project, I developed a Convolutional Neural Network (CNN) model using TensorFlow to classify food images into three categories: Bread, Soup, and Vegetables-Fruits.
The project aimed to show how effective CNNs can be in automating image classification for large volumes of images without the need for manual labeling.
The images were resized, normalized, and augmented during preprocessing to improve model performance.
I built the CNN using TensorFlow's Sequential API, incorporating layers like Conv2D, MaxPooling, and Dropout to enhance feature extraction and prevent overfitting.
The SGD and Adam optimizer were used for efficient learning, and model performance was evaluated using classification accuracy and confusion matrices.

CIFAR-10 Image Classification Using CNN and Transfer Learning with Pre-trained VGG16

For the CIFAR-10 Image Classification project, I developed a Convolutional Neural Network (CNN) model using TensorFlow to classify images into 10 categories.
Data preprocessing involved normalizing pixel values and augmenting the dataset to improve generalization.
I built the CNN using TensorFlow's Sequential API, incorporating layers like Conv2D, MaxPooling, BatchNormalization, and Dropout to optimize feature extraction and prevent overfitting.
In addition to building a CNN, I implemented transfer learning using a pre-trained VGG16 model for enhanced feature extraction.
The Adam optimizer was employed to fine-tune the model for efficient training, and model performance was evaluated using classification accuracy and confusion matrices.

5. Recommendation Systems

Recommendation Systems

Movie Recommendation System Using Clustering, Content-Based Filtering, and TF-IDF

For the Movie Recommendation System project, I developed a cluster-based recommendation system and a content-based recommendation system, utilizing tokenization with the NLTK package and TF-IDF for feature extraction in the content-based approach.
The data was preprocessed by merging relevant information. I implemented collaborative filtering with scikit-surprise to predict user preferences based on similarities between users and items.
The cluster-based system used co-clustering to group similar users and movies. TF-IDF and NLTK tokenization were applied to enhance the content-based model by extracting relevant features from the text data.
Model performance was evaluated using precision, recall, and F1-score to ensure accurate and relevant recommendations.

Book Recommendation System with Collaborative Filtering and Matrix Factorization

I developed a comprehensive AI-based book recommendation system using multiple techniques to provide personalized book suggestions.
I implemented a rank-based recommendation system along with user-based and item-based collaborative filtering using cosine similarity and KNN. To optimize model performance, I applied hyperparameter tuning using Grid Search CV. The Surprise package was used for collaborative filtering models.
Additionally, I built a matrix factorization model using SVD and further fine-tuned it to enhance the accuracy of predictions.
The models were evaluated using metrics such as precision, recall, RMSE, and F1-score to ensure the relevance and accuracy of the recommendations.

Music Recommendation System with Collaborative Filtering, Clustering, and Content-Based Models

I implemented multiple recommendation techniques to enhance personalization and improve the accuracy of song suggestions. These included a rank-based recommendation system and user-user and item-item collaborative filtering using Cosine similarity and KNNBasic
The Surprise package was used for collaborative filtering models. I fine-tuned them using Grid Search CV. I also developed a model-based collaborative filtering system using matrix factorization, leveraging latent features to recommend songs based on past user behavior, and tuned the model for optimal performance.
Additionally, I built a cluster-based recommendation system using co-clustering to group users with similar listening habits and recommend songs based on play counts within each cluster, optimized through hyperparameter tuning.
A content-based recommendation system was also created, where I processed text data using the NLTK package and employed CountVectorizer and TfidfVectorizer for feature extraction.
The models were evaluated using metrics such as RMSE, precision, recall, and F1-score to assess their accuracy and effectiveness in providing relevant recommendations.