top of page
ProjectOverview

Project Overview

Premium Vector _ Sales during shopping a
Project Type:
Role
Project Outcome
Methods
Deliverables
Tools

Customer Segmentation using Unsupervised Learning.

Data Analyst / Machine Learning Engineer.

Developed an unsupervised learning model to segment customers based on their spending habits, demographics, and engagement in marketing campaigns. This segmentation is crucial for optimizing marketing strategies and increasing return on investment (ROI).

1. Unsupervised Learning techniques: K-means clustering, Hierarchical Clustering, DBSCAN, K-Medoids, t-SNE, and PCA (Dimensionality Reduction).

2. Data preprocessing: Handling missing values, scaling, and encoding categorical variables.

3. Model evaluation: Used Elbow Method, Silhouette Scores, and DBSCAN's core points analysis to determine optimal clustering.

1. Segmentation of customers into distinct groups based on spending and engagement patterns.

2. Recommendations for personalized marketing strategies for each customer segment.

3.Detailed visualizations (t-SNE plots, cluster analysis) and key insights.

Python, Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib, t-SNE, PCA, K-means, DBSCAN, Agglomerative Hierarchical Clustering, K-Medoids  and Google Colab.

Context

Customer Segmentation:

Customer segmentation divides customers into groups with similar characteristics for better-targeted marketing strategies. Using a large dataset from a marketing campaign, this project aimed to create effective customer segments to improve personalization in email marketing, boost campaign success, and ultimately drive higher revenue.

Objective

To leverage Unsupervised Learning techniques like K-means clustering, K-Mediods, Hierarchical clustering, DBSCAN, Dimensionality Reduction (PCA) and t-SNE to analyze customer profiles and their responses to marketing campaigns. The goal is to understand different customer segments and provide insights for optimizing marketing efforts.

 Key Research Questions

  • How can we segment customers based on spending habits, demographic information, and marketing engagement?

  • What are the main characteristics of each customer segment, and how can we use them to create personalized marketing campaigns?

  • How does clustering help improve the efficiency of targeted marketing?

PHASES OF PROJECT

This Customer Segmentation using Unsupervised Learning project progressed through six phases: Data Discovery, Data Preparation, Model Planning, Model Building, Communicating Results, and Operationalizing the system for real-world use. Each phase was crucial in developing and optimizing the recommendation model.

Data Discovery

Data Discovery

Goal:

Identify and gather relevant data to build the music recommendation system.

Data Preparation

Data Preparation

Goal:

Clean and prepare the data for further analysis.

Model Planning

Model Planning

Goal:

Decide on the most appropriate algorithm to build the recommendation system.

Model Building

Model Building

Goal:

Develop and train the recommendation model.

Communicating Results

Communicating Results

Goal:

Present the model’s performance and insights to stakeholders.

Operationalize

Operationalize

Goal:

Deploy the model in a real-world environment and monitor its performance.

Phase-1

view notebook
1 / Datasets:

Loaded the dataset containing 27 features, including demographic, spending, and marketing response data.

2 / Initial Data exploration
  • A thorough check was performed to identify missing values, which were minimal but primarily found in the "Income" column, which had about 1.07% of its data missing.

  • The dataset contained a mix of numerical and categorical variables, with variables like ID being removed as they were unique identifiers without predictive value.

Data Discovery

Phase1

Key Insights

  • Missing Data: A very small proportion of the data was missing, indicating the dataset's robustness.

Challenges

  • Skewed Distributions: Some variables might show skewed distributions, posing a challenge for modeling.

  • Outliers: Certain variables, such as high-income individuals, presented potential outliers, necessitating careful consideration during model building.

view notebook

Data Preparation

Phase-2

1 / Process:
  • Dropped irrelevant columns like 'ID' and handled missing values through imputation.

  • Conducted univariate and bivariate analysis on key features such as Income, Age, and Marital Status.

  • Performed feature scaling using StandardScaler to normalize spending variables.

  • Applied Label Encoding for categorical variables like Education and Marital Status

  • Feature Engineering: Several new variables were created to enhance the dataset.

  • Imputing Missing Values: The missing values for the Income variable were imputed using the median, ensuring that the missing data didn't skew the results.

Phase2

Key Insights

  • Income Skewness: The Income variable showed a right-skewed distribution with outliers, making the median a better measure for imputation and analysis.

  • Customer Demographics: Most customers are married and have graduation-level education, forming a key demographic for targeted marketing.

  • Spending Patterns: High-income customers spend more on premium products (wine, meat, gold), while families with small children prefer online shopping and are less inclined toward high-end items.

  • Feature Engineering: New features like Total Expenses and Age were created to better capture spending behaviors and improve model performance.

Challenges

  • Skewed Distributions and Outliers: Handling skewed variables like Income and managing outliers without distorting the data was critical for accurate segmentation.

  • Feature Selection and Missing Data: Imputing missing values and selecting the most relevant features, such as Total Expenses and Age, required balancing to avoid overfitting while enhancing model performance.

Phase-3

1 / Process:
  • Applied Dimensionality Reduction (PCA and t-SNE) to reduce the dataset's complexity and visualize clusters effectively while retaining key variance across features.

  • Decided to use K-means Clustering, DBSCAN, Hierarchical Clustering, and K-Medoids to evaluate different segmentation techniques.

  • DBSCAN was specifically chosen to handle noise and identify clusters that are not necessarily spherical, unlike K-means.

view notebook

Model Planning

Phase3

Key Insights

  • PCA and t-SNE allowed for effective visualization and interpretation of high-dimensional data, helping to confirm the separability of the clusters.

  • DBSCAN offered flexibility in identifying irregular clusters and proved useful in detecting outliers.

Challenges

  • Parameter Tuning: Finding the right parameters, especially for DBSCAN (e.g., epsilon and minimum samples), required several iterations to avoid misclassifying noise or small clusters.

  • Data Complexity: Balancing dimensionality reduction while maintaining the variance of key features was difficult, as reducing too much could lead to information loss, affecting clustering performance.

view notebook

Model Building

Graphs

5 / ​t-SNE

t-SNE was used to visualize and interpret the clustering results, providing insights into the separation between customer groups.

4 / ​DBSCAN

Applied DBSCAN for density-based clustering to detect irregular-shaped clusters and outliers, especially useful in identifying noise in customer behavior.

Phase-4

1 / K-means Clustering:

Employed the Elbow Method to plot the sum of squared distances from each point to its assigned cluster centroid. Silhouette Scores were used to assess the consistency within clusters. Higher scores indicated better-defined clusters.

2 / Agglomerative Hierarchical Clustering:

Applied Ward’s linkage method to minimize variance within clusters and used dendrograms to visualize how clusters merged, providing insights into the relationships between data points.

3 / ​K-Medoids Clustering

Selected medoids (actual data points) instead of centroids to handle non-Euclidean distances, making it a robust alternative to K-means for data with irregularly shaped clusters.

Phase4

Key Insights

  • K-means: Produced well-defined, spherical clusters that were easy to interpret, making it a strong candidate for segmenting customer behavior.

  • DBSCAN: Added value by detecting irregularly shaped clusters and identifying outliers, which provided a more nuanced view of customer patterns. However, tuning DBSCAN's parameters required careful experimentation to prevent meaningful points from being classified as noise.

  • Hierarchical Clustering: Dendrograms offered additional insights into cluster relationships, especially when trying to understand how different groups of customers were related.

  • K-Medoids: Was effective in handling outliers and non-Euclidean distances, but did not outperform K-means or DBSCAN in terms of cluster definition.

  • t-SNE: Enabled visualization of high-dimensional data and helped confirm the separability of customer clusters, validating the effectiveness of the segmentation approach.

Challenges

  • Algorithm Selection: Choosing between K-means, DBSCAN, and other methods required balancing cluster quality with business relevance, as each algorithm had its own strengths.

  • Parameter Tuning: Fine-tuning key parameters like epsilon (ε) for DBSCAN and the number of clusters for K-means required several iterations to avoid misclassification and poorly defined clusters.

  • Handling Noise: DBSCAN was effective for outlier detection but sometimes classified important data points as noise, making it challenging to find the right balance.

  • Model Visualization: While t-SNE helped visualize clusters, combining it with other techniques like PCA was necessary for a clearer understanding of both local and global data structures.

Phase-5

1 / Process:
  • Segment Profiling: Grouped customers into four distinct segments based on demographics (income, family size) and spending habits (wine, meat, and gold purchases).

  • Visualizing Results: Used t-SNE plots, Silhouette scores, and cluster distribution plots to visually demonstrate the distinctiveness of the clusters.

  • Stakeholder Communication: Presented insights using simple visualizations and summaries to ensure non-technical stakeholders could easily understand the segmentation results and apply them to marketing strategies.

view notebook

Communicating Results

Phase5

Key Insights

  • Segmentation identified four distinct customer groups based on demographics and spending behavior.

  • High-income, high-spending customers were the most responsive to marketing campaigns, particularly for wine products.

  • Lower-income customers responded better to discounts and promotions.

Key Insights

  • Presenting complex model results in a digestible way for non-technical stakeholders required thoughtful use of visual aids like Silhouette scores, cluster distribution, and t-SNE plots to effectively communicate the results.

Phase-6

1 / Business Recommendation:
  • The segmented customer groups could be used to tailor marketing strategies. For instance, high-spending customers received premium product promotions, while discount-based campaigns targeted lower-income segments.

view notebook

Operationalization

Phase6

Key Insights

  • Tailoring marketing efforts based on segmented insights led to better engagement, higher click-through rates, and optimized campaign resources.

Challenges

  • Implementing these strategies across marketing channels while ensuring continued validation of the segmentation results over time.

Reflection

This project was a fun dive into the world of customer segmentation. I got to play around with tools like K-means, DBSCAN, and Hierarchical Clustering to find just the right way to group customers, and t-SNE really helped bring those clusters to life with cool visualizations. Throw in some PCA for dimensionality reduction and Silhouette Scores to check how well the clusters held up, and it felt like a mix of detective work and creative problem-solving. While I didn’t get to see these strategies play out in real life, it was a great reminder of how machine learning can take raw data and turn it into something super practical—and a little bit of fun too!

Like what you see ? 

Let's chat!

© 2024 by Alvita Yathati. Powered and secured by Wix

bottom of page