ProjectOverview

Project Overview

Project Type:

Role

Project Outcome

Methods

Deliverables

Tools

Customer Segmentation using Unsupervised Learning.

Data Analyst / Machine Learning Engineer.

Developed an unsupervised learning model to segment customers based on their spending habits, demographics, and engagement in marketing campaigns. This segmentation is crucial for optimizing marketing strategies and increasing return on investment (ROI).

1. Unsupervised Learning techniques: K-means clustering, Hierarchical Clustering, DBSCAN, K-Medoids, t-SNE, and PCA (Dimensionality Reduction).

2. Data preprocessing: Handling missing values, scaling, and encoding categorical variables.

3. Model evaluation: Used Elbow Method, Silhouette Scores, and DBSCAN's core points analysis to determine optimal clustering.

1. Segmentation of customers into distinct groups based on spending and engagement patterns.

2. Recommendations for personalized marketing strategies for each customer segment.

3.Detailed visualizations (t-SNE plots, cluster analysis) and key insights.

Python, Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib, t-SNE, PCA, K-means, DBSCAN, Agglomerative Hierarchical Clustering, K-Medoids and Google Colab.

Context

Customer Segmentation:

Customer segmentation divides customers into groups with similar characteristics for better-targeted marketing strategies. Using a large dataset from a marketing campaign, this project aimed to create effective customer segments to improve personalization in email marketing, boost campaign success, and ultimately drive higher revenue.

Objective

To leverage Unsupervised Learning techniques like K-means clustering, K-Mediods, Hierarchical clustering, DBSCAN, Dimensionality Reduction (PCA) and t-SNE to analyze customer profiles and their responses to marketing campaigns. The goal is to understand different customer segments and provide insights for optimizing marketing efforts.

Key Research Questions

How can we segment customers based on spending habits, demographic information, and marketing engagement?
What are the main characteristics of each customer segment, and how can we use them to create personalized marketing campaigns?
How does clustering help improve the efficiency of targeted marketing?

PHASES OF PROJECT

This Customer Segmentation using Unsupervised Learning project progressed through six phases: Data Discovery, Data Preparation, Model Planning, Model Building, Communicating Results, and Operationalizing the system for real-world use. Each phase was crucial in developing and optimizing the recommendation model.

Data Discovery

Goal:

Identify and gather relevant data to build the music recommendation system.

Data Preparation

Goal:

Clean and prepare the data for further analysis.

Model Planning

Goal:

Decide on the most appropriate algorithm to build the recommendation system.

Model Building

Goal:

Develop and train the recommendation model.

Communicating Results

Goal:

Present the model’s performance and insights to stakeholders.

Operationalize

Goal:

Deploy the model in a real-world environment and monitor its performance.

Phase-1

1 / Datasets:

Loaded the dataset containing 27 features, including demographic, spending, and marketing response data.

2 / Initial Data exploration

A thorough check was performed to identify missing values, which were minimal but primarily found in the "Income" column, which had about 1.07% of its data missing.
The dataset contained a mix of numerical and categorical variables, with variables like ID being removed as they were unique identifiers without predictive value.

Data Discovery

Phase1

Key Insights

Missing Data: A very small proportion of the data was missing, indicating the dataset's robustness.

Challenges

Skewed Distributions: Some variables might show skewed distributions, posing a challenge for modeling.
Outliers: Certain variables, such as high-income individuals, presented potential outliers, necessitating careful consideration during model building.

Data Preparation

histogram for the feature 'Income' to understand the distribution and outliers

We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.

Boxplot for the feature 'Income' to understand the distribution and outliers

As observed in both the histogram and box plot, there are outliers to the right end of the distributions. The histogram shows a right-skewed distribution. The star in the box plot highlights the mean income value, which is very close to the middle quartile or median. The box plot visualizes that the interquartile range is between 35303 and 68522. This is where most of the values lie for the Income variable.

2D correlation matrix between numerical features.

Income and Spending Patterns: Higher income is strongly correlated with increased spending on premium products while lower-income customers tend to rely on discounts and make fewer high-end purchases. Family Size and Purchases: Households with small children tend to spend less on premium items and make more online purchases, with fewer catalog or in-store purchases. Website vs. Catalog Purchases: Customers who make more catalog or in-store purchases tend to visit the company’s website less often

histogram for the feature 'Income' to understand the distribution and outliers

We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.

1/3

Phase-2

1 / Process:

Dropped irrelevant columns like 'ID' and handled missing values through imputation.
Conducted univariate and bivariate analysis on key features such as Income, Age, and Marital Status.
Performed feature scaling using StandardScaler to normalize spending variables.
Applied Label Encoding for categorical variables like Education and Marital Status
Feature Engineering: Several new variables were created to enhance the dataset.
Imputing Missing Values: The missing values for the Income variable were imputed using the median, ensuring that the missing data didn't skew the results.

Phase2

Key Insights

Income Skewness: The Income variable showed a right-skewed distribution with outliers, making the median a better measure for imputation and analysis.
Customer Demographics: Most customers are married and have graduation-level education, forming a key demographic for targeted marketing.
Spending Patterns: High-income customers spend more on premium products (wine, meat, gold), while families with small children prefer online shopping and are less inclined toward high-end items.
Feature Engineering: New features like Total Expenses and Age were created to better capture spending behaviors and improve model performance.

Challenges

Skewed Distributions and Outliers: Handling skewed variables like Income and managing outliers without distorting the data was critical for accurate segmentation.
Feature Selection and Missing Data: Imputing missing values and selecting the most relevant features, such as Total Expenses and Age, required balancing to avoid overfitting while enhancing model performance.

Phase-3

1 / Process:

Applied Dimensionality Reduction (PCA and t-SNE) to reduce the dataset's complexity and visualize clusters effectively while retaining key variance across features.
Decided to use K-means Clustering, DBSCAN, Hierarchical Clustering, and K-Medoids to evaluate different segmentation techniques.
DBSCAN was specifically chosen to handle noise and identify clusters that are not necessarily spherical, unlike K-means.

Model Planning

Phase3

Key Insights

PCA and t-SNE allowed for effective visualization and interpretation of high-dimensional data, helping to confirm the separability of the clusters.
DBSCAN offered flexibility in identifying irregular clusters and proved useful in detecting outliers.

Challenges

Parameter Tuning: Finding the right parameters, especially for DBSCAN (e.g., epsilon and minimum samples), required several iterations to avoid misclassifying noise or small clusters.
Data Complexity: Balancing dimensionality reduction while maintaining the variance of key features was difficult, as reducing too much could lead to information loss, affecting clustering performance.

Model Building

Applying T-SNE to the data to visualize the data distributed in 2 dimensions

Applying PCA to the data to visualize the data distributed in 2 dimensions

visualizing the clusters using PCA for DBSCAN

Applying T-SNE to the data to visualize the data distributed in 2 dimensions

1/8

Graphs

5 / t-SNE:

t-SNE was used to visualize and interpret the clustering results, providing insights into the separation between customer groups.

4 / DBSCAN:

Applied DBSCAN for density-based clustering to detect irregular-shaped clusters and outliers, especially useful in identifying noise in customer behavior.

Phase-4

1 / K-means Clustering:

Employed the Elbow Method to plot the sum of squared distances from each point to its assigned cluster centroid. Silhouette Scores were used to assess the consistency within clusters. Higher scores indicated better-defined clusters.

2 / Agglomerative Hierarchical Clustering:

Applied Ward’s linkage method to minimize variance within clusters and used dendrograms to visualize how clusters merged, providing insights into the relationships between data points.

3 / K-Medoids Clustering

Selected medoids (actual data points) instead of centroids to handle non-Euclidean distances, making it a robust alternative to K-means for data with irregularly shaped clusters.

Phase4

Key Insights

K-means: Produced well-defined, spherical clusters that were easy to interpret, making it a strong candidate for segmenting customer behavior.
DBSCAN: Added value by detecting irregularly shaped clusters and identifying outliers, which provided a more nuanced view of customer patterns. However, tuning DBSCAN's parameters required careful experimentation to prevent meaningful points from being classified as noise.
Hierarchical Clustering: Dendrograms offered additional insights into cluster relationships, especially when trying to understand how different groups of customers were related.
K-Medoids: Was effective in handling outliers and non-Euclidean distances, but did not outperform K-means or DBSCAN in terms of cluster definition.
t-SNE: Enabled visualization of high-dimensional data and helped confirm the separability of customer clusters, validating the effectiveness of the segmentation approach.

Challenges

Algorithm Selection: Choosing between K-means, DBSCAN, and other methods required balancing cluster quality with business relevance, as each algorithm had its own strengths.
Parameter Tuning: Fine-tuning key parameters like epsilon (ε) for DBSCAN and the number of clusters for K-means required several iterations to avoid misclassification and poorly defined clusters.
Handling Noise: DBSCAN was effective for outlier detection but sometimes classified important data points as noise, making it challenging to find the right balance.
Model Visualization: While t-SNE helped visualize clusters, combining it with other techniques like PCA was necessary for a clearer understanding of both local and global data structures.

Phase-5

1 / Process:

Segment Profiling: Grouped customers into four distinct segments based on demographics (income, family size) and spending habits (wine, meat, and gold purchases).
Visualizing Results: Used t-SNE plots, Silhouette scores, and cluster distribution plots to visually demonstrate the distinctiveness of the clusters.
Stakeholder Communication: Presented insights using simple visualizations and summaries to ensure non-technical stakeholders could easily understand the segmentation results and apply them to marketing strategies.

Communicating Results

Phase5

Key Insights

Segmentation identified four distinct customer groups based on demographics and spending behavior.
High-income, high-spending customers were the most responsive to marketing campaigns, particularly for wine products.
Lower-income customers responded better to discounts and promotions.

Key Insights

Presenting complex model results in a digestible way for non-technical stakeholders required thoughtful use of visual aids like Silhouette scores, cluster distribution, and t-SNE plots to effectively communicate the results.

Phase-6

1 / Business Recommendation:

The segmented customer groups could be used to tailor marketing strategies. For instance, high-spending customers received premium product promotions, while discount-based campaigns targeted lower-income segments.

Operationalization

Phase6

Key Insights

Tailoring marketing efforts based on segmented insights led to better engagement, higher click-through rates, and optimized campaign resources.

Challenges

Implementing these strategies across marketing channels while ensuring continued validation of the segmentation results over time.

Reflection

This project was a fun dive into the world of customer segmentation. I got to play around with tools like K-means, DBSCAN, and Hierarchical Clustering to find just the right way to group customers, and t-SNE really helped bring those clusters to life with cool visualizations. Throw in some PCA for dimensionality reduction and Silhouette Scores to check how well the clusters held up, and it felt like a mix of detective work and creative problem-solving. While I didn’t get to see these strategies play out in real life, it was a great reminder of how machine learning can take raw data and turn it into something super practical—and a little bit of fun too!

Like what you see ?

Let's chat!

Contact

Project Overview

Project Type:

Role

Project Outcome

Methods

Deliverables

Tools

Context

Objective

Key Research Questions

PHASES OF PROJECT

Goal:

Goal:

Goal:

Goal:

Goal:

Goal:

Phase-1

1 / Datasets:

2 / Initial Data exploration

Data Discovery

Key Insights

Challenges

Data Preparation

Phase-2

1 / Process:

Key Insights

Challenges

Phase-3

1 / Process:

Model Planning

Key Insights

Challenges

Model Building

Graphs

5 / ​t-SNE:

4 / ​DBSCAN:

Phase-4

1 / K-means Clustering:

2 / Agglomerative Hierarchical Clustering:

3 / ​K-Medoids Clustering

Key Insights

Challenges

Phase-5

1 / Process:

Communicating Results

Key Insights

Key Insights

Phase-6

1 / Business Recommendation:

Operationalization

Key Insights

Challenges

Reflection

Like what you see ?

Let's chat!

5 / t-SNE:

4 / DBSCAN:

3 / K-Medoids Clustering