Project Overview
Project Type:
Role
Project Outcome
Machine Learning Project on Used Cars Price Prediction
Data Scientist / Machine Learning Engineer
1. Developed a linear regression model to predict the price of used cars based on historical data.
2. Generated insights about the key factors affecting car prices, such as Year, Kilometers Driven, Fuel Type, and Engine Power.
3. Provided a pricing model that could assist Cars4U in determining the optimal price for used cars.
Methods
1. Data exploration and preprocessing: Dealing with missing values, outlier treatment, and feature engineering.
2. Linear regression modeling to predict car prices.
3. Model evaluation using performance metrics like RMSE, MAE, and R-squared.
Deliverables
1. A model capable of accurately predicting car prices.
2. Insights on the factors influencing pricing in the used car market, helping Cars4U in its pricing strategy.
Tools
Python, Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib, and Google Colab.



Context
Used Cars Price Prediction:
With the rise in demand for used cars in India, especially as the new car market slows down, Cars4U aims to capitalize on this trend. Pricing in the used car market is tricky due to several factors like mileage, condition, brand, and more. The goal of this project is to develop a pricing model that can effectively predict the price of a used car based on various attributes, assisting Cars4U in making competitive pricing decisions.
Objective
To build a predictive model that can estimate the price of a used car based on factors such as Year, Kilometers Driven, Fuel Type, and Transmission. The model will help Cars4U in determining competitive and profitable pricing strategies.
Key Research Questions
-
What factors have the greatest impact on car prices?
-
How does mileage influence the price of a used car?
-
How can the linear regression model be improved for better prediction accuracy?
PHASES OF PROJECT
This Used Car's Price Prediction project progressed through six phases: Data Discovery, Data Preparation, Model Planning, Model Building, Communicating Results, and Operationalizing the system for real-world use. Each phase was crucial in developing and optimizing the recommendation model.
Phase-1
1 / Datasets:
The dataset contained features like Year, Kilometers Driven, Fuel Type, Transmission, Mileage, Engine, Power, and Price (the target variable).
2 / Initial Data exploration
-
Missing values were identified in key columns such as Price, New Price, Engine, and Power. These missing values need to be addressed before model building to avoid skewing the results
-
​Several features such as Name, Mileage, Engine, and Power contained both textual and numerical components. These needed to be split and processed separately, converting the numerical portions for analysis.
Key Insights
-
The Mileage, Engine, Power, and New_Price are strings with numerical values. We need to extract the numerical values for further analysis. Seats - The minimum no of seats definitely can't be 0. This may be an error that will be addressed later on.
-
Location can provide valuable insights into regional differences in used car preferences, especially when visualized as a heatmap.
-
Identifying missing values at the start of the analysis helped plan for effective imputation techniques, ensuring that the final dataset used for model building would be complete and reliable.New Price and Price will require special attention due to missing values, which could influence model performance if not handled appropriately.
-
A significant amount of data pre-processing is required before we can explore the dataset.
Challenges
-
The handling of strings with units (e.g., Mileage, Engine, and Power) requires careful preprocessing to ensure consistency, as different units may require conversion for comparison across all rows.
-
Missing values in critical fields like Price and Engine Power had to be addressed carefully to avoid introducing bias into the model.
Phase-2
1 / Feature Engineering:
-
Categorical variables such as Fuel Type and Transmission were one-hot encoded for use in the machine learning model. Additionally, numerical features like Mileage, Engine, and Power were split into numerical values and units, with the numerical parts retained for analysis.
2 / Handling Missing values:
-
Missing values in critical columns like Price, Engine, and New Price were imputed using appropriate methods (mean or mode imputation) to ensure that the dataset remained complete for modeling.
3 / Univariate Analysis:
-
Each column was examined individually to understand its distribution and identify any anomalies. Key numerical features such as Price, Mileage, Engine, and Power were plotted to observe their distributions, helping to identify skewness and the presence of outliers.
4 / Bivariate Analysis:
-
Relationships between the target variable Price and key features like Year, Mileage, Fuel Type, and Transmission were analyzed. This helped in identifying which features had the strongest relationships with car prices and were crucial for model building.
5 / Outlier Treatment:
-
Outliers in columns such as Kilometers Driven and Price were detected and addressed, either by removing extreme values or treating them appropriately to avoid model distortion.
6 / Standardization:
-
Numerical variables like Mileage, Engine, and Power were standardized to bring all features onto the same scale, improving the performance of algorithms that are sensitive to scaling, such as linear regression.
Key Insights
-
Univariate analysis revealed that features like Price and Mileage were skewed, with outliers, necessitating careful outlier treatment for accurate modeling.
-
Bivariate analysis identified significant relationships between Year, Engine Power, and Price, confirming these variables as important predictors for the final model.
-
Imputation of missing values and proper handling of categorical variables through one-hot encoding ensured the dataset was clean and ready for the regression model.
Challenges
-
Handling missing values in columns such as Price and New Price required careful consideration to ensure the imputation method didn’t introduce bias into the model.
-
Outlier detection and treatment was crucial, especially for continuous variables like Kilometers Driven and Price, to prevent distortion of the regression model’s predictions.
-
Splitting and processing of features like Mileage, Engine, and Power, which contained both numerical values and units, was a time-intensive process but necessary for accurate analysis.
Key Insights
-
A simple linear regression model provided a good baseline for prediction, but there was room for improvement through feature selection and hyperparameter tuning.
Challenges
-
Multicollinearity between some of the features was identified as a potential issue, which required careful feature selection to avoid unreliable model coefficients.
Model Building

Visualizing OLS prediction Price: Actuals vs Predictions

Goldfeldquandt Test, Homoscedacity - If the residuals are symmetrically distributed across the regression line , then the data is said to homoscedastic. Heteroscedasticity - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case the residuals can form a funnel shape or any other non symmetrical shape. We'll use Goldfeldquandt Test to test the following hypothesis Null hypothesis : R Observed vs Predicted values

Residuals QQ Plot

Visualizing OLS prediction Price: Actuals vs Predictions
Graphs
4 / ​Assumption Testing:
Residual plots were analyzed to check for linear regression assumptions (e.g., normality of residuals, homoscedasticity). Mild heteroscedasticity was observed, but the model still performed well.
Phase-4
1 / OLS Regression:
An Ordinary Least Squares (OLS) regression model was built using Statsmodels, with an adjusted R-squared value of 0.778, indicating a strong model fit. Significant predictors included car age, engine size, and kilometers driven.
2 / Linear Regression:
A Linear Regression model was built and trained using the training data. This served as a simple, interpretable baseline to assess relationships between the features and car prices.
3 / ​Multicollinearity Check:
Variance Inflation Factor (VIF) was used to check for multicollinearity. Features like fuel type showed some multicollinearity but were retained as they did not affect other coefficients.
Key Insights
-
The OLS regression model explained around 77.8% of the variance in used car prices, which indicates a strong relationship between the independent variables and the target variable (price).
-
Features like car age, engine size, kilometers driven, and transmission type were found to be significant predictors of used car prices.
-
Fuel type and transmission type showed a strong impact on the price, especially with diesel and manual transmission cars being valued differently compared to their counterparts.
-
While the model performed well, some multicollinearity issues were detected, especially between the fuel types. This was addressed but not removed, as it did not affect other parts of the model.
Challenges
-
​Certain features like fuel type (Diesel and Petrol) exhibited high multicollinearity, which could lead to instability in the model’s coefficients. Although not removed, this required careful interpretation.
-
Mild heteroscedasticity was detected in the residuals, indicating that the variance of errors was not constant across predictions. This could affect model predictions but did not severely degrade overall performance.
-
While most linear regression assumptions were satisfied, slight deviations from homoscedasticity and normality of residuals were observed, warranting further refinement or exploration of more robust models.
Phase-5
1 / Process:
-
The results of the model, including RMSE, MAE, and R-squared, were communicated, emphasizing how well the model captured the relationship between the features and car prices.
-
The most influential features (such as Year, Mileage, Fuel Type, and Engine Power) were highlighted, providing actionable insights for pricing strategies.
Communicating Results
Key Insights
-
Engine Size: Larger engines are associated with higher car prices.
-
Car Category: Mid-range and luxury cars significantly increase pricing.
-
Region: Prices vary by region, with notable differences between the North, South, and West.
-
Fuel Type: Diesel and electric cars are valued higher than gasoline cars.
-
Mileage: Higher mileage tends to decrease car prices.
Phase-6
1 / Business Recommendation:
-
Although the model was not deployed in real-time, the findings from this regression model can be operationalized in several ways:
-
Pricing Strategy: The model’s predictions can be used to assist Cars4U in setting competitive prices for used cars based on key features such as engine size, car category, and fuel type.
-
Inventory Management: Insights into regional preferences can guide Cars4U in stocking the right types of vehicles for different markets, ensuring that demand is met efficiently.
-
Marketing Customization: Targeted marketing campaigns can be designed to highlight key selling points of cars in different categories and regions, such as promoting fuel-efficient vehicles in urban areas and SUVs in suburban markets.
Operationalization
Key Insights
-
Pricing Adjustments: Using the model’s insights, Cars4U can adjust its pricing based on market demands and car characteristics, improving profitability.
-
Market Segmentation: The regional data allows for more tailored inventory and marketing strategies, making Cars4Umore competitive by focusing on local consumer preferences.
-
Scalability: The model can be retrained periodically as new data comes in, ensuring that pricing remains competitive and relevant as market conditions evolve.
Challenges
-
Model Scalability: Ensuring the model is continuously updated with new data to remain accurate over time.
-
Real-time Implementation: While the model was not deployed in real-time, integrating it into an operational system would require additional technical infrastructure to ensure seamless use.
Reflection
As a data analyst, this project really showed me how powerful data can be in the real world. Predicting used car prices wasn’t just about plugging numbers into a model—it was about uncovering the hidden relationships between featuresand understanding how they influence market value.
Dealing with challenges like multicollinearity and heteroscedasticity was a bit tricky, but it taught me that data rarely behaves perfectly, and that's okay. What I loved most was seeing how these insights could actually help shape business decisions, like pricing strategies. It was a great reminder that being a data analyst isn’t just about crunching numbers—it’s about turning data into something meaningful and useful.