Goals:
Explore hypothesis: Car prices rise in the winter months and drop in the summer
Improve my Python skills, specifically on web scraping and data analysis
Inspired by a merchant friend's words, I decided to gather some data and explore the market trends and pricing, while gaining some valuable hands on experience at an end to end projects.
I used Selenium to scrap data from Greece's largest used car website in 3 month intervals, Pandas to explore the data, and Matplotlib and Seaborn for visualizing it.
Key insights:
The overall Mean price tends to increase by 3.18%, while the Median in the same period increases by 4.36%.
Cars with larger engine sizes appear to drive the mean price increase from winter to summer.
In general, the mean tends to increase more sharpy than the median, suggesting that a small amount of listings with extreme values in mileage, price, and engine sizes still have a noticeable impact.
I scraped data from Greece's largest used car website at three-month intervals. I used Selenium to help me handle some dynamic content existing on the website. This approach allowed me to collect comprehensive and up-to-date information on car listings, ensuring consistent data points across different seasons to analyze trends and patterns effectively.
Here's a snapshot of the web scrapper
And the consolidation script
Next I moved on on loading and cleaning the data. This involved:
removing duplicates,
handling missing values and outliers,
and formatting the data for easier manipulation.
This was my first time using Jupiter Notebooks. I chose to use it to make data handling and manipulation more efficient and easy.
First, I installed and imported the necessary libraries:
Then, imported the data and took a first look
Decided to drop 'date' and 'pagenumber' columns, since they were only helper columns for scrapping and not useful anymore. Also, renamed a few columns for clarity.
Checked data types and shape:
Many unexpected objects types, so I checked for missing values before I attempt to .describe():
The scrapper did a good job of not leaving blank values. So, I formatted the data types appropriately:
And took a look at the data again:
Next, looked for duplicates and deleted them:
Out of the initial 305,595 rows, the data was reduced to 282,995 after deduplicating, meaning 22,600 entries were deleted—about 7.39% of the original dataset. This reduction is reasonable, given the high volume of pages scraped and the fact that the website mixed in a couple of ads with the listings using a dynamic object, making it impossible to filter them out while scraping frontend with Selenium.
Next, I looked for outliers in the price(€), mileage(km), and engine(cc) columns. I started by visualizing the price(€) column for a first look:
Listings costing under 1500€ where excluded during scrapping. There are a few very expensive cars close to 1m euros, that are definitely outliers, and many that according to the right whisker of this boxplot are branded outliers. I thought it would be helpful to see the price of that right whisker:
Almost 32000€, which felt too low to exclude all cars above it, perhaps something around the 100k mark is more reasonable for the car market. Retrieved the number of listings above 100k euros, 2016 listings or the 0.71% of our initial data, seemed as a safer option for exlusion. Followed the same process for the 'mileage(km)' column, where I excluded the 1st and 99th percentile, and 'engine(cc)' column, where I excluded anything below the 0.4% and 99.9% of the data, and ended up with 274192 listings:
This is our clean dataframe.
By this point, I had already gotten a first look at the data and was ready to dive deeper now that it was cleaned:
First look in the clean data:
The median price of €11,000 suggests that typical cars in the dataset are more affordable, while the higher mean price of €14,425 indicates that a few expensive cars are skewing the average upward.
Mileage: The mileage data shows that most cars have significant usage, with a median mileage of 133,300 km. The high mileage is typical for a used car dataset.
Engine Size: Most cars in the dataset fall within the typical range of 1,300 cc to 1,800 cc.
Visualised the price distribution, and saw that most car prices fall between €5,000 and €20,000. The distribution is right-skewed, indicating a small number of high-priced cars:
Followed, similar process for the engine sizes and found that the majority of cars have engines between 1,000 cc and 2,000 cc, which is typical for personal vehicles:
And the same for mileage, where most cars have mileage between 50,000 and 200,000 kilometers, with very few having extremely high or low mileage:
Then, I looked at my categorical data, starting with fuel types, where by far the most common fuel types are Petrol and Diesel, consisting almost the 91% of the total listings. Another noticeable insight here, is that almost one in every twenty cars sold, use Gas/ LPG as a fuel:
Lastly, I looked at the number of listings by season. They are fairly even across seasons, with a slight increase during the Spring:
After gaining a better understanding of the dataset through the EDA, it was time to focus on testing my initial hypothesis: whether car prices tend to rise in the winter and drop in the summer. For that, I calculated and visualized the mean and median car prices for each season:
From this, I could see that:
The mean prices across the three seasons—Winter (€14,290), Spring (€14,256), and Summer (€14,745)—are relatively close. The mean increase from winter to summer of 3.18%, and the median increase in the same period suggests that, overall, car prices are fairly stable throughout the year, with no drastic seasonal fluctuations.
More, it contradicts the initial hypothesis, that car prices tend to rise in the winter. The data at hand show the exact opposite; car prices increase during the summer.
So, what if I tried to make similar comparisons in various segments of the data?
I initially segmented the data based on engine sizes, using the Greek vehicle tax categories as a framework:
And used a bar graph to view them:
Then, grouped by 'Engine Category' and 'Season'; calculated the mean and median of 'price(€)', and visualized them:
Mean and Median Price Trends by Engine Category and Season:
Both graphs are consistent in indicating that car prices tend to increase from Winter to Summer. This suggests a seasonal trend where prices rise as we move toward the summer months, which is again contradicting my initial hypothesis. What is more, the flatter trend for smaller engine categories (A, B) in the second and third graph suggests that these categories don't contribute much to the overall price increases from the winter to summer.
To explore further segments of my data I followed the same logic starting with mileage categories and season:
Mean and Median Price Trends by Mileage Category and Season:
Mean and Median Price Trends by Price Category and Season:
Staying consistent with the analysis so far, across both mileage and price categories, car prices tend to increase slightly from Winter to Summer. This trend is visible whether we look at mean or median values. Although, not all sub-groups contribute the same to the total increase, some lines are flatter than others.
I truly enjoyed the entire process and learned a lot along the way. Interestingly, the initial hypothesis—that car prices tend to rise in the winter and fall during the summer—was not supported by the data I collected. In fact, the opposite seemed to be true: prices actually increased during the summer months.
There is plenty to do to expand this project, like:
including cars priced below 1500€,
explore motorbikes,
collect more data over a longer period of time,
scraping more frequently, like once a month,
gather data from different dealers.
I hope to have the time and data to further explore the topic.
Key insights:
The overall Mean price tends to increase by 3.18%, while the Median in the same period increases by 4.36%.
Cars with larger engine sizes appear to drive the mean price increase from winter to summer.
In general, the mean tends to increase more sharpy than the median, suggesting that a small amount of listings with extreme values in mileage, price, and engine sizes still have a noticeable impact.
Thanks for reading. Feel free to reach out for any comment.