Predicting car prices with scikit-learn

In this notebook I will attempt to predict automobile prices using Python and its data analysis and machine learning packages such as pandas and scikit-learn. Forecasting car prices can be useful for businesses in general: insurance companies can use this information to calculate their premia; websites and enterprises can provide estimates even when asking prices are not available in a specific application; enterprises can set contracts where a car’s resale value must be defined a priori with greater information, or determine if a car is overvalued or undervalued with respect to the market.

My data source is Mercado Livre, a widely used e-commerce platform in Brazil. For simplicity, and also to isolate other geographical factors that affect prices, this research restricts to ads in the city of Porto Alegre/RS. I also excluded trucks and minivans from the analysis. While new cars can be announced, most of the cars advertised are used. To publish an ad, the seller must fill information about the car, such as brand, model, mileage, engine power and additional features. It is a common belief that this features can help predict a car’s asking price in the market, and in our methodology I will explore them to improve our predictions (of course, there also will be unobservable factors that affect prices). A popular source for automobile prices in Brazil is the FIPE table. In the table, average prices are calculated from newspaper and web ads. Data from the FIPE table can be improved here, as learning models can benefit from data updates, predict prices for cars with a specific set of features and location.

Along the notebook, I will go through all steps of data analysis, with code and commentary.

Data wrangling

First, let’s import the required packages for this step:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import numpy as np
import json
import requests

To access data on ads, I’ll make requests to the MercadoLivre’s API. We must build the query URL, using category and location identifiers that can be found in the API documentation site.

# parameters
category = 'MLB1744' # cars and trucks
city     = 'TUxCUFJJT0xkYzM0' # state of rs
limit    = '50'
offset   = [str(i) for i in list(range(50,10000,50))]

# access token
with open('ml_token.txt') as file:  
    token = file.read()

#url = 'https://api.mercadolibre.com/sites/MLB/search?category='+category+'&city='+city+'&limit='+limit+'&offset='+offset+'&access_token='+token

Users are only allowed to fetch 50 items per request - up to 10000 items - and this is controlled by the offset parameter. Using requests, we can download all data:

responses = []
for off in offset:
    url = 'https://api.mercadolibre.com/sites/MLB/search?category='+category+'&state='+city+'&limit='+limit+'&offset='+off+'&access_token='+token
    responses.append(requests.get(url))

respd = [i.json() for i in responses]

The .json method transforms the data into a Python dictionary. Looking at the structure of data, I find that ad entries are stored under key results:

respd[0].keys()
dict_keys(['site_id', 'paging', 'results', 'secondary_results', 'related_results', 'sort', 'available_sorts', 'filters', 'available_filters'])
# tests
data1 = [pd.io.json.json_normalize(i['results'], sep='_') for i in respd]

The result is a list of pandas DataFrames. We can append all datasets using concat:

data1 = pd.concat(data1, sort=False).reset_index(drop=True)
data1.head()
accepts_mercadopago address_area_code address_city_id address_city_name address_phone1 address_state_id address_state_name attributes available_quantity buying_mode ... shipping_logistic_type shipping_mode shipping_store_pick_up shipping_tags site_id sold_quantity stop_time tags thumbnail title
0 True TUxCQ0NBTjcxNTM Canoas TUxCUFJJT0xkYzM0 Rio Grande Do Sul [{'value_id': '2230581', 'value_name': 'Usado'... 1 classified ... None not_specified False [] MLB 0 2019-02-24T04:00:00.000Z [poor_quality_thumbnail, poor_quality_picture,... http://mlb-s1-p.mlstatic.com/757910-MLB2918517... Hyundai Hb20s 1.6 Comfort Plus 16v Flex 4p Manual
1 True TUxCQ0VOQzZhYTdi Encantado TUxCUFJJT0xkYzM0 Rio Grande Do Sul [{'value_id': '2230581', 'value_name': 'Usado'... 1 classified ... None not_specified False [] MLB 0 2019-02-21T22:16:47.000Z [poor_quality_picture, poor_quality_thumbnail,... http://mlb-s1-p.mlstatic.com/671101-MLB2907376... Chevrolet Agile Ltz - Fernando Multimarcas
2 True TUxCQ1BPUjgwZTJl Porto Alegre TUxCUFJJT0xkYzM0 Rio Grande Do Sul [{'source': 1, 'id': 'ITEM_CONDITION', 'name':... 1 classified ... None not_specified False [] MLB 0 2019-02-09T09:34:55.000Z [good_quality_picture, good_quality_thumbnail,... http://mlb-s2-p.mlstatic.com/962446-MLB2896708... Renault Grand Scenic 2002
3 True TUxCQ0NBWDgxMzcw Caxias do Sul TUxCUFJJT0xkYzM0 Rio Grande Do Sul [{'id': 'ITEM_CONDITION', 'name': 'Condição do... 1 classified ... None not_specified False [] MLB 0 2019-03-19T04:00:00.000Z [only_html_description, immediate_payment] http://mlb-s2-p.mlstatic.com/668212-MLB2919195... Hb20 1.6 Comfort Plus 16v Flex 4p
4 True TUxCQ1BPUjgwZTJl Porto Alegre TUxCUFJJT0xkYzM0 Rio Grande Do Sul [{'value_name': 'Automática', 'value_struct': ... 1 classified ... None not_specified False [] MLB 0 2019-02-16T04:00:00.000Z [dragged_visits, good_quality_picture, good_qu... http://mlb-s1-p.mlstatic.com/968792-MLB2777645... Volvo Xc90 2.0 T6 Inscription Drive-e 5p

5 rows × 73 columns

In each entry, we have a series of characteristics on the car, such as make, model, asking price, features and location. Next, I will transform the data into a pandas DataFrame for further manipulation. This requires a few steps, as the data is structured in a list of dicts:

cols = ['title','price','address_city_name','attributes', 'id',
       'location_latitude', 'location_longitude',
       'permalink',
       'seller_car_dealer']

data1 = data1.loc[:,cols]
data1.head()
title price address_city_name attributes id location_latitude location_longitude permalink seller_car_dealer
0 Hyundai Hb20s 1.6 Comfort Plus 16v Flex 4p Manual 46990.0 Canoas [{'value_id': '2230581', 'value_name': 'Usado'... MLB1167035642 -29.9188 -51.1585 https://carro.mercadolivre.com.br/MLB-11670356... True
1 Chevrolet Agile Ltz - Fernando Multimarcas 28800.0 Encantado [{'value_id': '2230581', 'value_name': 'Usado'... MLB1159937620 -29.2249 -51.8895 https://carro.mercadolivre.com.br/MLB-11599376... True
2 Renault Grand Scenic 2002 8500.0 Porto Alegre [{'source': 1, 'id': 'ITEM_CONDITION', 'name':... MLB1152825359 -30.0346 -51.2177 https://carro.mercadolivre.com.br/MLB-11528253... False
3 Hb20 1.6 Comfort Plus 16v Flex 4p 44890.0 Caxias do Sul [{'id': 'ITEM_CONDITION', 'name': 'Condição do... MLB1167546961 -29.1634 -51.1797 https://carro.mercadolivre.com.br/MLB-11675469... True
4 Volvo Xc90 2.0 T6 Inscription Drive-e 5p 339950.0 Porto Alegre [{'value_name': 'Automática', 'value_struct': ... MLB1165970712 -30.0554 -51.2224 https://carro.mercadolivre.com.br/MLB-11659707... True

It’s a great step forward, however information on a car’s attributes are still trapped in a list of dicts. Hence any attempts at using the previous json_serialize method will fail. As an example, let’s apply the function to the feature alone:

pd.io.json.json_normalize(data1.attributes[0], sep='_').head()

Now, I will reshape the data in order to have a single row per advertising. Let’s define a helper function:

def reshape_(data):
    return data.loc[:,['id','value_name']].set_index('id').transpose()

We must now loop over samples and then join all data. For loops can be avoided, using Python’s list comprehensions:

df_temp = [reshape_(pd.io.json.json_normalize(i)) for i in data1.attributes]

data2 = pd.concat(df_temp, sort=False).reset_index(drop=True)
del df_temp
data2.tail()
ITEM_CONDITION TRANSMISSION ENGINE_DISPLACEMENT BRAND DOORS FUEL_TYPE KILOMETERS MODEL TRIM VEHICLE_YEAR TRACTION_CONTROL
9943 Usado Manual 999 cc Volkswagen 5 Gasolina e álcool 100000 km Fox 1.0 Vht Trend Total Flex 5p 2009 Dianteira
9944 Usado Manual 999 cc Ford 5 Gasolina 130000 km Fiesta 1.0 Street 5p 2004 Dianteira
9945 Usado Manual 1360 cc Peugeot 5 Gasolina e álcool 124 km 206 1.4 Presence Flex 5p 2008 Dianteira
9946 Usado Manual 999 cc Fiat 5 Gasolina e álcool 95800 km Palio 1.0 Fire Celebration Flex 5p 2009 Dianteira
9947 Usado Manual 1598 cc Ford 5 Gasolina 190000 km Focus 1.6 Gl 5p 2007 Dianteira

Finally, we’ll merge with the original data:

df = pd.concat([data1,data2], axis=1)
df.head()
price address_city_name location_latitude location_longitude seller_car_dealer transmission engine_displacement brand doors fuel_type kilometers model vehicle_year traction_control
0 46990.0 Canoas -29.918818 -51.158540 True Manual 1591.0 hyundai 4.0 Gasolina e álcool 72.258 hb20s 2017.0 Dianteira
1 28800.0 Encantado -29.224850 -51.889533 True Manual 1389.0 chevrolet 4.0 Gasolina e álcool 74.000 agile 2011.0 Dianteira
3 44890.0 Caxias do Sul -29.163403 -51.179668 True Manual 1591.0 hyundai 4.0 Gasolina e álcool 39.293 hb20 2018.0 Dianteira
5 52900.0 Santa Maria -29.696541 -53.799310 True Automática 1598.0 volkswagen 4.0 Gasolina e álcool 30.600 crossfox 2015.0 4x2
6 61900.0 Santa Maria -29.696541 -53.799310 True Automática 999.0 ford 5.0 Gasolina 9.345 fiesta 2017.0 4x2

Some more data cleaning, to deal with strings and data types:

df.drop(['TRIM','ITEM_CONDITION','title','attributes','id','permalink'], axis=1, inplace=True)
df.columns = map(str.lower, df.columns)

#remove unwanted strings
cols = ['location_latitude', 'location_longitude','engine_displacement','kilometers','vehicle_year','doors']
df.loc[:,cols] = df.loc[:,cols].replace(regex=True,to_replace=r'\D',value=r'')

# replace empty strings with NaN
df.replace({'':np.nan}, inplace=True)

# set brand and model strings to lowercase
df.brand = df.brand.str.lower()
df.model = df.model.str.lower()
# type conversion
df = df.astype({'location_latitude':float, 'location_longitude':float,'engine_displacement':float,
                'kilometers':float,'vehicle_year':float,'doors':float})

df.dtypes
price                  float64
address_city_name       object
location_latitude      float64
location_longitude     float64
seller_car_dealer         bool
transmission            object
engine_displacement    float64
brand                   object
doors                  float64
fuel_type               object
kilometers             float64
model                   object
vehicle_year           float64
traction_control        object
dtype: object

Now, I will attempt to detect outliers in numerical features. We must note that some extreme values are not outliers (i.e. we will have feature kilometers set to zero for new cars). Outlier detection will generate missing values that must be treated later.

df.describe().round(2)
price location_latitude location_longitude engine_displacement doors kilometers vehicle_year
count 9740.00 9740.00 9740.00 9740.00 9740.00 9740.00 9740.00
mean 47753.99 -29.70 -51.29 1653.57 3.97 57.63 2012.58
std 32267.50 1.17 0.99 528.68 0.76 51.19 4.39
min 8999.00 -33.69 -57.55 994.00 0.00 0.00 1960.00
25% 27900.00 -30.02 -51.20 1368.00 4.00 21.00 2011.00
50% 37900.00 -29.95 -51.18 1598.00 4.00 54.00 2013.00
75% 56040.00 -29.70 -51.14 1975.00 4.00 83.00 2015.00
max 249000.00 -6.93 -37.88 3960.00 7.00 920.00 2019.00
# fix extreme values
df.price.replace({df.price.max(): np.nan}, inplace=True)
df.engine_displacement[df.engine_displacement < 900] = np.nan
df.engine_displacement[df.engine_displacement > 4000] = np.nan
df.kilometers.replace({df.kilometers.max(): np.nan}, inplace=True)
df.vehicle_year.replace({df.vehicle_year.min(): np.nan}, inplace=True)

Finally, we must deal with missing values, since scikit-learn does not accept them. It is often more fruitful to fill in missing values, rather than dropping whole samples.

# find missing values
df.isna().sum()
price                  0
address_city_name      0
location_latitude      0
location_longitude     0
seller_car_dealer      0
transmission           0
engine_displacement    0
brand                  0
doors                  0
fuel_type              0
kilometers             0
model                  0
vehicle_year           0
traction_control       0
dtype: int64
df.transmission       = df.transmission.fillna('Manual')
df.kilometers         = df.kilometers.fillna(df.kilometers.mean())
df.price              = df.price.fillna(df.price.mean())
df.traction_control   = df.traction_control.fillna('Dianteira')
df.location_latitude  = df.location_latitude.fillna(method='ffill')
df.location_longitude = df.location_longitude.fillna(method='ffill')
df.vehicle_year       = df.vehicle_year.fillna(df.vehicle_year.mean())

Features engine_displacement still exhibit lots of missing values. For engine_displacement, my strategy was to fill with the mode (or median) values of features according to each car model. This seems to be a better approach than filling with the average engine size of all cars.

import statistics
def mode_or_median(var):
    try:
        return statistics.mode(var)
    except statistics.StatisticsError: # there may be no unique value
        return np.median(var)
avg_eng = df.groupby('model')['engine_displacement'].apply(mode_or_median)
avg_eng.fillna(avg_eng.median(),inplace=True) # fill the remaining NaN's
avg_eng.sample(10)

Now, we will insert the average values only when data were initially missing:

df.set_index('model', drop=False, inplace=True)
df.update(avg_eng, overwrite=False) # overwrite=False only updates NaN's
df.reset_index(drop=True, inplace=True)

Last, I will drop extreme values from our target distribution of car prices. First, because luxury car prices are harder to predict (because of fewer data points), and also because some entries are just wrong (i.e. ads for older cars whose prices have been mistyped).

# keep values within interval
df = df[(df.price > df.price.quantile(0.01)) & (df.price < df.price.quantile(0.99))]

Finally, we have a dataset ready for analysis:

df.head()
price address_city_name location_latitude location_longitude seller_car_dealer transmission engine_displacement brand doors fuel_type kilometers model vehicle_year traction_control
0 46990.0 Canoas -29.918818 -51.158540 True Manual 1591.0 hyundai 4.0 Gasolina e álcool 72.258 hb20s 2017.0 Dianteira
1 28800.0 Encantado -29.224850 -51.889533 True Manual 1389.0 chevrolet 4.0 Gasolina e álcool 74.000 agile 2011.0 Dianteira
3 44890.0 Caxias do Sul -29.163403 -51.179668 True Manual 1591.0 hyundai 4.0 Gasolina e álcool 39.293 hb20 2018.0 Dianteira
5 52900.0 Santa Maria -29.696541 -53.799310 True Automática 1598.0 volkswagen 4.0 Gasolina e álcool 30.600 crossfox 2015.0 4x2
6 61900.0 Santa Maria -29.696541 -53.799310 True Automática 999.0 ford 5.0 Gasolina 9.345 fiesta 2017.0 4x2

Exploratory data analysis

First, let’s check descriptive statistics:

#df.to_csv('~/mlcars.csv', index=False)
df = pd.read_csv('C:/Users/gsalt/mlcars.csv')
df.describe(include='all').round(2)
price address_city_name location_latitude location_longitude seller_car_dealer transmission engine_displacement brand doors fuel_type kilometers model vehicle_year traction_control
count 9740.00 9740 9740.00 9740.00 9740 9740 9740.00 9740 9740.00 9740 9740.00 9740 9740.00 9740
unique NaN 147 NaN NaN 2 4 NaN 49 NaN 12 NaN 440 NaN 5
top NaN Porto Alegre NaN NaN True Manual NaN chevrolet NaN Gasolina e álcool NaN onix NaN Dianteira
freq NaN 4103 NaN NaN 9019 6008 NaN 1514 NaN 6910 NaN 280 NaN 7812
mean 47753.99 NaN -29.70 -51.29 NaN NaN 1653.57 NaN 3.97 NaN 57.63 NaN 2012.58 NaN
std 32267.50 NaN 1.17 0.99 NaN NaN 528.68 NaN 0.76 NaN 51.19 NaN 4.39 NaN
min 8999.00 NaN -33.69 -57.55 NaN NaN 994.00 NaN 0.00 NaN 0.00 NaN 1960.00 NaN
25% 27900.00 NaN -30.02 -51.20 NaN NaN 1368.00 NaN 4.00 NaN 21.00 NaN 2011.00 NaN
50% 37900.00 NaN -29.95 -51.18 NaN NaN 1598.00 NaN 4.00 NaN 54.00 NaN 2013.00 NaN
75% 56040.00 NaN -29.70 -51.14 NaN NaN 1975.00 NaN 4.00 NaN 83.00 NaN 2015.00 NaN
max 249000.00 NaN -6.93 -37.88 NaN NaN 3960.00 NaN 7.00 NaN 920.00 NaN 2019.00 NaN

The describe method is very useful to find strange values left in data, such as found in engine displacement and kilometers.

The correlation matrix:

sns.heatmap(df.corr().round(2), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x285fae7c208>

png

The most popular cars announced and the average price:

df.groupby(['brand','model']).agg({'price':['mean','count']}).sort_values(('price','count'), ascending=False)[:10]
price
mean count
brand model
chevrolet onix 40334.564250 280
fiat palio 25952.777778 270
renault sandero 32828.739837 246
volkswagen gol 23750.466667 240
ford fiesta 32778.090909 231
volkswagen fox 31756.559633 218
fiat uno 27420.529126 206
ford ecosport 48480.684211 190
citroën c3 32972.803371 178
ford ka 32543.722543 173

The same query as before, now considering the manufacturing year of the cars:

df.groupby(['brand','model','vehicle_year']).agg({'price':['mean','count']}).sort_values(('price','count'), ascending=False)[:10]
price
mean count
brand model vehicle_year
fiat mobi 2018.0 33558.644068 59
ford ka 2018.0 39714.827586 58
chevrolet onix 2015.0 39301.210526 57
2018.0 39498.392857 56
ford fiesta 2014.0 33070.696429 56
jeep renegade 2016.0 75305.489796 49
chevrolet onix 2016.0 42870.638085 47
fiat uno 2018.0 32799.565217 46
chevrolet onix 2017.0 43780.888889 45
renault sandero 2018.0 37992.222222 45

Here are the most expensive brands, by average price:

df.pivot_table(index=['brand'], values=['price'], aggfunc=np.mean).round(2).sort_values('price', ascending=False)[:10].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x285fb184828>

png

How cars are located geographically:

import cartopy.crs as ccrs
from cartopy.io import shapereader

kw = dict(resolution='50m', category='cultural',
          name='admin_1_states_provinces')

states_shp = shapereader.natural_earth(**kw)
shp = shapereader.Reader(states_shp)

subplot_kw = dict(projection=ccrs.PlateCarree())

fig, ax = plt.subplots(figsize=(7, 11),
                       subplot_kw=subplot_kw)
ax.set_extent([-57.5,-49.5, -34,-27])
ax.add_geometries(shp.geometries(), ccrs.PlateCarree(), facecolor='lightgreen')
ax.scatter(df.location_longitude, df.location_latitude, marker='.', c='green', zorder=2)
<matplotlib.collections.PathCollection at 0x285869b1f98>

png

The relationship between continuous features are key in our dataset. Are they linear? This is important as most models assume a linear relationship between features and the target.

sns.pairplot(df[['price','kilometers','engine_displacement','vehicle_year']], diag_kind='kde')
C:\Users\gsalt\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





<seaborn.axisgrid.PairGrid at 0x285885ec6d8>

png

Model tests

First, let’s identify the scope of our problem. Our target, price, is continuous, so we’re dealing with a regression problem. In this section I will experiment with different regression techniques. But first, the dataset must be further processed: our categorical features must be translated into numeric. This can be done in pandas with get_dummies (or in scikit-learn with OneHotEncoder): it transforms a categorical feature into dummy variables that indicate which category an observation belongs.’

My theoretical model is currently specified as (simplified):

In a regression context, this means that all auto models depreciate at a rate determined by the coefficient associated with . However, one would expect each model to lose value at a different rate (i.e. luxury, fuel-efficient or low-maintenance cars would preserve their value for longer). We can implement this by interacting features and , that is, . The downside of this approach is that it may lead to overfitting.

Let’s load the scikit-learn methods:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn import metrics

# disable standard scaler data conversion warning
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

X = df.drop(['price','address_city_name'], axis=1).copy()
y = df.price.values.copy() # prices in R$

Next, I split the data in both training and data sets. Training data will be used to fit the model, and the test set to assess model accuracy.

# generate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In addition, we can perform standardization (centering of data around the average, also called scaling) of our numerical features using StandardScaler. Standardization can improve the accuracy of most models, on the other hand we lose interpretation of model coefficients such as the ones given by linear regression. Fortunately, we can use the predict method to give us model predictions. This step will be done inside a pipeline, after training and test data splits, to avoid data leakage.

# categorical features
cat_ft = ['seller_car_dealer', 'transmission', 'brand', 'fuel_type', 'model', 'traction_control']

# run once on full dataset to get all category values
temp = ColumnTransformer([('cat', OneHotEncoder(), cat_ft)]).fit(X)
cats = temp.named_transformers_['cat'].categories_

cat_tr = OneHotEncoder(categories=cats, sparse=False)

# numerical features
num_ft = ['location_latitude', 'location_longitude', 'engine_displacement', 'doors', 'kilometers', 'vehicle_year']
num_tr = StandardScaler()
#num_tr = FunctionTransformer(func=None)

# data transformer
data_tr = ColumnTransformer([('num', num_tr, num_ft), ('cat', cat_tr, cat_ft)])

Model comparison

There are a lot of techniques that can be used in a regression problem. As a starting point, I will compare some of them in terms of scores and sum of errors:

from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.dummy import DummyRegressor
from sklearn import ensemble

models = [DummyRegressor(),
          KNeighborsRegressor(),
          #linear_model.LinearRegression(), # ommited because of negative scores
          linear_model.Lasso(),
          linear_model.Ridge(),
          linear_model.ElasticNet(),
          ensemble.GradientBoostingRegressor(),
          ensemble.RandomForestRegressor(),
          ensemble.ExtraTreesRegressor()]

models_names = ['Dummy','K-nn','Lasso','Ridge','Elastic','Boost','Forest','Extra']
scores = []
mse = []
mae = []

for model in models:
    pipe = Pipeline([
    ('features', data_tr),
    ('model', model)
    ])
    fits = pipe.fit(X_train,y_train)
    scores.append(metrics.r2_score(y_test, fits.predict(X_test)))
    mse.append(metrics.mean_squared_error(y_test, fits.predict(X_test)))
    mae.append(metrics.median_absolute_error(y_test, fits.predict(X_test)))
C:\Users\gsalt\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
C:\Users\gsalt\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Users\gsalt\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Let’s take a look at the model scores (more is better), mean squared error and median absolute error (less is better):

f, (ax1, ax2, ax3) = plt.subplots(ncols=3, sharex=True, sharey=False, figsize=(18,6))
ax1.bar(models_names, scores)
ax1.set_ylabel('$R^2$')
ax2.bar(models_names, mse)
ax2.set_ylabel('Mean squared error')
ax3.bar(models_names, mae)
ax3.set_ylabel('Median absolute error')
Text(0, 0.5, 'Median absolute error')

png

All model scores are calculated using regression residuals, so they are correlated. The score in most models performed at around 0.8, except the elastic net model, which peaked at 0.5 in the training set. Gradient boosting, random forest and extra trees had the best scores. However, scores should be taken with precaution as they are biased towards larger models. Mean squared error (MSE) is a metric that penalizes larger prediction errors. Last are the median absolute error (MAE). Absolute error measures are interesting because they inform average deviations in terms of our target (predicted car prices, in our case).

I’ll stick with the random forest regressor, as it was less prone to overfitting in previous tests.

Predictions

Now, we fit the chosen model and extract some predictions:

mod = Pipeline([
    ('features', data_tr),
    ('model', ensemble.RandomForestRegressor(n_estimators=100, min_samples_leaf=1))])

mod_fit = mod.fit(X_train, y_train)

mod_fit.score(X_test,y_test)

Some example predictions from the model:

pred = pd.DataFrame.from_dict({'predicted':mod_fit.predict(X_test), 'true':y_test})
pred['difference'] = pred.predicted - pred.true
pred.sample(n=10).round(2)
predicted true difference
2386 36963.40 34990.0 1973.40
144 162293.38 150000.0 12293.38
1175 31838.20 32960.0 -1121.80
2155 29880.00 29900.0 -20.00
1052 44718.00 44500.0 218.00
1759 32580.00 31900.0 680.00
622 27303.20 23900.0 3403.20
2363 40363.60 42900.0 -2536.40
429 32632.40 33900.0 -1267.60
2709 39779.80 33900.0 5879.80

The average prediction error in car prices is around -R$600. The value being negative means that the model tends to underestimate prices in general. The not-so-good news is that the standard deviation of predictions is pretty high.

pred.difference.describe()
count      2922.000000
mean       -632.707264
std       13053.008512
min     -189274.200000
25%       -2489.150000
50%         150.560000
75%        2949.950000
max       80052.000000
Name: difference, dtype: float64

Model tuning

The random forest regressor have a few hyperparameters that can be tuned to improve overall accuracy. I will play around with n_estimators (the number of estimators/trees) and min_samples_leaf (the minimum number of samples to be in a tree branch). One might be tempted to test all parameters, but the computational costs quickly grow according to the number of parameters, possible values and cross-validation folds:

from sklearn.model_selection import GridSearchCV

# cross validation folds
folds=3

# set parameter range for grid search
params = {'model__n_estimators': [10, 20, 30, 50, 100], 'model__min_samples_leaf': [1,2]}

grids = GridSearchCV(mod, param_grid=params, cv=folds)
grids_fit = grids.fit(X_train, y_train)

print('Best choice of parameter:', grids_fit.best_params_)
Best choice of parameter: {'model__min_samples_leaf': 1, 'model__n_estimators': 100}

Validation

First, I will perform a three-fold cross validation (that is, calculate training and test scores for three random sub-samples of data) and take the mean of scores:

from sklearn.model_selection import cross_validate

mod_cross_val = cross_validate(mod, X, y=y, cv=folds, return_train_score=True)

print('Average train score:', mod_cross_val['train_score'].mean().round(4))
print('Average test score:',  mod_cross_val['test_score'].mean().round(4))
Average train score: 0.9766709555336993
Average test score: 0.8178101080682266

Next, the learning curve tells us how model prediction improves as we add training samples (that’s when the machine is learning):

from sklearn.model_selection import learning_curve

learn_tr_size, learn_train_sc, learn_test_sc = learning_curve(mod, X_train, y_train, cv=folds)

# calculate mean over cross-validation folds
learn_train_m = np.apply_along_axis(np.mean, 1, learn_train_sc)
learn_test_m  = np.apply_along_axis(np.mean, 1, learn_test_sc)

plt.plot(learn_tr_size, learn_train_m)
plt.plot(learn_tr_size, learn_test_m)
plt.title('Learning curve')
plt.xlabel('Training samples')
plt.ylabel('Scores')
plt.legend(['Train score','Test score'])
<matplotlib.legend.Legend at 0x2859d6e8d68>

png

The learning curve tells us that our model performs much better in training than in test data, although both scores improve as we add samples. Intuitively, this means that the model struggles to generalize to new, unknown cases. Improvements can be made in the following senses:

  • Model tuning: Results from the learning curve are a sign of overfitting. A solution could be fitting a more parsimonious tree.
  • Feature engineering: in a regression context, having a single coefficient for kilometers means that all cars depreciate at a constant rate. But we know that luxury or fuel-efficient models keep their value for longer. A solution would be the creation of features that are interactions of model and vehicle_year.