Publicado em 29 Jan 2019

Python scikit-learn

In this notebook I will attempt to predict automobile prices using Python and its data analysis and machine learning packages such as pandas and scikit-learn. Forecasting car prices can be useful for businesses in general: insurance companies can use this information to calculate their premia; websites and enterprises can provide estimates even when asking prices are not available in a specific application; enterprises can set contracts where a car’s resale value must be defined a priori with greater information, or determine if a car is overvalued or undervalued with respect to the market.

My data source is Mercado Livre, a widely used e-commerce platform in Brazil. For simplicity, and also to isolate other geographical factors that affect prices, this research restricts to ads in the city of Porto Alegre/RS. I also excluded trucks and minivans from the analysis. While new cars can be announced, most of the cars advertised are used. To publish an ad, the seller must fill information about the car, such as brand, model, mileage, engine power and additional features. It is a common belief that this features can help predict a car’s asking price in the market, and in our methodology I will explore them to improve our predictions (of course, there also will be unobservable factors that affect prices). A popular source for automobile prices in Brazil is the FIPE table. In the table, average prices are calculated from newspaper and web ads. Data from the FIPE table can be improved here, as learning models can benefit from data updates, predict prices for cars with a specific set of features and location.

Along the notebook, I will go through all steps of data analysis, with code and commentary.

Data wrangling

First, let’s import the required packages for this step:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import numpy as np
import json
import requests

To access data on ads, I’ll make requests to the MercadoLivre’s API. We must build the query URL, using category and location identifiers that can be found in the API documentation site.

# parameters
category = 'MLB1744' # cars and trucks
city     = 'TUxCUFJJT0xkYzM0' # state of rs
limit    = '50'
offset   = [str(i) for i in list(range(50,10000,50))]

# access token
with open('ml_token.txt') as file:  
    token = file.read()

#url = 'https://api.mercadolibre.com/sites/MLB/search?category='+category+'&city='+city+'&limit='+limit+'&offset='+offset+'&access_token='+token

Users are only allowed to fetch 50 items per request - up to 10000 items - and this is controlled by the offset parameter. Using requests, we can download all data:

responses = []
for off in offset:
    url = 'https://api.mercadolibre.com/sites/MLB/search?category='+category+'&state='+city+'&limit='+limit+'&offset='+off+'&access_token='+token
    responses.append(requests.get(url))

respd = [i.json() for i in responses]

The .json method transforms the data into a Python dictionary. Looking at the structure of data, I find that ad entries are stored under key results:

respd[0].keys()

dict_keys(['site_id', 'paging', 'results', 'secondary_results', 'related_results', 'sort', 'available_sorts', 'filters', 'available_filters'])

# tests
data1 = [pd.io.json.json_normalize(i['results'], sep='_') for i in respd]

The result is a list of pandas DataFrames. We can append all datasets using concat:

data1 = pd.concat(data1, sort=False).reset_index(drop=True)
data1.head()

	accepts_mercadopago	address_city_id	address_city_name	address_state_id	address_state_name	attributes	available_quantity	buying_mode	...	shipping_logistic_type	shipping_mode	shipping_store_pick_up	shipping_tags	site_id	stop_time	tags	thumbnail	title
0	True	TUxCQ0NBTjcxNTM	Canoas	TUxCUFJJT0xkYzM0	Rio Grande Do Sul	[{'value_id': '2230581', 'value_name': 'Usado'...	1	classified	...	None	not_specified	False	[]	MLB	2019-02-24T04:00:00.000Z	[poor_quality_thumbnail, poor_quality_picture,...	http://mlb-s1-p.mlstatic.com/757910-MLB2918517...	Hyundai Hb20s 1.6 Comfort Plus 16v Flex 4p Manual
1	True	TUxCQ0VOQzZhYTdi	Encantado	TUxCUFJJT0xkYzM0	Rio Grande Do Sul	[{'value_id': '2230581', 'value_name': 'Usado'...	1	classified	...	None	not_specified	False	[]	MLB	2019-02-21T22:16:47.000Z	[poor_quality_picture, poor_quality_thumbnail,...	http://mlb-s1-p.mlstatic.com/671101-MLB2907376...	Chevrolet Agile Ltz - Fernando Multimarcas
2	True	TUxCQ1BPUjgwZTJl	Porto Alegre	TUxCUFJJT0xkYzM0	Rio Grande Do Sul	[{'source': 1, 'id': 'ITEM_CONDITION', 'name':...	1	classified	...	None	not_specified	False	[]	MLB	2019-02-09T09:34:55.000Z	[good_quality_picture, good_quality_thumbnail,...	http://mlb-s2-p.mlstatic.com/962446-MLB2896708...	Renault Grand Scenic 2002
3	True	TUxCQ0NBWDgxMzcw	Caxias do Sul	TUxCUFJJT0xkYzM0	Rio Grande Do Sul	[{'id': 'ITEM_CONDITION', 'name': 'Condição do...	1	classified	...	None	not_specified	False	[]	MLB	2019-03-19T04:00:00.000Z	[only_html_description, immediate_payment]	http://mlb-s2-p.mlstatic.com/668212-MLB2919195...	Hb20 1.6 Comfort Plus 16v Flex 4p
4	True	TUxCQ1BPUjgwZTJl	Porto Alegre	TUxCUFJJT0xkYzM0	Rio Grande Do Sul	[{'value_name': 'Automática', 'value_struct': ...	1	classified	...	None	not_specified	False	[]	MLB	2019-02-16T04:00:00.000Z	[dragged_visits, good_quality_picture, good_qu...	http://mlb-s1-p.mlstatic.com/968792-MLB2777645...	Volvo Xc90 2.0 T6 Inscription Drive-e 5p

5 rows × 73 columns

In each entry, we have a series of characteristics on the car, such as make, model, asking price, features and location. Next, I will transform the data into a pandas DataFrame for further manipulation. This requires a few steps, as the data is structured in a list of dicts:

cols = ['title','price','address_city_name','attributes', 'id',
       'location_latitude', 'location_longitude',
       'permalink',
       'seller_car_dealer']

data1 = data1.loc[:,cols]
data1.head()

	title	price	address_city_name	attributes	id	location_latitude	location_longitude	permalink	seller_car_dealer
0	Hyundai Hb20s 1.6 Comfort Plus 16v Flex 4p Manual	46990.0	Canoas	[{'value_id': '2230581', 'value_name': 'Usado'...	MLB1167035642	-29.9188	-51.1585	https://carro.mercadolivre.com.br/MLB-11670356...	True
1	Chevrolet Agile Ltz - Fernando Multimarcas	28800.0	Encantado	[{'value_id': '2230581', 'value_name': 'Usado'...	MLB1159937620	-29.2249	-51.8895	https://carro.mercadolivre.com.br/MLB-11599376...	True
2	Renault Grand Scenic 2002	8500.0	Porto Alegre	[{'source': 1, 'id': 'ITEM_CONDITION', 'name':...	MLB1152825359	-30.0346	-51.2177	https://carro.mercadolivre.com.br/MLB-11528253...	False
3	Hb20 1.6 Comfort Plus 16v Flex 4p	44890.0	Caxias do Sul	[{'id': 'ITEM_CONDITION', 'name': 'Condição do...	MLB1167546961	-29.1634	-51.1797	https://carro.mercadolivre.com.br/MLB-11675469...	True
4	Volvo Xc90 2.0 T6 Inscription Drive-e 5p	339950.0	Porto Alegre	[{'value_name': 'Automática', 'value_struct': ...	MLB1165970712	-30.0554	-51.2224	https://carro.mercadolivre.com.br/MLB-11659707...	True

It’s a great step forward, however information on a car’s attributes are still trapped in a list of dicts. Hence any attempts at using the previous json_serialize method will fail. As an example, let’s apply the function to the feature alone:

pd.io.json.json_normalize(data1.attributes[0], sep='_').head()

Now, I will reshape the data in order to have a single row per advertising. Let’s define a helper function:

def reshape_(data):
    return data.loc[:,['id','value_name']].set_index('id').transpose()

We must now loop over samples and then join all data. For loops can be avoided, using Python’s list comprehensions:

df_temp = [reshape_(pd.io.json.json_normalize(i)) for i in data1.attributes]

data2 = pd.concat(df_temp, sort=False).reset_index(drop=True)
del df_temp
data2.tail()

	ITEM_CONDITION	TRANSMISSION	ENGINE_DISPLACEMENT	BRAND	DOORS	FUEL_TYPE	KILOMETERS	MODEL	TRIM	VEHICLE_YEAR	TRACTION_CONTROL
9943	Usado	Manual	999 cc	Volkswagen	5	Gasolina e álcool	100000 km	Fox	1.0 Vht Trend Total Flex 5p	2009	Dianteira
9944	Usado	Manual	999 cc	Ford	5	Gasolina	130000 km	Fiesta	1.0 Street 5p	2004	Dianteira
9945	Usado	Manual	1360 cc	Peugeot	5	Gasolina e álcool	124 km	206	1.4 Presence Flex 5p	2008	Dianteira
9946	Usado	Manual	999 cc	Fiat	5	Gasolina e álcool	95800 km	Palio	1.0 Fire Celebration Flex 5p	2009	Dianteira
9947	Usado	Manual	1598 cc	Ford	5	Gasolina	190000 km	Focus	1.6 Gl 5p	2007	Dianteira

Finally, we’ll merge with the original data:

df = pd.concat([data1,data2], axis=1)

df.head()

	price	address_city_name	location_latitude	location_longitude	seller_car_dealer	transmission	engine_displacement	brand	doors	fuel_type	kilometers	model	vehicle_year	traction_control
0	46990.0	Canoas	-29.918818	-51.158540	True	Manual	1591.0	hyundai	4.0	Gasolina e álcool	72.258	hb20s	2017.0	Dianteira
1	28800.0	Encantado	-29.224850	-51.889533	True	Manual	1389.0	chevrolet	4.0	Gasolina e álcool	74.000	agile	2011.0	Dianteira
3	44890.0	Caxias do Sul	-29.163403	-51.179668	True	Manual	1591.0	hyundai	4.0	Gasolina e álcool	39.293	hb20	2018.0	Dianteira
5	52900.0	Santa Maria	-29.696541	-53.799310	True	Automática	1598.0	volkswagen	4.0	Gasolina e álcool	30.600	crossfox	2015.0	4x2
6	61900.0	Santa Maria	-29.696541	-53.799310	True	Automática	999.0	ford	5.0	Gasolina	9.345	fiesta	2017.0	4x2

Some more data cleaning, to deal with strings and data types:

df.drop(['TRIM','ITEM_CONDITION','title','attributes','id','permalink'], axis=1, inplace=True)
df.columns = map(str.lower, df.columns)

#remove unwanted strings
cols = ['location_latitude', 'location_longitude','engine_displacement','kilometers','vehicle_year','doors']
df.loc[:,cols] = df.loc[:,cols].replace(regex=True,to_replace=r'\D',value=r'')

# replace empty strings with NaN
df.replace({'':np.nan}, inplace=True)

# set brand and model strings to lowercase
df.brand = df.brand.str.lower()
df.model = df.model.str.lower()

# type conversion
df = df.astype({'location_latitude':float, 'location_longitude':float,'engine_displacement':float,
                'kilometers':float,'vehicle_year':float,'doors':float})

df.dtypes

price                  float64
address_city_name       object
location_latitude      float64
location_longitude     float64
seller_car_dealer         bool
transmission            object
engine_displacement    float64
brand                   object
doors                  float64
fuel_type               object
kilometers             float64
model                   object
vehicle_year           float64
traction_control        object
dtype: object

Now, I will attempt to detect outliers in numerical features. We must note that some extreme values are not outliers (i.e. we will have feature kilometers set to zero for new cars). Outlier detection will generate missing values that must be treated later.

df.describe().round(2)

	price	location_latitude	location_longitude	engine_displacement	doors	kilometers	vehicle_year
count	9740.00	9740.00	9740.00	9740.00	9740.00	9740.00	9740.00
mean	47753.99	-29.70	-51.29	1653.57	3.97	57.63	2012.58
std	32267.50	1.17	0.99	528.68	0.76	51.19	4.39
min	8999.00	-33.69	-57.55	994.00	0.00	0.00	1960.00
25%	27900.00	-30.02	-51.20	1368.00	4.00	21.00	2011.00
50%	37900.00	-29.95	-51.18	1598.00	4.00	54.00	2013.00
75%	56040.00	-29.70	-51.14	1975.00	4.00	83.00	2015.00
max	249000.00	-6.93	-37.88	3960.00	7.00	920.00	2019.00

# fix extreme values
df.price.replace({df.price.max(): np.nan}, inplace=True)
df.engine_displacement[df.engine_displacement < 900] = np.nan
df.engine_displacement[df.engine_displacement > 4000] = np.nan
df.kilometers.replace({df.kilometers.max(): np.nan}, inplace=True)
df.vehicle_year.replace({df.vehicle_year.min(): np.nan}, inplace=True)

Finally, we must deal with missing values, since scikit-learn does not accept them. It is often more fruitful to fill in missing values, rather than dropping whole samples.

# find missing values
df.isna().sum()

price                  0
address_city_name      0
location_latitude      0
location_longitude     0
seller_car_dealer      0
transmission           0
engine_displacement    0
brand                  0
doors                  0
fuel_type              0
kilometers             0
model                  0
vehicle_year           0
traction_control       0
dtype: int64

df.transmission       = df.transmission.fillna('Manual')
df.kilometers         = df.kilometers.fillna(df.kilometers.mean())
df.price              = df.price.fillna(df.price.mean())
df.traction_control   = df.traction_control.fillna('Dianteira')
df.location_latitude  = df.location_latitude.fillna(method='ffill')
df.location_longitude = df.location_longitude.fillna(method='ffill')
df.vehicle_year       = df.vehicle_year.fillna(df.vehicle_year.mean())

Features engine_displacement still exhibit lots of missing values. For engine_displacement, my strategy was to fill with the mode (or median) values of features according to each car model. This seems to be a better approach than filling with the average engine size of all cars.

import statistics
def mode_or_median(var):
    try:
        return statistics.mode(var)
    except statistics.StatisticsError: # there may be no unique value
        return np.median(var)

avg_eng = df.groupby('model')['engine_displacement'].apply(mode_or_median)
avg_eng.fillna(avg_eng.median(),inplace=True) # fill the remaining NaN's
avg_eng.sample(10)

Now, we will insert the average values only when data were initially missing:

df.set_index('model', drop=False, inplace=True)
df.update(avg_eng, overwrite=False) # overwrite=False only updates NaN's
df.reset_index(drop=True, inplace=True)

Last, I will drop extreme values from our target distribution of car prices. First, because luxury car prices are harder to predict (because of fewer data points), and also because some entries are just wrong (i.e. ads for older cars whose prices have been mistyped).

# keep values within interval
df = df[(df.price > df.price.quantile(0.01)) & (df.price < df.price.quantile(0.99))]

Finally, we have a dataset ready for analysis:

df.head()

	price	address_city_name	location_latitude	location_longitude	seller_car_dealer	transmission	engine_displacement	brand	doors	fuel_type	kilometers	model	vehicle_year	traction_control
0	46990.0	Canoas	-29.918818	-51.158540	True	Manual	1591.0	hyundai	4.0	Gasolina e álcool	72.258	hb20s	2017.0	Dianteira
1	28800.0	Encantado	-29.224850	-51.889533	True	Manual	1389.0	chevrolet	4.0	Gasolina e álcool	74.000	agile	2011.0	Dianteira
3	44890.0	Caxias do Sul	-29.163403	-51.179668	True	Manual	1591.0	hyundai	4.0	Gasolina e álcool	39.293	hb20	2018.0	Dianteira
5	52900.0	Santa Maria	-29.696541	-53.799310	True	Automática	1598.0	volkswagen	4.0	Gasolina e álcool	30.600	crossfox	2015.0	4x2
6	61900.0	Santa Maria	-29.696541	-53.799310	True	Automática	999.0	ford	5.0	Gasolina	9.345	fiesta	2017.0	4x2

Exploratory data analysis

First, let’s check descriptive statistics:

#df.to_csv('~/mlcars.csv', index=False)
df = pd.read_csv('C:/Users/gsalt/mlcars.csv')

df.describe(include='all').round(2)

	price	address_city_name	location_latitude	location_longitude	seller_car_dealer	transmission	engine_displacement	brand	doors	fuel_type	kilometers	model	vehicle_year	traction_control
count	9740.00	9740	9740.00	9740.00	9740	9740	9740.00	9740	9740.00	9740	9740.00	9740	9740.00	9740
unique	NaN	147	NaN	NaN	2	4	NaN	49	NaN	12	NaN	440	NaN	5
top	NaN	Porto Alegre	NaN	NaN	True	Manual	NaN	chevrolet	NaN	Gasolina e álcool	NaN	onix	NaN	Dianteira
freq	NaN	4103	NaN	NaN	9019	6008	NaN	1514	NaN	6910	NaN	280	NaN	7812
mean	47753.99	NaN	-29.70	-51.29	NaN	NaN	1653.57	NaN	3.97	NaN	57.63	NaN	2012.58	NaN
std	32267.50	NaN	1.17	0.99	NaN	NaN	528.68	NaN	0.76	NaN	51.19	NaN	4.39	NaN
min	8999.00	NaN	-33.69	-57.55	NaN	NaN	994.00	NaN	0.00	NaN	0.00	NaN	1960.00	NaN
25%	27900.00	NaN	-30.02	-51.20	NaN	NaN	1368.00	NaN	4.00	NaN	21.00	NaN	2011.00	NaN
50%	37900.00	NaN	-29.95	-51.18	NaN	NaN	1598.00	NaN	4.00	NaN	54.00	NaN	2013.00	NaN
75%	56040.00	NaN	-29.70	-51.14	NaN	NaN	1975.00	NaN	4.00	NaN	83.00	NaN	2015.00	NaN
max	249000.00	NaN	-6.93	-37.88	NaN	NaN	3960.00	NaN	7.00	NaN	920.00	NaN	2019.00	NaN

The describe method is very useful to find strange values left in data, such as found in engine displacement and kilometers.

The correlation matrix:

sns.heatmap(df.corr().round(2), annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x285fae7c208>

png

The most popular cars announced and the average price:

df.groupby(['brand','model']).agg({'price':['mean','count']}).sort_values(('price','count'), ascending=False)[:10]

		price
		mean	count
brand	model
chevrolet	onix	40334.564250	280
fiat	palio	25952.777778	270
renault	sandero	32828.739837	246
volkswagen	gol	23750.466667	240
ford	fiesta	32778.090909	231
volkswagen	fox	31756.559633	218
fiat	uno	27420.529126	206
ford	ecosport	48480.684211	190
citroën	c3	32972.803371	178
ford	ka	32543.722543	173

The same query as before, now considering the manufacturing year of the cars:

df.groupby(['brand','model','vehicle_year']).agg({'price':['mean','count']}).sort_values(('price','count'), ascending=False)[:10]

			price
			mean	count
brand	model	vehicle_year
fiat	mobi	2018.0	33558.644068	59
ford	ka	2018.0	39714.827586	58
chevrolet	onix	2015.0	39301.210526	57
chevrolet	onix	2018.0	39498.392857	56
ford	fiesta	2014.0	33070.696429	56
jeep	renegade	2016.0	75305.489796	49
chevrolet	onix	2016.0	42870.638085	47
fiat	uno	2018.0	32799.565217	46
chevrolet	onix	2017.0	43780.888889	45
renault	sandero	2018.0	37992.222222	45

Here are the most expensive brands, by average price:

df.pivot_table(index=['brand'], values=['price'], aggfunc=np.mean).round(2).sort_values('price', ascending=False)[:10].plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x285fb184828>

png

How cars are located geographically:

import cartopy.crs as ccrs
from cartopy.io import shapereader

kw = dict(resolution='50m', category='cultural',
          name='admin_1_states_provinces')

states_shp = shapereader.natural_earth(**kw)
shp = shapereader.Reader(states_shp)

subplot_kw = dict(projection=ccrs.PlateCarree())

fig, ax = plt.subplots(figsize=(7, 11),
                       subplot_kw=subplot_kw)
ax.set_extent([-57.5,-49.5, -34,-27])
ax.add_geometries(shp.geometries(), ccrs.PlateCarree(), facecolor='lightgreen')
ax.scatter(df.location_longitude, df.location_latitude, marker='.', c='green', zorder=2)

<matplotlib.collections.PathCollection at 0x285869b1f98>

png

The relationship between continuous features are key in our dataset. Are they linear? This is important as most models assume a linear relationship between features and the target.

sns.pairplot(df[['price','kilometers','engine_displacement','vehicle_year']], diag_kind='kde')

C:\Users\gsalt\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.PairGrid at 0x285885ec6d8>

png

Model tests

First, let’s identify the scope of our problem. Our target, price, is continuous, so we’re dealing with a regression problem. In this section I will experiment with different regression techniques. But first, the dataset must be further processed: our categorical features must be translated into numeric. This can be done in pandas with get_dummies (or in scikit-learn with OneHotEncoder): it transforms a categorical feature into dummy variables that indicate which category an observation belongs.’

My theoretical model is currently specified as (simplified):

\[price_i = model_i + year_i + features_i + error_i\]

In a regression context, this means that all auto models depreciate at a rate determined by the coefficient associated with $year$. However, one would expect each model to lose value at a different rate (i.e. luxury, fuel-efficient or low-maintenance cars would preserve their value for longer). We can implement this by interacting features $model$ and $year$, that is, $price_i = modelyear_i + features_i + error_i$. The downside of this approach is that it may lead to overfitting.

Let’s load the scikit-learn methods:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn import metrics

# disable standard scaler data conversion warning
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

X = df.drop(['price','address_city_name'], axis=1).copy()
y = df.price.values.copy() # prices in R$

Next, I split the data in both training and data sets. Training data will be used to fit the model, and the test set to assess model accuracy.

# generate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In addition, we can perform standardization (centering of data around the average, also called scaling) of our numerical features using StandardScaler. Standardization can improve the accuracy of most models, on the other hand we lose interpretation of model coefficients such as the ones given by linear regression. Fortunately, we can use the predict method to give us model predictions. This step will be done inside a pipeline, after training and test data splits, to avoid data leakage.

# categorical features
cat_ft = ['seller_car_dealer', 'transmission', 'brand', 'fuel_type', 'model', 'traction_control']

# run once on full dataset to get all category values
temp = ColumnTransformer([('cat', OneHotEncoder(), cat_ft)]).fit(X)
cats = temp.named_transformers_['cat'].categories_

cat_tr = OneHotEncoder(categories=cats, sparse=False)

# numerical features
num_ft = ['location_latitude', 'location_longitude', 'engine_displacement', 'doors', 'kilometers', 'vehicle_year']
num_tr = StandardScaler()
#num_tr = FunctionTransformer(func=None)

# data transformer
data_tr = ColumnTransformer([('num', num_tr, num_ft), ('cat', cat_tr, cat_ft)])

Model comparison

There are a lot of techniques that can be used in a regression problem. As a starting point, I will compare some of them in terms of scores and sum of errors:

from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.dummy import DummyRegressor
from sklearn import ensemble

models = [DummyRegressor(),
          KNeighborsRegressor(),
          #linear_model.LinearRegression(), # ommited because of negative scores
          linear_model.Lasso(),
          linear_model.Ridge(),
          linear_model.ElasticNet(),
          ensemble.GradientBoostingRegressor(),
          ensemble.RandomForestRegressor(),
          ensemble.ExtraTreesRegressor()]

models_names = ['Dummy','K-nn','Lasso','Ridge','Elastic','Boost','Forest','Extra']

scores = []
mse = []
mae = []

for model in models:
    pipe = Pipeline([
    ('features', data_tr),
    ('model', model)
    ])
    fits = pipe.fit(X_train,y_train)
    scores.append(metrics.r2_score(y_test, fits.predict(X_test)))
    mse.append(metrics.mean_squared_error(y_test, fits.predict(X_test)))
    mae.append(metrics.median_absolute_error(y_test, fits.predict(X_test)))

C:\Users\gsalt\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
C:\Users\gsalt\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Users\gsalt\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Let’s take a look at the model $R^2$ scores (more is better), mean squared error and median absolute error (less is better):

f, (ax1, ax2, ax3) = plt.subplots(ncols=3, sharex=True, sharey=False, figsize=(18,6))
ax1.bar(models_names, scores)
ax1.set_ylabel('$R^2$')
ax2.bar(models_names, mse)
ax2.set_ylabel('Mean squared error')
ax3.bar(models_names, mae)
ax3.set_ylabel('Median absolute error')

Text(0, 0.5, 'Median absolute error')

png

All model scores are calculated using regression residuals, so they are correlated. The $R^2$ score in most models performed at around 0.8, except the elastic net model, which peaked at 0.5 in the training set. Gradient boosting, random forest and extra trees had the best scores. However, $R^2$ scores should be taken with precaution as they are biased towards larger models. Mean squared error (MSE) is a metric that penalizes larger prediction errors. Last are the median absolute error (MAE). Absolute error measures are interesting because they inform average deviations in terms of our target (predicted car prices, in our case).

I’ll stick with the random forest regressor, as it was less prone to overfitting in previous tests.

Predictions

Now, we fit the chosen model and extract some predictions:

mod = Pipeline([
    ('features', data_tr),
    ('model', ensemble.RandomForestRegressor(n_estimators=100, min_samples_leaf=1))])

mod_fit = mod.fit(X_train, y_train)

mod_fit.score(X_test,y_test)

Some example predictions from the model:

pred = pd.DataFrame.from_dict({'predicted':mod_fit.predict(X_test), 'true':y_test})
pred['difference'] = pred.predicted - pred.true
pred.sample(n=10).round(2)

	predicted	true	difference
2386	36963.40	34990.0	1973.40
144	162293.38	150000.0	12293.38
1175	31838.20	32960.0	-1121.80
2155	29880.00	29900.0	-20.00
1052	44718.00	44500.0	218.00
1759	32580.00	31900.0	680.00
622	27303.20	23900.0	3403.20
2363	40363.60	42900.0	-2536.40
429	32632.40	33900.0	-1267.60
2709	39779.80	33900.0	5879.80

The average prediction error in car prices is around -R$600. The value being negative means that the model tends to underestimate prices in general. The not-so-good news is that the standard deviation of predictions is pretty high.

pred.difference.describe()

count      2922.000000
mean       -632.707264
std       13053.008512
min     -189274.200000
25%       -2489.150000
50%         150.560000
75%        2949.950000
max       80052.000000
Name: difference, dtype: float64

Model tuning

The random forest regressor have a few hyperparameters that can be tuned to improve overall accuracy. I will play around with n_estimators (the number of estimators/trees) and min_samples_leaf (the minimum number of samples to be in a tree branch). One might be tempted to test all parameters, but the computational costs quickly grow according to the number of parameters, possible values and cross-validation folds:

from sklearn.model_selection import GridSearchCV

# cross validation folds
folds=3

# set parameter range for grid search
params = {'model__n_estimators': [10, 20, 30, 50, 100], 'model__min_samples_leaf': [1,2]}

grids = GridSearchCV(mod, param_grid=params, cv=folds)
grids_fit = grids.fit(X_train, y_train)

print('Best choice of parameter:', grids_fit.best_params_)

Best choice of parameter: {'model__min_samples_leaf': 1, 'model__n_estimators': 100}

Validation

First, I will perform a three-fold cross validation (that is, calculate training and test scores for three random sub-samples of data) and take the mean of scores:

from sklearn.model_selection import cross_validate

mod_cross_val = cross_validate(mod, X, y=y, cv=folds, return_train_score=True)

print('Average train score:', mod_cross_val['train_score'].mean().round(4))
print('Average test score:',  mod_cross_val['test_score'].mean().round(4))

Average train score: 0.9766709555336993
Average test score: 0.8178101080682266

Next, the learning curve tells us how model prediction improves as we add training samples (that’s when the machine is learning):

from sklearn.model_selection import learning_curve

learn_tr_size, learn_train_sc, learn_test_sc = learning_curve(mod, X_train, y_train, cv=folds)

# calculate mean over cross-validation folds
learn_train_m = np.apply_along_axis(np.mean, 1, learn_train_sc)
learn_test_m  = np.apply_along_axis(np.mean, 1, learn_test_sc)

plt.plot(learn_tr_size, learn_train_m)
plt.plot(learn_tr_size, learn_test_m)
plt.title('Learning curve')
plt.xlabel('Training samples')
plt.ylabel('Scores')
plt.legend(['Train score','Test score'])

<matplotlib.legend.Legend at 0x2859d6e8d68>

png

The learning curve tells us that our model performs much better in training than in test data, although both scores improve as we add samples. Intuitively, this means that the model struggles to generalize to new, unknown cases. Improvements can be made in the following senses:

Model tuning: Results from the learning curve are a sign of overfitting. A solution could be fitting a more parsimonious tree.
Feature engineering: in a regression context, having a single coefficient for kilometers means that all cars depreciate at a constant rate. But we know that luxury or fuel-efficient models keep their value for longer. A solution would be the creation of features that are interactions of model and vehicle_year.

Predicting car prices with scikit-learn