Categorical data
Nominal and ordinal features
Ordinal features: categorical values that can be sorted, i.e. t-shirt size
Nominal features: don’t imply any order, i.e. t-shirt color
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
['red', 'L', 13.5, 'class1'],
['blue', 'XL', 15.3, 'class2']])
df.columns = ['color', 'size', 'price', 'classlabel']
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Mapping ordinal features
size_mapping = {'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Inverse mapping
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)
0 M
1 L
2 XL
Name: size, dtype: object
One-hot encoding nominal features
Note you cannot use the LabelEncoder
since it will introduce an order that did not exist orginally. Use One-Hot encoding
# option1: Pandas
pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# option2: scikit-learn
from sklearn.preprocessing import OneHotEncoder
X = df[['color', 'size', 'price']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()
array([[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.]])
Encoding class labels
from sklearn.preprocessing import LabelEncoder
# Label encoding with sklearn's LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y
# reverse mapping
class_le.inverse_transform(y)
array(['class2', 'class1', 'class2'], dtype=object)