参考:
What is One Hot Encoding? Why And When do you have to use it?
preprocessing categorical features
1. 一组数据经过 One-Hot Encoding 处理后的结果,可以清楚的看出One-Hot Encoding 具体的做的事情。
这个过程可以用这句话概括:
This estimator transforms each categorical feature with m possible values into m binary features, with only one active.
2. 为什么需要 One-Hot Encoding
对于类别,在向量化的时候会编码成数字,由于类别之间没有明确的数值关系,编码产生的数字,会默认给类编加上数值关系,如下所述:
Let me explain: What this form of organization presupposes is VW > Acura > Honda based on the categorical values. Say supposing your model internally calculates average, then accordingly we get, 1+3 = 4/2 =2. This implies that: Average of VW and Honda is Acura. This is definitely a recipe for disaster. This model’s prediction would have a lot of errors.
One-Hot Encoding 实际将类别信息二进制化, 如果属于相应类别,相应值为 1, 否则为 0, 这样避在编码类别时,引入无关的数值关系。