What makes a good feature

import numpy as np
import matplotlib.pyplot as plt

greyhounds = 500 # 灰猎犬500只
labs = 500 # 拉布拉多犬500只

# 灰猎犬的身高高于拉布拉多犬
grey_height = 28 + 4 * np.random.randn(greyhounds)
lab_height = 24 + 4 * np.random.randn(labs)

plt.hist([grey_height,lab_height],stacked=True,color=['r','b'])
plt.show()

从上图中我们可以得到的信息，首先在height低于20时，我们可以很高的概率预测这只狗是拉布拉多犬，而在height高于35时，我们也能相当有信心的预测这只狗是灰猎犬，但是当height位于这两者之间时，两种狗的概率很接近。因此height是一个有用的特征但是并不完美。

所以在机器学习中，我们总是需要多种特征，否则我们只需要写if-else的规则而不是分类器。

Independent features are best

独立的特征给你不同类型的信息，试想一下上面的例子，有两个特征，用厘米测定的height和用英寸测定的height。这两个特征是高度相关的。从训练数据中删除高度相关的特征是一个很好的做法，因为很多分类器没有足够的智能明白用厘米或英寸度量的身高是同样的事情。因此分类器可能会重复计算height特征的重要性。

Features should be easy to understand （simpler relationships are easier to learn）

举一个例子，假如你把一封信从一个城市寄到另一个城市，预测需要多少天。两个城市之间的距离越远所需时间也就越长。一个有用的特征是用英里表示两个城市之间的距离，一个较差的特征是用城市的经纬度去标示其位置。

reference

What makes a good feature? - machine learning recipes # 3

histogram_demo_multihist.py - 做EDA时可以参考的代码

优质内容筛选与推荐>>
1、写在windows中的日志类
2、jdcb封装&类加载器&
3、视频主观质量评价方法
4、.NET core
5、第九章顺序容器（上）

Independent features are best

Features should be easy to understand （simpler relationships are easier to learn）

reference

朋友将在看一看看到

分享想法到看一看