“Always be the worst guy in every band you’re in. – so you can learn. The people around you affect your performance. Choose your crowd wisely.”
—— Chad Fowler
His concept is a simple truth about your level of talent relative to the people in the team. The idea is that if you’re the least talented, you can’t help but play towards the level of the more talented. Conversely, the more talented people play more like the less talented people. You can’t help in a group setting but to become more like the other people you play with.
Most of time, in some situations, I work alone. According to Chad’s concept, I would be standing still, or worse, maybe even going backwards.
As a musician, I’ve NEVER been anything but the least talented in the group. The people I play with a generally VERY good. I’m by no means good these days, but my playing over the last few years has improved dramatically because of it.
Today, I’m going to find something to get involved with where I’m the least talented in the room. Hopefully I won’t break something
“Why shouldn’t I just use ordinary least squares?”
Y = a + BX + e
Y is a dummy dependent variable, =1 if event happens, =0 if event doesn’t happen, a is the coefficient on the constant term, B is the coefficient(s) on the independent variable(s),
X is the independent variable(s), and
e is the error term.
Use of the LP model generally gives you the correct answers in terms of the sign and significance level of the coefficients. The predicted probabilities from the model are usually where we run into trouble. There are 3 problems with using the LP model:
1、The error terms are heteroskedastic (heteroskedasticity occurs when the variance of the dependent variable is different with different values of the independent variables): var(e)= p(1-p), where p is the probability that EVENT=1. Since P depends on X the “classical regression assumption” that the error term does not depend on the Xs is violated. （不太理解ORZ…）
2、e is not normally distributed because P takes on only two values, violating another “classical regression assumption”. 因为因变量Y只有两种取值非1即0，但通过LP模型得出的error（随机误差项）的方差会随着X取值变化而变化（讲的不就是第1条的异方差嘛？不理解+1…）
3、The predicted probabilities can be greater than 1 or less than 0 which can be a problem if the predicted values are used in a subsequent analysis. Some people try to solve this problem by setting probabilities that are greater than (less than) 1 (0) to be equal to 1 (0). This amounts to an interpretation that a high probability of the Event (Nonevent) occuring is considered a sure thing. 这条可解释为LP模型在整个实数域内敏感度一致，然而分类范围需要在[0,1]。
The logistic regression model
The “logit” model solves these problems:
ln[p/(1-p)] = a + BX + e or
[p/(1-p)] = exp(a + BX + e)
ln is the natural logarithm, logexp, where exp=2.71828…
p is the probability that the event Y occurs, p(Y=1)
p/(1-p) is the “odds ratio”（比值比 or 几率）
ln[p/(1-p)] is the log odds ratio, or “logit”（logit函数）
all other components of the model are the same.
The logistic regression model is simply a non-linear transformation of the linear regression. The “logistic” distribution is an S-shaped distribution function which is similar to the standard-normal distribution (which results in a probit regression model) but easier to work with in most applications (the probabilities are easier to calculate). The logit distribution constrains the estimated probabilities to lie between 0 and 1. LR模型是对线性回归作一个非线性变换得到，即套用了一个逻辑函数，使其转换成sigmoid函数形式（下面公式），则分布呈S型，将预测值限定为(0,1)。
For instance, the estimated probability is:
p = σ(-a – BX) = 1/[1 + exp(-a – BX)]
With this functional form:
if you let a + BX =0, then p = .50（无法分类）;
as a + BX gets really big, p approaches 1（EVENT发生）；
as a + BX gets really small, p approaches 0.
The likelihood function (似然函数)of a random sample is defined as its joint pdf as
L(θ) = L( θ, x1, x2,…, xn) = f(θ, x1, x2,…, xn)
with () fixed, the value L(θ, x1, x2,…, xn) is called the likelihood at θ.
3、ROC曲线(Receiver Operating Characteristic)（和PR曲线有相似之处）
ROC曲线可用来评价随着选取不同的阈值时二值分类器的优劣变化，用于选取最优阈值。ROC的纵轴为TPR（另一个相同含义的概念“sensitivity”，敏感度），横轴为FPR（另，可由1 – specificity，即1 – 特异性得到），每取一个阈值，就在曲线上对应一个点；同时，TPR又作benefit，FPR又作cost，即ROC也描述了收益和代价的trade-off。TPR和FPR的计算见以上confusion matrix。 分类效果越好，在ROC space中表现为点越趋近于左上角（TPR越大，FPR越小），点(0,1) 对应的阈值被称为a perfect classification。A completely random guess would give a point along a diagonal line (the so-called line of no-discrimination) from the left bottom to the top right corners. An intuitive example of random guessing is a decision by flipping coins (heads or tails). As the size of the sample increases, a random classifier’s ROC point migrates towards (0.5,0.5). The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random), points below the line poor results (worse than random). Note that the output of a consistently poor predictor could simply be inverted to obtain a good predictor. poor predictor经过转置可变为good，数学上怎么解释？？？
赤池信息量准则，即Akaike information criterion、简称AIC，是衡量统计模型拟合优良性的一种标准。赤池信息量准则建立在熵的概念基础上，可以权衡所估计模型的复杂度和此模型拟合数据的优良性。