Cousera机器学习基石第二周笔记 Machine Learning Foundation Week 2 Note in Cousera

2 min

Learning to Answer Yes/No

Perceptron Hypothesis Set

Example

Approval Problem Revisited

A Simple Hypothesis Set
‘Perceptron’

  • For x=(x1,x2,...,xd)(x_1,x_2,...,x_d) ‘feature of customer’,compute a weighted’score’ and

    <center> approvecreditifi=1dwixi>thresholdapprove \quad credit\quad if \sum_{i=1}^{d} w_i x_i>threshold<center>

    <center> denycreditifi=1dwixi<thresholddeny \quad credit\quad if \sum_{i=1}^{d} w_i x_i<threshold<center>

  • y:{+1(good),1(bad)}y:\{+1(good),-1(bad)\},0 ignored-linear formulahHh\in Hare

    <center>h(x)=sign((i=idwixi)threshold)h(x)=sign((\sum_{i=i}^{d}w_i x_i)-threshold)<center>

this is called’perceptron’hypothesis histroically

Vector Form of Perceptron Hypothesis

h(x)=sign((i=1dwixi)threshold)=sign((i=1dwixi)+(threshold)w0(+1)x0)=sign(sumi=0dwixi)=sign(WTx)\begin{split} h(x)&=sign((\sum_{i=1}^{d}w_i x_i)-threshold)\\ &=sign((\sum_{i=1}^{d}w_i x_i)+\underbrace{(threshold)}_{w_0}\cdot\underbrace{(+1)}_{x_0})\\ &=sign(sum_{i=0}^{d}w_i x_i)\\ &=sign(W^T x) \end{split}
  • each’tall’w represents a hypothesis h &is multiplied with ‘tall’ x——will use tall versions to simplify notation

Q

does h look like

Perceptrons in R2R^2

perceptrons\Leftrightarrowlinear(binary) classifiers

Perceptron Learning Algorithm (PLA)

Select g from H

  • want: g\approxf(hard when f unkown)
  • almost necessary
    \approxf on D,ideally g(xn)=f(xn)=yng(x_n)=f(x_n)=y_n
  • difficult
    is of infinite size
  • idea: start from some g0g_0,and ‘correct’ its mistakes on D

Perceptron Learning Algorithm

For t=0,1,…

1. find a mistake of wtw_t called(xn(t),yn(t))(x_{n(t)},y_{n(t)})

sign(WtTXn(t))yn(t)(W^{T}_t X_{n(t)})\not=y_{n(t)}

2. (try to) correct the mistake by

wt+1wt+yn(t)xn(t)w_{t+1}\leftarrow w_t+y_{n(t)}x_{n(t)}

…until no more mistakes

return last w (called wPLAw_{PLA} )as g

My question

determinant is like vector

Pratical implementation of PLA

Cyclic PLA

Some Remaning issues of PLA

### Algorithmic

(no mistake)?

  • naive cyclic:??
  • random cyclic:??
  • other variant:??

Learning:gfg\approx f ?

  • on D, if halt,yes(no mistake)
  • outside D:??
  • if not halting:??

Guarantee of PLA

Linear Separability

  • if PLA halts(i.e. no more mistakes), (necessary condition)D allows some w to make no mistakes
  • call such D linear separable

PLA Fact: wtw_tGets More Aligned with wfw_f

  • wfw_f perfect hence every xnx_n correctly away from line: yn(t)wfTxn(t)minnyn(t)wfTxn(t)>0y_{n(t)}w_f^T x_{n(t)}\ge \underset{n}{min} y_{n(t)}w_f^T x_{n(t)}>0

  • wfTwtw_f^T w_t\uparrowby updating with any(xn(t),yn(t))(x_{n(t)},y_{n(t)})

    wfTwt+1=wfT(wt+yn(t)xn(t))wfTwt+minnyn(t)wfTxn(t)>wfTwt+0\begin{split} w_f^Tw_{t+1}&=w_f^T(w_t+y_{n(t)}x_{n(t)})\\ &\ge w_f^Tw_{t}+\underset{n}{min} y_{n(t)}w_f^T x_{n(t)}\\ &> w_f^Tw_{t}+0 \end{split}

wtw_t appears more algned with wfw_f

Q

length of vector

PLA fact: wtw_t Does Not Grow Too Fast

wtw_t changed only when mistake

sign(wtTxn(t))yn(t)yn(t)wtTxn(t)0\Leftrightarrow sign(w_t^Tx_{n(t)})\not=y_{n(t)}\Leftrightarrow y_{n(t)}w_t^Tx_{n(t)}\le 0

  • mistake ‘limits’wt2||w_t||^2 growth,even when updating with’longest’ xnx_n wt+12=wt+yn(t)xn(t)2=wt2+2yn(t)wtTxn(t)+yn(t)xn(t)2wt2+0+yn(t)xn(t)2wt2+maxnynxn2\begin{split} ||w_{t+1}||^2&=||w_t+y_{n(t)}x_{n(t)}||^2\\ &=||w_t||^2+2y_{n(t)}w_t^Tx_{n(t)}+||y_{n(t)}x_{n(t)}||^2\\ &\le ||w_t||^2+0+||y_{n(t)}x_{n(t)}||^2\\ &\le||w_t||^2+\underset{n}{max}||y_nx_n||^2 \end{split}

start from w0=0w_0=0,afterT mistake corrections,

wfTwTwfwTTconstant\frac{w_f^T\quad w_T}{||w_f||\quad||w_T||}\ge \sqrt{T}\cdot constant

Novikoff theorem

ref

page 31

设训练数据集T=(x1,y1),(x2,y2),...,(xN,yN){(x_1,y_1),(x_2,y_2),...,(x_N,y_N)}是线性可分的,其中xiX=Rn,yiY=1,+1,i=1,2,...,Nx_i\in\mathcal{X}=R^n,y_i\in\mathcal{Y}={-1,+1},i=1,2,...,N,则

(1)存在满足条件w^opt=1||\hat{w}_{opt}||=1的超平面w^optx^+bopt=0\hat{w}_{opt}\cdot\hat{x}+b_{opt}=0将训练数据集完全正确分开,且存在γ>0\gamma>0,对所有i=1,2,...,Ni=1,2,...,N

yi(w^optx^i)=yi(w^optx^i+boptγy_i(\hat{w}_{opt}\cdot\hat{x}_i)=y_i(\hat{w}_{opt}\cdot\hat{x}_i+b_{opt}\ge\gamma

(2)令R=max1iNx^iR=\underset{1\le i\le N}{max}||\hat{x}_i||,则感知机算法在训练数据集上的误分类次数k满足不等式

k(Rγ)2k\le (\frac{R}{\gamma})^2

Non-Separable Data

More about PLA

CONS

not’linnear separable’,and not fully sure how long halting takes

Learning with Noisy Data

Line with Noise Tolerance

NP-hard

Pocket Algorithm

Find the least mistakes until iterations

Summary

  • Perceptron Hypothesis Set

    hyperplanes/linear classifiers in Rd\mathcal{R}^d

  • Perceptron Learning Algorithm(PLA)

    correct mistakes and improve iteratively

  • Guarantee of PLA

    no mistake eventually if linear separable

  • Non-Separable Data

    hold somewhat’best’weights in pocket