18
Machine learning system design 할 일의 우선순위를 매기기: 스팸(Spam) 분류 예 Machine Learning

할일의우선순위를 스팸(Spam) 분류예 - Jun Jijun.hansung.ac.kr/ML/docs-slides-Lecture11-kr.pdf · 2016. 10. 5. · Andrew Ng 스팸(spam) 분류기구축하기 감독(supervised)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Machine learning system design

    할일의우선순위를

    매기기: 스팸(Spam) 분류예

    Machine Learning

  • Andrew Ng

    스팸(spam) 분류기구축하기

    From: [email protected]

    To: [email protected]

    Subject: Buy now!

    Deal of the week! Buy now!

    Rolex w4tchs - $100

    Med1cine (any kind) - $50

    Also low cost M0rgages

    available.

    From: Alfred Ng

    To: [email protected]

    Subject: Christmas dates?

    Hey Andrew,

    Was talking to Mom about plans

    for Xmas. When do you get off

    work. Meet Dec 22?

    Alf

  • Andrew Ng

    스팸(spam) 분류기구축하기감독(supervised) 학습. features of email. spam (1) or not spam (0).특징 : 스팸/스팸아님을나타내는 100 단어들을선택한다.

    From: [email protected]

    To: [email protected]

    Subject: Buy now!

    Deal of the week! Buy now!

    주의: 실제로는, 손으로 100 단어들을선택하기보다는, 훈련자료에서가장자주

    나오는 단어들을 (10,000 ~ 50,000) 선택한다.

  • Andrew Ng

    스팸(spam) 분류기구축하기

    낮은오류를갖게하기위해당신의시간을어떻게사용하는가?

    - 많은자료를수집

    - E.g. “honeypot” 프로젝트.- (이메일해더로부터) 이메일라우팅(routing) 정보에기초한정교한

    특징을개발.

    - 메시지 내용에 대한 정교한 특징 개발 , e.g. “discount” 와“discounts” 를같은단어로취급할것인가? “deal” 과 “Dealer”는어떻게 ? 구두점(punctuation)에대한특징은?

    - 오타감지를 위한 정교한 특징을 개발 (e.g. m0rtgage, med1cine,w4tches.)

  • Machine learning system design

    오류분석

    Machine Learning

  • Andrew Ng

    추천접근법

    - 당신이빨리구현할수있는간단한알고리즘을갖고시작하시오. 그것을구현하고, 상호-검증자료에대하여시험하시오.

    - 더많은자료또는더많은특징등이도움이될지를알수있도록학습커브를그려보시오.

    - 오류분석(Error analysis): 당신의알고리즘이오류를생성한(상호검증자료에서의) 예들을수동으로검토한다. 오류가생성되는어떤유형의예에서어떤체계적인경향을발견할수있는지를본다.

  • Andrew Ng

    오류분석(Error Analysis)

    500 examples in cross validation set알고리즘이 100이메일을오분류하였다.

    수동으로 100오류를검토하고,다음에기초하여분류한다:(i) 어떤유형의이메일인가?(ii) 당신은 어떤 단서(특징)가 분류기가 그들을 정확히 분류하는데

    도움이된다고생각하는가?

    Pharma:Replica/fake:Steal passwords:Other:

    의도적철자오기:(m0rgage, med1cine, etc.)흔치않은이메일라우팅(routing):흔치않은 (spamming) 구두점:

  • Andrew Ng

    수치평가(numerical evaluation)의중요성

    discount/discounts/discounted/discounting 들을같은단어로취급해야하는가?

    “stemming” software (E.g. “Porter stemmer”)를사용할수있다.universe/university.

    오류분석이이들이성능을개선할수있다고결정하는데도움이되지않을것이다. 유일한해결법은시도해보고작동되는지보는것이다.

    알고리즘의 stemming 사용과무사용의성능에대한수치검증 (e.g., cross validation error) 이필요한다.

    Without stemming: With stemming:

    Distinguish upper vs. lower case (Mom/mom):

  • Machine learning system design

    편향된(skewed)클래스에대한오류행렬

    Machine Learning

  • Andrew Ng

    Cancer classification example

    로지스틱회귀모델 을훈련. ( if cancer, otherwise)시험자료에대하여당신은 1% 오류 (99% 정확한진단)를얻었다.

    단지환자의 0.50% 만암을갖고있다.

    function y = predictCancer(x)

    y = 0; %ignore x!

    return

  • Andrew Ng

    정확도(Precision)/재현율(Recall)

    :우리가감지하기를원하는흔치않은클래스가존재

    정확도(우리가 이라고예측한모든환자들중에서, 어느비율이실제로암을갖고있는가?)

    재현율(실제로암을갖고있는모든환자들중에서, 어느비율로암을갖고있다고정확히감지했는가?)

  • Machine learning system design

    Trading off precision and recall

    Machine Learning

  • Andrew Ng

    정확도와재현율상충관계

    Logistic regression:Predict 1 if Predict 0 if 우리가매우확신할경우에만 (cancer)라고예측하기를원한다면.

    우리가암의경우를놓치기를피하고싶다면

    (avoid false negatives).

    More generally: Predict 1 if threshold.

    1

    0.5

    0.5 1

    Recall

    Pre

    cisi

    on

    precision =true positives

    no. of predicted positive

    recall =true positives

    no. of actual positive

  • Andrew Ng

    Precision(P) Recall (R) Average F1 Score

    Algorithm 1 0.5 0.4 0.45 0.444

    Algorithm 2 0.7 0.1 0.4 0.175

    Algorithm 3 0.02 1.0 0.51 0.0392

    F1 Score (F score)

    precision/recall 값을어떻게비교하는가?

    Average:

    F1 Score:

  • Machine learning system design

    Data for machine learning

    Machine Learning

  • Designing a high accuracy learning system

    [Banko and Brill, 2001]

    E.g. Classify between confusable words.{to, two, too}, {then, than}

    For breakfast I ate _____ eggs.Algorithms

    - Perceptron (Logistic regression)- Winnow- Memory-based- Naïve Bayes

    “It’s not who has the best algorithm that wins. It’s who has the most data.”

    Training set size (millions)

    Acc

    ura

    cy

  • 유용한시험 : Given the input , can a human expert confidently predict ?

    큰자료근거

    특징 이 를정확히예측할수있을만큼충분한정보를가진다.

    예 : For breakfast I ate _____ eggs.반대예: Predict housing price from only size (feet2) and no other features.

  • 큰자료근거

    Use a learning algorithm with many parameters (e.g. logistic regression/linear regression with many features; neural network with many hidden units).

    Use a very large training set (unlikely to overfit)