[Theme 05] MLE (Maximum likelihood estimation) 을 통한 Loss

안녕하세요 pulluper 입니다! 😀 저번까지 linear regression, logistic regression, softmax regression 등의 개념을 살펴보았는데요. 머신러닝공부를 하다보면 나오는 심심찮은 개념인 MLE 이름은 익숙한데, 실제 개념이 헷갈리시다면 이번 포스팅을 잘 읽어보시기 바랍니다! 이번 시간에는 MLE의 개념과, 이를통해서 만들어지는 Loss의 특징들에 대하여 알아보겠습니다. 목차는 다음과 같습니다.

Likelihood (bayes' rule)
MLE
Negative Log Likelihood
Examples

먼저 bayesian rule 에 나오는 likelihood 와, MLE에 개념, 이를 Loss 로 사용하는 NLL, 그리고 task에 맞는 examples 들의 내용을 다뤄보겠습니다. 그럼 시작해보겠습니다 ~ 🎬

1. Likelihood (우도)

Likelihood (우도) 의 개념은 Bayes' rule 에서 왔습니다. 이는 다음과 같습니다.

$P(A|B)=P(B|A)P(A)P(B)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>A</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>B</mi><mo stretchy="false">)</mo><mo>=</mo><mstyle displaystyle="true" scriptlevel="0"><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>B</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>A</mi><mo stretchy="false">)</mo><mi>P</mi><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>B</mi><mo stretchy="false">)</mo></mrow></mfrac></mstyle></math>$

이것이 뜻하는 바는, 조건부 확률은 집합의 교집합을 주어진 집합의 확률로 나누어 나타낼 수 있다는 것 입니다. 다음 그림을 보시면 이해가 쉽습니다. 마지막 부분으로 유도 될 때, 확률의 곱의 법칙이 사용 되었습니다.

이 베이즈 정리에, 의미를 부여해 보겠습니다.

$P(θ|X)=P(X|θ)P(θ)P(X)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">)</mo><mo>=</mo><mstyle displaystyle="true" scriptlevel="0"><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo></mrow></mfrac></mstyle></math>$

$θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ : hypothesis로 추정하고자 하는 값 입니다. 이는 특정 클래스가 될 수도 있고, 확률분포의 모수가 될수도 있습니다.

$X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ : observation로 관측된 데이터로 예를들어 training data 가 될 수 있습니다.

$P (θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></math>$ : prior probability 추정하려는값의 미리 가지고 있는 확률분포를 뜻합니다.

$P (X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo></math>$ : marginal probability 혹은 evidence 이며, data X 자체의 분포를 뜻합니다.

$P (θ | X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">)</mo></math>$ : posterior probability로 X가 주어졌을 때의 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 의 분포입니다. X의 영향을 받습니다.

$P (X | θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo></math>$ : likelihood로 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 가 주어졌을 때(가정을 한 상태)에서의 X의 확률분포입니다.

제가 예전포스팅에서 다음과 같이 likelihood를 말했습니다.

그러나 가능도(likelihood)는 data가 고정 된 상태로 확률분포의 parameter 의 함수로 나타낼 수 있습니다. 즉 $고 정 고 정 p (X (고 정) | θ) = L (θ | X (고 정)) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">(</mo><mi>고</mi><mi>정</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">(</mo><mi>고</mi><mi>정</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 입니다. 이는 그냥 그 사건에 대한 pdf 값으로 나타내면 됩니다.

저도 포스팅을 하면서 갑자기 이부분이 혼동 되었었는데요. X가 고정이 되었다면, $P (θ | X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">)</mo></math>$ 가 likelihood 의 의미가 되어야 하는것이 아닌가? 하는 의문이 들었습니다. 💦💦💦

이와 관련하여 개념을 정립하는데, 비슷한 질문이 다음에 있었습니다.

https://stats.stackexchange.com/questions/429819/what-is-the-conceptual-difference-between-posterior-and-likelihood?answertab=scoredesc#tab-top

What is the conceptual difference between posterior and likelihood?

I have trouble discerning conceptually between these two notions. I am aware of their formal relations, proprieties and what not, but I just can't wrap my head around what they "mean", if that even...

stats.stackexchange.com

답변자는 다음과 같이 likelihood 의 의미를 말합니다.

"만약 𝜃가 특정 값을 취한다는 것을 알았다면 우리가 가지고 있는 데이터를 관찰할 확률은 얼마나 될까요?"

다시 $P (X | θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo></math>$ 의 의미를 보면, $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 가 주어졌을때이지 고정된것은 아닙니다. 따라서 특정 분포를 취할때(정규분포등), 특정 X를 관찰할 확률이라고 생각하면 될 것 같습니다. 즉, 만약 observed 된 data X가 {2, 3, 4} 이고, 이들이 정규분포에서 나왔을 확률(likelihood)는 다음과 같습니다.

$P (X | θ) = P (X | μ, σ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo>=</mo><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>μ</mi><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo></math>$

이때, 각 data 들이 IID 조건을 만족한다고 하면,

$P(X|θ)=P(X|μ,σ)=P(2|μ,σ)×P(3|μ,σ)×P(4|μ,σ)=n∏i=1P(xi|μ,σ)=n∏i=11√2πσ2exp−(xi−μ)22σ2<math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo></mtd><mtd><mi></mi><mo>=</mo><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>μ</mi><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mi>P</mi><mo stretchy="false">(</mo><mn>2</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>μ</mi><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo><mo>×</mo><mi>P</mi><mo stretchy="false">(</mo><mn>3</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>μ</mi><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo><mo>×</mo><mi>P</mi><mo stretchy="false">(</mo><mn>4</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>μ</mi><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><munderover><mo data-mjx-texclass="OP">∏</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mi>P</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>μ</mi><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><munderover><mo data-mjx-texclass="OP">∏</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mstyle displaystyle="true" scriptlevel="0"><mfrac><mn>1</mn><msqrt><mn>2</mn><mi>π</mi><msup><mi>σ</mi><mn>2</mn></msup></msqrt></mfrac></mstyle><mi>e</mi><mi>x</mi><msup><mi>p</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>−</mo><mi>μ</mi><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac></mrow></msup></mtd></mtr></mtable></math>$

따라서 결국 위에서의 likelihood $P (X | θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo></math>$ 는 $μ, σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>,</mo><mi>σ</mi></math>$ 의 함수가 되면서 data들의 "그 분포의 PDF들의 곱" 으로 표현할 수 있습니다. 즉, 특정 분포의 곱으로 가정하는것을 기억해 두시면 됩니다. 👏👏👏👏👏

2. MLE(Maximum likelihood estimation)

우리의 목표는 Posterior 를 최대화 시키는 것 입니다. 왜냐하면, $P (θ | X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">)</mo></math>$ 가 커질때, $는 P (θ) 는 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo><mi>는</mi></math>$ 보통 1로 두기 때문입니다. 즉, X라는 data 가 주어졌을 때 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 라는 가설이 맞을 확률을 늘리는 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 를 찾아야 하기 때문입니다. 그렇다면, 어떻게 Posterior를 최대화 할 수 있을까요? 위의 베이즈 법칙에 의하여 posterior 는 prior 와 비례함을 알 수 있습니다.

$P(θ|X)=P(X|θ)P(θ)P(X)∝P(X|θ)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">)</mo><mo>=</mo><mstyle displaystyle="true" scriptlevel="0"><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mi>P</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo></mrow></mfrac></mstyle><mo>∝</mo><mi>P</mi><mo stretchy="false">(</mo><mi>X</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo></math>$

posterior 를 정확히 구하기 위해서는 prior 를 알아야 합니다. 근데 이게 어렵습니다. 따라서 likelihood를 최대로 만드는 방법으로 posterior 를 최대화 시키도록 근사하여 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 들을 구하는 것이 바로 MLE입니다.

예를들어 보면 위에서 data = {2, 3, 4} 에 대한 likelihood 는 $μ, σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>,</mo><mi>σ</mi></math>$ 에 대한 함수인데, 이를 최대화 시키는 각각의 $μ, σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>,</mo><mi>σ</mi></math>$ 를 구하는 것입니다. 어떻게 구하냐면, 이제 최적화 문제가 되었으므로 미분을 이용하거나, Gradient Descent 방법등을 이용합니다.

3. Negative Log-Likelihood

네 좋습니다 이제 우리는, likelihood 를 최대화 시키면 됩니다! 그런데 왜 log likelihood 를 사용할까요? 그것은 다음 질문에서 알 수 있었습니다. [link]

바로 곱하기 연산을 덧셈으로 바꿀 수 있기 때문입니다. 아래 그림과 같이 Likelihood는 곱으로 표현이 되는데 log 를 취하면, 덧셈으로 표현 가능합니다.

또한 아래와 같이 gaussian, bernoulli 등의 분포에 포함된 exponential 에 대한 계산을 피할 수 있기 때문입니다.

그리고 log 함수는 단조증가 함수이기 때문에, likelihood에 연산을 하여도 그 순서등의 관계를 바꾸지 않습니다.

마지막으로 Negative는 음수를 곱해줘서 Maximum 문제를 Minimum 문제로 바꾸기 위한 목적입니다.

정리하자면, 만약 우리가 GD 로 이 문제를 풀어야 합니다. 그런데, 가정한 모델이 exponential 과 곱으로 구성되어 있다면, 곱의 미분등을 모두 구해야 하는데 이렇게 되면 computational cost가 매우 커지게 됩니다. 이를 방지하기 위해 더 간단하게 연산 할 수 있는 log likelihood 를 사용합니다.

4. Examples

이제 Likelihood 에 대한 예를 2가지 들어보겠습니다. X가 주어졌을때, deep neural network 를 통한 Y에 대한 likelihood 는 다음과 같습니다.

$P (Y | X) = P (Y | w = f θ (X)) = P (Y | X; θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mi>P</mi><mo stretchy="false">(</mo><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">)</mo></mtd><mtd><mi></mi><mo>=</mo><mi>P</mi><mo stretchy="false">(</mo><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>w</mi><mo>=</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mi>P</mi><mo stretchy="false">(</mo><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo>;</mo><mi>θ</mi><mo stretchy="false">)</mo></mtd></mtr></mtable></math>$

1. Let Y ~ gaussian, $μ = f θ (X), σ = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>=</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo><mo>,</mo><mi>σ</mi><mo>=</mo><mn>1</mn></math>$ 이라 가정하면,

Then, Likelihood $=∏ni=11√2πe−(yi−fθ(xi))22<math xmlns="http://www.w3.org/1998/Math/MathML"><mo>=</mo><munderover><mo data-mjx-texclass="OP">∏</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mfrac><mn>1</mn><msqrt><mn>2</mn><mi>π</mi></msqrt></mfrac><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><mn>2</mn></mfrac></mrow></msup></math>$

Negative Log Likelihood :

$=−n∑i=1{ln1√2π−12(yi−fθ(xi))2}=n∑i=1{12(yi−fθ(xi))2+const}<math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mo>−</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mo fence="false" stretchy="false">{</mo><mi>l</mi><mi>n</mi><mfrac><mn>1</mn><msqrt><mn>2</mn><mi>π</mi></msqrt></mfrac><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo fence="false" stretchy="false">}</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mo fence="false" stretchy="false">{</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo>+</mo><mi>c</mi><mi>o</mi><mi>n</mi><mi>s</mi><mi>t</mi><mo fence="false" stretchy="false">}</mo></mtd></mtr></mtable></math>$

이는 Sum Square Error 와 같습니다.

2. Let Y ~ Bernoulli, $p = f θ (X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo>=</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo></math>$ 이라 가정하면,

Then, Likelihood $= \prod n i = 1 f θ (x i) y i (1 - f θ (x i)) (1 - y i) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><msub><mi>y</mi><mi>i</mi></msub></mrow></msup><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow></msup></math>$

Negative Log Likelihood :

$= - n \sum i = 1 {y i l n (f θ (x i)) + ((1 - y i) l n (1 - f θ (x i)} <math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right" columnspacing="" rowspacing="3pt"><mtr><mtd><mo>=</mo><mo>-</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mo fence="false" stretchy="false">{</mo><msub><mi>y</mi><mi>i</mi></msub><mi>l</mi><mi>n</mi><mo stretchy="false">(</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mi>l</mi><mi>n</mi><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo></mtd></mtr></mtable></math>$

이는 Binary Cross Entropy 입니다.

Y가 multi normial distribution 을 사용한 예는 다음 포스팅에 있습니다. https://csm-kr.tistory.com/48

[Theme 04] Multinomial Logistic Regression (Softmax Regression)

안녕하세요 pulluper 입니다. 🔗 이번 포스팅에서는 다중 클래스 분류를 위한 multinomial logistic regression (softmax regression) 에 대하여 알아보겠습니다. 목적은 여러 클래스의 분류를 위함입니다. 예로

csm-kr.tistory.com

MLE 를 통해서 NLL loss 의 예를 보았는데, 우리가 익숙했던 loss 가 나오니 신기하지 않나요? 이렇듯, Y가 continuous 한 성질을 가진다면 가우시안 분포로, Y가 2개의 클래스를 가질것 같다면 베르누이 분포로 가정하여 MLE를 이용 할 수 있습니다. 각각은 regression, classification 문제를 푸는데 적절합니다.

네 이번 포스팅은 MLE 에 대하여 알아보았습니다. 마지막으로 복습을 위한 quiz 를 보고 포스팅 마치겠습니다.

1. bayes's rule 에 대한 설명을 하시오

2. likelihood 의 의미를 설명하세요.

3. MLE 가 NLL 을 쓰는 이유를 설명하세요.

4. Y가 gaussian 일 때, deep nueral net $p (Y | X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">)</mo></math>$ 의 loss 가 SSE 가 됨을 보이세요.

다음에는 XOR, MLP 에 대하여 포스팅 할 예정입니다. 감사합니다. 😎😎😎😎😎

Reference

https://sanghyu.tistory.com/10

https://stats.stackexchange.com/questions/429819/what-is-the-conceptual-difference-between-posterior-and-likelihood

https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution

https://hwiyong.tistory.com/27

https://www.youtube.com/watch?v=o_peo6U7IRM&t=2977s

'Basic ML' 카테고리의 다른 글

[Theme 00] Basic Machine Learning 정리 scheduler! (0)	2023.01.24
[Theme 06] Perceptron, XOR, MLP, Universal Approximation Theorem (3)	2023.01.24
[Theme 04] Multinomial Logistic Regression (Softmax Regression) (0)	2022.06.30
[Theme 03] Logistic Regression (odds/logit/sigmoid/bce) (0)	2022.06.07
[Theme 02] Linear Regression(선형회귀) (Feat. OLS/GD/MLE) (4)	2022.04.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

아무블로그

[Theme 05] MLE (Maximum likelihood estimation) 을 통한 Loss

'Basic ML' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Theme 05] MLE (Maximum likelihood estimation) 을 통한 Loss

'Basic ML' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역