[Theme 03] Logistic Regression (odds/logit/sigmoid/bce)

안녕하세요 pulluper 입니다. 😀

이번 포스팅에서는 Logistic Regression model 에 대하여 알아보겠습니다.

이전의 [Linear Regression] 에서는 종속변수 Y가 연속적 이었습니다. 즉, 철수가 시험공부시간을 예측하는데, 예측점수 y는 0~100 까지의 연속적인 data입니다. 그런데, 만약 이번에는 종속변수가 범주형 (0, 1 혹은 여러 class) 이라면 그대로 Linear Reression 을 사용하기는 문제가 있습니다. 다음 그림에서 x축은 공부 시간이고, y축은 시험에 합격, 불합격을 나타내는 0, 1 의 binary label 입니다.

이를 linear regression 으로 다음과 같은 직선을 얻었다고 가정해 봅시다. y = 0.25x - 0.5 그리고 자료들을 이 직선에 넣었을 때, 0.5 이상인 값들을 합격으로, 그 이하는 불합격으로 판단 할 수는 있습니다.

그런데, 아래 그림과 같이 새로운 data 가 들어왔을 때, 기존의 판단의 근거로 사용했던 직선에서 오른쪽 점선 직선으로 변경이 됩니다, 이를 기준으로 사용하기에는 0.5라는 수치는 판단오류를 너무 많이 나게 합니다. 이렇듯 Linear Regression 은 이상치에 대하여 민감한 단점이 있습니다. 그리고 어느 직선이든 기울기가 0이 아니면 그 범위는 $[- \infty, \infty] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mo>-</mo><mi mathvariant="normal">\infty</mi><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">]</mo></math>$ 이기 때문에, 학습할 때, 직선으로는 정확한 판단을 하기가 어렵고, test 에서 학습에 사용되지 않은 data에 대하여 예측을 잘 못할 가능성이 큽니다. 이를 해결하기 위해서 직선이 아닌 곡선으로 fitting 하기 위한 방법이 logistic regression 입니다.

그러면 범주형 data는 어떻게 regression 하면 좋을까요? 바로 logistic regression을 이용하면 됩니다! logit을 알기 전에 먼저 odds 의 개념부터 정리하고 가겠습니다.

Odds란 성공횟수 / 실패횟수 에 관한 비입니다. 예를들어, 주사위를 던질 때, (1, 3)이 나오면 승리하는 것이고, (2, 4, 5, 6) 이 나오면 패배 하는 게임에서 $odds=24=0.5<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>o</mi><mi>d</mi><mi>d</mi><mi>s</mi><mo>=</mo><mfrac><mn>2</mn><mn>4</mn></mfrac><mo>=</mo><mn>0.5</mn></math>$ 입니다. 이것은 성공확률 / 실패확률 과 같으며, 어떤 사건에 대한 확률이 p이고 그것에 대한 odds는 $odds=p1−p<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>o</mi><mi>d</mi><mi>d</mi><mi>s</mi><mo>=</mo><mfrac><mi>p</mi><mrow><mn>1</mn><mo>−</mo><mi>p</mi></mrow></mfrac></math>$ 입니다. (사건과 ~사건은 배반사건이며, 전체를 이룬다) p 는 $[0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ 의 범위를 가지며, 다음과 같은 그래프로 표현 가능합니다. 점근선은 1이며, $[0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ 의 정의역에서 $[0, \infty] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">]</mo></math>$ 의 치역을 갖습니다

Logit 은 odds 에 log 를 취한것과 같습니다. 이때, 치역이 $[- \infty, \infty] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mo>-</mo><mi mathvariant="normal">\infty</mi><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">]</mo></math>$ 이고, (0.5, 0) 에 대하여 symmetric 합니다. 이 그래프는 다음과 같습니다.

logit 을 a 라고 가정하고, p에 대하여 정리를 하면, 다음과 같이 $p=1e−a+1<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mi>a</mi></mrow></msup><mo>+</mo><mn>1</mn></mrow></mfrac></math>$ 이 나옵니다. 이 의미는, $[- \infty, \infty] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mo>-</mo><mi mathvariant="normal">\infty</mi><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">]</mo></math>$ 의 a를 $[0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ 의 p로 변환시켜 주는 것입니다. 선형회귀의 식은, 보통 $[- \infty, \infty] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mo>-</mo><mi mathvariant="normal">\infty</mi><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">]</mo></math>$ 을 갖기 때문에, a 대신 넣어주면, 그 값을 0~1로변환시켜주는 logistic function 으로 사용됩니다. 이것은 sigmoid function 이라고도 불립니다. 유도과정은 다음과 같습니다.

$log(p1−p)=ap1−p=ea1−pp=e−a1p=e−a+1p=1e−a+1<math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><mfrac><mi>p</mi><mrow><mn>1</mn><mo>−</mo><mi>p</mi></mrow></mfrac><mo stretchy="false">)</mo></mtd><mtd><mi></mi><mo>=</mo><mi>a</mi></mtd></mtr><mtr><mtd><mfrac><mi>p</mi><mrow><mn>1</mn><mo>−</mo><mi>p</mi></mrow></mfrac></mtd><mtd><mi></mi><mo>=</mo><msup><mi>e</mi><mi>a</mi></msup></mtd></mtr><mtr><mtd><mfrac><mrow><mn>1</mn><mo>−</mo><mi>p</mi></mrow><mi>p</mi></mfrac></mtd><mtd><mi></mi><mo>=</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mi>a</mi></mrow></msup></mtd></mtr><mtr><mtd><mfrac><mn>1</mn><mi>p</mi></mfrac></mtd><mtd><mi></mi><mo>=</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mi>a</mi></mrow></msup><mo>+</mo><mn>1</mn></mtd></mtr><mtr><mtd><mi>p</mi></mtd><mtd><mi></mi><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mi>a</mi></mrow></msup><mo>+</mo><mn>1</mn></mrow></mfrac></mtd></mtr></mtable></math>$

그렇다면, linear regression 을 $W x + b <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mi>x</mi><mo>+</mo><mi>b</mi></math>$ 라고 했을 때, logistic(sigmoid) function 에 넣으면, $1e−Wx+b+1<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>1</mn><mrow><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mi>W</mi><mi>x</mi><mo>+</mo><mi>b</mi></mrow></msup><mo>+</mo><mn>1</mn></mrow></mfrac></math>$ 이 되고 이것은 다음 그림처럼 곡선을 따라서 fitting 하게 됩니다. 이를 통해 범주형 data에 대한 regression 을 더 잘 할수 있습니다. 👏👏👏

이제 sigmoid 함수를 이용한 output에 대하여, 어떻게 학습을 하는지 알아보겠습니다.

Linear regression 에서 사용했던 OLS/ GD / MLE 를 사용해보겠습니다

OLS (ordinary least square) 를 사용해보려 하는데, 다음과 같은 의견이 있었습니다. sigmoid function 은 OLS 같은 방식을 배제시킨다는.. 따라서 GD, newton 방법, MLE 등을 이용해야 합니다.

https://stats.stackexchange.com/questions/236028/how-to-solve-logistic-regression-using-ordinary-least-squares

How To Solve Logistic Regression Using Ordinary Least Squares?

I was self-learning machine learning. I came upon this section of the Wikipedia page on Logistic regression, where it claims Because the model can be expressed as a generalized linear model (see...

stats.stackexchange.com

MLE

logistic regression은 binary cross entropy 의 loss 를 사용합니다. 이번에 MLE를 통해서 BCE의 유도과정을 알아보겠습니다.

위에서 구한 $11+e−Wx+b<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>1</mn><mrow><mn>1</mn><mo>+</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mi>W</mi><mi>x</mi><mo>+</mo><mi>b</mi></mrow></msup></mrow></mfrac></math>$ 는 0~1 의 값을 가집니다.

이는 bernoulli 분포를 가진다고 가정하기 적당합니다. bernoulli 는 확률변수 Y 에 대하여 다음과 같은 pdf를 갖습니다, $p (Y | θ) = L (θ | Y) = (θ) Y (1 - θ) 1 - Y, w h e r e Y \in 0, 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo>=</mo><mi>L</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>Y</mi><mo stretchy="false">)</mo><mo>=</mo><mo stretchy="false">(</mo><mi>θ</mi><msup><mo stretchy="false">)</mo><mi>Y</mi></msup><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>θ</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>-</mo><mi>Y</mi></mrow></msup><mo>,</mo><mi>w</mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>e</mi><mi>Y</mi><mo>\in</mo><mrow data-mjx-texclass="ORD"><mn>0</mn><mo>,</mo><mn>1</mn></mrow></math>$ 따라서 data들의 likelihood 는 pdf들의 곱 입니다 .

이를 정리하면, $L (θ | Y) = \prod i θ y i (1 - θ) (1 - y i) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>Y</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></munder><msubsup><mi>θ</mi><mi>i</mi><mi>y</mi></msubsup><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>θ</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow></msup></math>$ 입니다. 여기에 log를 씌워 log likelihood 를 만들면, $log (L) = \sum i y i log (θ) + \sum i (1 - y i) log (1 - θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mi>L</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></munder><msub><mi>y</mi><mi>i</mi></msub><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>+</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>θ</mi><mo stretchy="false">)</mo></math>$ 입니다. $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 는 0 ~ 1 을 가지는 확률이고 이는 우리가 구한 $f(x|W,b)=11+e−Wx+b<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>W</mi><mo>,</mo><mi>b</mi><mo stretchy="false">)</mo><mo>=</mo><mstyle displaystyle="true" scriptlevel="0"><mfrac><mn>1</mn><mrow><mn>1</mn><mo>+</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mi>W</mi><mi>x</mi><mo>+</mo><mi>b</mi></mrow></msup></mrow></mfrac></mstyle></math>$ 로 대체될 수 있습니다.

이를 정리하면 $log (L) = \sum i y i log (f (x)) + \sum i (1 - y i) log (1 - f (x)) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mi>L</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></munder><msub><mi>y</mi><mi>i</mi></msub><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 입니다. MLE에서는 이것을 maximize 시켜야 하므로, -1 을 곱한 후 minimize 시키는 방법으로 (binary) cross entropy 를 구할 수 있습니다.

GD (Gradient Descent)

이제 위에서 구한 binary cross entropy 함수 $b c e (X, Y) = - (\sum i y i log (f (x i)) + \sum i (1 - y i) log (1 - f (x i))) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mi>c</mi><mi>e</mi><mo stretchy="false">(</mo><mi>X</mi><mo>,</mo><mi>Y</mi><mo stretchy="false">)</mo><mo>=</mo><mo>-</mo><mo stretchy="false">(</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></munder><msub><mi>y</mi><mi>i</mi></msub><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 를 minimize 시켜야 하므로 이를 Gradient Descent 방법으로 줄여서 W, b를 업데이트 할 수 있습니다.

Quiz

오늘 다뤘던 내용을 마무리 하기전 quiz 로 복습을 해 보겠습니다.

1. odds 란?

2. logit 이란?

3. sigmoid 유도과정은?

4. bernoiulli 분포란?

5. binary cross entropy 를 MLE 를 통해 유도한다면?

이번에는 두가지 클래스 분류를 할 수 있는 Logistic Regression에 대하여 sigmoid 와 binary cross entropy 에 대하여 알아보았습니다. 다음 포스팅에서는 multi class 에 대한 분류를 할 수 있는 softmax regression 에 대하여 알아보겠습니다. 질문과 토론은 항상 환영합니다. 감사합니다~뿅!! 😎😎😎

reference :

https://hyunlee103.tistory.com/12

[머신러닝] Logistic Regression(MLE와 Bayesian inference를 통한 확률론적 접근)

스탠퍼드 머신러닝 3주차인 Logistic Regression 강의를 듣는데, 여기서 사용되는 Cost function이 MLE 기반으로 유도된다는 얘기를 듣고 아예 MLE 기반의 estimator들과 cross-entropy까지 한꺼번에 공부해서 정..

hyunlee103.tistory.com

https://nittaku.tistory.com/478

5-6. 로지스틱 회귀분석(Logistic Regression)

로지스틱 회귀분석 지금까지 학습한 선형 회귀분석 단순/다중은 모두 종속변수Y가 연속형 이었다. 로지스틱회귀분석 은 종속변수가 범주형이면서 0 or 1 인 경우 사용하는 회귀분석이다. 로지스

nittaku.tistory.com

https://csm-kr.tistory.com/35?category=1268905

'Basic ML' 카테고리의 다른 글

[Theme 06] Perceptron, XOR, MLP, Universal Approximation Theorem (3)	2023.01.24
[Theme 05] MLE (Maximum likelihood estimation) 을 통한 Loss (0)	2022.07.05
[Theme 04] Multinomial Logistic Regression (Softmax Regression) (0)	2022.06.30
[Theme 02] Linear Regression(선형회귀) (Feat. OLS/GD/MLE) (4)	2022.04.28
[Theme 01] What is Artificial Intelligence / Machine Learning / Deep Learning? (2)	2022.03.30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

아무블로그

[Theme 03] Logistic Regression (odds/logit/sigmoid/bce)

'Basic ML' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Theme 03] Logistic Regression (odds/logit/sigmoid/bce)

'Basic ML' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역