[Theme 02] Linear Regression(선형회귀) (Feat. OLS/GD/MLE)

안녕하세요 Pulluper입니다. 😀

Basic ML 2번째 주제는 supervised learning 의 기본인 Classification / Regression 중 Regression 의 한 종류인 Linear Regression(선형회귀)입니다.

먼저 회귀(Regress) 란 옛날 상태로 돌아가는 것을 의미합니다. 옛날 영국의 유전학자가 부모의 키와 아이들의 키가 선형적인 관계가 있고 전체 키 평균으로 돌아가는 경향이 잇다는 가설을 세웠으며 이를 분석하는 방법을 회귀분석이라고 하였습니다. 이를 알아보기 위해 몇가지 용어들에 대하여 알아보겠습니다.

통계적 기법에서 회귀란 다음과 같습니다.

회귀(Regression) : 둘 이상의 변수들 간의 관계를 보여주는 통계적 방법

또한 수학적, 통계적 모델에서 사용되는 독립변수, 종속변수는 다음과 같습니다.

독립변수 (Independent variable) : 다른 변수에 영향을 주는 변수 (원인) [원인,설명,예측]변수

종속변수 (Dependent variable) : 다른 변수로부터 영향을 받는 변수 (결과) [반응,피설명,피예측]변수

독립변수는 어떤 결과를 위해서 의도적으로 변화시키는 변수이고 종속변수는 그 결과가 뜻하는 변수입니다.

오늘 다룰 Linear Regression (선형회귀)가 통계적 기법으로 다음 용어들이 필요하기 때문입니다. 위키백과에서 선형회귀를 검색하면 다음과 같이 나옵니다.

통계학에서, 선형 회귀(線型回歸, 영어: linear regression)는 종속 변수 y와 한 개 이상의 독립 변수 (또는 설명 변수) X와의 선형 상관 관계를 모델링하는 회귀분석 기법이다. 한 개의 설명 변수에 기반한 경우에는 단순 선형 회귀(simple linear regression), 둘 이상의 설명 변수에 기반한 경우에는 다중 선형 회귀라고 한다.

즉 X, Y의 간의 관계가 Y = WX + b 선형적인 모델을 따른다는 가정을 가지고 회귀를 하는 것이 선형 회귀입니다.

예를 들어서 시작해 보겠습니다. 😎😎

철수는 이번 중간고사에서 평균 70점 넘으면 엄마가 게임기를 사준다고 하였습니다. 😀

하지만 철수는 공부가 너무 하기 싫었답니다. 그래서 최소한의 노력을 가지고 70점을 넘게 하기 위해서 친구들의 공부시간과 중간고사 성적을 조사하였습니다. 4명을 조사했는데 그 정보는 다음과 같습니다.

x(hours)	y(score)
10	90
9	80
3	50
2	20

그리고 하나의 가설을 세웠습니다. "공부시간과 성적은 어떤 직선의 관계가 있겠다." 라는 가설입니다.

그 직선을 H(x) 라고 하였고, 다음의 관계를 설정하였습니다. H(x) = Wx + b, where x 는 공부시간, H(x) 는 성적입니다. 그렇다면 H(x)를 잘 구해야 철수는 최소한의 노력으로 70점 이상을 맞을 수 있습니다.

H(x) 를 구하는 방식으로 각 점과 가설함수 H(x) 에 대한 오차를 설정하고 그 오차를 최소화하는 방향으로 H(x)를 구할 수 있습니다. 오차를 H(x) 와 y의 차이의 제곱들의 평균이라고 설정하면, $1m∑mi=1(H(xi)−yi)2<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>1</mn><mi>m</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>m</mi></munderover><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>H</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mi>i</mi></msup><mo stretchy="false">)</mo><mo>−</mo><msup><mi>y</mi><mi>i</mi></msup><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow></math>$ 이 오차가 됩니다.

아래 그림에서 분홍색이 뜻하는 error들의 제곱의 평균을 뜻합니다. 제곱을 하는 이유는 음수를 나오지 않게 하면서, 큰오차는 더 큰 차이를 내도록 하여 잘 줄이게 하기 위함입니다. 이는 cost function 또는 loss function 등으로 불립니다.

이 오차를 최소화하는 가설 (H(x)) 을 구하고, 70점이 넘도록 공부시간을 최적화 하는 것이 철수의 계획이었습니다.

그렇다면 어떻게 오차를 최소화 할 수 있을까요?

OLS (ordinary least square - 최소제곱법)
GD (gradient descent - 경사하강법)

MLE (maximum likelihood estimation - 최대가능도(우도)법)

이 3가지 방법으로 오차를 최소화는 H(x)를 구해보겠습니다 🤪

OLS (Ordinary Least Square)

OLS (ordinary least square) 은 최소제곱법으로 잘 알려져 있으며, 선형 회귀의 가장 기본적인 해결법이 됩니다. 사실 미분을 통해서 error 가 최소가 되는 각 parameter들을 구하는 방법입니다. 통계, 계량경제학에서 나오는 기본적인 내용입니다. 행렬로 표현을 해서 다중회귀분석 (parameter 가 여러개인 선형회귀) 를 푸는데 편합니다.

먼저, H(x) 를 행렬로 표현해 봅시다. $H (x) = W x + b <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>W</mi><mi>x</mi><mo>+</mo><mi>b</mi></math>$ 에서 $b <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi></math>$ 를 $w 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mn>0</mn></msub></math>$ 로, $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 를 $w 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mn>1</mn></msub></math>$ 으로 바꾸면, $H (x) = w 0 + w 1 x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>w</mi><mn>0</mn></msub><mo>+</mo><msub><mi>w</mi><mn>1</mn></msub><mi>x</mi></math>$ 입 니다. 이는 다음 행렬로 표현 가능합니다. $(1 x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>1</mn></mtd><mtd><mi>x</mi></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ $(w 0 w 1) = w 0 + w 1 x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><msub><mi>w</mi><mn>0</mn></msub></mtd></mtr><mtr><mtd><msub><mi>w</mi><mn>1</mn></msub></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><msub><mi>w</mi><mn>0</mn></msub><mo>+</mo><msub><mi>w</mi><mn>1</mn></msub><mi>x</mi></math>$ $X = (1 x), w = (w 0 w 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>1</mn></mtd><mtd><mi>x</mi></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>,</mo><mi>w</mi><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><msub><mi>w</mi><mn>0</mn></msub></mtd></mtr><mtr><mtd><msub><mi>w</mi><mn>1</mn></msub></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ 일 때, $H (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 는 $X w <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi><mi>w</mi></math>$ 입니다. OLS에서는 잔차 제곱합 RSS(residual sum of squares) 를 최소화 합니다. 위에서 구한 cost function에서 평균 말고 합을 해 주면 됩니다.

RSS를 구하는 방법은 $(y - X w) T (y - X w) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>y</mi><mo>-</mo><mi>X</mi><mi>w</mi><msup><mo stretchy="false">)</mo><mi>T</mi></msup><mo stretchy="false">(</mo><mi>y</mi><mo>-</mo><mi>X</mi><mi>w</mi><mo stretchy="false">)</mo></math>$ 입니다. 위에서 보면 data가 4개가 있는데, 이를 $(x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><msub><mi>y</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mo>,</mo><mo stretchy="false">(</mo><msub><mi>x</mi><mn>2</mn></msub><mo>,</mo><msub><mi>y</mi><mn>2</mn></msub><mo stretchy="false">)</mo><mo>,</mo><mo stretchy="false">(</mo><msub><mi>x</mi><mn>3</mn></msub><mo>,</mo><msub><mi>y</mi><mn>3</mn></msub><mo stretchy="false">)</mo><mo>,</mo><mo stretchy="false">(</mo><msub><mi>x</mi><mn>4</mn></msub><mo>,</mo><msub><mi>y</mi><mn>4</mn></msub><mo stretchy="false">)</mo></math>$ 라고 하겠습니다. RSS에서 사용된 $y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>$ 는 data가 4개 이므로 풀어쓰면 $y = (y 1 y 2 y 3 y 4) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><msub><mi>y</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd><msub><mi>y</mi><mn>2</mn></msub></mtd></mtr><mtr><mtd><msub><mi>y</mi><mn>3</mn></msub></mtd></mtr><mtr><mtd><msub><mi>y</mi><mn>4</mn></msub></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ 입니다. 이는 $4 \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>4</mn><mo>\times</mo><mn>1</mn></math>$ 행렬입니다. $X w = (1 x 1 1 x 2 1 x 3 1 x 4) (w 0 w 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi><mi>w</mi><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>x</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>x</mi><mn>2</mn></msub></mtd></mtr><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>x</mi><mn>3</mn></msub></mtd></mtr><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>x</mi><mn>4</mn></msub></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><msub><mi>w</mi><mn>0</mn></msub></mtd></mtr><mtr><mtd><msub><mi>w</mi><mn>1</mn></msub></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ 로 계산하면 $4 \times 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>4</mn><mo>\times</mo><mn>1</mn></math>$ 의 행렬(벡터)입니다!

따라서 $(y - X w) T (y - X w) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>y</mi><mo>-</mo><mi>X</mi><mi>w</mi><msup><mo stretchy="false">)</mo><mi>T</mi></msup><mo stretchy="false">(</mo><mi>y</mi><mo>-</mo><mi>X</mi><mi>w</mi><mo stretchy="false">)</mo></math>$ 는 하나의 스칼라 값을 갖게 되고, 이는 여러 data들의 잔차의 제곱의 합을 뜻합니다. 우리는 이를 미분을 통해 최소화 할 수 있습니다.

$R S S = (y - X w) T (y - X w) = (y T - w T X T) (y - X w) = y T y - y T X w - w T X T y + w T X T X w <math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mi>R</mi><mi>S</mi><mi>S</mi></mtd><mtd><mi></mi><mo>=</mo><mo stretchy="false">(</mo><mi>y</mi><mo>-</mo><mi>X</mi><mi>w</mi><msup><mo stretchy="false">)</mo><mi>T</mi></msup><mo stretchy="false">(</mo><mi>y</mi><mo>-</mo><mi>X</mi><mi>w</mi><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mo stretchy="false">(</mo><msup><mi>y</mi><mi>T</mi></msup><mo>-</mo><msup><mi>w</mi><mi>T</mi></msup><msup><mi>X</mi><mi>T</mi></msup><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mi>y</mi><mo>-</mo><mi>X</mi><mi>w</mi><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><msup><mi>y</mi><mi>T</mi></msup><mi>y</mi><mo>-</mo><msup><mi>y</mi><mi>T</mi></msup><mi>X</mi><mi>w</mi><mo>-</mo><msup><mi>w</mi><mi>T</mi></msup><msup><mi>X</mi><mi>T</mi></msup><mi>y</mi><mo>+</mo><msup><mi>w</mi><mi>T</mi></msup><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mi>w</mi></mtd></mtr></mtable></math>$

이제 RSS를 w에 대하여 미분하여 그 값이 0 이되는 값을 구하면, 최소를 구할 수 있겠습니다.

$∂RSS∂w=−2XTy+(XTX+XTX)w=−2XTy+2XTXw<math xmlns="http://www.w3.org/1998/Math/MathML"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mrow data-mjx-texclass="ORD"><mfrac><mrow><mi>∂</mi><mi>R</mi><mi>S</mi><mi>S</mi></mrow><mrow><mi>∂</mi><mi>w</mi></mrow></mfrac></mrow></mtd><mtd><mi></mi><mo>=</mo><mo>−</mo><mn>2</mn><msup><mi>X</mi><mi>T</mi></msup><mi>y</mi><mo>+</mo><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mo>+</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mo stretchy="false">)</mo><mi>w</mi></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mo>−</mo><mn>2</mn><msup><mi>X</mi><mi>T</mi></msup><mi>y</mi><mo>+</mo><mn>2</mn><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mi>w</mi></mtd></mtr></mtable></math>$

이 0이 되게 하면, $X T y = X T X w <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mi>T</mi></msup><mi>y</mi><mo>=</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mi>w</mi></math>$ 이므로 이때 $w = (X T X) - 1 X T y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>w</mi><mo>=</mo><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><msup><mi>X</mi><mi>T</mi></msup><mi>y</mi></math>$ 입니다.

참고로 행렬/벡터미분을 이용했습니다.

또한 $(X T X) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mo stretchy="false">)</mo></math>$ 이 역행렬을 가져야 하고(full-rank, col row가 independent..등등) 이고, RSS의 w에 대한 헤시안 행렬 (2번 미분한 값) 이 0보다 커야(positive definite) 최소를 가지게 됩니다. 즉, $∂2RSS∂w2=(XTX)>0<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mfrac><mrow><msup><mi>∂</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mi>R</mi><mi>S</mi><mi>S</mi></mrow><mrow><mi>∂</mi><msup><mi>w</mi><mn>2</mn></msup></mrow></mfrac></mrow><mo>=</mo><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mo stretchy="false">)</mo><mo>></mo><mn>0</mn></math>$ 여야 합니다.

이번에 실제로 철수의 예제에 대하여 $w 0, w 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mn>0</mn></msub><mo>,</mo><msub><mi>w</mi><mn>1</mn></msub></math>$ 을 구해봅시다. 위에서 구한 공식에 따라,

import numpy as np

y = np.array([90, 80, 50, 20])
X = np.array([[1, 10], [1, 9], [1, 3], [1, 2]])
w = np.linalg.inv(X.T @ X) @ X.T @ y
print(w)

다음 과 같은 결과를 얻습니다.

[15.6  7.4]

그러면, h(x) = 15.6 + 7.4x 를 얻을 수 있고, h(x) > 70 인 값을 얻기위한 공부시간 x는

min_study_time = (70 - w[0]) / w[1]
print(min_study_time)

결과는

7.351351351351353

이므로 OLS에 따라서는 철수는 적어도 7시간 30분 정도는 공부를 해야 하겠네요 😂😂

GD(Gradient Descent)

철수는 믿을 수 없었습니다. 8시간에 가까운 공부를 해야하다니!!

그래서 이번에는 GD(경사하강법)으로 H(x)를 구해보도록 했습니다.

경사하강법이란 다음 수식으로 parameter 들을 구하는 것 입니다.

$θ = θ - η \nabla θ L (H θ (x), y) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi><mo>=</mo><mi>θ</mi><mo>-</mo><mi>η</mi><msub><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mi>L</mi><mo stretchy="false">(</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo></math>$

loss(cost function) 에 대한 각 parameter들의 gradient 를 구해서 작아지는 방향으로 $η <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>η</mi></math>$ 만큼씩 업데이트 하여 이상적인 H(x)를 구하는 반복적 최적화 방법입니다. 일차 미분을 이용하고, parameter 의 해가 없는 경우나 아주 많은 경우에서도 실행 할 수 있는 장점이 있습니다.

이번에는 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 로 $H (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 를 표현 해 보겠습니다. $H (x) = θ 0 + θ 1 x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>H</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>θ</mi><mn>0</mn></msub><mo>+</mo><msub><mi>θ</mi><mn>1</mn></msub><mi>x</mi></math>$

def hypothesis(theta_0, theta_1, x_data):
  return theta_0 + theta_1 * x_data

그리고 data를 setting하고 분포를 보겠습니다.

import numpy as np
import matplotlib.pyplot as plt

data = np.array([[10, 90], [9, 80], [3, 50], [2, 20]])

x_data = data[:, 0]
y_data = data[:, 1]

plt.figure(figsize=(8, 8))
plt.scatter(x_data, y_data, alpha=0.3, color='k')
plt.show()

이번에는 Loss 를 보겠습니다.

$L=1m∑mi=1(H(xi)−yi)2<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mo>=</mo><mfrac><mn>1</mn><mi>m</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>m</mi></munderover><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>H</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mi>i</mi></msup><mo stretchy="false">)</mo><mo>−</mo><msup><mi>y</mi><mi>i</mi></msup><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow></math>$ 으로 두었는데, 미분의 편의성을 위해서 분모에 2m 을 넣으면 다음과 같습니다.

def l2_loss (h, y_data):
  m = len(h)  
  ret = np.sum((h-y_data)*(h-y_data)) / (2*m)
  return ret

$L=12m∑mi=1(H(xi)−yi)2<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mo>=</mo><mfrac><mn>1</mn><mrow><mn>2</mn><mi>m</mi></mrow></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>m</mi></munderover><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>H</mi><mo stretchy="false">(</mo><msup><mi>x</mi><mi>i</mi></msup><mo stretchy="false">)</mo><mo>−</mo><msup><mi>y</mi><mi>i</mi></msup><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow></math>$ 그리고 $\nabla L θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\nabla</mi><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub></math>$ 는 모든 theta들에 대하여 각각 Loss를 미분한 것을 뜻합니다. 우리의 예제에서는 $θ 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>θ</mi><mn>0</mn></msub></math>$ 과 $θ 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>θ</mi><mn>1</mn></msub></math>$ 에 대한 각각의 미분값들 입니다.

따라서 $∂L∂θ0=1m∑((θ0+θ1x)−y)<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mfrac><mrow><mi>∂</mi><mi>L</mi></mrow><mrow><mi>∂</mi><msub><mi>θ</mi><mn>0</mn></msub></mrow></mfrac></mrow><mo>=</mo><mfrac><mn>1</mn><mi>m</mi></mfrac><mo data-mjx-texclass="OP">∑</mo><mo stretchy="false">(</mo><mo stretchy="false">(</mo><msub><mi>θ</mi><mn>0</mn></msub><mo>+</mo><msub><mi>θ</mi><mn>1</mn></msub><mi>x</mi><mo stretchy="false">)</mo><mo>−</mo><mi>y</mi><mo stretchy="false">)</mo></math>$ 이고 $∂L∂θ1=1m∑(((θ0+θ1x)−y)×x)<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mfrac><mrow><mi>∂</mi><mi>L</mi></mrow><mrow><mi>∂</mi><msub><mi>θ</mi><mn>1</mn></msub></mrow></mfrac></mrow><mo>=</mo><mfrac><mn>1</mn><mi>m</mi></mfrac><mo data-mjx-texclass="OP">∑</mo><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mo stretchy="false">(</mo><msub><mi>θ</mi><mn>0</mn></msub><mo>+</mo><msub><mi>θ</mi><mn>1</mn></msub><mi>x</mi><mo stretchy="false">)</mo><mo>−</mo><mi>y</mi><mo stretchy="false">)</mo><mo>×</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ 입니다, 그것들을 통해 업데이트를 하는 Gradient descent 코드는 다음과 같습니다. learning rate $η <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>η</mi></math>$ 는 0.01입니다.

def gradient_descent(x, y, theta_0, theta_1, learning_rate=0.01):
  m = len(x)  
  gradient_theta_0 = np.sum(hypothesis(theta_0, theta_1, x) - y) / m
  gradient_theta_1 = np.sum((hypothesis(theta_0, theta_1, x) - y) * x) / m

  new_theta_0 = theta_0 - learning_rate * gradient_theta_0
  new_theta_1 = theta_1 - learning_rate * gradient_theta_1
  return new_theta_0, new_theta_1

각 parameter 들을 -30 (임의의 값) 으로 초기화(initialization) 한 후 업데이트 하는 코드는 다음과 같습니다.

# initialize thetas as -30 each.
theta_0 = -30
theta_1 = -30

theta_0_list = []
theta_1_list = []
loss_list = []

# until converge about 10000 steps
converge_step = 10000
for i in range(converge_step):
  h = hypothesis(theta_0, theta_1, x_data)

  loss = l2_loss(h, y_data)
  loss_list.append(loss)

  theta_0_list.append(theta_0)
  theta_1_list.append(theta_1)
  theta_0, theta_1 = gradient_descent(x_data, y_data, theta_0, theta_1)
  
print(theta_0)
print(theat_1)

이렇게 10000번의 iteration 이후의 theta_0 과 theat_1 의 값을 보면 다음과 같은 값을 얻을 수 있습니다.

이 값은.. 앞에서 OLS(최소제곱법) 으로 얻은 값 [15.6, 7.4] 과 매우 유사한 값을 구할 수 있습니다.

15.599999999630178
7.400000000045993

그러면, h(x) = 15.599999999630178 + 7.400000000045993x 를 얻을 수 있고, h(x) > 70 인 값을 얻기위한 공부시간 x는 이때도 7시간 반 이상이었습니다. 철수는 과연 포기할까요? 순순히 8시간을 공부할 것일까요? 🤬

MLE(Maximum Likelihood Estimation)

마지막으로 철수는 한번 만 더 검증을 해 보기로 했습니다. 이번에는 (MLE) maximum likelihood estimation 이라는 방법으로 내가 공부해야 할 시간을 정확하게 알아보겠다는 열정에 사로잡힙니다.

먼저 이를위해 확률(probability)과 가능도(likelihood)의 차이를 간단히 알아보겠습니다. 🤪

먼저 확률은 이산확률분포(discrete probability distribution)과 연속확률(continuous probability)로 나눌 수 있으며, 이산확률은 주사위의 한 면이 나오는 확률을 뜻하는 것 처럼 전체의 사건이 유한한(셀 수 있는)사건들로 이루어져 있고, 각각의 사건들의 확률을 구할 수 있는 분포 입니다. 여기서는 확률과 가능도가 같습니다.

다음은 연속확률분포(continuous probability distribution)은 특정 확률분포가 주어지고, 각 사건의 확률은 0이고, 연속확률분포의 구간에 따른 확률분포에서 차지하는 영역(area under pdf)이 그 사건에 대한 확률이 되는 분포 입니다.

확률은 probability density function 이 고정되었을 때, 그 차지하는 영역입니다. 고정되었다는 것은 그 확률분포를 구성하는 모수(parameter) 가 특정되었다는 것 입니다. 예를들어, 가우시안 분포는 $μ, σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>,</mo><mi>σ</mi></math>$ 의 함수로 나타낼 수 있습니다.

$N(μ,σ2)=1σ√2πexp(−(x−μ)22σ2)<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mo stretchy="false">(</mo><mi>μ</mi><mo>,</mo><msup><mi>σ</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mi>σ</mi><msqrt><mn>2</mn><mi>π</mi></msqrt></mrow></mfrac><mi>e</mi><mi>x</mi><mi>p</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mi>x</mi><mo>−</mo><mi>μ</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo></mrow></math>$ 여기서 $μ, σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>,</mo><mi>σ</mi></math>$ 가 고정될 때 확률을 구할 수 있습니다.

그러나 가능도(likelihood)는 data가 고정된 상태로 확률분포의 parameter 의 함수로 나타낼 수 있습니다. 즉 $고 정 고 정 p (X (고 정) | θ) = L (θ | X (고 정)) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">(</mo><mi>고</mi><mi>정</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>X</mi><mo stretchy="false">(</mo><mi>고</mi><mi>정</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 입니다. 이는 그냥 그 사건에 대한 pdf 값으로 나타내면 됩니다.

철수에 예를 적용해 봅시다. 철수가 가지고 있는 시간(hour)과 점수(score)의 관계가 다음과 같은 수식으로 표현 할 수 있다 하겠습니다. $y = a x + b + ϵ \sim N (0, σ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><mi>a</mi><mi>x</mi><mo>+</mo><mi>b</mi><mo>+</mo><mi>ϵ</mi><mo>\sim</mo><mi>N</mi><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>σ</mi><mo stretchy="false">)</mo></math>$ 즉, 위에서 예를 들었던 정규분포(mean=0, std= $σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>σ</mi></math>$ )를 따른다고 하겠습니다. $ϵ = (a x + b) - y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi><mo>=</mo><mo stretchy="false">(</mo><mi>a</mi><mi>x</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>-</mo><mi>y</mi></math>$ 입니다. (x, y) 값이 각각 (2, 20), (3, 50), (9, 80), (10, 90) 이 있고 이것을 가능도 함수에 넣어보면 각각의 likelihood(가능도)는 다음과 같습니다. $μ = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>=</mo><mn>0</mn></math>$ 이고 $ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ 이 정규 분포를 따르므로 원래 pdf에서 x자리에 $(a x + b) - y <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>a</mi><mi>x</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>-</mo><mi>y</mi></math>$ 를 넣어 줄 수 있습니다.

$p(20|a,b,σ2)=1σ√2πexp(−((2a+b)−20)22σ2)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mn>20</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mi>σ</mi><msqrt><mn>2</mn><mi>π</mi></msqrt></mrow></mfrac><mi>e</mi><mi>x</mi><mi>p</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>2</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>20</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo></mrow></math>$

$p(50|a,b,σ2)=1σ√2πexp(−((3a+b)−50)22σ2)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mn>50</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mi>σ</mi><msqrt><mn>2</mn><mi>π</mi></msqrt></mrow></mfrac><mi>e</mi><mi>x</mi><mi>p</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>3</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>50</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo></mrow></math>$

$p(80|a,b,σ2)=1σ√2πexp(−((9a+b)−80)22σ2)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mn>80</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mi>σ</mi><msqrt><mn>2</mn><mi>π</mi></msqrt></mrow></mfrac><mi>e</mi><mi>x</mi><mi>p</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>9</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>80</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo></mrow></math>$

$p(90|a,b,σ2)=1σ√2πexp(−((10a+b)−90)22σ2)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mn>90</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mi>σ</mi><msqrt><mn>2</mn><mi>π</mi></msqrt></mrow></mfrac><mi>e</mi><mi>x</mi><mi>p</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>10</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>90</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo></mrow></math>$

표본들의 전체의 가능도는, 각각의 표본들이 독립이라는 가정하에 모두의 곱으로 나타낼 수 있고, 계산의 편의성을 위해서 각 likelihood에 log를 취한 log likelihood를 최대화하는 $a, b <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>,</mo><mi>b</mi></math>$ 를 미분을 통해서 구할 수 있습니다.

$l o g (l i k e l i h o o d) = l o g (p (20 | a, b, σ 2)) + l o g (p (50 | a, b, σ 2)) + l o g (p (80 | a, b, σ 2)) + l o g (p (90 | a, b, σ 2)) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>l</mi><mi>i</mi><mi>k</mi><mi>e</mi><mi>l</mi><mi>i</mi><mi>h</mi><mi>o</mi><mi>o</mi><mi>d</mi><mo stretchy="false">)</mo><mo>=</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>p</mi><mo stretchy="false">(</mo><mn>20</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>p</mi><mo stretchy="false">(</mo><mn>50</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>p</mi><mo stretchy="false">(</mo><mn>80</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>p</mi><mo stretchy="false">(</mo><mn>90</mn><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msup><mi>σ</mi><mn>2</mn></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 입니다. 이를 풀어서 쓰면, $log(1σ√2π)∗4+((−((2a+b)−20)22σ2)+(−((3a+b)−50)22σ2)+(−((9a+b)−80)22σ2)+((−((10a+b)−90)22σ2))<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mfrac><mn>1</mn><mrow><mi>σ</mi><msqrt><mn>2</mn><mi>π</mi></msqrt></mrow></mfrac><mo stretchy="false">)</mo><mo>∗</mo><mn>4</mn><mo>+</mo><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>2</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>20</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>3</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>50</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>9</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>80</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mo>−</mo><mfrac><mrow><mo stretchy="false">(</mo><mo stretchy="false">(</mo><mn>10</mn><mi>a</mi><mo>+</mo><mi>b</mi><mo stretchy="false">)</mo><mo>−</mo><mn>90</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></mrow><mrow><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 입니다.

이제 이를 최대화 하기 위해 각각 a, b 에 대하여 미분을 하면,

$∂log(likelihood)∂a=(388a+48b−3620)−2σ2=0<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mfrac><mrow><mi>∂</mi><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>l</mi><mi>i</mi><mi>k</mi><mi>e</mi><mi>l</mi><mi>i</mi><mi>h</mi><mi>o</mi><mi>o</mi><mi>d</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mi>a</mi></mrow></mfrac></mrow><mo>=</mo><mfrac><mrow><mo stretchy="false">(</mo><mn>388</mn><mi>a</mi><mo>+</mo><mn>48</mn><mi>b</mi><mo>−</mo><mn>3620</mn><mo stretchy="false">)</mo></mrow><mrow><mo>−</mo><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo>=</mo><mn>0</mn></math>$ 이고,

$∂log(likelihood)∂b=(48a+8b−480)−2σ2=0<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mfrac><mrow><mi>∂</mi><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>l</mi><mi>i</mi><mi>k</mi><mi>e</mi><mi>l</mi><mi>i</mi><mi>h</mi><mi>o</mi><mi>o</mi><mi>d</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mi>b</mi></mrow></mfrac></mrow><mo>=</mo><mfrac><mrow><mo stretchy="false">(</mo><mn>48</mn><mi>a</mi><mo>+</mo><mn>8</mn><mi>b</mi><mo>−</mo><mn>480</mn><mo stretchy="false">)</mo></mrow><mrow><mo>−</mo><mn>2</mn><msup><mi>σ</mi><mn>2</mn></msup></mrow></mfrac><mo>=</mo><mn>0</mn></math>$ 입니다.

이 연립방정식을 풀어보면,

a = 7.4, b = 15.6 이 나오고 위의 값들과 일치합니다. 😎😎

철수는 이제 포기하고 공부를 시작하려 했습니다. 적어도 8시간은 해야겠지요.

그러나 검증을 하느라 시험공부시간이 1시간도 채 안남았다고 하네요..

결국에는 게임기를 얻지 못한 슬픈 결과가 나왔답니다. 😂😂😂😂😂😂

이번 포스팅은 예를들어서 여러가지 방법으로 선형회귀를 풀어보는 시간을 가졌습니다.

질문과 토론 그리고 오타등 댓글은 환영합니다.

감사합니다 뿅! 👏

Reference

https://ko.wikipedia.org/wiki/%ED%9A%8C%EA%B7%80_%EB%B6%84%EC%84%9D

https://ko.wikipedia.org/wiki/%EC%84%A0%ED%98%95_%ED%9A%8C%EA%B7%80

https://datascienceschool.net/02%20mathematics/04.04%20%ED%96%89%EB%A0%AC%EC%9D%98%20%EB%AF%B8%EB%B6%84.html

https://statproofbook.github.io/P/slr-mle

https://ko.d2l.ai/chapter_deep-learning-basics/linear-regression.html

'Basic ML' 카테고리의 다른 글

[Theme 06] Perceptron, XOR, MLP, Universal Approximation Theorem (3)	2023.01.24
[Theme 05] MLE (Maximum likelihood estimation) 을 통한 Loss (0)	2022.07.05
[Theme 04] Multinomial Logistic Regression (Softmax Regression) (0)	2022.06.30
[Theme 03] Logistic Regression (odds/logit/sigmoid/bce) (0)	2022.06.07
[Theme 01] What is Artificial Intelligence / Machine Learning / Deep Learning? (2)	2022.03.30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

아무블로그

[Theme 02] Linear Regression(선형회귀) (Feat. OLS/GD/MLE)

'Basic ML' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Theme 02] Linear Regression(선형회귀) (Feat. OLS/GD/MLE)

'Basic ML' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역