future_encoder.py 적용, 신뢰 구간 계산 추가

2018-05-19 18:01:33 +09:00
parent e168f9941e
commit 9ae4f7b037
2 changed files with 1098 additions and 3 deletions
--- a/02_end_to_end_machine_learning_project.ipynb
+++ b/02_end_to_end_machine_learning_project.ipynb
@@ -631,7 +631,7 @@
   "outputs": [],
   "source": [
    "# 이 버전의 test_set_check() 함수가 파이썬 2도 지원합니다.\n",
-    "def test_set_check(identifier, test_ratio, hash):\n",
+    "def test_set_check(identifier, test_ratio, hash=hashlib.md5):\n",
    "    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio"
   ]
  },
@@ -2597,6 +2597,13 @@
    "이제 범주형 입력 특성인 `ocean_proximity`을 전처리합니다:"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 책에 실린 방법"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 59,
@@ -3057,6 +3064,129 @@
    "cat_encoder.categories_"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### future_encoders.py를 사용한 새로운 방법"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "housing_cat = housing[['ocean_proximity']]\n",
+    "housing_cat.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "주의: 번역서는 판다스의 `Series.factorize()` 메서드를 사용하여 문자열 범주형 특성을 정수로 인코딩합니다. 사이킷런 0.20에 추가될 `OrdinalEncoder` 클래스(PR #10521)는 입력 특성(레이블 `y`가 아니라 `X`)을 위해 설계되었고 파이프라인(나중에 이 노트북에서 나옵니다)과 잘 작동되기 때문에 더 좋은 방법입니다. 지금은 `future_encoders.py` 파일에서 임포트하지만 사이킷런 0.20 버전이 릴리스되면 `sklearn.preprocessing`에서 바로 임포팅할 수 있습니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from future_encoders import OrdinalEncoder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ordinal_encoder = OrdinalEncoder()\n",
+    "housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)\n",
+    "housing_cat_encoded[:10]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ordinal_encoder.categories_"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "주의: 번역서는 `CategoricalEncoder`를 사용하여 각 범주형 값을 원-핫 벡터로 변경합니다. `OneHotEncoder`를 사용하는 것이 더 낫습니다. 지금은 정수형 범주 입력만 다룰 수 있지만 사이킷런 0.20에서는 문자열 범주 입력도 다룰 수 있을 것입니다(PR #10521). 지금은 `future_encoders.py` 파일에서 임포트하지만 사이킷런 0.20 버전이 릴리스되면 `sklearn.preprocessing`에서 바로 임포팅할 수 있습니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from future_encoders import OneHotEncoder\n",
+    "\n",
+    "cat_encoder = OneHotEncoder()\n",
+    "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n",
+    "housing_cat_1hot"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "기본적으로 `OneHotEncoder` 클래스는 희소 행렬을 반환하지만 필요하면 `toarray()` 메서드를 호출하여 밀집 배열로 바꿀 수 있습니다:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "housing_cat_1hot.toarray()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "또는 `OneHotEncoder` 객체를 만들 때 `sparse=False`로 지정하면 됩니다:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cat_encoder = OneHotEncoder(sparse=False)\n",
+    "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n",
+    "housing_cat_1hot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cat_encoder.categories_"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 다시 책의 내용이 이어집니다"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -3237,7 +3367,9 @@
    }
   ],
   "source": [
-    "housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+[\"rooms_per_household\", \"population_per_household\"])\n",
+    "housing_extra_attribs = pd.DataFrame(\n",
+    "    housing_extra_attribs, \n",
+    "    columns=list(housing.columns)+[\"rooms_per_household\", \"population_per_household\"])\n",
    "housing_extra_attribs.head()"
   ]
  },
@@ -3353,6 +3485,19 @@
    "    ])"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# future_encoders.py를 사용한 방법\n",
+    "cat_pipeline = Pipeline([\n",
+    "        ('selector', DataFrameSelector(cat_attribs)),\n",
+    "        ('cat_encoder', OneHotEncoder(sparse=False)),\n",
+    "    ])"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 75,
@@ -4789,6 +4934,74 @@
    "final_rmse"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "테스트 RMSE에 대한 95% 신뢰 구간을 계산할 수 있습니다:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy import stats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "confidence = 0.95\n",
+    "squared_errors = (final_predictions - y_test) ** 2\n",
+    "mean = squared_errors.mean()\n",
+    "m = len(squared_errors)\n",
+    "\n",
+    "np.sqrt(stats.t.interval(confidence, m - 1,\n",
+    "                         loc=np.mean(squared_errors),\n",
+    "                         scale=stats.sem(squared_errors)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "다음과 같이 수동으로 계산할 수도 있습니다:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)\n",
+    "tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)\n",
+    "np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "또는 t 점수 대신 z 점수를 사용할 수도 있습니다:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zscore = stats.norm.ppf((1 + confidence) / 2)\n",
+    "zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)\n",
+    "np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -6182,7 +6395,7 @@
    "from scipy.stats import expon, reciprocal\n",
    "\n",
    "# expon(), reciprocal()와 다른 확률 분포 함수에 대해서는\n",
-    "# https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html를 참고하세요.\n",
+    "# https://docs.scipy.org/doc/scipy/reference/stats.html를 참고하세요.\n",
    "\n",
    "# 노트: kernel 매개변수가 \"linear\"일 때는 gamma가 무시됩니다.\n",
    "param_distribs = {\n",