XGBoost 에제 추가, 연습문제 8, 9번 해답 추가

2018-05-19 00:44:23 +09:00
parent e895575805
commit 2b77ff699a
1 changed files with 509 additions and 2 deletions
--- a/07_ensemble_learning_and_random_forests.ipynb
+++ b/07_ensemble_learning_and_random_forests.ipynb
@@ -1596,7 +1596,7 @@
    "    else:\n",
    "        error_going_up += 1\n",
    "        if error_going_up == 5:\n",
-    "            break  # early stopping"
+    "            break  # 조기 종료"
   ]
  },
  {
@@ -1616,6 +1616,81 @@
    "print(gbrt.n_estimators)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"최소 검증 MSE:\", min_val_error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# XGBoost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import xgboost\n",
    "except ImportError as ex:\n",
    "    print(\"에러: xgboost 라이브러리가 설치되지 않았습니다.\")\n",
    "    xgboost = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if xgboost is not None:  # 책에는 없음\n",
    "    xgb_reg = xgboost.XGBRegressor(random_state=42)\n",
    "    xgb_reg.fit(X_train, y_train)\n",
    "    y_pred = xgb_reg.predict(X_val)\n",
    "    val_error = mean_squared_error(y_val, y_pred)\n",
    "    print(\"검증 MSE:\", val_error)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if xgboost is not None:  # 책에는 없음\n",
    "    xgb_reg.fit(X_train, y_train,\n",
    "                eval_set=[(X_val, y_val)], early_stopping_rounds=2)\n",
    "    y_pred = xgb_reg.predict(X_val)\n",
    "    val_error = mean_squared_error(y_val, y_pred)\n",
    "    print(\"검증 MSE:\", val_error)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit xgboost.XGBRegressor().fit(X_train, y_train) if xgboost is not None else None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit GradientBoostingRegressor().fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
@@ -1629,7 +1704,439 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**곧 제공됩니다**"
+    "## 1. to 7.\n",
    "\n",
    "부록 A를 참조."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. 투표 기반 분류기\n",
    "\n",
    "*문제: MNIST 데이터를 불러들여 훈련 세트, 검증 세트, 테스트 세트로 나눕니다(예를 들면 훈련에 40,000개 샘플, 검증에 10,000개 샘플, 테스트에 10,000개 샘플).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_mldata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist = fetch_mldata('MNIST original')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_val, X_test, y_train_val, y_test = train_test_split(\n",
    "    mnist.data, mnist.target, test_size=10000, random_state=42)\n",
    "X_train, X_val, y_train, y_val = train_test_split(\n",
    "    X_train_val, y_train_val, test_size=10000, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*문제: 그런 다음 랜덤 포레스트 분류기, 엑스트라 트리 분류기, SVM 같은 여러 종류의 분류기를 훈련시킵니다.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.neural_network import MLPClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "random_forest_clf = RandomForestClassifier(random_state=42)\n",
    "extra_trees_clf = ExtraTreesClassifier(random_state=42)\n",
    "svm_clf = LinearSVC(random_state=42)\n",
    "mlp_clf = MLPClassifier(random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]\n",
    "for estimator in estimators:\n",
    "    print(\"훈련 예측기: \", estimator)\n",
    "    estimator.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[estimator.score(X_val, y_val) for estimator in estimators]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "선형 SVM이 다른 분류기보다 성능이 많이 떨어집니다. 그러나 투표 기반 분류기의 성능을 향상시킬 수 있으므로 그대로 두겠습니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*문제: 그리고 검증 세트에서 개개의 분류기보다 더 높은 성능을 내도록 이들을 간접 또는 직접 투표 분류기를 사용하는 앙상블로 연결해보세요.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import VotingClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "named_estimators = [\n",
    "    (\"random_forest_clf\", random_forest_clf),\n",
    "    (\"extra_trees_clf\", extra_trees_clf),\n",
    "    (\"svm_clf\", svm_clf),\n",
    "    (\"mlp_clf\", mlp_clf),\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf = VotingClassifier(named_estimators)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.score(X_val, y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SVM 모델을 제거해서 성능이 향상되는지 확인해 보죠. 다음과 같이 `set_params()`를 사용하여 `None`으로 지정하면 특정 예측기를  제외시킬 수 있습니다:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.set_params(svm_clf=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "예측기 목록이 업데이트되었습니다:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.estimators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "하지만 훈련된 예측기 목록은 업데이트되지 않습니다:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.estimators_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`VotingClassifier`를 다시 훈련시키거나 그냥 훈련된 예측기 목록에서 SVM 모델을 제거할 수 있습니다:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del voting_clf.estimators_[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`VotingClassifier`를 다시 평가해 보죠:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.score(X_val, y_val)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "훨씬 나아졌네요! SVM 모델이 성능을 저하시켰습니다. 이제 간접 투표 분류기를 사용해 보죠. 분류기를 다시 훈련시킬 필요는 없고 `voting`을 `\"soft\"`로 지정하면 됩니다:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.voting = \"soft\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.score(X_val, y_val)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "성능이 많이 나아졌고 개개의 분류기보다 훨씬 좋습니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*문제: 앙상블을 얻고 나면 테스트 세트로 확인해보세요. 개개의 분류기와 비교해서 성능이 얼마나 향상되나요?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "투표 기반 분류기는 에러율을 제일 좋은 모델(`MLPClassifier`)의 4.9%에서 3.5%로 줄였습니다. 약 28%나 오류가 적습니다. 나쁘지 않네요!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. 스태킹 앙상블"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*문제: 이전 연습문제의 각 분류기를 실행해서 검증 세트에서 예측을 만들고 그 결과로 새로운 훈련 세트를 만들어보세요. 각 훈련 샘플은 하나의 이미지에 대한 전체 분류기의 예측을 담은 벡터고 타깃은 이미지의 클래스입니다. 새로운 이 훈련 세트에 분류기 하나를 훈련시켜 보세요.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)\n",
    "\n",
    "for index, estimator in enumerate(estimators):\n",
    "    X_val_predictions[:, index] = estimator.predict(X_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_val_predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)\n",
    "rnd_forest_blender.fit(X_val_predictions, y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnd_forest_blender.oob_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "이 블렌더를 세밀하게 튜닝하거나 다른 종류의 블렌더(예를 들어, MLPClassifier)를 시도해 볼 수 있습니다. 그런 늘 하던대로 다음 교차 검증을 사용해 가장 좋은 것을 선택합니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*문제: 축하합니다. 방금 블렌더를 훈련시켰습니다. 그리고 이 분류기를 모아서 스태킹 앙상블을 구성했습니다. 이제 테스트 세트에 앙상블을 평가해보세요. 테스트 세트의 각 이미지에 대해 모든 분류기로 예측을 만들고 앙상블의 예측 결과를 만들기 위해 블렌더에 그 예측을 주입합니다. 앞서 만든 투표 분류기와 비교하면 어떤가요?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)\n",
    "\n",
    "for index, estimator in enumerate(estimators):\n",
    "    X_test_predictions[:, index] = estimator.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = rnd_forest_blender.predict(X_test_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "이 스태킹 앙상블은 앞서 만든 간접 투표 분류기만큼 성능을 내지는 못합니다. 하지만 개개의 분류기보다는 당연히 좋습니다."
   ]
  }
 ],