variance_scaling_initializer 변경, ReplayMemory 클래스 적용, 연습문제 8번 해답 추가

2018-05-19 23:11:06 +09:00
parent ef3099f411
commit 6935967596
1 changed files with 406 additions and 6 deletions
--- a/16_reinforcement_learning.ipynb
+++ b/16_reinforcement_learning.ipynb
@@ -5446,7 +5446,7 @@
    "n_inputs = 4  # == env.observation_space.shape[0]\n",
    "n_hidden = 4  # 간단한 작업이므로 너무 많은 뉴런이 필요하지 않습니다\n",
    "n_outputs = 1 # 왼쪽으로 가속할 확률을 출력합니다\n",
-    "initializer = tf.contrib.layers.variance_scaling_initializer()\n",
+    "initializer = tf.variance_scaling_initializer()\n",
    "\n",
    "# 2. 네트워크를 만듭니다\n",
    "X = tf.placeholder(tf.float32, shape=[None, n_inputs])\n",
@@ -5641,7 +5641,7 @@
    "\n",
    "learning_rate = 0.01\n",
    "\n",
-    "initializer = tf.contrib.layers.variance_scaling_initializer()\n",
+    "initializer = tf.variance_scaling_initializer()\n",
    "\n",
    "X = tf.placeholder(tf.float32, shape=[None, n_inputs])\n",
    "y = tf.placeholder(tf.float32, shape=[None, n_outputs])\n",
@@ -5901,7 +5901,7 @@
    "\n",
    "learning_rate = 0.01\n",
    "\n",
-    "initializer = tf.contrib.layers.variance_scaling_initializer()\n",
+    "initializer = tf.variance_scaling_initializer()\n",
    "\n",
    "X = tf.placeholder(tf.float32, shape=[None, n_inputs])\n",
    "\n",
@@ -6760,7 +6760,7 @@
    "n_hidden = 512\n",
    "hidden_activation = tf.nn.relu\n",
    "n_outputs = env.action_space.n  # 9개의 행동이 가능합니다\n",
-    "initializer = tf.contrib.layers.variance_scaling_initializer()\n",
+    "initializer = tf.variance_scaling_initializer()\n",
    "\n",
    "def q_network(X_state, name):\n",
    "    prev_layer = X_state / 128.0 # 픽셀 강도를 [-1.0, 1.0] 범위로 스케일 변경합니다.\n",
@@ -6859,6 +6859,13 @@
    "saver = tf.train.Saver()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "노트: 처음 책을 쓸 때는 타깃 Q-가치(y)와 예측 Q-가치(q_value) 사이의 제곱 오차를 사용했습니다. 하지만 매우 잡음이 많은 경험 때문에 작은 오차(1.0 이하)에 대해서만 손실에 이차식을 사용하고, 큰 오차에 대해서는 위의 계산식처럼 선형적인 손실(절대 오차의 두 배)을 사용하는 것이 더 낫습니다. 이렇게 하면 큰 오차가 모델 파라미터를 너무 많이 변경하지 못합니다. 또 몇 가지 하이퍼파라미터를 조정했습니다(작은 학습률을 사용하고 논문에 따르면 적응적 경사 하강법 알고리즘이 이따금 나쁜 성능을 낼 수 있으므로 Adam 최적화대신 네스테로프 가속 경사를 사용합니다). 아래에서 몇 가지 다른 하이퍼파라미터도 수정했습니다(재생 메모리 크기 확대, e-그리디 정책을 위한 감쇠 단계 증가, 할인 계수 증가, 온라인 DQN에서 타깃 DQN으로 복사 빈도 축소 등입니다)."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 64,
@@ -6872,7 +6879,7 @@
    "\n",
    "def sample_memories(batch_size):\n",
    "    indices = np.random.permutation(len(replay_memory))[:batch_size]\n",
-    "    cols = [[], [], [], [], []] # 상태, 행동, 보상, 다음 상태, 종료 여부\n",
+    "    cols = [[], [], [], [], []] # 상태, 행동, 보상, 다음 상태, 계속\n",
    "    for idx in indices:\n",
    "        memory = replay_memory[idx]\n",
    "        for col, value in zip(cols, memory):\n",
@@ -6881,6 +6888,78 @@
    "    return cols[0], cols[1], cols[2].reshape(-1, 1), cols[3], cols[4].reshape(-1, 1)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ReplayMemory 클래스를 사용한 방법 =================="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "랜덤 억세스(random access)가 훨씬 빠르기 때문에 deque 대신에 ReplayMemory 클래스를 사용합니다(기여해 준 @NileshPS 님 감사합니다). 또 기본적으로 중복을 허용하여 샘플하면 큰 재생 메모리에서 중복을 허용하지 않고 샘플링하는 것보다 훨씬 빠릅니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class ReplayMemory:\n",
+    "    def __init__(self, maxlen):\n",
+    "        self.maxlen = maxlen\n",
+    "        self.buf = np.empty(shape=maxlen, dtype=np.object)\n",
+    "        self.index = 0\n",
+    "        self.length = 0\n",
+    "        \n",
+    "    def append(self, data):\n",
+    "        self.buf[self.index] = data\n",
+    "        self.length = min(self.length + 1, self.maxlen)\n",
+    "        self.index = (self.index + 1) % self.maxlen\n",
+    "    \n",
+    "    def sample(self, batch_size, with_replacement=True):\n",
+    "        if with_replacement:\n",
+    "            indices = np.random.randint(self.length, size=batch_size) # 더 빠름\n",
+    "        else:\n",
+    "            indices = np.random.permutation(self.length)[:batch_size]\n",
+    "        return self.buf[indices]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "replay_memory_size = 500000\n",
+    "replay_memory = ReplayMemory(replay_memory_size)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def sample_memories(batch_size):\n",
+    "    cols = [[], [], [], [], []] # 상태, 행동, 보상, 다음 상태, 계속\n",
+    "    for memory in replay_memory.sample(batch_size):\n",
+    "        for col, value in zip(cols, memory):\n",
+    "            col.append(value)\n",
+    "    cols = [np.array(col) for col in cols]\n",
+    "    return cols[0], cols[1], cols[2].reshape(-1, 1), cols[3], cols[4].reshape(-1, 1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ============================================="
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 65,
@@ -10967,6 +11046,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "from collections import deque\n",
+    "\n",
    "def combine_observations_multichannel(preprocessed_observations):\n",
    "    return np.array(preprocessed_observations).transpose([1, 2, 0])\n",
    "\n",
@@ -11027,7 +11108,326 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Coming soon..."
+    "## 1. to 7."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "부록 A 참조."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. BipedalWalker-v2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*문제: 정책 그래디언트를 사용해 OpenAI 짐의 ‘BypedalWalker-v2’를 훈련시켜보세요*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gym"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env = gym.make(\"BipedalWalker-v2\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "노트: 만약 `BipedalWalker-v2` 환경을 만들 때 \"`module 'Box2D._Box2D' has no attribute 'RAND_LIMIT'`\"와 같은 이슈가 발생하면 다음과 같이 해보세요:\n",
+    "```\n",
+    "$ pip uninstall Box2D-kengz\n",
+    "$ pip install git+https://github.com/pybox2d/pybox2d\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "obs = env.reset()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "img = env.render(mode=\"rgb_array\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.imshow(img)\n",
+    "plt.axis(\"off\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "obs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "이 24개의 숫자에 대한 의미는 [온라인 문서](https://github.com/openai/gym/wiki/BipedalWalker-v2)를 참고하세요."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env.action_space"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env.action_space.low"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env.action_space.high"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "이는 각 다리의 엉덩이 관절의 토크와 발목 관절 토크를 제어하는 연속적인 4D 행동 공간입니다(-1에서 1까지). 연속적인 행동 공간을 다루기 위한 한 가지 방법은 이를 불연속적으로 나누는 것입니다. 예를 들어, 가능한 토크 값을 3개의 값 -1.0, 0.0, 1.0으로 제한할 수 있습니다. 이렇게 하면 가능한 행동은 $3^4=81$개가 됩니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from itertools import product"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "possible_torques = np.array([-1.0, 0.0, 1.0])\n",
+    "possible_actions = np.array(list(product(possible_torques, possible_torques, possible_torques, possible_torques)))\n",
+    "possible_actions.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tf.reset_default_graph()\n",
+    "\n",
+    "# 1. 네트워크 구조를 정의합니다\n",
+    "n_inputs = env.observation_space.shape[0]  # == 24\n",
+    "n_hidden = 10\n",
+    "n_outputs = len(possible_actions) # == 625\n",
+    "initializer = tf.variance_scaling_initializer()\n",
+    "\n",
+    "# 2. 신경망을 만듭니다\n",
+    "X = tf.placeholder(tf.float32, shape=[None, n_inputs])\n",
+    "\n",
+    "hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.selu,\n",
+    "                         kernel_initializer=initializer)\n",
+    "logits = tf.layers.dense(hidden, n_outputs,\n",
+    "                         kernel_initializer=initializer)\n",
+    "outputs = tf.nn.softmax(logits)\n",
+    "\n",
+    "# 3. 추정 확률에 기초하여 무작위한 행동을 선택합니다\n",
+    "action_index = tf.squeeze(tf.multinomial(logits, num_samples=1), axis=-1)\n",
+    "\n",
+    "# 4. 훈련\n",
+    "learning_rate = 0.01\n",
+    "\n",
+    "y = tf.one_hot(action_index, depth=len(possible_actions))\n",
+    "cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=logits)\n",
+    "optimizer = tf.train.AdamOptimizer(learning_rate)\n",
+    "grads_and_vars = optimizer.compute_gradients(cross_entropy)\n",
+    "gradients = [grad for grad, variable in grads_and_vars]\n",
+    "gradient_placeholders = []\n",
+    "grads_and_vars_feed = []\n",
+    "for grad, variable in grads_and_vars:\n",
+    "    gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())\n",
+    "    gradient_placeholders.append(gradient_placeholder)\n",
+    "    grads_and_vars_feed.append((gradient_placeholder, variable))\n",
+    "training_op = optimizer.apply_gradients(grads_and_vars_feed)\n",
+    "\n",
+    "init = tf.global_variables_initializer()\n",
+    "saver = tf.train.Saver()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "아직 훈련되지 않았지만 이 정책 네트워크를 실행해 보죠."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def run_bipedal_walker(model_path=None, n_max_steps = 1000):\n",
+    "    env = gym.make(\"BipedalWalker-v2\")\n",
+    "    frames = []\n",
+    "    with tf.Session() as sess:\n",
+    "        if model_path is None:\n",
+    "            init.run()\n",
+    "        else:\n",
+    "            saver.restore(sess, model_path)\n",
+    "        obs = env.reset()\n",
+    "        for step in range(n_max_steps):\n",
+    "            img = env.render(mode=\"rgb_array\")\n",
+    "            frames.append(img)\n",
+    "            action_index_val = action_index.eval(feed_dict={X: obs.reshape(1, n_inputs)})\n",
+    "            action = possible_actions[action_index_val]\n",
+    "            obs, reward, done, info = env.step(action[0])\n",
+    "            if done:\n",
+    "                break\n",
+    "    env.close()\n",
+    "    return frames"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "frames = run_bipedal_walker()\n",
+    "video = plot_animation(frames)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "안되네요, 걷지를 못합니다. 그럼 훈련시켜 보죠!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n_games_per_update = 10\n",
+    "n_max_steps = 1000\n",
+    "n_iterations = 1000\n",
+    "save_iterations = 10\n",
+    "discount_rate = 0.95\n",
+    "\n",
+    "with tf.Session() as sess:\n",
+    "    init.run()\n",
+    "    for iteration in range(n_iterations):\n",
+    "        print(\"\\rIteration: {}/{}\".format(iteration + 1, n_iterations), end=\"\")\n",
+    "        all_rewards = []\n",
+    "        all_gradients = []\n",
+    "        for game in range(n_games_per_update):\n",
+    "            current_rewards = []\n",
+    "            current_gradients = []\n",
+    "            obs = env.reset()\n",
+    "            for step in range(n_max_steps):\n",
+    "                action_index_val, gradients_val = sess.run([action_index, gradients],\n",
+    "                                                           feed_dict={X: obs.reshape(1, n_inputs)})\n",
+    "                action = possible_actions[action_index_val]\n",
+    "                obs, reward, done, info = env.step(action[0])\n",
+    "                current_rewards.append(reward)\n",
+    "                current_gradients.append(gradients_val)\n",
+    "                if done:\n",
+    "                    break\n",
+    "            all_rewards.append(current_rewards)\n",
+    "            all_gradients.append(current_gradients)\n",
+    "\n",
+    "        all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate)\n",
+    "        feed_dict = {}\n",
+    "        for var_index, gradient_placeholder in enumerate(gradient_placeholders):\n",
+    "            mean_gradients = np.mean([reward * all_gradients[game_index][step][var_index]\n",
+    "                                      for game_index, rewards in enumerate(all_rewards)\n",
+    "                                          for step, reward in enumerate(rewards)], axis=0)\n",
+    "            feed_dict[gradient_placeholder] = mean_gradients\n",
+    "        sess.run(training_op, feed_dict=feed_dict)\n",
+    "        if iteration % save_iterations == 0:\n",
+    "            saver.save(sess, \"./my_bipedal_walker_pg.ckpt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "frames = run_bipedal_walker(\"./my_bipedal_walker_pg.ckpt\")\n",
+    "video = plot_animation(frames)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "최상의 결과는 아니지만 적어도 직립해서 (느리게) 오른쪽으로 이동합니다. 이 문제에 대한 더 좋은 방법은 액터-크리틱(actor-critic) 알고리즘을 사용하는 것입니다. 이 방법은 행동 공간을 이산화할 필요가 없으므로 훨씬 빠르게 수렴합니다. 이에 대한 더 자세한 내용은 Yash Patel가 쓴 멋진 [블로그 포스트](https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69)를 참고하세요."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9.\n",
+    "**Comming soon**"
   ]
  }
 ],