Table of Contents

안녕하세요, mj입니다!

오늘은 파이썬을 사용하여 강화 학습의 Q-러닝 알고리즘을 구현해보겠습니다.

Q-러닝이란?

Q-러닝은 강화 학습의 한 종류로, 에이전트가 환경과 상호작용하며 최적의 행동을 학습하는 방법입니다. Q-러닝은 상태-행동 쌍에 대한 가치를 학습하여 최적의 정책을 찾습니다.

Q-러닝 알고리즘 구현하기

이번 포스팅에서는 간단한 그리드 월드 환경에서 Q-러닝을 구현해보겠습니다. 그리드 월드는 에이전트가 목표 지점에 도달하기 위해 이동해야 하는 격자 형태의 환경입니다.

필요한 라이브러리 설치

먼저 필요한 라이브러리를 설치합니다.

pip install numpy matplotlib

환경 설정

그리드 월드 환경을 설정합니다. 아래는 간단한 4×4 그리드 월드의 예시입니다.


import numpy as np

class GridWorld:
    def __init__(self, size):
        self.size = size
        self.state = (0, 0)  # 에이전트의 초기 위치
        self.goal = (size-1, size-1)  # 목표 지점

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        # 행동에 따른 상태 변화
        x, y = self.state
        if action == 0:  # 위
            x = max(x - 1, 0)
        elif action == 1:  # 아래
            x = min(x + 1, self.size - 1)
        elif action == 2:  # 왼쪽
            y = max(y - 1, 0)
        elif action == 3:  # 오른쪽
            y = min(y + 1, self.size - 1)
        self.state = (x, y)
        reward = 1 if self.state == self.goal else 0
        return self.state, reward

Q-러닝 알고리즘

이제 Q-러닝 알고리즘을 구현합니다.


class QLearningAgent:
    def __init__(self, actions, learning_rate=0.1, discount_factor=0.9):
        self.q_table = np.zeros((4, 4, len(actions)))  # Q-테이블 초기화
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.actions = actions

    def choose_action(self, state):
        if np.random.rand() < 0.1:  # 탐사
            return np.random.choice(self.actions)
        else:  # 활용
            return np.argmax(self.q_table[state[0], state[1]])

    def learn(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state[0], next_state[1]])
        td_target = reward + self.discount_factor * self.q_table[next_state[0], next_state[1], best_next_action]
        td_delta = td_target - self.q_table[state[0], state[1], action]
        self.q_table[state[0], state[1], action] += self.learning_rate * td_delta

학습 진행

이제 에이전트를 학습시키는 코드를 작성합니다.


actions = [0, 1, 2, 3]  # 위, 아래, 왼쪽, 오른쪽
env = GridWorld(size=4)
agent = QLearningAgent(actions)

for episode in range(1000):
    state = env.reset()
    done = False
    while not done:
        action = agent.choose_action(state)
        next_state, reward = env.step(action)
        agent.learn(state, action, reward, next_state)
        state = next_state
        if reward == 1:
            done = True

결과 출력

Q-테이블의 학습 결과를 확인해보겠습니다.


print(agent.q_table)

위 코드를 실행하면 Q-테이블이 출력됩니다. 각 상태에서 가능한 행동의 Q-값이 표시됩니다.