Оценка Fit Q в офлайн обучении с подкреплением
Краткое содержание
I am working on a PyTorch implementation of Implicit Q-Learning (IQL) (paper), given a dataset$\mathcal D = \left\{ (\mathbf s_i, \mathbf a_i, \mathbf s_i', r_i ) \right\}$of transitions. I think I have implemented IQL correctly, so now I have a learned policy$\pi$that takes in an element of the state space$\mathcal S$, and outputs a mean vector and covariance diagonals for a multivariate normal distribution over the action space$\mathcal A = \mathbb R^d$. I would like to evaluate this learned policy$\pi$.This leads me to investigate off-policy evaluation, specifically Fit Q Evaluation (FQE), given a test dataset of transitions$\mathcal D_e$as above. I have been reading the offline reinforcement learning paper by Prudencio et al. (link).My problem is that I do not understand how to implement FQE as described in the paper, so I would appreciate guidance for this.Below I quote the relevant part of the paper and ask some questions.C. Fit Q EvaluationIn fit Q evaluation (FQE), we first train a$Q$-function$Q_{\phi}^{\pi}$by minimizing the Bellman error under the policy$\pi$. Then, we evaluate the policy by computing the average expected return over the states and actions from$\mathcal D_e$, such that\begin{align*} \hat J(\pi) = \mathbb E_{\mathbf s, \mathbf a \sim \mathcal D_e} \left[ Q_{\phi}^{\pi}(\mathbf s, \mathbf a) \right]. \end{align*}Questions:Presumably I should use a neural network for$Q_{\phi}^{\pi}$, right?What is the Bellman error in this context? I have some course notes that give a definition of the Bellman error involving$\max_{\mathbf a'}$, which is not workable since my action space$\mathcal A$is$\mathbb R^d$for some$d \in \mathbb Z$.I have looked at the off-policy evaluation paper by Voloshin et al. (link), which seems to describe FQE in a different way. Specifically,$\hat Q(\cdot, \theta) = \lim_{k \to \infty} \hat Q_k$, where\begin{align*} \hat Q_k &= \min_{\theta} \frac{1}{N} \sum_{i = 1}^N \sum_{t = 0}^{\tilde T} \left( \hat Q_{k - 1}(x_t^i, a_t^i ; \theta) - y_t^i \right)^2\\ y_t^i &\equiv r_t^i + \gamma \mathbb E_{\pi_e} \hat Q_{k - 1}(x_{t + 1}^i, \cdot ; \theta) \end{align*}If I understand this correctly, we fit a sequence of neural network$Q$functions$\hat Q_1, \hat Q_2, \dots$here, trying to minimize the trajectory-average squared distance between$\hat Q_{k - 1}(x_t^i, a_t^i ; \theta)$and$y_t^i$. But how do I calculate$y_t^i$here? I don't know what$\mathbb E_{\pi_e} \hat Q_{k - 1}(x_{t + 1}^i, \cdot ; \theta)$means.I appreciate any help.
Полный текст статьи пока не загружен.