Decoding reward–curiosity conflict in decision-making from irrational behaviors

Amount of reward

In the two-choice task, the reward is given as all-or-none; however, its intensity depends on the agent’s own feeling as

$$\begin{array}{*{20}{c}} {R = 0\,{\rm{or}}\ln \frac{{P_o}}{{1 – P_o}},} \end{array}$$

(9)

where P_o represents the desired probability of how much the agent wants the reward and controls the reward intensity felt by the agent (Supplementary Fig. 8). In Friston’s formulation, the presence and absence of rewards are given by log preference with natural unit as ln P_o and In(1 − P_o), which take negative values^22,23. In this equation, the reward is the difference between log preferences to ensure that the reward intensity is positive.

State space model for reward probability recognition

The agent assumed that the reward is generated probabilistically from the latent cause w, and probabilities λ_i are represented by

$$\begin{array}{*{20}{c}} {\lambda _i = f\left( {w_i} \right),} \end{array}$$

(10)

where i indicates the indices of the options and $f\left( x \right) = 1/\left( {1 + {\mathrm{e}}^{ – x}} \right)$. In addition, the agent assumed an ambiguous environment in which the reward probabilities are fluctuated by a random walk as

$$w_{i,t} = w_{i,t – 1} + \sigma _w\xi _{i,t},$$

(11)

where t, $\xi _{i,t}$, and $\sigma _w$ denote the trial of the two-choice task, standard Gaussian noise and the noise intensity, respectively. Thus, the agent assumes an environment expressed by an SSM as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t – 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t – 1},\sigma _w^2{{{\mathbf{I}}}}} \right),} \end{array}$$

(12)

$$P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right) = \mathop {\prod }\limits_i \left[ {f\left( {w_{i,t}} \right)^{o_t}\left\{ {1 – f\left( {w_{i,t}} \right)} \right\}^{1 – o_t}} \right]^{a_{i,t}},$$

(13)

where w_t and o_t denote the latent variables controlling the reward probabilities of both options at step t (${{{\mathbf{w}}}}_t = \left( {w_{1,t},w_{2,t}} \right)^{\mathrm{T}}$) and the observation of the presence of the reward $\left( {o_t \in \left\{ {0,1} \right\}} \right)$, respectively. Meanwhile, a_t denotes the agent’s action at step t, which is represented by a one-hot vector $\left( {{{{\mathbf{a}}}}_t \in \left\{ {\left( {1,0} \right)^{\mathrm{T}},\left( {0,1} \right)^{\mathrm{T}}} \right\}} \right)$ , ${{{\mathcal{N}}}}\left( {{{{\mathbf{x}}}}|{{{\mathbf{\upmu }}}},{\varSigma}} \right)$ denotes the Gaussian distribution mean μ and variance Σ, $\sigma _w^2$ denotes the variance of the transition probability of w, I denotes an identity matrix, and $f(w_{i,t}) = 1/(1 + {\mathrm{e}}^{ – w_{i,t}})$, which represents the probability of the reward of option i at step t. The initial distribution of w₁ is given by $P\left( {{{{\mathbf{w}}}}_1} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_1|0,\kappa {{{\mathbf{I}}}}} \right)$, where κ denotes variance.

FEP for reward probability recognition

We modeled the agent’s recognition process of reward probability using sequential Bayesian updating as

$$P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right) \propto P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\int} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t – 1}} \right)P\left( {{{{\mathbf{w}}}}_{t – 1}|{{{{o}}}}_{1:t – 1},{{{\mathbf{a}}}}_{1:t – 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t – 1}} .$$

(14)

Because of the non-Gaussian P(o_t|w_t, a_t), the posterior distribution of w_t, $P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right)$, becomes non-Gaussian and cannot be calculated analytically. To avoid this problem, we introduced a simple posterior distribution approximated by a Gaussian distribution:

$$Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{\upmu }}}}_t,{{{\mathbf{{\Lambda}}}}}_t^{ – 1}} \right) {\fallingdotseq} P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right),$$

(15)

where φ_t = {μ_t, Λ_t}, and μ_t and Λ_t denote the mean and precision, respectively $( {{{{\mathbf{\upmu }}}}_t = \left( {\mu _{1,t},\mu _{2,t}} \right)^{\mathrm{T}};\,{{{\mathbf{{\Lambda}}}}}_{{{{t}}}} = \text{diag}\left( {p_{1,t},p_{2,t}} \right)})$. $Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)$ denotes the recognition distribution. The model agent aims to update the recognition distribution through φ_t at each time step by minimizing the surprise, which is defined by $- \ln P\left( {o_t|o_{1:t – 1}} \right)$. The surprise can be decomposed as follows:

$$\begin{array}{rcl} – \ln P\left( {o_t|o_{1:t – 1}} \right) & = & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\ln \frac{{Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)}}{{P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t – 1},{{{\mathbf{a}}}}_{1:t}} \right)}}\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ && – {{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)||P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right)} \right], \end{array}$$

(16)

where ${{{\mathrm{KL}}}}\left[ {q\left( {{{\mathbf{x}}}} \right)||p\left( {{{\mathbf{x}}}} \right)} \right]$ denotes the Kullback–Leibler (KL) divergence between the probability distributions q(x) and p(x). Because of the nonnegativity of KL divergence, the first term is the upper bound of the surprise:

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\ln Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\mathrm{d}{{{\mathbf{w}}}}_t} + {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)J\left( {o_t,{{{\mathbf{w}}}}_t} \right)\mathrm{d}{{{\mathbf{w}}}}_t,} } \end{array}$$

(17)

which is called the free energy, where $J\left( {o_t,{{{\mathbf{w}}}}_t} \right) = – \ln P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t – 1},{{{\mathbf{a}}}}_{1:t}} \right)$. The first term of the free energy corresponds to the negative entropy of a Gaussian distribution:

$$\begin{array}{*{20}{c}} {F_1 = {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{{t}}}}|\varphi _t} \right)\ln Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)\mathrm{d}{{{\mathbf{w}}}}_t.} } \end{array}$$

(18)

The second term is approximated as

$$\begin{array}{rcl} F_2 & = & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)J\left( {o_t,{{{\mathbf{w}}}}_t} \right)\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ & \cong & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{{t}}}}|\varphi _t} \right)\left\{ {J\left( {o_t,{{{\mathbf{\upmu }}}}_t} \right) + \frac{{\mathrm{d}J}}{{\mathrm{d}w}}\left( {{{{\mathbf{w}}}}_t – {{{\mathbf{\upmu }}}}_t} \right) + \frac{1}{2}\frac{{\mathrm{d}^2J}}{{\mathrm{d}w^2}}\left( {{{{\mathbf{w}}}}_t – {{{\mathbf{\upmu }}}}_t} \right)^2} \right\}\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ & = & \left. {J\left( {o_t,{{{\mathbf{w}}}}_t} \right)} \right|_{w_t = \mu _t} + \frac{1}{2}\left. {\frac{{\mathrm{d}^2J}}{{\mathrm{d}w^2}}} \right|_{w_t = \mu _t}{{{\mathbf{{\Lambda}}}}}_t^{ – 1}. \end{array}$$

(19)

Note that $E\left( {o_t,{{{\mathbf{w}}}}_t} \right)$ is expanded by a second-order Taylor series around ${{{\mathbf{\upmu }}}}_t$. At each time step, the agent updates $\varphi _t$ by minimizing $F\left( {o_t,\varphi _t} \right)$.

Calculation of free energy

The free energy is derived as follows: F₁ simply becomes

$$\begin{array}{*{20}{c}} {F_1 = \frac{1}{2}\ln 2\uppi p_{1,t}^{ – 1} + \frac{1}{2}\ln 2\uppi p_{2,t}^{ – 1} + \text{const.}} \end{array}$$

(20)

For computing F₂,

$$\begin{array}{*{20}{c}}{P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t – 1},{{{\mathbf{a}}}}_{1:t}} \right) = P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t – 1}} \right)P\left( {{{{\mathbf{w}}}}_{t – 1}|o_{1:t – 1},{{{\mathbf{a}}}}_{1:t – 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t – 1}} }\\ { \cong P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t – 1}} \right)N\left( {{{{\mathbf{w}}}}_{t – 1}|{{{\mathbf{\upmu }}}}_{t – 1},{{{\mathbf{{\Lambda}}}}}_{{{{\mathrm{t}}}} – 1}^{ – 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t – 1}.} }\end{array}$$

(21)

In the second line of this equation, we use the approximated recognition distribution as the previous posterior $P\left( {{{{\mathbf{w}}}}_{t – 1}|o_{1:t – 1},{{{\mathbf{a}}}}_{1:t – 1}} \right)$ . This equation can be written as

$$P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t – 1},{{{\mathbf{a}}}}_{1:t}} \right) \cong \mathop {\prod}\limits_i {\left[ {f\left( {w_{i,t}} \right)^{o_t}\left\{ {1 – f\left( {w_{i,t}} \right)} \right\}^{1 – o_t}} \right]^{a_{i,t}}N\left( {w_{i,t}|\mu _{i,\,t – 1},p_{i,\,t}^{ – 1} + \sigma _w^2} \right).}$$

(22)

Then

$$\begin{array}{*{20}{c}} {\left. {E\left( {o_t,{{{\mathbf{w}}}}_t} \right)} \right|_{w_t = \mu _t} = J_1\left( {o_t,\mu _{1,t}} \right) + J_2\left( {o_t,\mu _{2,t}} \right) + \text{const.},} \end{array}$$

(23)

where

$$\begin{array}{*{20}{c}} \begin{array}{ccccc}\\ J_i\left( {o_t,\,\mu _{i,t}} \right) = & a_{i,t}\left[ {o_t\ln f\left( {\mu _{i,t}} \right) + \left( {1 – o_t} \right)\ln \left\{ {1 – f\left( {\mu _{i,t}} \right)} \right\}} \right]\cr \\ & – \frac{1}{2}\frac{{\left( {\mu _{i,t} – \mu _{i,t – 1}} \right)^2}}{{p_{i,t}^{ – 1} + \sigma _w^2}} – \frac{1}{2}\ln \left( {p_{i,t}^{ – 1} + \sigma _w^2} \right).\\ \end{array} \end{array}$$

(24)

Thus, F₂ is calculated by substituting this equation into equation (19). Taken together,

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = \mathop {\sum}\limits_i {\left\{ {J_i\left( {o_t,\mu _{i,t}} \right) + \frac{1}{2}\left. {\frac{{\mathrm{d}^2J_i}}{{\mathrm{d}w_{i,t}^2}}} \right|_{w_{i,t} = \mu _{i,t}}p_{i,t}^{ – 1} + \frac{1}{2}\ln 2\uppi p_{i,t}^{ – 1}} \right\}.} } \end{array}$$

(25)

Sequential updating of the agent’s recognition

The updating rule for φ_t was derived by minimizing the free energy. The optimized p_i,t can be computed by $\partial F/\partial p_{i,t}^{ – 1} = 0$, which leads to

$$\begin{array}{*{20}{c}} {p_{i,t} = \left. {\frac{{\mathrm{d}^2J_i}}{{\mathrm{d}w_{i,t}^2}}} \right|_{w_{i,t} = \mu _{i,t}}.} \end{array}$$

(26)

By substituting p_i,t into equation (25), the second term in the summation becomes constant, irrespective of μ_i,t. Thus, μ_i,t is updated by minimizing only the first term as

$$\begin{array}{*{20}{c}} {\mu _{i,t} = \mu _{i,t – 1} – \alpha \delta _{i,a}\left. {\frac{{\partial J_i}}{{\partial \mu _{i,t}}}} \right|_{\mu _{i,t} = \mu _{i,t – 1}},} \end{array}$$

(27)

where α is the learning rate. These two equations finally lead to

$$\begin{array}{*{20}{c}} {\mu _{i,t} = \mu _{i,t – 1} + \alpha K_{i,t}\left( {o_t – f\left( {\mu _{i,t – 1}} \right)} \right),} \end{array}$$

(28)

$$\begin{array}{*{20}{c}} {p_{i,t} = K_{i,t}^{ – 1} + f\left( {\mu _{i,t}} \right)\left( {1 – f\left( {\mu _{i,t}} \right)} \right),} \end{array}$$

(29)

where $K_{i,t} = \left( {p_{i,t – 1} + \sigma _w^{ – 2}} \right)/\left( {p_{i,t – 1}\sigma _w^{ – 2}} \right)$, which is called the Kalman gain. If option i was not selected, the second terms in both equations will vanish, which results in belief $\mu _{i,t}$ staying the same, while its precision decreases (that is, $p_{i,t + 1} < p_{i,t}$). If it is selected, the belief is updated by the prediction error (that is, $o_t – f\left( {\mu _{i,t}} \right)$), and its precision is improved. The confidence of the recognized reward probability should be evaluated not in $w_{i,t}$ space but in $\lambda _{i,t}$ space; hence, the confidence is defined by $\gamma _{i,t} = p_{i,t}/f^\prime \left( {\mu _{i,t}} \right)^2$.

Expected net utility

The expected net utility is described by

$$\begin{array}{*{20}{c}}{U_t\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = c_t \cdot E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right]}\\ { + E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {R\left( {o_{t + 1}} \right)} \right],}\end{array}$$

(30)

where $R\left( {o_{t + 1}} \right) = o_{t + 1}\ln \left( {P_o/\left( {1 – P_o} \right)} \right)$ ; the first and second terms represent the expected information gain and expected reward, respectively; and c_t denotes the intensity of curiosity at time t. The heuristic idea of introducing the curiosity meta-parameter c_t was also proposed in the RL field³⁰.

We briefly show how to derive it based on ref. ⁴¹. The free energy at current time t is described by

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = E_{Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)}\left[ {\ln Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right) – \ln P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t – 1},{{{\mathbf{a}}}}_{1:t}} \right)} \right],} \end{array}$$

(31)

which is a rewriting of equation (17). Here, we attempt to express the future free energy at time t + 1, conditioned on action a_{t +1} as

$$\begin{array}{*{20}{c}} {F\left( {o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)}\left[ {\ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right) – \ln P\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)} \right].} \end{array}$$

(32)

However, there is an apparent problem in this equation: o_t+1 has not yet been observed because it is a future observation. To resolve this issue, we take an expectation with respect to o_t+1 as

$$\begin{array}{l} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{{{{\boldsymbol{t}}}} + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{{{{\boldsymbol{t}}}} + 1}|\varphi _t} \right)} \\ \qquad \qquad\left[ {\ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right) – \ln P\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)} \right]. \end{array}$$

(33)

This can be rewritten as

$$\begin{array}{lll} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)\right.\\ \qquad\qquad\quad- \ln P\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right) – \ln P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right) \left]\right. \end{array}$$

(34)

Although $P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)$ can be computed by the generative model as

$$\begin{array}{*{20}{c}}{P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right)P\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} }\\ { \cong {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1},} }\end{array}$$

(35)

it is assumed that $P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)$ was heuristically replaced with $P\left( {o_{t + 1}} \right)$ as the prior agent’s preference of observing $o_{t + 1}$, so that $\ln P\left( {o_{t + 1}} \right)$ can be addressed as an instrumental reward. The equation (34) was further transformed into

$$\begin{array}{lll} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)\right.\\ \qquad\qquad\quad {- \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right) – \ln P\left( {o_{t + 1}} \right) \left]\right.}. \end{array}$$

(36)

Finally, we obtained the so-called expected free energy as

$$\begin{array}{*{20}{c}} {F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ { – {{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}} \right)} \right] – \ln P\left( {o_{t + 1}} \right)} \right].} \end{array}$$

(37)

where the KL divergence is called a Bayesian surprise, representing the extent to which the agent’s beliefs are updated by observation, whereas the second term $\ln P\left( {o_{t + 1}} \right)$ can be addressed as the agent’s prior preference of observing $o_{t + 1}$, which can be interpreted as an instrumental reward. Therefore, the expected free energy is derived in an ad hoc manner.

In the first term of equation (30), the posterior and prior distributions of ${{{\mathbf{w}}}}_{t + 1}$ in the KL divergence are derived as

$$\begin{array}{*{20}{c}} {Q\left( {{{{\mathbf{w}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{w}}}}_t} \right)Q\left( {{{{\mathbf{w}}}}_t} \right){\mathrm{d}}{{{\mathbf{w}}}}_t} ,} \end{array}$$

(38)

$$\begin{array}{*{20}{c}} {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1,}{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{{{{\boldsymbol{t}}}} + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}}{{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}},} \end{array}$$

(39)

respectively. $o_{t + 1}$ is a future observation, and therefore, the first term was expected by $P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)$, which can be calculated as

$$\begin{array}{*{20}{c}} {P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}.} } \end{array}$$

(40)

In the second term, the reward is quantitatively interpreted as the desired probability of $o_{t + 1}$. For the two-choice task, we use

$$\begin{array}{*{20}{c}} {P\left( {o_{t + 1}} \right) = P_o^{o_{t + 1}}\left( {1 – P_o} \right)^{\left( {1 – o_{t + 1}} \right)},} \end{array}$$

(41)

where P_o indicates the desired probability of the presence of a reward. According to the probabilistic interpretation of reward^22,23, the presence and absence of a reward can be evaluated by ln P_o and ln(1 − P_o), respectively.

In this study, we changed the sign of the expected free energy as the expected net utility, because we modeled decision-making as the maximization of the expected free energy. We ventured to introduce the curiosity meta-parameter c_t to express irrational decision-making. Because rewards are relative, in this study, we set ${{{\mathrm{ln}}}}\left\{ {P_o/\left( {1 – P_o} \right)} \right\}$ and 0 for the presence and absence of a reward, respectively. Thus, our expected net utility can be described by equation (30), in which there is a gap between the original expected free energy and our expected net utility.

Model for action selection

The agent probabilistically selects a choice with higher expected net utility as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{\exp \left( {\beta U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right)}}{{\mathop {\sum}\nolimits_{{{\mathbf{a}}}} {\exp \left( {\beta U\left( {{{\mathbf{a}}}} \right)} \right)} }},} \end{array}$$

(42)

where $U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)$ indicates the expected net utility of action a_t+1. Equation (42) is equivalent to equation (2). To derive equation (42), we considered the expectation of the expected net utility with respect to probabilistic action as

$$\begin{array}{*{20}{c}} {E\left[ U \right] = E_{Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {U\left( {{{{\mathbf{a}}}}_{t + 1}} \right) – \beta ^{ – 1}\ln Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right],} \end{array}$$

(43)

where β indicates an inverse temperature, and the entropic constraint of action probability is introduced in the second term. This equation can be rewritten as

$$\begin{array}{*{20}{c}} {E\left[ U \right] = – \beta ^{ – 1}\text{KL}\left[ {Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)\Vert\exp \left( {\beta U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right)/Z} \right] + \beta ^{ – 1}\ln Z,} \end{array}$$

(44)

where Z indicates a normalization constant. Thus, its maximization respective to $Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)$ leads to the optimal action probability as shown in equation (42).

Alternative expected net utility

For comparison, we consider an alternative expected net utility by introducing a time-dependent meta-parameter in the second term as follows:

$$\begin{array}{*{20}{c}}{U\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right]}\\ { +\, d_t \cdot E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {R\left( {o_{t + 1}} \right)} \right],}\end{array}$$

(45)

where d_t denotes the subjective intensity of reward at time t. In this case, the agent with high d_t will show more exploitative behavior, whereas the agent with d_t = 0 shows more explorative behavior driven by the expected information gain.

Calculation of expected net utility

Here, we present the calculation of the expected net utility. The KL divergence in the first term of equation (30) can be transformed into

$$\begin{array}{*{20}{c}} {E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\text{KL}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right] = H\left( {o_{t + 1}} \right) – H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right),} \end{array}$$

(46)

where the first and second terms represent the conditional and marginal entropies, respectively:

$$\begin{array}{*{20}{c}} {H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = {{{\mathrm{E}}}}_{P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\boldsymbol{a}}}}_{t + 1}} \right)}\left[ { – \ln P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)} \right],} \end{array}$$

(47)

$$\begin{array}{*{20}{c}} {H\left( {o_{t + 1}} \right) = {{{\mathrm{E}}}}_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ { – \ln P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right].} \end{array}$$

(48)

The conditional entropy ${{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right)$ can be calculated by substituting equation (13) into equation (47) as

$$H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = – E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\mathop {\sum}\nolimits_i {a_{i,t + 1}g\left( {w_{i,t + 1}} \right)} } \right],$$

(49)

where

$$\begin{array}{*{20}{c}} {g\left( w \right) = f\left( w \right)\ln f\left( w \right) + \left( {1 – f\left( w \right)} \right)\ln \left( {1 – f\left( w \right)} \right).} \end{array}$$

(50)

Here, we approximately calculate this equation by using the second-order Taylor expansion as

$$\begin{array}{*{20}{c}} {{{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) \cong – E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}{{{\mathrm{|}}}}{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\mathop {\sum}\limits_i {a_{i,t + 1}\left\{ {\begin{array}{*{20}{c}} {g\left( {\mu _{i,t + 1}} \right) + \frac{{\partial g}}{{\partial w_{i,t + 1}}}\left( {w_{i,t + 1} – \mu _{i,t + 1}} \right)} \\ { + \frac{1}{2}\frac{{\partial ^2g}}{{\partial w_{i,t + 1}^2}}\left( {w_{i,t + 1} – \mu _{i,t + 1}} \right)^2} \end{array}} \right\}} } \right],} \end{array}$$

(51)

which leads to

$$\begin{array}{*{20}{c}} {{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = \\- \mathop {\sum}\limits_i {a_{i,t + 1}\left[ {\begin{array}{*{20}{c}} {f\left( {\mu _{i,t + 1}} \right){\mathrm{ln}}f\left( {\mu _{i,t + 1}} \right) + \left( {1 – f\left( {\mu _{i,t + 1}} \right)} \right){\mathrm{ln}}\left( {1 – f\left( {\mu _{i,t + 1}} \right)} \right)} \\ { + \frac{1}{2}\left\{ {f\left( {\mu _{i,t + 1}} \right)\left( {1 – f\left( {\mu _{i,t + 1}} \right)} \right)\left( {1 + \left( {1 – 2f\left( {\mu _{i,t + 1}} \right)} \right){{{\mathrm{ln}}}}\frac{{f\left( {\mu _{i,t + 1}} \right)}}{{1 – f\left( {\mu _{i,t + 1}} \right)}}} \right)} \right\}} \\ {\left( {p_{i,t}^{ – 1} + p_w^{ – 1}} \right)} \end{array}} \right]} . \end{array}$$

(52)

The marginal entropy ${{{\mathrm{H}}}}\left( {o_{t + 1}} \right)$ can be calculated as

$$\begin{array}{*{20}{c}}{{{{\mathrm{H}}}}\left( {o_{t + 1}} \right) = – \mathop {\sum}\limits_i {a_{i,t + 1}\left\{ {P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)\ln P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)} \right.} ,}\\ {\left. { + P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)\ln P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)} \right\}}\end{array}$$

(53)

where

$$\begin{array}{ccc}\\ & P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} \cr \\ & = {\displaystyle{\int}} {\mathop {\prod }\limits_i \left\{ {f\left( {w_{i,t + 1}} \right)^{o_{t + 1}}\left( {1 – f\left( {w_{i,t + 1}} \right)} \right)^{1 – o_{t + 1}}} \right\}^{a_{i,t + 1}}Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} \cr \\ & = \mathop {\prod }\limits_i \left[ {\begin{array}{*{20}{c}} {f\left( {\mu _{i,t + 1}} \right)^{o_{t + 1}}\left( {1 – f\left( {\mu _{i,t + 1}} \right)} \right)^{1 – o_{t + 1}}} \\ { + 1^{o_{t + 1}}\left( { – 1} \right)^{1 – o_{t + 1}}\frac{1}{2}f\left( {\mu _{i,t + 1}} \right)\left\{ {1 – f\left( {\mu _{i,t + 1}} \right)} \right\}\left\{ {1 – 2f\left( {\mu _{i,t + 1}} \right)} \right\}\left( {p_{i,t}^{ – 1} + p_w^{ – 1}} \right)} \end{array}} \right]^{a_i,t + 1}.\\ \end{array}$$

(54)

The second term of the expected net utility (equation (36)) is calculated as

$$\begin{array}{ccccc}\\ & E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{ln}}}}P\left( {o_{t + 1}} \right)} \right]= E_{P\left( {o_{t + 1}{{{\mathrm{|}}}}{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {o_{t + 1}{{{\mathrm{ln}}}}P_o + \left( {1 – o_{t + 1}} \right)\ln \left( {1 – P_o} \right)} \right]\cr \\ & = P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)\ln \left( {1 – P_o} \right) + P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)\ln \left( {1 – P_o} \right).\\ \end{array}$$

(55)

Observer-SSM

We constructed the observer-SSM, which describes the temporal transitions of the latent internal state z of agent and the generation of action, from the viewpoint of the observer of the agent. This is depicted graphically in Fig. 4. As prior information, we assumed that the agent acts based on the internal state, that is, the intensity of curiosity, the recognized reward probabilities and their confidence levels. The intensity of curiosity was assumed to change temporally as a random walk:

$$\begin{array}{*{20}{c}} {c_t = c_{t – 1} + {\it{\epsilon }}\zeta _t,} \end{array}$$

(56)

where $\zeta _t$ denotes white noise with zero mean and unit variance, and ${\it{\epsilon }}$ denotes its noise intensity. Other internal states, that is, $\mu _i$ and $p_i$, were assumed to update as equations (28) and (29). The transition of the internal state is expressed by the probability distribution

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t – 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t – 1},{{{\mathbf{a}}}}_{t – 1}} \right),{{{{{\varGamma}}}}}} \right),} \end{array}$$

(57)

$${{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t – 1},{{{\mathbf{a}}}}_{t – 1}} \right) = \left[ {\begin{array}{*{20}{c}} 0 \\ {h\left( {\mu _{1,t – 1},p_{1,t – 1},o_t,a_1} \right)} \\ {h\left( {\mu _{2,t – 1},p_{2,t – 1},o_t,a_2} \right)} \\ {k\left( {\mu _{1,t – 1},p_{1,t – 1},a_1} \right)} \\ {k\left( {\mu _{2,t – 1},p_{2,t – 1},a_2} \right)} \end{array}} \right],$$

(58)

where ${{{\mathbf{z}}}}_t = \left( {c_t,{{{\mathbf{\mu }}}}_t^{\mathrm{T}},{{{\mathbf{p}}}}_t^{\mathrm{T}}} \right)^{{{\mathrm{T}}}}$ and ${{{{{\varGamma}}}}} = {\it{\epsilon }}^2\text{diag}(1,0,0,0,0)$ . $h\left( {\mu _{i,t – 1},p_{i,t – 1},o_t,a_i} \right)$ and $k\left( {\mu _{i,t – 1},p_{i,t – 1},a_i} \right)$ represent the right-hand sides of equations (28) and (29), respectively; and ${{{{{\varGamma}}}}}$ and $\text{diag}\left( {{{\mathbf{x}}}} \right)$ denote the variance–covariance matrix and square matrix whose diagonal component is ${{{\mathbf{x}}}}$, respectively. In addition, the agent was assumed to select an action ${{{\mathbf{a}}}}_{t + 1}$ based on the expected net utilities, as follows:

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{{{{\mathrm{exp}}}}(\beta U({{{\mathbf{a}}}}_{t + 1}))}}{{\mathop {\sum}\nolimits_{{{\mathbf{a}}}} {\exp \left( {\beta U\left( {{{\mathbf{a}}}} \right)} \right)} }},} \end{array}$$

(59)

and the reward was obtained by the following probability distribution:

$$\begin{array}{*{20}{c}} {P\left( {o_t|{{{\mathbf{a}}}}_t} \right) = \mathop {\prod }\limits_i \left\{ {\lambda _{i,t}^{o_t}(1 – \lambda _{i,t})^{1 – o_t}} \right\}^{a_{i,t}}.} \end{array}$$

(60)

Q-leaning in two-choice task and its observer-SSM

The decision-making in the two-choice task was also modeled by Q-learning. Reward prediction for the ith option Q_i,t is updated as

$$\begin{array}{*{20}{c}} {Q_{i,t} = Q_{i,t – 1} + \alpha _{t – 1}\left( {r_ta_{i,t – 1} – Q_{i,t – 1}} \right),} \end{array}$$

(61)

where $\alpha _t$ indicates a learning rate at trail t. The agents selected action following a softmax function:

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{\exp \left( {B_tQ_{i,t}} \right)}}{{\mathop {\sum}\nolimits_i {\exp \left( {B_tQ_{i,t}} \right)} }},} \end{array}$$

(62)

where B_t indicates the inverse temperature at trail t controlling the randomness of the action selection.

The time-dependent parameters α_t and B_t in Q-learning were estimated from behavioral data³⁷. These parameters were assumed to change temporally as a random walk:

$$\begin{array}{*{20}{c}} {\theta _t = \theta _{t – 1} + {\it{\epsilon }}_\theta \zeta _{\theta ,t},} \end{array}$$

(63)

where $\theta \in \left\{ {\alpha ,B} \right\}$, $\zeta _{\theta ,t}$ denotes white noise with zero mean and unit variance and ${\it{\epsilon }}_\theta$ denotes its noise intensity. Thus, the transition of the internal state is expressed by the probability distribution

(64)

$$\begin{array}{*{20}{c}} {\mathbf{F}\left( {{{{\mathbf{z}}}}_{t – 1},{{{\mathbf{a}}}}_{t – 1}} \right) = \left[ {\begin{array}{*{20}{c}} 0 \\ 0 \\ {h_1\left( {\alpha _t,Q_{1,t – 1},{{{\mathbf{a}}}}_{t – 1}} \right)} \\ {h_2\left( {\alpha _t,Q_{2,t – 1},{{{\mathbf{a}}}}_{t – 1}} \right)} \end{array}} \right],} \end{array}$$

(65)

where ${{{\mathbf{z}}}}_t = \left( {\alpha _t,B_t,Q_{1,t},Q_{2,t}} \right)^{{{\mathrm{T}}}}$, ${{{{{\Gamma}}}}} = {\it{\epsilon }}^2{\mathrm{diag}}\left( {1,1,0,0} \right)$, $h_i\left( {\alpha _t,Q_{i,t – 1},{{{\mathbf{a}}}}_t} \right)$ represents the right-hand side of equation (61); and ${{{{{\Gamma}}}}}$ and $\text{diag}\left( {{{\mathbf{x}}}} \right)$ denote the variance–covariance matrix and square matrix whose diagonal component is ${{{\mathbf{x}}}}$, respectively.

iFEP by particle filter and Kalman backward algorithm

Based on the observer-SSM, we estimated the posterior distribution of the latent internal state of agent $z_t$ given all observations from 1 to T ($x_{1:T}$) in a Bayesian manner, that is, $P\left( {z_t|x_{1:T}} \right)$. This estimation was done by forward and backward algorithms, which are called filtering and smoothing, respectively.

In filtering, the posterior distribution of $z_t$ given observations until t ($x_{1:t}$) is sequentially updated in a forward direction as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right) \propto P\left( {{{{\mathbf{x}}}}_t|{{{\mathbf{z}}}}_t,\theta } \right){\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t – 1},\theta } \right)P\left( {{{{\mathbf{z}}}}_{t – 1}|{{{\mathbf{x}}}}_{1:t – 1}} \right){\mathrm{d}}{{{\mathbf{z}}}}_{t – 1},} } \end{array}$$

(66)

where ${{{\mathbf{x}}}}_t = \left( {{{{\mathbf{a}}}}_t^T,o_t} \right)^{{{\mathrm{T}}}}$ and $\theta = \left\{ {\sigma ^2,\alpha ,P_o} \right\}$. The prior distribution of ${{{\mathbf{z}}}}_1$ is

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_1} \right) = \left[ {\mathop {\prod }\limits_i {{{\mathcal{N}}}}(\mu _{i,1}|\mu _0,\sigma _\mu ^2)\text{Gam}(p_{i,1}|a_g,b_g)} \right]\text{Uni}(c_1|a_u,b_u),} \end{array}$$

(67)

where $\mu _0$ and $\sigma _\mu ^2$ denote means and variances, $\text{Gam}(x|a_g,b_g)$ indicates the Gamma distribution with shape parameter $a_g$ and scale parameter $b_g$, and $\text{Uni}\left( {x|a_u,b_u} \right)$ indicates uniform distribution from $a_u$ to $b_u$. We used a particle filter⁴² to sequentially calculate the posterior $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right)$, which cannot be analytically derived because of the nonlinear transition probability.

After the particle filter, the posterior distribution of ${{{\mathbf{z}}}}_t$ given all observations (${{{\mathbf{x}}}}_{1:T}$) is sequentially updated in a backward direction as

$$\begin{array}{ccccc}\\ & P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right) = {\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{x}}}}_{1:T}} \right)P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t + 1},{{{\mathbf{x}}}}_{1:t},\theta } \right){\mathrm{d}}{{{\mathbf{z}}}}_{t + 1}} \cr \\ & = {\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{x}}}}_{1:T}} \right)\frac{{P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t},\theta } \right)}}{{\mathop {\smallint }\nolimits^ P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)P({{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t},\theta ){\mathrm{d}}{{{\mathbf{z}}}}_t}}{\mathrm{d}}{{{\mathbf{z}}}}_{t + 1}} .\\ \end{array}$$

(68)

However, this backward integration is intractable because of the non-Gaussian $P\left( {{{{\mathbf{z}}}}_T|{{{\mathbf{x}}}}_{1:T}} \right)$, which was represented by the particle ensemble in the particle filter, and the nonlinear relationship between ${{{\mathbf{z}}}}_t$ and ${{{\mathbf{z}}}}_{t + 1}$ in $P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)$ (equation (57)). Thus, we approximated $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right)$ as ${{{\mathcal{N}}}}({{{\mathbf{z}}}}_t|{{{\mathbf{m}}}}_t,{{{\mathbf{V}}}}_t)$, where ${{{\mathbf{m}}}}_t$ and ${{{\mathbf{V}}}}_t$ denote a sample mean and a sample variance of the particles at t, whereas we linearized $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t – 1},\theta } \right)$ as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t – 1},\theta } \right) \cong {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{Az}}}}_{t – 1} + {{{\mathbf{b}}}},{{{{{\Gamma}}}}}} \right),} \end{array}$$

(69)

$$\begin{array}{*{20}{c}} {{{{\mathbf{A}}}} = \left. {\frac{{\partial {{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t – 1},{{{\mathbf{a}}}}_{t – 1}} \right)}}{{\partial {{{\mathbf{z}}}}_{t – 1}}}} \right|_{{{{\mathbf{m}}}}_t},} \end{array}$$

(70)

$$\begin{array}{*{20}{c}} {{{{\mathbf{b}}}} = F\left( {{{{\mathbf{m}}}}_t,{{{\mathbf{a}}}}_{t – 1}} \right) – {{{A\mathbf{m}}}}_t,} \end{array}$$

(71)

where A denotes a Jacobian matrix. Because these approximations make the integration of equation (68) tractable, the posterior distribution $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right)$ can be computed by a Gaussian distribution as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|\widehat {{{\mathbf{m}}}}_t,\widehat {{{\mathbf{V}}}}_t} \right)} \end{array}$$

(72)

whose mean and variance were analytically updated by Kalman backward algorithms as⁴³

$$\begin{array}{*{20}{c}} {\widehat {{{\mathbf{m}}}}_t = {{{\mathbf{m}}}}_t + {{{\mathbf{J}}}}_t\left\{ {\widehat {{{\mathbf{m}}}}_{t + 1} – ({{{\mathbf{Am}}}}_t + {{{\mathbf{b}}}})} \right\},} \end{array}$$

(73)

$$\begin{array}{*{20}{c}} {\widehat {{{\mathbf{V}}}}_t = {{{\mathbf{V}}}}_t + {{{\mathbf{J}}}}_t\left\{ {\widehat {{{\mathbf{m}}}}_{t + 1} – ({{{\mathbf{AV}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}} + {{{\mathbf{{\Gamma}}}}})} \right\}{{{\mathbf{J}}}}_t^{{{\mathrm{T}}}},} \end{array}$$

(74)

where

$$\begin{array}{*{20}{c}} {{{{\mathbf{J}}}}_t = {{{\mathbf{V}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}}\left( {{{{\mathbf{AV}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}} + {{{{{\Gamma}}}}}} \right)^{ – 1}.} \end{array}$$

(75)

Impossibility of model discrimination

In the ReCU and Q-learning models, the action selections were formulated with the same softmax functions as

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{{{{\mathrm{exp}}}}\left( {\beta \left( {E[\mathrm{Reward}_{i,t}] + c_tE[\text{Info}_{i,t}]} \right)} \right)}}{{\mathop {\sum}\nolimits_j {{{{\mathrm{exp}}}}\left( {\beta \left( {E[\mathrm{Reward}_{j,t}] + c_tE[\text{Info}_{j,t}]} \right)} \right)} }}.} \end{array}$$

(76)

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{\exp \left( {\beta _tQ_{i,t}} \right)}}{{\mathop {\sum}\nolimits_i {\exp \left( {\beta _tQ_{i,t}} \right)} }},} \end{array}$$

(77)

which correspond to that of the ReCU and Q-learning models, respectively. These equations contain the time-dependent meta-parameters, that is, c_t and β_t. For both models, the goodness of fit (that is, likelihood) for the actual behavioral data can be freely improved by tuning the time-dependent meta-parameters. Thus, discrimination between the ReCU and Q-learning models must be essentially impossible.

Estimation of parameters in iFEP

The ReCU model has several parameters: $\sigma _w^2$, α, β, P_o and ${\it{\epsilon }}$. In the estimation, we set ${\it{\epsilon }}$ to 1, which was the optimal value for estimation in the artificial data (Supplementary Fig. 3). We assumed the unit intensity of reward, that is, $\ln P_o/(1 – P_o) = 1$, because it is impossible to estimate both P_o and β caused by multiplying β and $\ln P_o/(1 – P_o)$ in the expected net utility (equations (30) and (42)). This treatment is suitable for relative comparison between the curiosity meta-parameter and the reward. In addition, we addressed βc_t as a latent variable as $\hat c_t = \beta c_t$ because of the multiplication of β in the expected net utility (equations (30) and (42)). Thus, the estimation of c_t can be obtained by dividing the estimated $\hat c_t$ by the estimated β. Therefore, the hyperparameters to be estimated were $\sigma _w^2$, α and β.

To estimate these parameters $\theta = \left\{ {\sigma _w^2,\alpha ,\beta } \right\}$, we extended the observer-SSM to a self-organizing SSM⁴⁴ in which θ was addressed as constant latent variables:

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t,\theta |{{{\mathbf{x}}}}_{1:t}} \right) \propto P\left( {{{{\mathbf{x}}}}_t|{{{\mathbf{z}}}}_t} \right){\int} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t – 1},\theta } \right)P\left( {{{{\mathbf{z}}}}_{t – 1},\theta |{{{\mathbf{x}}}}_{1:t – 1}} \right)\mathrm{d}{{{\mathbf{z}}}}_{t – 1},} } \end{array}$$

(78)

where $P\left( \theta \right) = \text{Uni}\left( {\sigma ^2|a_\sigma ,b_\sigma } \right)\text{Uni}(\alpha |a_\alpha ,b_\alpha ){{{\mathcal{N}}}}(\beta |m_\beta ,v_\beta )$. To sequentially calculate the posterior $P\left( {{{{\mathbf{z}}}}_t,\theta |{{{\mathbf{x}}}}_{1:t}} \right)$ using the particle filter, we used 100,000 particles and augmented the state vector of all particles by adding the parameter θ, which was not updated from randomly sampled initial values.

The hyperparameter values used in this estimation were μ₀ = 0, $\sigma _\mu ^2 = 0.01^2$, a_g = 10, b_g = 0.001, a_u = −15, b_u = 15, a_σ = 0.2, b_σ = 0.7, a_α = 0.04, b_α = 0.06, a_β = 0 and b_β = 50, which were heuristically given as parameters correctly estimated using the artificial data (Supplementary Fig. 2).

Statistical testing with Monte Carlo simulations

Supplementary Fig. 5 shows statistical testing of the negative curiosity estimated in Fig. 5. A null hypothesis is that an agent has no curiosity (that is, c_t = 0) decides on a choice only depending on its recognition of the reward probability. Under the null hypothesis, model simulations were repeated 1,000 times under the same experimental conditions as in Fig. 5 and the curiosity was estimated for each using iFEP. We adopted the temporal average of the estimated curiosity as a test statistic and plotted the null distribution of the test statistic. Compared with the estimated curiosity of the rat behavior, we computed the P value for a one-sided left-tailed test.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Amount of reward

State space model for reward probability recognition

FEP for reward probability recognition

Calculation of free energy

Sequential updating of the agent’s recognition

Expected net utility

Model for action selection

Alternative expected net utility

Calculation of expected net utility

Observer-SSM

Q-leaning in two-choice task and its observer-SSM

iFEP by particle filter and Kalman backward algorithm

Impossibility of model discrimination

Estimation of parameters in iFEP

Statistical testing with Monte Carlo simulations

Reporting summary

Related posts:

Leave a Reply Cancel reply