Continuous-Time Mean--Variance Portfolio Selection via Reinforcement Learning
We consider continuous-time mean--variance (MV) portfolio selection with reinforcement learning (RL). The problem is to achieve the best tradeoff between exploration and exploitation, and is formulated as an entropy-regularized relaxed stochastic control problem. We prove that the optimal feedback policy for the MV problem must be Gaussian, with time-decaying variance. We establish connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays. Finally, we prove a policy improvement theorem (PIT) for the RL continuous-time MV problem. We then devise an implementable RL algorithm based on the theory, and find that it outperforms both an adaptive control based method and a deep neural networks based algorithm by a large margin in our simulations. This is a joint work with Haoran Wang (Columbia).