Advertisement
Review| Volume 44, ISSUE 1, P139-154, January 2022

Reinforcement Learning Methods in Public Health

      Abstract

      Purpose

      Reinforcement learning (RL) is the subfield of machine learning focused on optimal sequential decision making under uncertainty. An optimal RL strategy maximizes cumulative utility by experimenting only if and when the information generated by experimentation is likely to outweigh associated short-term costs. RL represents a holistic approach to decision making that evaluates the impact of every action (ie, data collection, allocation of resources, and treatment assignment) in terms of short-term and long-term utility to stakeholders. Thus, RL is an ideal model for a number of complex decision problems that arise in public health, including resource allocation in a pandemic, monitoring or testing, and adaptive sampling for hidden populations. Nevertheless, although RL has been applied successfully in a wide range of domains, including precision medicine, it has not been widely adopted in public health. The purposes of this review are to introduce key ideas in RL and to identify challenges and opportunities associated with the application of RL in public health.

      Methods

      We provide a nontechnical review of the theoretical and methodologic underpinnings of RL. A running example of RL for the management of an infectious disease is used to illustrate ideas.

      Findings

      RL has the potential to make a transformative impact in a range of sequential decision problems in public health. By allocating resources if, when, and where they are most impactful, RL can improve health outcomes while reducing resource consumption.

      Implications

      Public health researchers and stakeholders should consider RL as a means of efficiently using data to inform optimal evidence-based decision making.

      Key words

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      Subscribe:

      Subscribe to Clinical Therapeutics
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect

      References

        • Chakraborty B.
        • Moodie E.
        Statistical Methods for Dynamic Treatment Regimes.
        Springer, 2013
        • Kosorok M.R.
        • Moodie E.E.
        Adaptive treatment strategies in practice: planning trials and analyzing data for personalized medicine.
        SIAM, 2015
        • Tsiatis A.A.
        • Davidian M.
        • Holloway S.T.
        • Laber E.B.
        Dynamic Treatment Regimes: Statistical Methods for Precision Medicine.
        CRC press, 2019
        • Kosorok M.R.
        • Laber E.B.
        Precision medicine.
        Annual review of statistics and its application. 2019; 6: 263-286
        • MacEachern S.J.
        • Forkert N.D.
        Machine learning for precision medicine.
        Genome. 2021; 64: 416-425
        • Schranz M.
        • Umlauft M.
        • Sende M.
        • Elmenreich W.
        Swarm robotic behaviors and current applications.
        Frontiers in Robotics and AI. 2020; 7: 36
        • Kohavi R.
        • Tang D.
        • Xu Y.
        Trustworthy online controlled experiments: A practical guide to a/b testing.
        Cambridge University Press, 2020
        • Estl H.
        Paving the way to self-driving cars with advanced driver assistance systems.
        in: Worldwide Systems Marketing for Advanced Driver Assistance Systems (ADAS), Texas Instruments. 2015
        • Koesdwiady A.
        • Soua R.
        • Karray F.
        • Kamel M.S.
        Recent trends in driver safety monitoring systems: State of the art and challenges.
        IEEE transactions on vehicular technology. 2016; 66: 4550-4563
        • Aiello A.E.
        • Simanek A.M.
        • Eisenberg M.C.
        • Walsh A.R.
        • Davis B.
        • Volz E.
        • Cheng C.
        • Rainey J.J.
        • Uzicanin A.
        • Gao H.
        • et al.
        Design and methods of a social network isolation study for reducing respiratory infection transmission: The ex-flu cluster randomized trial.
        Epidemics. 2016; 15: 38-55
        • Bloomfield S.F.
        • Aiello A.E.
        • Cookson B.
        • O'Boyle C.
        • Larson E.L.
        The effectiveness of hand hygiene procedures in reducing the risks of infections in home and community settings including handwashing and alcohol-based hand sanitizers.
        American journal of infection control. 2007; 35: S27-S64
        • Yang K.
        What can covid-19 tell us about evidence-based management?.
        The American Review of Public Administration. 2020; 50: 706-712
        • Vynnycky E.
        • White R.
        An introduction to infectious disease modelling.
        OUP oxford, 2010
        • Keeling M.J.
        • Rohani P.
        Modeling infectious diseases in humans and animals.
        Princeton university press, 2011
        • Schiesser W.E.
        Mathematical Modeling Approach To Infectious Diseases, A: Cross Diffusion Pde Models For Epidemiology.
        World Scientific, 2018
        • Riley S.
        Large-scale spatial-transmission models of infectious disease.
        Science. 2007; 316: 1298-1301
        • Hollingsworth T.D.
        Controlling infectious disease outbreaks: Lessons from mathematical modelling.
        Journal of public health policy. 2009; 30: 328-341
        • Metcalf C.J.E.
        • Lessler J.
        Opportunities and challenges in modeling emerging infectious diseases.
        Science. 2017; 357: 149-152
        • Keeling M.
        • Woolhouse M.
        • May R.
        • Davies G.
        • Grenfell B.T.
        Modelling vaccination strategies against foot-and-mouth disease.
        Nature. 2003; 421: 136-142
        • Tildesley M.J.
        • Bessell P.R.
        • Keeling M.J.
        • Woolhouse M.E.
        The role of pre-emptive culling in the control of foot-and-mouth disease.
        Proceedings of the Royal Society B: Biological Sciences. 2009; 276: 3239-3248
        • Lee B.Y.
        • Bacon K.M.
        • Connor D.L.
        • Willig A.M.
        • Bailey R.R.
        The potential economic value of a Trypanosoma cruzi (Chagas disease) vaccine in Latin America.
        PloS neglected tropical diseases. 2010; 4: e916
        • Lee B.Y.
        • Bacon K.M.
        • Bottazzi M.E.
        • Hotez P.J.
        Global economic burden of Chagas disease: a computational simulation model.
        The Lancet infectious diseases. 2013; 13: 342-348
        • Halloran M.E.
        • Ferguson N.M.
        • Eubank S.
        • Longini I.M.
        • Cummings D.A.
        • Lewis B.
        • Xu S.
        • Fraser C.
        • Vullikanti A.
        • Germann T.C.
        • et al.
        Modeling targeted layered containment of an influenza pandemic in the United States.
        Proceedings of the National Academy of Sciences. 2008; 105: 4639-4644
        • Chao D.L.
        • Halloran M.E.
        • Obenchain V.J.
        • Longini Jr I.M.
        Flute, a publicly available stochastic influenza epidemic simulation model.
        PloS computational biology. 2010; 6e1000656
        • Zivich P.N.
        • Volfovsky A.
        • Moody J.
        • Aiello A.E.
        Assortativity can lead to bias in epidemiologic studies of contagious outcomes: a simulated example in the context of vaccination.
        American journal of epidemiology. 2021;
        • Stover J.
        Influence of mathematical modeling of HIV and AIDS on policies and programs in the developing world.
        Sexually transmitted diseases. 2000; 27: 572-578
        • Tripathi A.
        • Naresh R.
        • Sharma D.
        Modeling the effect of screening of unaware infectives on the spread of HIV infection.
        Applied mathematics and computation. 2007; 184: 1053-1068
        • Di Giamberardino P.
        • Compagnucci L.
        • De Giorgi C.
        • Iacoviello D.
        Modeling the effects of prevention and early diagnosis on HIV/AIDS infection diffusion.
        IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2017; 49: 2119-2130
      1. Q. Guan, B. J. Reich, and E. B. Laber. A spatiotemporal recommendation engine for malaria control. arXiv preprint arXiv:2003.05084, 2020.

        • Runge M.
        • Snow R.W.
        • Molteni F.
        • Thawer S.
        • Mohamed A.
        • Mandike R.
        • Giorgi E.
        • Macharia P.M.
        • Smith T.A.
        • Lengeler C.
        • et al.
        Simulating the council-specific impact of anti-malaria interventions: a tool to support malaria strategic planning in Tanzania.
        PloS one. 2020; 15e0228469
        • Mabud T.S.
        • de Lourdes Delgado Alves M.
        • Ko A.I.
        • Basu S.
        • Walter K.S.
        • Cohen T.
        • Mathema B.
        • Colijn C.
        • Lemos E.
        • Croda J.
        • et al.
        Evaluating strategies for control of tuberculosis in prisons and prevention of spillover into communities: An observational and modeling study from Brazil.
        PloS medicine. 2019; 16e1002737
        • Kim S.
        • de los Reyes V A.A.
        • Jung E.
        Country-specific intervention strategies for top three tb burden countries using mathematical model.
        PloS one. 2020; 15e0230964
        • Powell W.B.
        Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703.
        John Wiley & Sons, 2007
        • Wiering M.A.
        • Van Otterlo M.
        Reinforcement learning.
        Adaptation, learning, and optimization. 2012; 12
        • Sutton R.S.
        • Barto A.G.
        Reinforcement learning: An introduction.
        MIT press, 2018
        • Hern´andez-Lerma O.
        Adaptive Markov control processes, volume 79.
        Springer Science & Business Media, 2012
        • Puterman M.L.
        Markov decision processes: discrete stochastic dynamic programming.
        John Wiley & Sons, 2014
        • Qian M.
        • Murphy S.A.
        Performance guarantees for individualized treatment rules.
        Annals of statistics. 2011; 39: 1180
        • Zhao Y.
        • Zeng D.
        • Rush A.J.
        • Kosorok M.R.
        Estimating individualized treatment rules using outcome weighted learning.
        Journal of the American Statistical Association. 2012; 107: 1106-1118
        • Fu C.
        • Fu C.
        • Michael M.
        Handbook of simulation optimization.
        Springer, 2015
        • Sutton R.S.
        On the significance of Markov decision processes.
        in: International Conference on Artificial Neural Networks. Springer, 1997: 273-282 (pages)
        • Cassandra A.R.
        • Kaelbling L.P.
        • Littman M.L.
        Acting optimally in partially observable stochastic domains.
        proceedings of AAAI. 1994; 94: 1023
        • Spaan M.T.
        Partially observable Markov decision processes.
        Reinforcement Learning. Springer, 2012: 387-414 (pages)
        • Kermack W.O.
        • McKendrick A.G.
        A contribution to the mathematical theory of epidemics.
        Proceedings of the royal society of London. Series A, Containing papers of a mathematical and physical character. 1927; 115: 700-721
        • Bauch C.T.
        • Lloyd-Smith J.O.
        • Coffee M.P.
        • Galvani A.P.
        Dynamically modeling sars and other newly emerging respiratory illnesses.
        Epidemiology. 2005; 16: 791-801
        • Diekmann O.
        • Heesterbeek J.
        • Roberts M.G.
        The construction of next-generation matrices for compartmental epidemic models.
        Journal of the Royal Society Interface. 2010; 7: 873-885
        • Drake J.M.
        • Dahlin K.
        • Rohani P.
        • Handel A.
        Five approaches to the suppression of sars-cov-2 without intensive social distancing.
        Proceedings of the Royal Society B. 2021; 288
        • Menach A.Le
        • Vergu E.
        • Grais R.F.
        • Smith D.L.
        • Flahault A.
        Key strategies for reducing spread of avian influenza among commercial poultry holdings: lessons for transmission to humans.
        Proceedings of the Royal Society B: Biological Sciences. 2006; 273: 2467-2475
        • Tildesley M.J.
        • Savill N.J.
        • Shaw D.J.
        • Deardon R.
        • Brooks S.P.
        • Woolhouse M.E.
        • Grenfell B.T.
        • Keeling M.J.
        Optimal reactive vaccination strategies for a foot-and-mouth outbreak in the uk.
        Nature. 2006; 440: 83-86
        • Maher S.P.
        • Kramer A.M.
        • Pulliam J.T.
        • Zokan M.A.
        • Bowden S.E.
        • Barton H.D.
        • Magori K.
        • Drake J.M.
        Spread of white-nose syndrome on a network regulated by geography and climate.
        Nature communications. 2012; 3: 1-8
        • Laber E.B.
        • Meyer N.J.
        • Reich B.J.
        • Pacifici K.
        • Collazo J.A.
        • Drake J.M.
        Optimal treatment allocations in space and time for on-line control of an emerging infectious disease.
        Journal of the Royal Statistical Society. Series C, Applied statistics. 2018; 67: 743
        • Kramer A.M.
        • Pulliam J.T.
        • Alexander L.W.
        • Park A.W.
        • Rohani P.
        • Drake J.M.
        Spatial spread of the West Africa ebola epidemic.
        Royal Society open science. 2016; 3160294
        • Bu F.
        • Aiello A.E.
        • Xu J.
        • Volfovsky A.
        Likelihood-based inference for partially observed epidemics on dynamic networks.
        Journal of the American Statistical Association. 2020; (pages): 1-17
      2. N. Ferguson, D. Laydon, G. Nedjati Gilani, N. Imai, K. Ainslie, M. Baguelin, S. Bhatia, A. Boonyasiri, Z. Cucunuba Perez, G. Cuomo-Dannenburg, et al. Report 9: Impact of non-pharmaceutical interventions (NPIS) to reduce covid19 mortality and healthcare demand. 2020.

        • Luckett D.J.
        • Laber E.B.
        • Kahkoska A.R.
        • Maahs D.M.
        • Mayer-Davis E.
        • Kosorok M.R.
        Estimating dynamic treatment regimes in mobile health using v-learning.
        Journal of the American Statistical Association. 2019;
        • Bellman R.
        Dynamic programming, Princeton, 1957.
        BellmanDynamic Programming. 1960; 1957
        • Watkins C.J.
        • Dayan P.
        Q-learning.
        Machine learning. 1992; 8: 279-292
        • Murphy S.A.
        A generalization error for q-learning.
        Journal of Machine Learning Research. 2005; 6: 1073-1097
        • Ertefaie A.
        • Strawderman R.L.
        Constructing dynamic treatment regimes over indefinite time horizons.
        Biometrika. 2018; 105: 963-977
        • Ormoneit D.
        • Sen S´.
        Kernel-based reinforcement learning.
        Machine learning. 2002; 49: 161-178
        • Ernst D.
        • Geurts P.
        • Wehenkel L.
        Tree-based batch mode reinforcement learning.
        Journal of Machine Learning Research. 2005; 6: 503-556
        • Westenbroek T.
        • Agrawal A.
        • Castaneda F.
        • Sastry S.S.
        • Sreenath K.
        Combining model-based design and model-free policy optimization to learn safe, stabilizing controllers.
        IFAC Analysis and Design of Hybrid Systems (ADHS), Brussels, Belgium. 2021;
      3. C. J. C. H. Watkins. Learning from delayed rewards. 1989.

      4. D. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on Thompson sampling. arXiv preprint arXiv:1707.02038, 2017.

        • Thompson W.R.
        On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.
        Biometrika. 1933; 25: 285-294
        • Chapelle O.
        • Li L.
        An empirical evaluation of Thompson sampling.
        Advances in neural information processing systems. 2011; 24: 2249-2257
        • Kaufmann E.
        • Korda N.
        • Munos R.
        Thompson sampling: An asymptotically optimal finite-time analysis.
        in: International conference on algorithmic learning theory. Springer, 2012: 199-213 (pages)
        • Agrawal S.
        • Goyal N.
        Further optimal regret bounds for Thompson 0sampling.
        Artificial intelligence and statistics. PMLR, 2013: 99-107 (pages)
        • Eckles D.
        • Kaptein M.
        Bootstrap Thompson Sampling and Sequential Decision Making in the Behavioral Sciences.
        Sage Open. 2019;
        • Foster D.
        • Rakhlin A.
        Beyond ucb: Optimal and efficient contextual bandits with regression oracles.
        in: International Conference on Machine Learning. PMLR, 2020: 3199-3210 (pages)
        • Auer P.
        Using confidence bounds for exploitation-exploration trade-offs.
        Journal of Machine Learning Research. 2002; 3: 397-422
        • Bather J.
        Decision theory: An introduction to dynamic programming and sequential decisions.
        John Wiley & Sons, Inc, 2000
      5. M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian reinforcement learning: A survey. arXiv preprint arXiv:1609.04436, 2016.

        • Rosenblatt S.F.
        • Smith J.A.
        • Gauthier G.R.
        • H´ebert-Dufresne L.
        Immunization strategies in networks with missing data.
        PloS computational biology. 2020; 16e1007897
        • Chen S.
        • Lu X.
        An immunization strategy for hidden populations.
        Scientific reports. 2017; 7: 1-10
        • Gile K.J.
        Improved inference for respondent-driven sampling data with application to HIV prevalence estimation.
        Journal of the American Statistical Association. 2011; 106: 135-146
        • Frost S.D.
        • Brouwer K.C.
        • Cruz M.A.F.
        • Ramos R.
        • Ramos M.E.
        • Lozada R.M.
        • Magis-Rodriguez C.
        • Strathdee S.A.
        Respondent-driven sampling of injection drug users in two US–Mexico border cities: recruitment dynamics and impact on estimates of HIV and syphilis prevalence.
        Journal of Urban Health. 2006; 83: 83-97
        • Montealegre J.R.
        • Risser J.M.
        • Selwyn B.J.
        • Sabin K.
        • McCurdy S.A.
        H IV testing behaviors among undocumented central American immigrant women in Houston, Texas.
        Journal of immigrant and minority health. 2012; 14: 116-123
        • McFall A.M.
        • Lau B.
        • Latkin C.
        • Srikrishnan A.K.
        • Anand S.
        • Vasudevan C.K.
        • Mehta S.H.
        • Solomon S.S.
        Optimizing respondent-driven sampling to find undiagnosed HIV-infected people who inject drugs.
        AIDS. 2021; 35: 485-494
        • Tab´ak A.G.
        • Herder C.
        • Rathmann W.
        • Brunner E.J.
        • Kivim¨aki M.
        Prediabetes: a high-risk state for diabetes development.
        The Lancet. 2012; 379: 2279-2290
        • Heckathorn D.D.
        • Cameron C.J.
        Network sampling: From snowball and multiplicity to respondent-driven sampling.
        Annual review of sociology. 2017; 43: 101-119
        • Volz E.
        • Heckathorn D.D.
        Probability based estimation theory for respondent driven sampling.
        Journal of official statistics. 2008; 24: 79
        • Gile K.
        • Handcock M.S.
        Model-based assessment of the impact of missing data on inference for networks.
        Unpublished manuscript. University of Washington, Seattle2006
        • Khabbazian M.
        • Hanlon B.
        • Russek Z.
        • Rohe K.
        Novel sampling design for respondent-driven sampling.
        Electronic Journal of Statistics. 2017; 11: 4769-4812
        • Goel S.
        • Salganik M.J.
        Assessing respondent-driven sampling.
        Proceedings of the National Academy of Sciences. 2010; 107: 6743-6747
        • Gile K.J.
        • Handcock M.S.
        7. Respondent-driven sampling: An assessment of current methodology.
        Sociological methodology. 2010; 40: 285-327
        • Tomas A.
        • Gile K.J.
        The effect of differential recruitment, nonresponse and non-recruitment on estimators for respondent-driven sampling.
        Electronic Journal of Statistics. 2011; 5: 899-934
        • Lu X.
        • Bengtsson L.
        • Britton T.
        • Camitz M.
        • Kim B.J.
        • Thorson A.
        • Liljeros F.
        The sensitivity of respondent-driven sampling.
        Journal of the Royal Statistical Society: Series A (Statistics in Society). 2012; 175: 191-216
        • Roch S.
        • Rohe K.
        Generalized least squares can overcome the critical threshold in respondent-driven sampling.
        Proceedings of the National Academy of Sciences. 2018; 115: 10299-10304
        • Rohe K.
        A critical threshold for design effects in network sampling.
        Annals of Statistics. 2019; 47: 556-582
        • Lee K.
        • Polson D.
        • Lowe E.
        • Main R.
        • Holtkamp D.
        • Mart´ınez-L´opez B.
        Unraveling the contact patterns and network structure of pig shipments in the United States and its association with porcine reproductive and respiratory syndrome virus (PRRSV) outbreaks.
        Preventive veterinary medicine. 2017; 138: 113-123
        • Galvis J.A.
        • Jones C.M.
        • Prada J.M.
        • Corzo C.A.
        • Machado G.
        The between-farm transmission dynamics of porcine epidemic diarrhoea virus: A short-term forecast modelling comparison and the effectiveness of control strategies.
        Transboundary and Emerging Diseases. 2021;
        • Galvis J.A.
        • Corzo C.A.
        • Prada J.M.
        • Machado G.
        Modelling the transmission and vaccination strategy for porcine reproductive and respiratory syndrome virus.
        Transboundary and Emerging Diseases. 2021;
        • Jara M.
        • Rasmussen D.A.
        • Corzo C.A.
        • Machado G.
        Porcine reproductive and respiratory syndrome virus dissemination across pig production systems in the United States.
        Transboundary and Emerging Diseases. 2021; 68: 667-683
        • Chase-Topping M.
        • Xie J.
        • Pooley C.
        • Trus I.
        • Bonckaert C.
        • Rediger K.
        • Bailey R.I.
        • Brown H.
        • Bitsouni V.
        • Barrio M.B.
        • et al.
        New insights about vaccine effectiveness: Impact of attenuated PRRS-strain vaccination on heterologous strain transmission.
        Vaccine. 2020; 38: 3050-3061
        • Park C.-K.
        • Lee C.-H.
        Clinical examination and control measures in a commercial pig farm persistently infected with porcine epidemic diarrhea (PED) virus.
        Journal of Veterinary Clinics. 2009; 26: 463-466
        • Gallien S.
        • Fablet C.
        • Bernard C.
        • Toulouse O.
        • Berri M.
        • Blanchard Y.
        • Rose N.
        • Grasland B.
        Lessons learnt from a porcine epidemic diarrhea (PED) case in France in 2014: Descriptive epidemiology and control measures implemented.
        Veterinary microbiology. 2018; : 226-229
        • Matindoust S.
        • Baghaei-Nejad M.
        • Abadi M.H.S.
        • Zou Z.
        • Zheng L.-R.
        Food quality and safety monitoring using gas sensor array in intelligent packaging.
        Sensor Review. 2016;
        • Herbon A.
        • Levner E.
        • Cheng T.
        Perishable inventory management with dynamic pricing using time–temperature indicators linked to automatic detecting devices.
        International Journal of Production Economics. 2014; 147: 605-613
        • Schomberg J.P.
        • Haimson O.L.
        • Hayes G.R.
        • Anton-Culver H.
        Supplementing public health inspection via social media.
        PloS one. 2016; 11e0152117
      6. S. Wong, H. Chinaei, and F. Rudzicz. Predicting health inspection results from online restaurant reviews. arXiv preprint arXiv:1603.05673, 2016.

        • Stephens T.S.
        • Lime B.
        • Griffiths F.
        Preparation of a frozen avocado mixture for guacamole.
        J. Rio Grande Valley Hort. Soc. 1957; 11: 82-89
        • Pauker R.
        • Bernstein S.
        • Popelf G.
        • Rosenthalf I.
        An assessment of processing potential of avocado fruit.
        Calif. Avocado Soc. 1992; 76: 137-144
        • Almeria S.
        • Assurian A.
        • Shipley A.
        Modifications of the us food and drug administration validated method for detection of Cyclospora cayetanensis oocysts in prepared dishes: Mexican-style salsas and guacamole.
        Food Microbiology. 2021; 96103719
        • Zivich P.N.
        • Huang W.
        • Walsh A.
        • Dutta P.
        • Eisenberg M.
        • Aiello A.E.
        Measuring office workplace interactions and hand hygiene behaviors through electronic sensors: A feasibility study.
        Plos one. 2021; 16e0243358
        • Liu C.
        • Xu X.
        • Hu D.
        Multiobjective reinforcement learning: A comprehensive overview.
        IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2014; 45: 385-398
        • Lizotte D.J.
        • Laber E.B.
        Multi-objective Markov decision processes for data-driven decision support.
        The Journal of Machine Learning Research. 2016; 17: 7378-7405
        • Butler E.L.
        • Laber E.B.
        • Davis S.M.
        • Kosorok M.R.
        Incorporating patient preferences into estimation of optimal individualized treatment rules.
        Biometrics. 2018; 74: 18-26
        • Fard M.M.
        • Pineau J.
        Non-deterministic policies in Markov decision processes.
        Journal of Artificial Intelligence Research (JAIR). 2021;