Computing Return-to-go ====================== Let us load previous OAR dataframe: >>> import pandas as pd >>> df = pd.read_pickle("docs/source/golf_oar.pkl") Return-to-go (rtg) are the discounted sum of reward from the current date until episode's end. This is an important signal since it is what actions should maximize. However, computing rtg for truncated episodes (as it is our case here) is a bit tricky since episodes never actually reach their end. By default our method of rtg computation assume such episodes are infinite and properly scale them so that rtg at the beginning of the episode and rtg near the truncation are comparable. >>> from sara.oar import enrich_rtg >>> df = enrich_rtg(df, discount=0.75) >>> df.head() # doctest: +NORMALIZE_WHITESPACE signal obs0 act0 rew1 rtg0 key position move distance cumulative distance episodes date 0 2000-01-01 10:00:00 0.176277 0.030472 -0.176277 -0.631686 2000-01-01 10:01:00 0.206749 -0.103998 -0.206749 -0.605225 2000-01-01 10:02:00 0.102750 0.075045 -0.102750 -0.523078 2000-01-01 10:03:00 0.177795 0.094056 -0.177795 -0.566191 2000-01-01 10:04:00 0.271852 -0.195104 -0.271852 -0.507397 As you can see, `enrich_rtg` simply add a new "rtg0, cumulative distance" column to the dataframe. Because it is not easy to select discount, we provided a simple helper function that return the discount based on an horizon such that the rtg tail is relatively negligible beyond this horizon. >>> from sara.oar import discount_from_horizon >>> discount_from_horizon(10) 0.7411344491069477 Congrats, you have a dataframe enriched with rtg, let's save it for next guide. >>> df.to_pickle("docs/source/golf_enriched.pkl")