Enhancing human learning via spaced repetition optimization

Significance Understanding human memory has been a long-standing problem in various scientific disciplines. Early works focused on characterizing human memory using small-scale controlled experiments and these empirical studies later motivated the design of spaced repetition algorithms for efficient memorization. However, current spaced repetition algorithms are rule-based heuristics with hard-coded parameters, which do not leverage the automated fine-grained monitoring and greater degree of control offered by modern online learning platforms. In this work, we develop a computational framework to derive optimal spaced repetition algorithms, specially designed to adapt to the learners’ performance. A large-scale natural experiment using data from a popular language-learning online platform provides empirical evidence that the spaced repetition algorithms derived using our framework are significantly superior to alternatives.


Proof of Proposition 1
According to Eq. 1, the recall probability m(t) depends on the forgetting rate, n(t), and the time elapsed since the last review, D(t) := t − tr. Moreover, we can readily write the differential of D(t) as dD(t) = dt − D(t)dN (t).

Lemma 2
Lemma 2 Consider the following family of losses with parameter d > 0, [2] where c1, c2 ∈ R are arbitrary constants. Then, the cost-to-go J d (m(t), n(t), t) that satisfies the HJB equation, defined by Eq. 7, is given by: [3] and the optimal intensity is given by: Proof Consider the family of losses defined by Eq. 2 and the functional form for the cost-to-go defined by Eq. 3. Then, for any parameter value d > 0, the optimal intensity u * d (t) is given by and the HJB equation is satisfied: where for notational simplicity m = m(t), n = n(t) and u = u(t).   Short-term recall probability corresponds to m(t + 5) and long-term recall probability to m(t + 15). In all cases, we use α = 0.5, β = 1, n(0) = 1 and t f − t0 = 20. Moreover, we set q = 3 · 10 −4 for MEMORIZE, µ = 0.6 for the uniform reviewing schedule, t lm = 5 and µ = 2.38 for the last minute reviewing schedule, and m th = 0.7 and c = ζ = 5 for the threshold based reviewing schedule. Under these parameter values, the total number of reviewing events for all algorithms are equal (with a tolerance of 5%).

Proof of Theorem 3
Consider the family of losses defined by Eq. 2 in Lemma 2 whose optimal intensity is given by: Now, set the constants c1, c2 ∈ R to the following values: Since the HJB equation is satisfied for any value of d > 0, we can recover the quadratic loss l(m, n, u) and derive its corresponding optimal intensity u * (t) using point wise convergence: where we used that lim d→1 Hospital's rule). This concludes the proof.

Synthetic Experiments
In this section, our goal is analyzing the performance of Memorize under a controlled setting using metrics and baselines that we cannot compute in the real data we have access to. We evaluate the performance of Memorize using two quality metrics: recall probability m(t + τ ) at a given time in the future t + τ and forgetting rate n(t). Here, by considering high (low) values of τ , we can assess long-term (short-term) retention. Moreover, we compare the performance of our algorithm with three baselines: (i) a uniform reviewing schedule, which sends item(s) for review at a constant rate µ; (ii) a threshold based reviewing schedule, which increases the reviewing intensity of an item by c exp ((t − s)/ζ) at time s, when its recall probability reaches a threshold m th ; and, (iii) a last minute reviewing schedule, which only sends item(s) for review during a period [t lm , t f ], at a constant rate µ therein. * Unless otherwise stated, we set the parameters of the baselines and our algorithm such that the total number of reviewing events during (t0, t f ] are equal. First, we run 100 independent simulations and compute the above quality metrics over time. Figure S1 summarizes the results, which show that our model: (i) consistently outperforms all the baselines in terms of both quality metrics; (ii) is more robust across runs both in terms of quality metrics and reviewing schedule; and (iii) reduces the reviewing intensity as times goes by and the recall probability improves, as one could have expected.
Second, we experiment with different values for the parameter q, which controls the learning effort required by Memorizethe lower its value, the higher the number of reviewing events. Intuitively, one may also expect the learning effort to influence how quickly a learner memorizes a given item-the lower its value, the quicker a learner will memorize it. Figure S2a confirms this intuition by showing the average forgetting rate n(t) and number of reviewing events N (t) at several times t for different q values.
Finally, we experiment with different values for the parameters α and β, which capture the aptitude of a learner and the difficulty of the item to be learned-the higher (lower) the value of α (β), the quicker a learner will memorize the item. In Figure S2b, we evaluate quantitatively this effect by means of the average time the learner takes to reach a forgetting rate of n(t) = 1 2 n(0) using Memorize for different parameter values. * The last minute reviewing schedule is only introduced here and was not used in empirical evaluations since in Duolingo there is no terminal time t f which users target. Additionally, in many (user, item) pairs, the first review takes place close to t = 0 and thus the last minute baseline is equivalent to the uniform reviewing schedule. Learning effort, aptitude of the learner and item difficulty. Panel (a) shows the average forgetting rate n(t) and number of reviewing events N (t) for different values of the parameter q, which controls the learning effort. Panel (b) shows the average time the learner takes to reach a forgetting rate n(t) = 1 2 n(0) for different values of the parameters α and β, which capture the aptitude of the learner and the item difficulty. In Panel (a), we use α = 0.5, β = 1, n(0) = 1 and t f − t0 = 20. In Panel (b), we use n(0) = 20 and q = 0.02. In both panels, error bars are too small to be seen.

Our Modeling Framework Using the Power-Law Forgetting Curve Model
Under the power-law forgetting curve model, the probability of recalling an item i at time t is given by (2): [4] where tr is the time of the last review, ni(t) ∈ RR + is the forgetting rate and ω is a time scale parameter. Similarly as in Proposition 1 for the exponential forgetting curve model, we can express the dynamics of the recall probability mi(t) by means of a SDE with jumps: where Di(t) := t − tr and thus the differential of Di(t) is readily given by Next, similarly as in the case of the exponential forgetting curve model in the main paper, we consider a single item with and ri(t) = r(t), and adapt Lemma 1 to the power-law forgetting curve model as follows:

Lemma 3 Let x(t) and y(t), k(t) be three jump-diffusion processes defined by the following jump SDEs:
where for notational simplicity we dropped the arguments of the functions f , g, h, p, q, s, v and argument of state variables.
and J = F in the above Lemma, the differential of the optimal cost-to-go is readily given by Moreover, under the same loss function (m(t), n(t), u(t)) as in Eq. 8, it is easy to show that the optimal cost-to-go J needs to satisfy the following nonlinear partial differential equation: Then, we can adapt Lemma 2 to derive the optimal scheduling policy for a single item under the power-law forgetting curve model: Lemma 4 Consider the following family of losses with parameter d > 0, [7] where c1, c2 ∈ R are arbitrary constants. Then, the cost-to-go J d (m(t), n(t), t) that satisfies the HJB equation, defined by Eq. 6, is given by: which is independent of D(t), and the optimal intensity is given by: Proof Consider the family of losses defined by Eq. 7 and the functional form for the cost-to-go defined by Eq. 8. Then, for any parameter value d > 0, the optimal intensity u * d (t) is given by and the HJB equation is satisfied: where for notational simplicity m = m(t), n = n(t), D = D(t) and u = u(t).
Finally, reusing Theorem 3, the optimal reviewing intensity for a single item under the power-law forgetting curve model is given by It is then straightforward to derive the optimal reviewing intensity for a set of items, which adopts the same form as in Theorem 4.

Our Modeling Framework Using the Multiscale Context Model
In this section, we will briefly describe the Multiscale Context Model (MCM) of memory (3) For modeling the probability of recall mMCM (t), we can use a differentiable approximation to the min {1, sM (t)} function. For example, we can use hyperbolic-tan, and approximate mMCM (t) viamMCM (t): [10] One can contrast Eq. 9 and Eq. 10 with Eq. 2 and Eq. 3 (or Eq. 5) respectively to compare the derivations for the exponential forgetting curve model and the MCM. Extension of Lemma 2 for Eq. 10 is straight-forward and the nonlinear partial differential equation corresponding to Eq. 7 (or Eq. 6) can be solved to arrive at the optimal scheduling for the MCM model. The resulting equation, however, does not readily admit to an analytical solution as was the case for the exponential and power-law forgetting curve models.

Predictive performance of the memory model
Before we evaluate the predictive performance of the exponential and power-law forgetting curve models, whose forgetting rates we estimated using a variant of Half-life regression (HLR) (4), we highlight the differences between the original HLR and the variant we used.
The original HLR and the variant we used differ in the way successful and unsuccessful recalls change the forgetting rate. In our work, the forgetting rate at time t depends on n (t) = t 0 r(τ )dN (τ ) and n(t) = t 0 (1 − r(τ ))dN (τ ). In contrast, in the original HLR, the forgetting rate at time t depends on n (t) + 1 and n(t) + 1. The rationale behind our modeling choice is to be able to express the dynamics of the forgetting rate using a linear stochastic differential equation with jumps. Moreover, Settles et al. consider each session to contain multiple review events for each item. Hence, within a session, the n(t) and n (t) may increase by more than one. In contrast, we consider each session to contain a single review event for each item because the reviews in each session take place in a very short time and it is likely that after the first review, the user will recall the item correctly during that session. Hence, we only increase one of n (t) or n(t) by exactly 1 after each session and consider an item has been successfully recalled during a session if all reviews were successful, i.e., p recall = 1. Noticeably, ∼83% of the items were successfully recalled during a session.  Table S1 summarizes our results on the Duolingo dataset in terms of mean absolute error (MAE), area under curve (AUC) and correlation (COR h ), which show that the performance of both the exponential and power-law forgetting curve models with forgetting rates estimated using the variant of HLR is comparable to the performance of the exponential forgetting curve model with forgetting rates estimated using the original HLR. In the above results, note that we fitted a single set of parameters α and β for all items and a different initial forgetting rate ni(0) per item i. One could think of estimating item specific (or even user specific) parameters α and β, however, we found that our dataset is not large enough to provide accurate estimates of such item specific parameters. More generally, there is always a trade-off between the complexity of the model and the size of the dataset used to fit the model parameters.

Constant vs time-varying α and β parameters
In previous studies (5,6), it has been shown the retention rate follows an inverted U-shape, i.e., mass practice does not improve the forgetting rate, and thus one could argue that our framework should consider time-varying parameters αi(t) and βi(t). In this section, we show that, for the reviewing sequences in our Duolingo dataset, allowing for time-varying αi(t) and βi(t) in our modeling framework does not lead to more accurate recall predictions. This was one of the reasons, in addition to tractability, for considering constant parameters αi and βi.
Formulation. In Eq. 2, we have considered α and β as constants, i.e., they do not vary with the review interval t − tr, where tr is the time of last review. We have dropped the subscript i denoting the item for ease of exposition. We can make a zeroth-order approximation to time varying (α, β) by allowing them to be piecewise constant for K mutually exclusive and exhaustive review-time intervals B (i) [K] . We denote the value that α (β) takes in interval B (i) as α (i) (β (i) ) and modify the forgetting rate update equation to If we find that ∃ {i, j} ⊂ [K] such that α (i) (β (i) ) is significantly different from α (j) (β (j) ), then we would conclude that α (β) vary with review-time.
We obtain repeated estimates of α (i) [K] and β (i) [K] by fitting our model to datasets sampled with replacement from our Duolingo dataset, i.e., via bootstrapping. The Welch's t-test is used to test if the difference in mean values of the parameters in different bins is significant.
Experimental setup. We set the bins boundaries by determining the K-quantiles of the review times in our dataset. Table S2 shows that the bin boundaries for different K are quite varied and adequately cover long time-scales as well as review intervals which are short enough to capture massed practicing. This method of binning also ensures that we have sufficient samples for accurate estimation (∼5.2e6/K) for all parameters. Then we use the variant of HLR described in Appendix 8 to fit the parameters in 400 different datasets using bootstrapping. The regularization parameters are determined via grid-search using a train/test dataset. We thus obtain 400 samples of α (i) [K] and β (i) [K] for K ∈ {3, 4, 5} and i ∈ [K]. Using Welch's t-test for distributions with varying variances, we observe that the mean values of the distributions of α (i) [K] ( β (i) [K] ) and α (j) [K] ( β (j) [K] ) are not significantly different for any {i, j}. As an example, the p-values obtained for K = 5 are shown in Table S3.

Likelihood distributions for different reviewing schedules
In this section, we compute the likelihood of each sequence of review events in our dataset under different reviewing schedules. Figure S3 summarizes the results by showing the empirical distribution of estimated likelihood values for Memorize, threshold and uniform schedules. Since Duolingo uses a near-optimal hand-tuned reviewing schedule, the peak of the distribution for Memorize corresponds to the highest likelihood values, i.e., there are many (user, item) pairs who follow Memorize closely.  Fig. S3. Empirical distribution of log-likelihood values for all (user, item) pairs under different reviewing schedules. Since Duolingo uses a near-optimal hand-tuned reviewing schedule, the peak of the distribution for MEMORIZE corresponds to the highest likelihood values, i.e., there are many (user, item) pairs who follow MEMORIZE closely.

Leitner system: particular case of our modeling framework
In this section, we first describe the Leitner system and then show that it can be explicitly cast using our modeling framework with particular choices of α, β and ni(0).
Leitner system. More than 40 years ago, Sebastian Leitner introduced the Leitner System as a method used to memorize flash cards (7). Since then, several variants of the system have been introduced and some of them are still in active use. Next, for the sake of brevity, we describe one of these variants, which has been recently studied by Reddy et al. (8) and Settles et al. (4).
The learner maintains several decks of flashcards, labelled j ∈ Z, each of which is reviewed at exponentially decreasing frequency λj = λ0c j , for some constants c and λ0. Whenever a card i from deck j is reviewed, it is moved to deck j + 1 if it is recalled correctly (i.e., if ri(t) = 1), or else (if ri(t) = 0) it is moved to deck j − 1, as shown in Figure S4. The intuition behind the Leitner system is that cards which belong to a deck with a large index j have been learned (or were easy to learn), i.e., they have a low forgetting rate, and cards which are in lower decks have not been learned yet (or were difficult to learn), i.e., they have a high forgetting rate. Then, the learning strategy of the learner is to select flashcards at random from any deck as long as the reviewing rate for flashcards in each deck j remains λj, i.e., the expected number of flashcards selected for review from deck j in any time interval ∆t is λj × ∆t. Modeling the Leitner system. For ease of exposition, we assume that the number of decks is unbounded both from above and below † , i.e., there are always decks with higher (or lower) rate of review than the current deck. Under this assumption, we can faithfully represent the Leitner system under our modeling framework as follows.
First, we assign a fixed forgetting rateñj to all flashcards in deck j, i.e., if at time t, a flashcard i is in deck j then ni(t) =ñj and, at the beginning, all flashcards are placed in the first deck, i.e., ∀i. ni(0) =ñ0. Then, every time a card i moves from deck j to j + 1, we change its forgetting rate ni(t) by a factor of (1 − α) =ñ j+1 n j and, similarly, every time it moves from j to j − 1, we change it by a factor of (1 + β) =ñ j−1 n j . Finally, we set λj =ñj, i.e., the reviewing intensity of a card is proportional to the rate of forgetting associated to its deck, where the constant of proportionality has been absorbed intoñj. Now we can uncover α and β by solving the equations (1 − α) =ñ j+1 n j = c and (1 + β) =ñ j−1 n j = 1 c . It is easy to see that, with minor modifications, our framework can be also used to represent many other variants of the Leitner system with, e.g., bounded number of decks.

Additional details on the Duolingo dataset
Learners on Duolingo have a source language (which they already know) and a target language (which they wish to learn). Upon log-in, they are greeted with a screen to select the skill they wish to train/learn, shown in Figure S5a. As soon as they select a skill, a session begins (see Figure S5b). In each session, the learner is asked to translate ∼10 phrases from the source language to the target language or vice-versa. A typical session may last for ∼2-5 minutes but if it is interrupted in the middle due to any reason (loss of connectivity, student logging-out, etc.), the session (and its associated data) is discarded. Figures S5c and S5d show a correct translation and an incorrect translation, respectively, for one such phrases for the language pair (English, French). For each word that appears at least once in a session, our dataset contains identity of the learner, a timestamp (in UTC) indicating when the session started, the total number of times the word appears, and how many times the learner correctly/incorrectly translated the phrases containing that word. We consider a session to be a single point in time (localized at the start of the session), when the student practiced all the words appearing in it. A learner may do several sessions in a single sitting: the sessions in our dataset are separated by a median of ∼7 minutes, including the time the learner spent in the session.  Fig. S4. The Leitner system. The learner picks flashcards for review from several decks and flashcards are moved from one deck to another based on the recall outcome after a review. The higher the index of the deck, the lower the rate at which cards are picked for review from that deck, e.g., cards in deck 1 may be reviewed once per day, cards in deck 2 once every two days, and so on. As discussed in Section 8, our experimental design differs from that of Settles et al. (4) in some respects: we consider a word i to be recalled correctly at time t, i.e., ri(t) = 1, if the student answered all the questions containing that word correctly. Otherwise, we assume that the word was not recalled correctly.

Random assignment assumption on the item difficulties
To rule out that the competitive advantage that Memorize offers with respect to the uniform and the threshold based baselines is a consequence of selection bias due to the item difficulty, we compute the empirical distributions of item difficulties for the treatment (Memorize) and control (uniform and threshold) groups and check whether the allocation of items across the treatment and control groups resemble random assignment. Figure S6 summarizes the results for reviewing sequences with a training period T = 5 ± 0.5 days, which show a striking similarity between distributions across groups. In fact, the treatment group is indistinguishable from both the control groups in terms of item difficulties because their values are within a standardize mean difference (SMD) ‡ of 0.25 standard deviations (9). Similar results are obtained for sequences with a training period T = 3 ± 0.3 and with T = 7 ± 0.7.
Finally, we would like to acknowledge that there may be other covariates that influence the performance of a learner such as, e.g., time of the day, amount of stress, amount of sleep or degree of concentration. Unfortunately, we did not have access to measurement about them.

Empirical forgetting rate without normalization
In Figure 2, for a more fair comparison across items, we normalized each empirical forgetting rate using the average empirical initial forgetting rate of the corresponding item at the beginning of the observation windown0. In this section, we demonstrate that the competitive advantage of our algorithm is not sensitive to this normalization step.
More specifically, we re-run our analysis using unnormalized values for the empirical forgetting rates. Figure S7 summarizes the results, which still show a competitive advantage of Memorize with respect to the uniform and threshold based baselines. The primary difference between using normalization (Figure 2) or not using normalization ( Figure S7 ) is just a scaling factor.