Improving Methods for Analyzing Antimalarial Drug Efficacy Trials: Molecular Correction Based on Length-Polymorphic Markers msp-1, msp-2, and glurp

Drug efficacy trials monitor the continued efficacy of front-line drugs against falciparum malaria. Overestimating efficacy results in a country retaining a failing drug as first-line treatment with associated increases in morbidity and mortality, while underestimating drug effectiveness leads to removal of an effective treatment with substantial practical and economic implications.

There is evidence of DHA-PPQ having high estimated failure rates in vivo, and well-documented PD 25 parameterization for this ACT as it fails (Saunders and colleagues (6) estimated PPQ IC50 had increased 26 to 23.9ng/ml in recrudescent infections as resistance spread; this is equivalent to 0.024 mg/L which 27 we round to 0.02mg/L in our calibration, see Table S1 ). Consequently, simulating failing DHA-PPQ 28 using in vivo data to calibrate the model was possible and produced a 12% true failure rate with the 29 MOI from Tanzania described in methods. There are field data allowing calibration of PK/PD 30 parameters for non-failing AR-LF and AS-MQ (the in vivo parameters given in Table S1 that produce a 31 0.05% and 2% true failure rate respectively. Ideally, we would use field PK/PD calibrations for each 32 ACT obtained from locations where the drug was failing but failing AR-LF and AS-MQ have not been 33 observed in any known PK/PD studies. To avoid drawing conclusions based on analysis of a single 34 failing drug (i.e. DHA-PPQ), we produced 'failing' calibrations of AR-LF and AS-MQ by artificially 35 increasing the parasites' mean IC50 values (Table S1 ) until the simulated drug failure rates reached 36 9% and 10% respectively. This reflected plausible future scenarios that may occur as resistance arises 37 to these drugs. We inflated failure rates to around 10% because this is the critical point at which WHO 38 recommend a drug be withdrawn from front-line usage (7) so it was important to evaluate the 39 accuracy of the various methods around this critical point. LF and MQ have very different durations of 40 completeness, we generated parasite dynamics using a three-compartment model described in (8)  51 with PK parameters based on the mean values reported in table 2 of (8) and in our Table S2. 52 Comparison of drug concentration over time for a single patient with this three compartment 53 calibration and the two compartment calibration (i.e., mean parameters shown in Table S1 against 54 mean parameters in table 2 of (8)) is shown in Figure S1. 55 Note that we did not incorporate the error model or covariate effects described in (8); we were not 56 trying to re-create their patient population (which is a mix of pregnant and non-pregnant women), 57 rather we were trying to create parasite dynamics for a general patient population under the 58 assumption of a three compartment PPQ model and so use the mean values for PK parameters in (8)  59 as a base. As with the parameterizations for two-compartment PPQ and the other drugs, we then used 60 relatively large coefficients of variation across 5,000 patients (Table S2 ). 61 The principal difference between the parasite dynamics generated with these assumptions is that the 62 three-compartment model is slightly more prophylactic and has a greater total area under the drug 63 kill curve; consequently, true failure rate is slightly lower, and a smaller number of reinfections 64 become patent. However, failure rate estimates obtained using each algorithm are not significantly 65 different between the two compartment and three compartment models, and we later show our 66 results (the relative performance of molecular correction algorithms) are qualitatively the same with 67 number of PPQ PK compartments included, our conclusions regarding the accuracy of these molecular 75 correction algorithms to estimate treatment failure rates are robust. 76 We did not have access to validated PK/PD models for other common partner drugs i.e. Amodiaquine 77 (AQ), sulfadoxine/pyrimethamine (SP) and pyronaridine. Both the parent form and metabolite of AQ 78 have antimalarial activity, they are both best described with multiple PK compartment models and 79 both are eliminated independently (e.g. (9)): we were unable to obtain robust PK/PD models (10). SP 80 exhibits strong synergy between the sulfadoxine and pyrimethamine components which again makes 81 it difficult to get a robust PK/PD model (11) (2)). Mueller et al. (12) 93 obtain estimates of between 3 and 9 reinfections emerging per year with an average of 5.9 in Papa 94 New Guinea. Additional work suggests the FOI in Ghana is highly seasonal with estimates ranging from 95 44 in the high transmission season to 7 in the low transmission season (13); any yearly average (such 96 as assumed in this manuscript) will fail to capture the nuances of seasonal transmission. Smith et al. 97 (14) explicitly modelled the relationship between EIR and FOI. It is technical, but some illustrative data 98 are summarised in their week period but serves as general illustrations of the relationship between EIR and FOI. 103

Additional Results 104
Misclassification of recurrent infections for DHA-PPQ with varying FOI levels 105 Figure 3 (main text) shows the misclassification of recurrent infections (recrudescence classified as 106 reinfection and vice versa) for an FOI of 8. Figure S2 shows the same plot for an FOI of 2, 8, and 16. It 107 shows that the number of recrudescence misclassified as reinfection is stable as FOI increased for all 108 algorithms. Furthermore, it shows that increased FOI had nearly no impact on the number of 109 reinfections being misclassified for the "WHO/MMV" algorithm (which correctly classified all 110 reinfections), and a very minor impact for the "no glurp" algorithm. For the "≥ 2/3 markers" and "allelic 111 family switch" algorithm, this figure demonstrates that increased FOI led to greatly increased numbers 112 of reinfections being misclassified as recrudescence. The proportion of reinfections misclassified was 113 stable as FOI increased, but the greater total number of misclassifications produced the increased 114 failure rates seen with these algorithms in Figure 4 (main text). 115 6 FOI, were lower with a three compartment model (likely due to its longer prophylactic period, see 124 Figure S1). Thus, while the "≥ 2/3 markers" algorithm produced an accurate estimate at most FOI 125 (though "no correction" is better with an FOI of 0) with a 42 day follow-up for a two compartment 126 model, assuming a three compartment model of DHA-PPQ showed that "≥ 2/3 markers" produced 127 accurate failure rate estimates but with a follow-up period of 63 days; an intuitive result given the 128 longer prophylactic period. Failure rate estimates increased as follow-up length increases because a) more true recrudescences 140 became patent and b) more reinfections became patent that may be misclassified as recrudescent 141 (see discussion in main manuscript Figure 4). Consequently, our results (main text) suggested that use 142 of the "≥ 2/3 markers" algorithm and a 42-day follow-up was the most appropriate option for DHA-143 PPQ trials. 144 145 Failure rate estimates for failing AR-LF for 21-day and 28-day follow-up lengths are presented in Figure  146 S4 . The true failure rate of AR-LF in these simulations was 0.918 (9%). The same pattern was observed 147 as for DHA-PPQ: The non-PCR corrected algorithm over-estimated the failure rate at any FOI higher 7 than 1, and severely overestimated failure rates at high FOI; the "WHO/MMV" algorithm and the "no 149 glurp" algorithm slightly under-estimated the failure rate across all levels of FOI. Use of a 21-day 150 follow-up period led to both the "allelic family switch" algorithm and the "≥ 2/3 markers" algorithm 151 under-estimating the failure rate, only at a high FOI of 13 did the allelic family switch algorithm 152 accurately recover the true failure rate. Use of a 28-day follow-up period produced more accurate 153 failure rate estimates: The "≥ 2/3 markers" algorithm accurately recovered the true failure rate 154 between an FOI of 5-16, with both the "≥ 2/3 markers" algorithm and the "allelic family" switch rate at all FOI settings -the "allelic family switch" and "≥ 2/3 markers" algorithm were close in value 163 up to an FOI of 9-10. As with DHA-PPQ and AR-LF, the "WHO/MMV" and "no glurp" algorithms under-164 estimated the failure rate consistently and using no PCR correction generated a large over-estimate 165 of the true failure rate. We simulated a novel follow-up length of 49 days ( Figure S6 (B)) under which 166 the "≥ 2/3 markers" algorithm produced a more accurate failure rate estimate than a 42-day follow-167 up at all FOI levels. With a 63-day follow-up period ( Figure S6 (C)), the "allelic family switch" algorithm 168 over-estimated the true failure rate from an FOI of 4 and upwards. The "≥ 2/3 markers" algorithm 169 over-estimated from an FOI of 8 and up, but only by a small amount. AS-MQ is more prophylactic than 170 DHA-PPQ and AR-LF: Given the same period of follow-up, fewer reinfections became patent, and 171 recrudescences occurred later in the follow-up period ( Figure S7 ). As such, it was unsurprising that a 172 longer period of follow-up led to more accurate failure rate estimates. Using the "≥ 2/3 markers" than the 42 and 49-day follow-up lengths, but the differences in estimates between 49 and 63 days 175 were small and the operational, logistical advantages of a 49-day trial over a 63-day trial are likely to 176 be substantial. Furthermore, with an FOI of ≥8, a shorter follow-up (49 days) produced a more 177 accurate failure rate estimate with the "≥ 2/3 markers" algorithm -a 63 day follow-up period over-178 estimated the true failure rate slightly with higher transmission intensity using this algorithm. Crucially, the under-estimate associated with of the "≥ 2/3 markers" algorithm was so small in terms 194 of absolute value that the use of the algorithm can be recommended without concern for over- The results presented in the main text all assumed MOI at time of treatment is representative of high 205 transmission i.e. using Tanzanian data (see MOI in main text). We did this because high MOI makes 206 detection of recrudescent alleles more difficult (due to the issues described in our methods section 207 with detection of minority alleles) so represents a "worst case" scenario. There is a likely mismatch 208 for areas of low transmission which have lower MOI at treatment, but we used high MOI across all 209 transmission intensities (quantified by FOI) for the following reasons: 210 • Keeping the same MOI across all transmission intensities allowed a direct comparison of 211 molecular correction algorithms (e.g. Figure 2, Figure 4) 212 • This assumption of high MOI at treatment is conservative (i.e. "worst case" scenarios) for low 213 transmission areas because we show that there is little operational difference between the 214 algorithms even if initial MOI is high; it is therefore a robust conclusion that algorithm choice 215 is not important in these areas because if MOI at treatment is lower, then there will be even 216 less difference between the algorithms (as illustrated by the Cambodian field data that 217 showed negligible differences). 218 • High MOI at time of treatment can occur even in low transmission areas if people immigrate 219 from areas of higher transmission or have acquired sufficient protective immunity that several 220 clones may co-circulate asymptomatically before the patient falls ill. More plausibly, this 221 scenario may arise in areas of seasonally intense transmission where MOI at time of treatment 222 is high, but trials are conducted during the low-transmission season to reduce the impact of 223 reinfections. 224 We checked the impact of reduced MOI. Analysis of simulated data for DHA-PPQ with a 42-day follow-225 up and a low MOI setting (the distribution obtained from PNG; see methods) is shown in Figure S10. 226 First note that the true failure rate was slightly lower than that obtained in a high MOI setting ( Figure  227 4) because patients harboured fewer clones at time of treatment which made their infection easier to 228 clear. Reducing the MOI to reflect a low-transmission setting reduced the difference between 229 algorithms. Overall, the results were consistent with those obtained from a high MOI setting i.e. the 230 "allelic family switch" algorithm produced an accurate failure rate estimate at an FOI of 4 and below, 231 and the "≥ 2/3 markers" algorithm produced the most accurate failure rate estimate at all higher FOI. 232

233
The relative detectability of the longest allele to the shortest allele was altered from 0.001:1 to 0.1:1. 234 The results are shown in Figure S11. Failure rate estimates obtained using this altered relative 235 detectability are nearly identical to those obtained with the relative detectability of 0.001:1 used 236 elsewhere in this manuscript (i.e. Figure 2 of main text). 237

238
The threshold at which minority genotyping signals are discounted as "noise" and disregarded was 239 varied from 0.3 to 0.05. Analysis of simulated data for DHA-PPQ with a 42-day follow-up under these 240 conditions is shown in Figure S12. The failure rate estimate produced by each algorithm increased as 241 the threshold decreased. At the lower threshold of 0.05 the "no glurp" algorithm (rather than the "≥ 242 2/3 markers" algorithm) produced the most accurate failure rate estimate from an FOI of 6 and higher. 243 A minority detection threshold of 0.05 is unrealistic because large amounts of 244 experimental/laboratory noise would be included in the signal, so this threshold could not be used in 245 practice. The threshold was changed to 0.2 (a more realistic value) in Figure S13. Under this 246 assumption the "≥ 2/3 markers" algorithm produced the most accurate failure rate estimate, robust 247 across all FOI levels, the same as when the minority detection threshold is set to 0.3. Optimising their use is the current priority but looking forward, there are alterative methodologies 281 and markers than may be used and which may be superior. These markers and methods will be 282 addressed in future studies but, for the record, the three main alternative markers are as follows. 283 • Amplicon sequencing of marker loci (16). Its main advantage over capillary electrophoresis of 284 msp-1, msp-2 and glurp is that deep sequencing allows very sensitive detection of minor 285 clones. Minority clones that had a frequency >1.0% of all reads were consistently detected 286 (16). We anticipate that this sensitivity will favour a "WHO/MMV"-type algorithm (i.e. a 287 recrudescence should share alleles at all amplicons when comparing initial and recurrent 288 samples) as the use of amplicon sequencing should improve detection of minor clones in the 289 initial sample (reducing the number of recrudescent clones being misclassified as reinfection) 290 and will be better able to detect recrudescent clones in mixed infection recurrences. 291 • Microsatellite loci have already been used in antimalarial efficacy studies (17,18). 292 Microsatellites are similar to the msp-1, msp-2, glurp markers as their sensitivity to detect 293 minor clones is relatively weak (in particular the presence of stutter-bands require a stringent 294 cut-off) but more loci are often genotyped (Plucinski et al (19) used 8 microsatellites), which 295 means there are a greater number of potential algorithms that may be constructed to 296 distinguish recrudescences from reinfections. In addition, there is a Bayesian analysis method 297 for these markers which may improve their role in molecular correction ( (17)) 298 • Finally, SNP barcodes may be used as genetic markers. 299 The intention here is not to provide an exhaustive description of alternative markers but to emphasise 300 that it is straightforward to assign such genotypes to our simulated patients in the same way that we 301 assigned the msp-1, msp-2 and glurp genotypes, and test various classification algorithms based on 302 such loci. Finally, we note that existing algorithms simply classify recurrent infections as either 303 reinfections or recrudescences and do not account for any degree of uncertainly in these 304 classifications; for example, although we recommend the "≥ 2/3 markers" algorithm, we may be more 305 confident that a recurrent infection is a drug failure if it shares identical alleles at all 3 loci than if it 306 shares alleles at only 2 loci. A natural way of incorporating such uncertainty is to use Bayesian methods 307 and a recent paper has identified such a technique (19); we will evaluate this method in our future 308 work. In short, validated simulations of drug treatment and the consequent post-treatment parasite 309 dynamics provide an ideal resource to investigate many issues surrounding the design,   for failing LF and MQ were arbitrarily increased by us to obtain ~10% drug failure rate. We only changed the IC50 of the partner drug, so to get high levels of failure we needed to overcome the artemisinin component (whose IC50 was not changed) -thus these IC50 values will be higher than those expected for monotherapy resistance. Piperaquine (PPQ) follows a two-compartment model as described in Kay,Hodel & Hastings (21). Patient bodyweight (BW) in the simulations was drawn from a uniform distribution between 45-75 kg and is involved in the calculations for PPQ parameters (see (21,22) (8) is 60.2 (we include a CV of 0.71 on this parameter) ) so the value presented here is illustrative and represents a bodyweight of 42kg. Figure S1: Comparison of drug concentration over time profiles created for a single patient with the mean parameters described in Table S1 for a two-compartment DHA-PPQ model and the mean parameters described in table 2 of (8) for a three compartment DHA-PPQ model, showing that the three compartment model produces a more prophylactic drug concentration over time profile.  Figure 3 (main text)), showing how misclassification by each algorithm alters as FOI changes. The X-axis shows the true status of patients on the day of recurrence (i.e. reinfection or a recrudescence) and the colour-coding shows how these patients were classified by each algorithm. Figure S3: Analysis of simulated trial data for DHA-PPQ using a three compartment model (see Table S2 ) with follow-up lengths of (A) 28 days, (b) 42 days and (C) 63 days.
Estimated failure rates are shown for the different algorithms of molecular correction as a function of FOI and calculated using survival analysis. Figure S4 : Analysis of simulated trial data for failing AR-LF with follow-up lengths of 21 days (A) and 28 days (B). Estimated failure rates are shown for the different algorithms of molecular correction as a function of FOI and calculated using survival analysis. Figure S5 : The true status of recurrent infections on each day of follow-up for a simulated trial of AR-LF with a true simulated failure rate of 9% and an FOI of 8. The total height of the bars indicates the number of recurrent infections detected on that day of follow-up, and the color-coding shows the number of those recurrent infectoins that were truly recrudescent or reinfections. Figure S7 : The true status of recurrent infections on each day of follow-up for a simulated trial of AS-MQ with a true simulated failure rate of 10% and an FOI of 8. The total height of the bars indicates the number of recurrent infections detected on that day of follow-up, and the color-coding shows the number of those recurrent infectoins that were truly recrudescent or reinfections. Figure S8 : Analysis of simulated trial data for effective AR-LF with follow-up lengths of 21 days (A) and 28 days (B). Estimated failure rates are shown for the different algorithms of molecular correction as a function of FOI and calculated using survival analysis.