**DOI:**10.1128/AAC.00159-07

## ABSTRACT

Antimalarial clinical trials use genotyping techniques to distinguish new infection from recrudescence. In areas of high transmission, the accuracy of genotyping may be compromised due to the high number of infecting parasite strains. We compared the accuracies of genotyping methods, using up to six genotyping markers, to assign outcomes for two large antimalarial trials performed in areas of Africa with different transmission intensities. We then estimated the probability of genotyping misclassification and its effect on trial results. At a moderate-transmission site, three genotyping markers were sufficient to generate accurate estimates of treatment failure. At a high-transmission site, even with six markers, estimates of treatment failure were 20% for amodiaquine plus artesunate and 17% for artemether-lumefantrine, regimens expected to be highly efficacious. Of the observed treatment failures for these two regimens, we estimated that at least 45% and 35%, respectively, were new infections misclassified as recrudescences. Increasing the number of genotyping markers improved the ability to distinguish new infection from recrudescence at a moderate-transmission site, but using six markers appeared inadequate at a high-transmission site. Genotyping-adjusted estimates of treatment failure from high-transmission sites may represent substantial overestimates of the true risk of treatment failure.

The spread of drug-resistant *Plasmodium falciparum* has hampered disease control efforts and resulted in widespread changes in antimalarial treatment policy (22). Treatment policy decisions are largely based on the results of antimalarial clinical trials, and the World Health Organization (WHO) now recommends that a country's first-line therapy should be changed if the absolute risk of treatment failure exceeds 10% (21). New WHO guidelines also recommend that assessment of antimalarial-drug efficacy requires a minimum follow-up period of 28 days and that molecular genotyping be used to classify outcomes in patients with recurrent parasitemia following therapy (21). Comparing genotypes from before therapy with those at the time of recurrence is necessary to distinguish whether recurrent parasitemia is due to a recrudescence of parasites present in pretreatment samples (treatment failure) or to a new infection. In areas of Africa with very high transmission intensities, over 50% of patients may develop recurrent parasitemia within 28 days, even when treated with artemisinin-based combination therapies (ACTs), which are believed to be highly efficacious (4). In such trials, estimates of antimalarial-drug efficacy depend heavily on the accuracy of the genotyping techniques used.

Genotyping techniques may not perform accurately in areas of high transmission, where patients are often infected with multiple parasite strains. It is generally agreed that if any parasites in the pretreatment sample persist after therapy, the subject is considered to have a recrudescence. However, when multiple parasite strains are present, the probability increases that at least one strain in the pretreatment and recurrent-parasitemia samples will have the same genotype by chance, leading to misclassification of a new infection as a recrudescence. Others have noted the importance of taking into account misclassification of new infections as recrudescence (3, 11, 14), but this has often been overlooked when interpreting genotyping-adjusted results from antimalarial clinical trials.

Misclassification of a new infection as a recrudescence can be decreased by using multiple genotyping markers, as the probability of different parasite strains having the same genotypes at multiple markers is reduced (19). The majority of antimalarial clinical trials currently use one to three markers for genotyping (6); however, the number of markers needed to accurately classify outcomes in areas of different transmission intensities is unknown. In addition, the degree to which the transmission intensity may affect genotyping misclassification, and thus the accuracy of antimalarial trial results, has not been studied. To address these questions, we used six markers to genotype samples from two antimalarial clinical trials performed at sites in Africa with different transmission intensities. We then evaluated the accuracies of these trial results when one to six markers were used by estimating the probabilities of genotyping misclassification.

## MATERIALS AND METHODS

Study sites and trial design.We utilized samples from clinical trials done at two sites in Africa with very different malaria transmission intensities. Bobo-Dioulasso, Burkina Faso, is a periurban area where malaria transmission is seasonal, and the entomological inoculation rate was estimated to be less than five infective mosquito bites per person per year (12), though it may be much higher during the rainy season, when the clinical trial was conducted. Tororo District, Uganda, is a rural area where malaria transmission is perennial, and the entomological inoculation rate was estimated to be 591 infective mosquito bites per person per year (16).

The details of the clinical trials have been published (4, 23). Briefly, in Bobo-Dioulasso, 944 patients aged 6 months or greater with uncomplicated falciparum malaria were randomized to receive directly observed therapy with amodiaquine (AQ), sulfadoxine-pyrimethamine (SP), or AQ plus SP. In Tororo, 408 children aged 1 to 10 years with uncomplicated falciparum malaria were randomized to receive directly observed therapy with AQ plus artesunate (AS) or artemether-lumefantrine (AL). In both trials, the patients were followed for 28 days and their treatment outcomes were assessed according to WHO guidelines as adequate clinical and parasitological response, early treatment failure (ETF), late clinical failure (LCF), or late parasitological failure (LPF) (20). Subjects with LCF or LPF underwent genotyping of pretreatment and recurrent-parasitemia samples to determine whether recurrent parasitemia represented a new infection or recrudescence.

Genotyping of blood samples.We performed genotyping of blood samples with the polymorphic surface antigens merozoite surface protein 2 (encoded by *msp2*) and merozoite surface protein 1 (encoded by *msp1*) and with four microsatellite markers. Detailed methods for genotyping all six markers have been published elsewhere (10). Briefly, DNA was extracted from filter paper samples using Chelex (17). The *msp2* and *msp1* markers were amplified using nested PCR with second-round primers specific to allelic families: IC3D7 and FC27 for *msp2* and K1, MAD20, and RO33 for *msp1* (24). The PCR products were separated on a 2.5% agarose gel (UltraPure Agarose; Invitrogen, Carlsbad, CA) and stained with ethidium bromide. GelCompar II software (Applied Maths, Sint-Martens-Latem, Belgium) was used to select alleles and to estimate the sizes of PCR products using a standardized approach (5). Alleles from paired pretreatment and recurrent-parasitemia samples run on adjacent lanes were considered to be the same if the PCR products were measured to be within 10 base pairs.

Four microsatellite markers with trinucleotide repeat regions—TA40, TA60, TA81, and PfPK2 (1)—were each amplified using a single round of PCR with fluorescent primers, as previously described (10). Capillary electrophoresis was performed using an Applied Biosystems 3730xl DNA Analyzer, and alleles were sized with GeneMapper software (Applied Biosystems, Foster City, CA). Data from GeneMapper were processed with an automated computer algorithm to remove stutter peaks and noise (10). Microsatellite alleles were sized in bins corresponding to numbers of trinucleotide repeats. Alleles from paired pretreatment and recurrent-parasitemia samples were considered to be the same if they had the same bin size.

We estimated allele frequency distributions for each site by genotyping 200 randomly selected pretreatment samples in Bobo-Dioulasso and all 400 pretreatment samples in Tororo. Allele frequencies were calculated by dividing the number of times a particular allele was detected by the total number of alleles detected in all samples. For the frequency distribution, alleles of *msp2* and *msp1* were separated into 20-base pair bins, since an allele from a paired sample was considered a match if it was within 10 base pairs on either side of another allele. Alleles of microsatellites were separated into bins corresponding to trinucleotide repeats, since alleles in paired samples were considered matches only if they had the same bin size.

Statistical methods.Homozygosity, the probability that two strains chosen at random from a population will have the same allele, was estimated using the formula $$mathtex$$\({{\sum}_{i{=}1}^{n}}p_{i}^{2}\)$$mathtex$$, where *n* is the number of unique alleles and *p* is the frequency of the “*i*th” allele. The probability that at least one allele from a pair of clinical samples will match by chance is based on both the diversity of alleles in a population and the complexity of infection (the number of alleles detected). To estimate the probability of a match occurring by chance, we performed the following analysis on each sample pair for each genotyping marker (the Python programming code is available upon request). We first generated all possible combinations of *x* alleles for the recurrent-parasitemia sample, where *x* was the complexity of infection for that sample. The relative probability of each possible combination of alleles occurring was then estimated by multiplying together the frequency of each of the component alleles in the combination. Each possible combination of recurrent-parasitemia alleles was then compared against the actual alleles present in the pretreatment sample to determine if at least one allele matched. The estimated probability of a match occurring by chance, *P*_{match}, was calculated by taking the sum of the probabilities of combinations that matched the pretreatment sample and dividing by the sum of the probabilities of all combinations. Therefore, *P*_{match} specifically represents the probability that a random match will occur between at least one pretreatment and one recurrent-parasitemia allele at a given marker, given the actual pretreatment alleles present, the frequency distribution of the alleles, that *x* alleles are present in the recurrent-parasitemia sample, and that the sample pair represents a new infection. When multiple markers were used in combination, *P*_{match} for a sample was calculated as the product of *P*_{match} values for each individual marker.

To illustrate the calculation of *P*_{match} with a simplified numerical example, let us assume that a genotyping marker has only four alleles with the following frequencies: A, 0.1; B, 0.2; C, 0.3; and D, 0.4. If a hypothetical pretreatment sample contained alleles A and B and the recurrent-parasitemia sample contained two alleles, calculation of *P*_{match} would be as follows. There are six possible combinations of recurrent-parasitemia alleles, each with the following relative probability (generated by multiplying the frequencies of the component alleles): (i) A and B = 0.02; (ii) A and C = 0.03; (iii) A and D = 0.04; (iv) B and C = 0.06; (v) B and D = 0.08; (vi) C and D = 0.12. These probabilities are relative and do not add up to one because we are limiting possible combinations to those with different alleles (e.g., not “A and A”), as the complexity of infection is determined by the number of different alleles detected. By taking the sum of the probabilities of those combinations that have at least one allele in common with the pretreatment sample (allele A or B, or both present) and dividing by the sum of the probabilities of all combinations, we arrive at a *P*_{match} of 0.23/0.35, or 0.66.

We used the mean *P*_{match} for all recurrent-parasitemia samples (P̄_{match}) to estimate the number of true recrudescent infections for each treatment arm, adjusting for misclassification of new infections as recrudescences. By adjusting at the level of the treatment arm, we were able to take into account the pretest probability of a sample being a new infection or recrudescence when applying the adjustment. We estimated the number of true recrudescent infections by combining the following two equations.
$$mathtex$$\[n_{\mathrm{or}}{=}n_{\mathrm{recru}}{+}n_{\mathrm{new}}{\cdot}{\bar{P}}_{\mathrm{match}}\]$$mathtex$$(1) where *n*_{or} is the number of observed recrudescent infections, *n*_{recru} is the estimated number of true recrudescent infections, and *n*_{new} is the estimated number of true new infections, and
$$mathtex$$\[n_{\mathrm{rp}}{=}n_{\mathrm{new}}{+}n_{\mathrm{recru}}\]$$mathtex$$(2) where *n*_{rp} is the number of recurrent-parasitemia samples. By solving equation 2 for *n*_{new}, substituting this into equation 1, and solving for *n*_{recru}, we arrive at equation 3:
$$mathtex$$\[n_{\mathrm{recru}}{=}\frac{n_{\mathrm{or}}{-}{\bar{P}}_{\mathrm{match}}{\cdot}n_{\mathrm{rp}}}{1{-}{\bar{P}}_{\mathrm{match}}}\]$$mathtex$$(3) Student's *t* test was used to test whether the complexities of infection were different at the two study sites. The two-sample Wilcoxon rank sum test was used to test whether P̄_{match} was different between the two sites. A *P* value of 0.05 was considered statistically significant for all tests. Statistical analysis was done using STATA v.8.0 (STATACorp, College Station, TX).

## RESULTS

Performances of individual genotyping markers.Parasite strains are more easily distinguished with increasing ability to detect diversity of alleles in a population. We estimated the diversity of alleles detected by each of six markers at our two study sites from the frequency distribution of alleles (Fig. 1). The marker *msp2* had the highest diversity, with 40 unique alleles and an estimated homozygosity of less than 0.05 at both study sites (Table 1) . The microsatellite TA81 had the lowest allelic diversity, with only 18 unique alleles and an estimated homozygosity of 0.15 in Tororo and 0.18 in Bobo-Dioulasso. Despite variation between markers, the allele frequencies and estimated homozygosities were remarkably similar between the two sites for any given marker. The similarity in allele frequencies between sites in East and West Africa with very different transmission intensities suggests that our measurements of diversity for these six markers may be generalizable to other sites in Africa over a wide range of transmission intensities.

When using genotyping to determine treatment outcomes, a recrudescence is generally defined by the detection of at least one identical allele at each marker gene tested in both the pretreatment sample and the sample collected at the time of recurrent parasitemia (19). A treatment outcome may be misclassified as recrudescence if a newly infecting strain has the same allele as a strain present before treatment, and a higher complexity of infection increases the probability of such misclassification occurring. As expected, the mean complexities of infection were significantly higher in the higher-transmission site (Tororo) than in the lower-transmission site (Bobo-Dioulasso) for all six markers in both pretreatment and recurrent-parasitemia samples (*P* < 0.005 for all comparisons) (Table 1). We estimated the probability of a match between pretreatment and recurrent-parasitemia samples occurring by chance, *P*_{match}, for each sample pair at each marker, taking into account both the marker diversity and the complexity of infection. Within each site, the mean *P*_{match} was lower for the markers with higher allelic diversity (Table 1). When each marker was compared between sites, *P*_{match}, and therefore the probability of misclassifying a new infection as a recrudescence, was significantly higher in Tororo than in Bobo-Dioulasso (1.8- to 2.1-fold higher; *P* < 0.0001 for all comparisons). As allelic diversities were similar at the sites, differences in *P*_{match} appeared to be mediated by the higher complexity of infection in Tororo.

Performances of genotyping markers used sequentially.To decrease the probability of a match occurring by chance, multiple genotyping markers can be used sequentially in a stepwise approach (15). In this approach, starting with the first genotyping marker, any subject with no alleles in common between pretreatment and recurrent-parasitemia samples is classified as having a new infection, and no further genotyping is necessary. If at least one allele is common between pretreatment and recurrent-parasitemia samples, further genotyping is done with additional markers to decrease the probability that an outcome may be misclassified as a recrudescence due to alleles matching by chance. We determined the optimal order of genotyping markers for this approach by evaluating at each step which marker, if added next, would detect the greatest number of new infections and thus leave the fewest subjects requiring further genotyping. Using our data on paired samples from Tororo, where the number of recurrent-parasitemia samples was highest, the following order resulted in the fewest samples requiring genotyping: *msp2, msp1*, TA40, TA60, TA81, and PfPK2 (Fig. 2). Using this algorithm, we then evaluated how the addition of each marker affected overall outcomes, as well as the cumulative probability of a match occurring by chance. The mean cumulative *P*_{match} for different points in the algorithm was calculated as the mean of the product of the *P*_{match}s for all markers used up to that point in the algorithm for each individual subject. In Bobo-Dioulasso, the mean *P*_{match} decreased from 0.28 with one marker to 0.05 with three markers and 0.02 with six markers. In Tororo, the cumulative mean *P*_{match}s were much higher: 0.49 with one marker, 0.21 with three markers, and 0.16 with six markers. Therefore, after being genotyped with six markers, new infections were eight times more likely to be misclassified as a recrudescence in Tororo than in Bobo-Dioulasso.

We next assessed the effect of adding each additional marker on estimates of the risk of treatment failure. The risks of treatment failure at different points in the genotyping algorithm were defined as the proportion of patients in each treatment arm with either an ETF response or an LCF or LPF response due to recrudescence. Estimates of the risk of treatment failure decreased for all treatment arms in Bobo-Dioulasso after genotyping with three markers (AQ from 18% to 8%; SP from 8% to 4%; AQ plus SP from 4% to 2%), while the use of three additional markers had little effect on these estimates (Fig. 3). In Tororo, however, decreases in estimates of the risk of treatment failure were appreciable with the addition of each marker in the genotyping algorithm. However, even after all six markers were applied, estimates of treatment failure remained high, at 20% for AQ plus AS and 17% for AL.

To assess the accuracy of our risk estimates, we used equation 3 as described in Materials and Methods to estimate how many observed recrudescences were true recrudescences. We used this equation to estimate the risk of treatment failure adjusted by both genotyping and the estimated number of matches occurring by chance for each treatment arm at each point in the algorithm (Fig. 3). In Bobo-Dioulasso, treatment failure estimates adjusted by genotyping became similar to those adjusted by both genotyping and chance matches after using three markers, suggesting that genotyping with these three markers yields accurate risk estimates at sites with moderate transmission intensities. In Tororo, however, the risk estimates remained dissimilar even after genotyping with all six markers, with absolute risk differences of 9% (AQ plus AS) and 6% (AL) between results adjusted by genotyping and those adjusted by both genotyping and chance matches. This suggests that even genotyping with the six markers described in this report may substantially overestimate the true risk of treatment failure at very high transmission sites.

## DISCUSSION

In our study, new infection and recrudescent outcomes were accurately distinguished in an antimalarial clinical trial performed in an area of moderate transmission intensity by genotyping with three markers in a stepwise algorithm. In contrast, these outcomes could not be accurately distinguished in an area of very high transmission intensity. In this area, estimates of the risks of treatment failure for AQ plus AS and AL were 34% and 24%, respectively, when genotyped with *msp2* and *msp1* and remained high at 20% and 17% even when genotyped with six markers. Due to the high complexity of infection at this high-transmission site, a large proportion of treatment failures were likely new infections misclassified as recrudescences, resulting in overestimation of the true risk of treatment failure.

Outcomes of new infection and recrudescence are relatively easy to distinguish in antimalarial trials performed in areas of low transmission intensity, since the complexity of infection is low and the probability of pretreatment and recurrent-parasitemia genotypes matching by chance is therefore small. The problem of genotypes matching by chance was analyzed in detail in one of the first antimalarial trials to use genotyping (3). In this trial, which was performed in Southeast Asia and used three genotyping markers (*msp2, msp1*, and GLURP), the vast majority of samples contained a single parasite strain. Based on an analysis of the frequency distribution of alleles, the authors of this trial determined that 54 of 57 subjects with recurrent parasitemia had a probability of genotypes matching by chance of less than 0.05. The majority of antimalarial trials are now performed in Africa, where transmission is much higher, but with few exceptions (14), the same genotyping methods are used without a reassessment of the increased likelihood of genotypes matching by chance (6), and it is commonly stated that the probability of genotypes matching by chance is low (7, 15). In this study, we replaced GLURP, which has less diversity than *msp1* (5), with four microsatellite markers, one of which (TA40) has diversity similar to that of *msp1*. Despite using more markers and using at least one marker with higher diversity than GLURP, the probability of genotypes matching by chance was high at our high-transmission site.

Outcomes from antimalarial trials are more difficult to accurately distinguish in high-transmission areas for two reasons. First, more new infections occur during follow-up, increasing the number of outcomes that require genotyping and therefore may be misclassified as recrudescences. Second, genotyping loses accuracy when there is a higher complexity of infection, as seen in areas of higher transmission intensity (2). A higher complexity of infection makes it more likely that at least one parasite strain present in pretreatment and recurrent-parasitemia samples will match by chance, increasing the probability of misclassifying a new infection as a recrudescence. In our study, a new infection was eight times more likely to be misclassified as a recrudescence at the high-transmission site than at the moderate-transmission site. Given that allele frequencies were similar at both sites but complexity of infection was significantly higher in the higher-transmission site, the increase in misclassification was driven primarily by the higher complexity of infection. Genotyping misclassification at our high-transmission site led to estimates of the absolute risk of treatment failure that were much higher than would be expected for two newly introduced ACTs. Trials of other ACTs performed in areas of high transmission have also resulted in surprisingly high estimates of treatment failure. A recent trial of dihydroartemisinin-piperaquine in Rwanda reported a risk of treatment failure of 11.4% at one site but only 1.3% and 1.1% at two other sites (13). The site in Rwanda with the higher risk of treatment failure also had a much higher rate of new infections than the other two sites, indicating that higher transmission at this site may have resulted in increased genotyping misclassification as a possible reason for the discrepant results. Results from high-transmission sites in Uganda and Rwanda imply that the failure risks for three newly introduced ACTs may already exceed the 10% cutoff for acceptability now advocated by the WHO (21).

Our estimation of the degree of genotyping inaccuracy was a best-case scenario, in which we made the assumption that the probability of genotypes matching by chance was the product of the probabilities of genotypes matching at each marker. This assumption is unlikely to be correct for two reasons: (i) individual genotyping markers are unlikely to be independent of each other (to exhibit complete linkage disequilibrium) even in high-transmission areas (18) and (ii) parasite genotypes may cluster in space and time, so that pretreatment and newly infecting parasites are more likely to have matching genotypes than two parasites randomly selected from the population as a whole (3). Therefore, we most likely underestimated the probability of genotypes matching by chance and, because of this, underestimated the extent of genotyping misclassification at our very high transmission site. In addition, the persistence of gametocytes in the blood after successful eradication of asexual forms may also lead to misclassification of a new infection as a recrudescence.

Genotyping using multiple markers in a stepwise approach has advantages and disadvantages. The main advantage of using multiple markers is that genotyping accuracy may be increased with each additional marker, and by using a stepwise approach, additional markers are used efficiently. More accurate distinction of new infection and recrudescence may be needed when identifying risk factors for treatment failure. For example, we were unable to find statistically significant associations between molecular markers of drug resistance and treatment outcomes in a moderate-transmission site when genotyping with *msp2* and *msp1* to determine outcomes (10). However, we found significant associations when genotyping with all six markers was performed (8). Another advantage of a stepwise approach is that outcomes of new infection or recrudescence may be assigned even if genotyping is unsuccessful at some markers, an important benefit, since missing data may affect overall results (11). By including genotyping results that were successful for at least one marker, we were able to assign outcomes in all but 1 of 307 recurrent-parasitemia pairs genotyped at our two study sites.

The main disadvantage in using multiple genotyping markers in a stepwise approach is a potential increase in the probability of misclassifying a recrudescence as a new infection. All PCR-based genotyping methods have a limit to their sensitivities for detecting alleles of strains present at low proportions (9). If genotyping fails to detect the alleles of all strains present in the pretreatment sample that are responsible for recrudescence, then the outcome will be misclassified as a new infection. By adding additional markers, the chance that at least one marker will fail to detect all recrudescent parasites increases. This type of error is difficult to estimate, because it is difficult to assess how frequently alleles are missed in field samples and how frequently the missed alleles in a sample would represent all of the recrudescent strains. However, we attempted to minimize this problem by using only genotyping markers that performed well when rigorously validated on polyclonal control samples (10). Another disadvantage to using multiple markers is that diminishing returns are observed with each additional marker. We observed fewer outcomes reclassified from recrudescence to new infection when using the last few markers, even at the high-transmission site, despite the fact that many of these remaining outcomes were likely to truly be new infections. There are two reasons for these diminishing returns: (i) the sample pairs remaining in the genotyping algorithm after the first few markers were used were enriched for those with a higher complexity of infection, and (ii) these pairs were genotyped using markers with lower diversity later in the algorithm.

In conclusion, we found that new infection and recrudescence could be accurately distinguished at a moderate-transmission site by genotyping with three markers but that these outcomes could not be accurately distinguished at a very high transmission site even by genotyping with the six markers assessed in this study. Thus, a genotyping system with higher discriminatory power appears to be needed to generate accurate estimates of treatment failure for antimalarial clinical trials in very high transmission areas. In the absence of improved genotyping methods, results from trials performed in high-transmission areas should be interpreted with the knowledge that genotyping misclassification may substantially overestimate the risk of treatment failure. In high-transmission areas, it may be better to rely on comparative results rather than the absolute risk of treatment failure in individual treatment arms. Alternatively, estimates of treatment failure from these areas may be adjusted by taking into account the probability of genotypes matching by chance, as we and others (14) have done, although these estimates may nonetheless overstate treatment failure rates for new highly efficacious drugs.

## ACKNOWLEDGMENTS

We thank Jon Woo and Elaine Carlson at the UCSF Genomics Core Facility for performing capillary electrophoresis and the clinical trial study teams in Bobo-Dioulasso and Tororo.

This work was supported by the Doris Duke Charitable Foundation. P.J.R. is a Doris Duke Charitable Foundation Distinguished Clinical Scientist.

## FOOTNOTES

- Received 2 February 2007.
- Returned for modification 3 April 2007.
- Accepted 16 June 2007.
↵▿ Published ahead of print on 25 June 2007.

- American Society for Microbiology