In Vitro Preclinical Testing of Nonoxynol-9 as Potential Anti-Human Immunodeficiency Virus Microbicide: a Retrospective Analysis of Results from Five Laboratories

ABSTRACT The first product to be clinically evaluated as a microbicide contained the nonionic surfactant nonoxynol-9 (nonylphenoxypolyethoxyethanol; N-9). Many laboratories have used N-9 as a control compound for microbicide assays. However, no published comparisons of the results among laboratories or attempts to establish standardized protocols for preclinical testing of microbicides have been performed. In this study, we compared results from 127 N-9 toxicity and 72 efficacy assays that were generated in five different laboratories over the last six years and were performed with 14 different cell lines or tissues. Intra-assay reproducibility was measured at two-, three-, and fivefold differences using standard deviations. Interassay reproducibility was assessed using general linear models, and interaction between variables was studied using step-wise regression. The intra-assay reproducibility within the same N-9 concentration, cell type, assay duration, and laboratory was consistent at the twofold level of standard deviations. For interassay reproducibility, cell line, duration of assay, and N-9 concentration were all significant sources of variability (P < 0.01). Half-maximal toxicity concentrations for N-9 were similar between laboratories for assays of similar exposure durations, but these similarities decreased with lower test concentrations of N-9. Results for both long (>24 h) and short (<2 h) exposures of cells to N-9 showed variability, while assays with 4 to 8 h of N-9 exposure gave results that were not significantly different. This is the first analysis to compare preclinical N-9 toxicity levels that were obtained by different laboratories using various protocols. This comparative work can be used to develop standardized microbicide testing protocols that will help advance potential microbicides to clinical trials.

transmission of HIV-1 and the other sexually transmitted disease pathogens (8,22,25,27,32,43). Results from human and animal studies indicated a narrow margin between N-9 effectiveness and safety as well as associations between frequent N-9 use and vaginal or rectal irritation, inflammation, tissue infiltration by host immune cells, and changes in the vaginal flora (5,15,39). These adverse effects may increase the risk for HIV-1 transmission during sexual intercourse. Consequently, the outcome of these clinical trials led to the recommendation by the Centers for Disease Control and Prevention that products containing N-9 should not be used for HIV-1 prevention, especially during rectal intercourse (6,29,31).
N-9 has been tested in many in vitro assay systems and was shown to be efficacious as well as toxic. The issue of selectivity for N-9 products was raised in 1994 when a study demonstrated that, while N-9 was active against HIV-1 at a concentration of 0.01%, it was also cytotoxic to lymphocytes at the same concentration (3). After additional preclinical testing, N-9 was moved forward into clinical trials (1,36,40) and the rationale for performing these trials has previously been described (14,38). Given the safety concerns raised in the N-9 clinical trials and the need for preclinical assays that accurately predict the potential in vivo toxicity of topically applied microbicides, historical data for N-9 in vitro toxicity provided by five laboratories were critically evaluated. In this retrospective study, the intra-assay, interassay, and interlaboratory reproducibilities were evaluated using statistical methods. The data presented here show that N-9 has a low selectivity index (SI) for HIV-1 and that the in vitro toxicity of N-9 was consistent with the adverse clinical outcomes associated with human trials of N-9based microbicides.

MATERIALS AND METHODS
Participating laboratories. Five laboratories provided historical N-9 efficacy and toxicity data from studies that were conducted between 1998 and 2004. They were coded as "GH" (University of London, London, United Kingdom), "SR" (Southern Research Institute, Frederick, MD), "CD" (Centers for Disease Control and Prevention, Atlanta, GA), "DU" (Drexel University College of Medicine, Philadelphia, PA), and "CR" (CONRAD, Eastern Virginia Medical School, Norfolk, VA). One laboratory was with a not-for-profit organization, three were with universities, and one was with a governmental institution.
Statistical analysis. Toxicity assays were divided by duration of N-9 exposure to cells into four categories: (i) 5 to 10 min, (ii) 1 to 2 h, (iii) 4 to 8 h, and (iv) Ͼ24 h. Assays with repetitive N-9 exposures and/or recovery periods were not analyzed statistically. Concentrations of N-9 differed widely across and within laboratories (0 to 10,000 g/ml). To improve the sensitivity of the statistical tests, the intra-assay variations were assessed for each N-9 concentration level separately, and the interassay and interlaboratory models included N-9 concentration as a factor.
Prior to statistical analysis, all data were normalized for all experimental values used in this study. The media control was predetermined to represent 100% viable cells, and the N-9-exposed samples were expressed relative to this control. Therefore, it was possible to statistically compare assays with various toxicity endpoints.
For the efficacy assays, there were insufficient data to identify factors that allowed for the grouping of results as was done for toxicity assay results. Therefore, the results from the efficacy assays were not analyzed statistically.
Intra-assay reproducibility. To test whether values for two replicates represented a real difference, intra-assay variance was compared at two-, three-, and fivefold differences; the latter was detectable at a 90% confidence level where an intra-assay standard deviation (SD) greater than 0.21 log 10 was statistically significantly greater than an intra-assay SD of 0.15 log 10 (4,16). An intra-assay SD of 0.15 log 10 has been used by the NIH-funded Virology Quality Assurance program to ensure that a laboratory could maintain precision across assays where detection of HIV-1 RNA was reproducible over a 5-log range (mean standard deviation, 0.15 log) (42).
Interassay reproducibility. Interassay reproducibility was measured between each series of values from the same cell line, laboratory, and duration of N-9 exposure. The criteria for interassay reproducibility were applied as follows.
(i) Effect of replicate. Interassay reproducibility was measured using the F value for assay repetition (PROC GLM; SAS). A significant main effect of assay        repetition would indicate that experimental values for the same cell line, assay duration, and laboratory would differ significantly between replicates.
(ii) Effect of three, six, and more than seven replicates on reproducibility. A reduced replicate effect for six and more than seven assay replicates compared to that for three assay replicates indicated increased reproducibility and evidence for recommending the use of more than three replicates in a standardized protocol for toxicity assays. To compare the effect of replicate numbers on reproducibility, an analysis of covariance was performed using general linear model (GLM) modeling, where N-9 concentration was the covariate factor.
(iii) Effect of N-9 concentration on replicate reproducibility. A significant interaction between N-9 concentration and replicate number indicated that the number of replicates required for interassay reproducibility depended upon the range of N-9 concentrations used in the experiment. To test the significance of interassay variation, a pilot analysis on a subsample of data was conducted to compare the goodness of fit with and without the use of GLM for the regression (SAS, 2004). The GLM procedure uses the method of least squares to fit the data. GLM was found to provide a substantially improved fit for this data and is recommended for the analysis of unbalanced designs (42).
To test for the effect of N-9 assay concentrations on reproducibility, N-9 concentrations needed to be equivalent for assays using three, six, and more than seven replicates. This meant that assays using a smaller number of replicates (e.g., one to three) were assumed to represent a similar range of N-9 concentrations to allow comparisons to assays using a larger number of replicates (i.e., six or more).
An analysis of the interaction between assay N-9 concentration and interassay reproducibility revealed that N-9 concentration was correlated with replicate number, rendering this comparison invalid. However, grouping of the data in N-9 concentration ranges allowed this comparison to be made, avoiding the intercorrelation between N-9 concentration and the number of replicates. Thus, the assay data were divided into low (less than 2.5 g/ml), medium (2.5 to 32 g/ml), or high (more than 32 g/ml) N-9 concentration levels. The number of data points falling into each category is shown in Table 5. A GLM regression model for each N-9 concentration group (low, medium, and high) tested main and interaction effects on their ability to predict the experimental measurements.
Toxicity concentrations. TC 50 values were expressed as the toxic concentration in g/ml where 50% of the cells were still viable. TC 50 values were calculated for each assay by four-parameter curve-fitting and point-to-point regression analysis using the computer program XLfit 4 (IDBS). Curve-fitting and point-to-point regression TC 50 values were tested for their comparability by correlation analysis and paired sample t test (PROC MEANS; SAS, 2002). As the two (curve-fitting and point-to-point) values were calculated using the same assay data, a correlation between the two sets of values was expected.
Mean TC 50 values were compared for each cell line and laboratory within the duration of each assay by analysis of variance (PROC ANOVA; SAS). This was followed by a means test (Scheffe or least significant difference; P, Ͻ0.05) to identify outlier TC 50 values. To increase sensitivity in the comparisons, mean TC 50 values (Ͼ500 g/ml) were excluded from this analysis.

RESULTS
Intra-assay reproducibility was within twofold differences of log 10 standard deviations. We tested the intra-assay reproducibility by using standard deviations. Figure 1 shows that the SDs (log 10 ) of all the assays tested were below the 0.21 SD (log 10 ) level. The horizontal lines indicate two-, three-, and fivefold differences (SD log 10 ) in intra-assay measurements, and "P Ͻ 0.05" indicates the SD (log 10 ) at which there is 90% confidence to find a fivefold difference (4). There is less than a 10% probability that, using the assay methods described, a fivefold difference in experimental values would occur for a measurement made at the same concentration, cell line, assay duration, and laboratory. In fact, the actual SDs (log 10 ) showed consistent intra-assay reproducibility at a twofold level. Effect of replicate had a nonsignificant effect in 75% of the assays. The main effects of cell type, duration of assay, and N-9 concentration were significant sources of variance (P Ͻ 0.01) in the experimental values. Since these main effects were expected to originate from the nature of the assays being performed and were not related to assay reproducibility, no further analyses were conducted. Thus, the remainder of the results are confined to the relationship between the experimental values and the main and interaction effects of assay replicates.
Main effects (P Ͻ 0.05) found for interassay reproducibility are presented in Fig. 2, where a high F value is an indicator of low interassay reproducibility. The horizontal line in Fig. 2 indicates the approximate F value that would result from a significant replicate effect when P was Ͻ0.05. This value differed slightly depending on the number of observations in each test. Seventy-five percent of the assays showed a nonsignificant effect of replicate determinations on experimental values (Fig.  2). In other words, experimental values for the same cell line, assay duration, and laboratory did not differ significantly between replicate values. Using the criteria described in Materials and Methods, the majority of the toxicity assays were found to have acceptable interassay reproducibility, with the exception of the following assays: Sup-T1, 10 min (DU); explant cervical tissue, 10 min (GH); explant cervical tissue, 1 h, 35 min (GH); ME180, 1 h (GH); HOS-CD4-X4/R5, 4 h (SR); ME180, 24 h (GH); P4-CCR5, 48 h (DU); and macrophages, 24 h (CD), which did not meet the criteria for interassay reproducibility.

No significant effect of replicate number on reproducibility.
As shown in Table 4, no significant effects of assay replicate number (three, six, or more than seven) on experimental values were detected (P Ͼ 0.05). This implied that the effect of replicate determinations on experimental values was unrelated to the number of replicates performed in the toxicity experiment. Therefore, for the assay types and laboratories represented in this study, no difference in reproducibility for assay methods using three, six, or more than seven replicates would be predicted. No significant effect of N-9 assay concentration above 2.5 g/ml on replicate reproducibility. The analysis of variance comparing concentrations of N-9 used in replicate numbers of one to three, four to six, and more than seven found N-9 concentrations to differ substantially between these groupings of replicate values. For example, the mean N-9 concentration used was 331.5 g/ml when the number of replicates was one to three, 129.14 g/ml when the number of replicates was four to six, and 18.81 g/ml when the number of replicates was more than seven. N-9 concentration variations with replicate numbers were found to be a confounding factor, and therefore, the independent effect of N-9 concentration on interassay reproducibility could not be tested. However, after grouping the data into three different concentration ranges, the effect of N-9 assay concentration on replicate reproducibility could be examined.
There were no replicate effects when the concentration of N-9 used in the experiment was above 2.5 g/ml (Table 5, columns 3 and 4). However, for N-9 levels below 2.5 g/ml, FIG. 1. Intra-assay reproducibility across assay durations for all cell lines. The intra-assay reproducibility was assessed using SD (log 10 ). All replicates within one concentration were compared. The two-, three-, and fivefold levels of SD log 10 are indicated from bottom to top, respectively, by horizontal lines. P Ͻ 0.05 indicates the SD (log 10 ) at which there is 90% confidence to find a fivefold difference (4).

VOL. 50, 2006
NONOXYNOL-9 TOXICITY AND EFFICACY EVALUATION 717 significant main (P Ͻ 0.0001) and interaction (P Ͻ 0.0001 to 0.048) replicate effects were found ( Table 5). The low N-9 group included assay data from the following cell lines: primary HVKs, explant cervical tissue, HOS-CD4-X4/R5 cells, and ME-180 cells. Interestingly, for this low N-9 group, the expected significant effects of cell type were not found. It is possible that low interassay reproducibility in the low N-9 group masked any effects of cell or tissue type on the experimental results. An analysis of variance (PROC ANOVA; SAS, 2002) followed by a means comparison (least significant difference) of experimental measures between the replicates showed that values for the first replicate were lower than those made for replicates two to six (data not shown). Toxicity to cells increased with longer N-9 exposure times. As shown in Fig. 3, TC 50 values showed a decreasing trend with longer assay durations (Fig. 3). When high TC 50 values (Ͼ500 g/ml) were excluded, assay durations of 5 to 10 min, 1 to 2 h, and Ͼ24 h showed differences in TC 50 values between assays. Specifically, TC 50 values obtained in monocyte-derived macrophages (MDM) (DU, 5 to 10 min; mean TC 50 , 195.0 g/ml) and explant cervical tissue (GH, 1 to 2 h and Ͼ24 h; mean TC 50 s, 431.5 g/ml and 119.5 g/ml, respectively) were significantly higher than the other assays made from the same duration group. The TC 50 values found for assays in the 4-to-8-h assay duration group were not significantly different.
When the mean TC 50 values of the different exposure times were compared, the same decreasing trend was confirmed (Fig.  4). Statistically significant differences were found when the individual mean TC 50 s of the three shorter exposure times (5 to 10 min, 1 to 2 h, and 4 to 8 h) were compared to the mean FIG. 2. Interassay reproducibility across assay duration and cell lines. Interassay reproducibility was assessed using a regression analysis with general linear models, taking into account all N-9 concentrations. Low reproducibility was indicated where the variability between replicates was higher than the variability within replicates and was significant (P Ͻ 0.05) when the F ratio was greater than approximately 3.0 (PROC GLM, SAS 2004). *, F ratio equals the variability in experimental measures between replicates divided by the variability in experimental measures within replicates. The horizontal line indicates the approximate F value that would result from a significant replicate effect when P was Ͻ0.05. In addition to evaluating the variation of TC 50 values between N-9 exposure times, two different methods of calculating the TC 50 values were compared. For the data set analyzed, TC 50 values were highly correlated (P Ͻ 0.01) between fourparameter curve-fitting and point-to-point regression analysis methods. The curve-fitting data presented a smoother curve relative to the point-to-point data (Fig. 5). There was no significant difference (t test, t ϭ 0.17; P ϭ 0.86) between the values obtained using the curve fit method (mean TC 50 , 459.14 g/ml; standard error, 172.47) and the point-to-point estimations (mean TC 50 , 474.12 g/ml; standard error, 173.27) (Fig.  5). However, a greater concordance between curve-fitting and point-to-point TC 50 values below 70 g/ml than equal to or above 70 g/ml was found (data not shown).
N-9 efficacy of HIV-1 replication was indistinguishable from N-9 toxicity. A total of 61 virucidal assays and 11 attachment assays were evaluated from four different laboratories ( Table  3). The number of drug dilutions ranged from four to six and the number of replicates from two to four in the efficacy assays ( Table 6). The IC 50 was defined as the concentration where virus replication was inhibited by 50% relative to the virus control. The IC 50 was calculated using the same four-parameter curve-fitting model as mentioned above (XLfit4; IDBS) to range from Ͼ1 to Ͼ696 g/ml. When the TC 50 was divided by the IC 50 (where applicable), a SI of just above 1 was calculated. The exceptions were only the cell-free infection assay (CFIA) and the cell-associated infection assay (CAIA), where the IC 50 s were 10-fold higher than the TC 50 s due to the design of the assays.

DISCUSSION
This report represents a multisite analysis to evaluate the toxicity of the frequently used spermicide N-9, which was initially proposed for use as a topical microbicide. The primary focus of this study was to evaluate the intra-assay, interassay, FIG. 3. TC 50 (g/ml) for N-9 toxicity assays for the various exposure times. Toxicity increased with longer compound exposure. and interlaboratory reproducibilities of established assays that are used to assess the preclinical toxicity of N-9, which already had progressed through clinical development as a potential therapy to decrease the transmission of HIV. This endeavor was accomplished using historical data from five independent laboratories that encompassed analyses of toxicity and efficacy data from a wide range of assays performed under multiple conditions. Results from assays using transformed cell lines, primary cells, and cervical explant tissues that were infected with either laboratory-adapted or clinical isolates of HIV-1 and exposed to N-9 for various times ranging from 5 min to Ͼ24 h were normalized to control cultures which were used as notoxicity controls. The normalized values were then assessed using various univariate and multivariate statistical methods. As expected, there was a high correspondence among the assays and laboratories for an outcome of increased N-9 toxicity in relation to exposure time and concentration. Additionally, high intra-assay reproducibility for toxicity could be demonstrated. While the study was initially designed to evaluate N-9 efficacy, this parameter could not be addressed due to a smaller data set and the high toxicity and low SI of N-9.
To evaluate the interassay variations of N-9 toxicity, the effect of replicates among all assays, including all assay concentrations, was examined. The historical data set that was used allowed assessments of the main effect of replicates, the effect of three, six, and more than seven replicates on reproducibility, and the effect of N-9 assay concentrations on replicate reproducibility. When the main effect of replicates was analyzed, 24 assays were within the acceptable F value range, whereas 8 assays did not meet the acceptable F value criteria for replicate effect. The cell lines used in the assays that did not meet the criteria spanned a wide range of primary and immortalized cell lines as well as cervical explant tissues. While primary cells and tissues are expected to have higher replicate variabilities, it is not clear why some immortalized cell lines have such a high variation among replicates. External factors, such as laboratory, plate setup, technician proficiency, and equipment used for the assay could influence the interassay variability and will be addressed in a prospective study. Moreover, it was shown that there was no significant difference on reproducibility regardless of the number of replicates used. Therefore, in future experiments, factors other than number of replicates, such as plate setup, cell culture incubator, number of compound concentrations, and medium conditions should be used to optimize the assays. When the effect of N-9 assay concentration on replicate reproducibility was investigated, it was found that the N-9 concentrations were corelated with the number of replicates used. Therefore, N-9 concentrations were grouped into low, middle, and high concentration ranges. There was evidence that interassay reproducibility for the assay types tested was reduced when low concentrations of N-9 were used, specifically those below 2.5 g/ml. This effect was due to lower values for the first assay replicate compared to subsequent measures. Therefore, the assay methods used in this experiment were less reproducible at lower N-9 concentrations (i.e., the higher the N-9 concentration, the more reproducible the assay method) and for the first replicate. Comparatively, the N-9 concentration used in the placebo-controlled, tripleblind phase II/III clinical trial (COL 1492) was 52.5 mg per treatment dose, which was about 100,000 times higher than the N-9 concentration applied to the cells in vitro (14,39). For the assessment of interlaboratory reproducibility, TC 50 values were calculated using a four-parameter curve-fitting model. As expected, TC 50 values showed a decreasing trend with longer assay durations. This implied that the toxicity of N-9 increased with longer N-9 exposure to the cells. Explant tissues were significantly more resistant to toxicity than cell lines, probably due to the three-dimensional architecture of the cervical explant tissue. When the explant tissues were excluded, assay durations of 5 min to 2 h and Ͼ24 h showed more variation in TC 50 values than assays using 4 to 8 h of exposure time. The reasons for this are unknown and need to be addressed in a prospective study.
In addition to the four-parameter curve-fitting method, the TC 50 was also calculated by point-to-point linear regression. When the two different methods for calculating the TC 50 values were compared, it appeared qualitatively for the data set analyzed that four-parameter curve-fitting was superior to point-to-point regression, although there was no statistically significant difference between both methods.
For evaluation of N-9 efficacy, not enough data could be collected to perform statistical analyses. However, in assays where efficacy and toxicity were performed in parallel (21), the calculated SI approximated 1, indicating that efficacy was indistinguishable from cellular toxicity. Thus, the in vitro studies presented here were consistent with the outcome of phase III clinical trials that demonstrated that N-9 efficacy could not be distinguished from its toxicity in vivo and that N-9 did not have a clinical benefit to prevent HIV infection (39).
The analysis of historical toxicity data has improved our overall understanding of the fate of N-9 as a topical microbicide. In the future, the reproducibility of toxicity and efficacy assays of promising microbicide candidates should also be evaluated in prospective studies and proficiency tests to avoid confounding factors and poor comparability between assays and laboratories. This will allow better statistical analyses of the variables, such as number of replicates, compound concentration and source, duration of compound exposure, TC 50 calculation, cell line/tissue, and laboratory. Several microbicide products with different mechanisms of action will need to be included in the proficiency panels to filter the effects that are specific to toxic compounds such as N-9. Since the toxicity of N-9 in cell culture was somewhat predictive of the clinical outcome, it is important to evaluate the reproducibility of efficacy and toxicity data for compounds tested in different laboratories before they are advanced to phase I clinical trials. Therefore, the N-9 assay data comparison provides a basis for extending these analyses to other inhibitor data sets as well as for beginning the design of a microbicide assay proficiency testing system. Proficiency tests are intended to evaluate the competence of participating laboratories to perform certain laboratory procedures, e.g., microbicide assays. Proficiency panels that are composed of blinded reagents to carry out the microbicide assays are sent to the participating laboratory together with detailed instructions on how to perform the procedure. Upon completion of the assays, the participating laboratory will return the results to the proficiency test agency for evaluation. Proficiency tests are particularly important in the planning phase of clinical trials to ensure that all assay methodologies are working correctly. In summary, preclinical toxicity assays were compared from five different laboratories for intra-assay, interassay, and interlaboratory variations. The intra-assay variation was very low, within a twofold difference. For the interassay variations, 75% of the assays had good reproducibility, while in 25% of the assays, the reproducibility was low. The latter included cervical explant tissues and monocyte-derived macrophages where a higher interassay variability was expected. Regarding the effect of N-9 concentration, the reproducibility was good for concentrations above 2.5 g/ml, but significant main and interaction replicate effects were found for levels below 2.5 g/ml. These data indicate that the higher the N-9 concentration, the more reproducible (toxic) were the results. When the interlaboratory variability was compared using TC 50 values, a strong positive correlation was identified between longer N-9 exposures and increased TC 50 values regardless of the laboratory. The statistical analyses of these preclinical data suggest that they were consistent with the effects of N-9 in vivo.