More complete details of our methods are described in the Research Methods Report for the Agency for Health Research and Quality [6].
Study selection
We included articles in five languages: Chinese, French, German, Japanese and Spanish. We searched MEDLINE with the term ‘randomized controlled trial’, restricted to each language. Working in reverse chronological order, we accepted the first 10 publications we found in each language, regardless of topic, for which either a machine-readable pdf or html file was available for the full text of the article (that we could translate with Google Translate). We also chose 10 English-language trials that were published in a distribution of years roughly corresponding to the distribution of the non-English articles. The list of included studies is presented in Additional file 1 and their characteristics in Additional file 2: Table S1. With this number of studies, the observed power to detect differences between extractions of translated and untranslated articles was above 80%.
Translation process
We translated each article into English using Google Translate. We used the simplest method possible for each article, including one-step translation of complete articles available as Web pages (html) or as pdf files; copying and pasting blocks of text from pdf files into Google Translate; or copying text into word processing software, reformatting the text, and then copying into Google Translate. We included the English translations, any English-language abstracts that were published with the original articles, and images of figures and tables that could not be translated due to formatting issues. Translations were performed primarily by a research assistant who estimated the approximate time she required to translate each article.
Data extraction process
A description of the data extractors and a flowchart of the basic processes employed for extracting, reconciling and analyzing articles are provided in Additional files 3 and 4. Each original language version of the articles was double extracted by two fluent readers. The extractors were informed of disagreements and asked to recheck discrepancies. The extractions were then reconciled allowing multiple ‘correct’ answers if the extractors interpreted the data differently. This approach was taken to reduce the likelihood of disagreements between native and translated extractions resulting from differences in interpretation rather than disagreements due to poor translation. The reconciled extractions from the fluent readers were considered to be the reference standard extractions. The Google translated version of each article was extracted by two researchers who did not speak the article language, out of a pool of eight extractors. These eight researchers also extracted the 10 English language articles. Reconciliation of the extractions of English language articles was conducted by consensus either among five of eight extractors or, failing that, agreement between the two senior researchers (EMB, TAT), again allowing for multiple ‘correct’ answers if data appeared to have been interpreted differently.
Data extraction form and comparison
We focused data extraction on common and important study domains for systematic review: study design and methods, interventions (and comparators), outcomes, and results. Extractors provided a rough estimate of how much extra time they spent with the article compared with the time they would have spent extracting a similar English article. They also reported their level of confidence in the accuracy and translation of the article. Additional file 5 lists the data extraction items.
When possible, we selected one categorical and one continuous outcome from each trial. We limited the extraction of results to two interventions. Prior to data extraction, for each language we compiled a list of about a dozen outcomes that were reported in at least one article in that language. We aimed for a mix of primary and secondary outcomes, and clinical and intermediate or surrogate outcomes. Researchers were asked to check off all outcomes from the lists that were reported in the article.
For the comparisons of translations and of English extractions with their reference standards, each data item was coded as agree or disagree. ‘Disagree’ included erroneous data, incomplete data, and data items incorrectly extracted as not reported.
Analysis
We used a generalized linear mixed-effects model to examine whether the probability of correctly extracting the item was related to the language of the original publication and to each extractor’s likelihood of correctly extracting English articles, accounting for the fact that extractions were grouped by paper. The model used the pattern of allocation of extractors to languages to control for reviewer effects. For each item, we report the model-predicted percent accuracy for an ‘average reviewer’ and the odds ratios for correct extraction of translated articles compared with correct extraction of English-language papers. We constructed an average reviewer by using the mean of the reviewer-specific coefficients to obtain model predictions. When the model did not converge, we ignored reviewer effects and calculated ‘crude’ percentages and odds ratios. We analyzed cases when the odds of correctly extracting individual items from the translated articles was greater than the odds of doing so from the extracted English articles (when the odds ratio was >1) as equivalent to perfect agreement.
Role of the funding source
This work was funded under contract from the US Agency for Health Research and Quality, US Department of Health and Human Services, which did not participate in data analysis or preparation, review, or approval of the manuscript for publication. The funder did not participate in the conception, design, conduct, analysis or decision to submit this manuscript for publication.