Subject participation
The study protocol was approved by the ethics committees of Osaka University and related medical institutions as well as the Translational Health Science and Technology Institute (Faridabad). Japanese individuals (n = 343) for whom gut metagenome shotgun sequencing were performed in previous studies were included in this study46,47,48. Among these individuals, 14 and 4 were used in the WGS analysis using HiseqX (Illumina) and NovaSeq 6000 (Illumina) systems, respectively. Among the individuals with WGS data, five were recruited for the ultra-deep metagenome shotgun sequencing analysis. Another 113 Japanese individuals for whom gut metagenome shotgun sequencing had been performed in previous studies46,49 were included in the validation dataset for the genetic sex prediction analysis. We newly recruited 37 individuals of various nationalities from Osaka University and the RIKEN Center for Integrative Medical Sciences, and included them in the multi-ancestry dataset for the validation of ancestry prediction analysis. Samples from individuals in Delhi (India) were also included in the multi-ancestry dataset50. All study participants provided written informed consent before participation. The main, validation and multi-ancestry datasets included 343, 113 and 73 study participants, respectively, comprised of 196, 65 and 25 females, and 147, 48 and 48 males, respectively. The age ranges for these three datasets were 20–88 yr (mean, 42.2 yr; s.d., 17.3 yr), 20–81 yr (mean, 39.1 yr; s.d., 17.3 yr) and 20–61 yr (mean, 30.5 yr; s.d., 7.9 ), respectively. Summaries of each dataset are provided in Supplementary Table 1. No statistical methods were used to pre-determine sample sizes but our sample sizes are similar to those reported in previous publications46,47. The study participants did not receive any compensation.
Sample collection and DNA extraction
Phenol–chloroform DNA extraction was newly performed in this study for the ultra-deep and multi-ancestry gut metagenome datasets. For the other datasets, phenol–chloroform DNA extraction and subsequent metagenome shotgun sequencing had been performed in previous studies46,47,48,49. For dataset 2 as well as the ultra-deep gut metagenome dataset and validation dataset for the genetic sex prediction, faecal samples were collected in tubes containing RNAlater (Ambion). After the weights of the samples were measured, RNAlater was added to make tenfold dilutions of the homogenates. The faecal samples were stored at −80 °C within 24 h of collection. After washing with 1 ml PBS(−), 200-μl volumes of the homogenates were used for DNA extraction.
For datasets 1 and 3 as well as the multi-ancestry gut metagenome dataset, faecal samples were stored at −80 °C within 6 h of collection or frozen in an insulated container for storage at −20 °C immediately after collection and subsequently stored at −80 °C within 24 h. For these samples stored without RNAlater, RNAlater was added to make tenfold dilutions of homogenates before the DNA extraction. After washing with 1 ml PBS(−), 200-μl volumes of the homogenates were used for DNA extraction.
DNA was extracted according to a previously described method51. Briefly, 300 μl SDS–Tris solution, 0.3 g glass beads (diameter, 0.1 mm; BioSpec) and 500 μl EDTA–Tris-saturated phenol were added to the suspension, and the mixture was vortexed vigorously using a FastPrep-24 homogenizer (MP Biomedicals) at a power level of 5.0 for 30 s. After centrifugation at 20,000g for 5 min at 4 °C, 400 μl supernatant was collected. Subsequently, phenol–chloroform extraction was performed and 250 μl of the supernatant was subjected to isopropanol precipitation. Finally, the DNA was suspended in 100 μl EDTA–Tris buffer and stored at −20 °C.
For 88 samples in dataset 1 and 47 samples in the multiple-ancestry dataset, we performed DNA extraction from frozen unhomogenized stool samples using the DNeasy PowerSoil Kit (QIAGEN) as per the manufacturer’s instructions.
Metagenome shotgun sequencing
The amounts of stool-derived DNA used for the library preparation are provided in Supplementary Table 14. Double-stranded DNA was quantified using a Qubit Fluorometer (Thermo Fisher Scientific). After sonication using an ME220 instrument (Covaris; target sonication length, 200 bp), a shotgun sequencing library was constructed using a KAPA hyper prep kit (KAPA Biosystems) and LifeECO version 2.0 (BIOER Technology) following the manufacturers’ instructions. The library quality was evaluated using a LabChip GX Touch system (PerkinElmer). A Qubit fluorometer and KAPA library quantification kits (KAPA Biosystems) were used to quantify the library and 150-bp paired-end reads were generated on a HiSeq 3000 or NovaSeq 6000 system. Detailed information on the number of lanes, total number of paired-end reads, sequencers, library preparation reagents and library preparation kits for the newly and previously obtained datasets is provided in Supplementary Table 14. The samples in each dataset were barcoded, pooled in equimolar ratios (targeted final amount of the pooled library, 4 nM) and sequenced simultaneously in a single run without library replication. A total of 1.8 nM (in 30 μl) of the pooled library was loaded to the sequencing machine for each sequencing run. All sequencing was performed at the Department of Infection Metagenomics, Next-Generation Sequencing Core Facility, Research Institute for Microbial Diseases, Osaka University. Details of the reagents and kits used for metagenome shotgun sequencing are provided in Supplementary Data. The sequence reads were converted to FASTQ format using bcl2fastq (version 2.19).
Genotyping of samples based on the SNP array
We performed genotyping using an Infinium Asian screening array (Illumina). This genotyping array was built using an EAS reference panel including whole-genome sequences, which enabled effective genotyping in EAS populations.
We applied stringent quality control filters to the genotyping dataset using PLINK (version 1.90b4.4)52,53. We excluded individuals with a genotyping call rate of <0.98. All of the individuals were determined to be of EAS ancestry, based on the principal component analysis (PCA) with the samples of the 1KG dataset using EIGENSTRAT54. We further excluded SNPs with a (1) call rate < 0.99, (2) minor allele count < 5 and (iii) Hardy–Weinberg equilibrium P value of <1.0 × 10−5.
We performed genome-wide genotype imputation to estimate untyped variants computationally. We used the combined reference panel of the 1KG Project Phase 3 version 5 genotype (n = 2,504) and Japanese WGS data (n = 1,037)55,56 as a haplotype reference for genotype imputation. First, we excluded SNPs with >7.5% allele frequency difference to the representative reference datasets of Japanese ancestry, namely the Japanese data in the aforementioned combined reference panel55,56 (n = 104 from the 1KG Project Phase 3 version 5 genotype and n = 1,037 from the Japanese WGS data) and the allele frequency panel of the Tohoku Medical Megabank Project57 (ToMMo 8.3KJPN Allele Frequency Panel, n = 8,380). Second, we conducted haplotype estimation to improve imputation performance using SHAPEIT (version 4.2.1)58 with haplotype reference. After the prephasing, we used Minimac4 (version 1.0.1)59 for genotype imputation. The variants imputed with Rsq (estimated value of the squared correlation between the true and imputed genotypes) > 0.7 were used for the downstream analysis.
Genotyping of samples based on WGS
DNA samples isolated from whole blood were sequenced at Macrogen Japan Corporation. The DNA was quantified using Picogreen and degradation of DNA was assessed by gel electrophoresis. All libraries were constructed using a TruSeq DNA PCR-free library preparation kit according to the manufacturer’s protocol. The libraries were sequenced on a HiSeqX or NovaSeq 6000 system, producing paired-end reads 2 × 150 bp in length, with a mean coverage of 15.5× and 16.4×, respectively. The reads produced by the HiSeqX and NovaSeq 6000 systems were processed as previously described60. Briefly, sequenced reads were aligned against the reference human genome with the decoy sequence (GRCh37, human_g1k_v37_decoy) using BWA-MEM (version 0.7.13). Duplicated reads were removed using Picard MarkDuplicates (version 2.10.10). After base-quality score recalibration implemented in GATK (versions 3.8-0), we generated individual variant call results using HaplotypeCaller and performed multi-sample joint-calling of the variants using GenotypeGVCFs. We set genotypes satisfying any of the following criteria as missing: (1) depth of coverage (DP) < 5, (2) genotype quality (GQ) < 20 or (3) DP > 60 and GQ < 95. Variants with low genotyping call rates (<0.90) were then removed. We performed variant quality score recalibration for SNPs and short indels according to the GATK Best Practice recommendations and adopted the variants that passed the quality control criteria. For the annotation of the pLoF variant, we used loftee17 in VEP (version 102)61 with GENCODE (release 19) annotation and only high-confidence LoF variants were extracted. For the annotation of the ClinVar pathogenic variants, we used ANNOVAR (8 Jun 2020)62 with the clinvar_20210123 annotation on GRCh37.
Extraction of human reads from metagenome data
We followed a series of steps to extract human reads from metagenome data. The main steps in the human read extraction were as follows: (1) trimming of low-quality bases, (2) identification of candidate human reads, (3) removal of duplicated reads and (4) removal of the potential bacterial reads. We trimmed the raw reads to clip Illumina adaptors and cut off low-quality bases using Trimmomatic (version 0.39; parameters: ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:true TRAILING:20 MINLEN:60)63. We discarded reads that were smaller than 60 bp in length after trimming. Next, we mapped the trimmed reads to the human reference genome (GRCh37, human_g1k_v37_decoy) using Bowtie 2 (version 2.3.5.1)64 with the ‘—no-discordant’ option and retained only the properly mapped reads. We then performed duplicate removal using Picard MarkDuplicates (version 2.22.8) with the ‘VALIDATION_STRINGENCY = LENIENT’ option. Finally, we mapped the duplicate removed reads to the bacterial reference genome set constructed by Kishikawa and colleagues65. This reference was composed of the 7,881 genomes including those derived from Nishijima et al.66 and those identified in the cultivated human gut bacteria projects67,68,69. We only kept reads for which both paired ends failed to align. The resulting reads were defined as human reads and used in the subsequent analyses. The read counts obtained using bedtools (version 2.29.2) was normalized to the number of reads to adjust for differences in sequencing depths between samples. The normalized read count was further normalized to the chromosome length to calculate the normalized depth.
Evaluation of the quality of the human reads in gut metagenome data
We utilized FastQC70 to assess the adaptor content and GC ratio of the reads. Based on the outputs from the ‘bcftools mpileup’, the transition to transversion SNP ratio of the reads was estimated for the non-reference bases. We utilized ‘bcftools mpileup’ to evaluate the ratio of the non-major allele on the mitochondrial genome, which could be associated with the contamination of non-host human DNA. We also used mixemt (version 0.2) to evaluate the ratio of the non-major haplogroup, which could also be associated with non-host human DNA contamination71. To evaluate the performance of the methods for estimating non-host human DNA contamination, we simulated mtDNA-derived human reads using WGS data of five individuals. We made use of WGS data of another five individuals as sources of contaminated mtDNA-derived reads. Next, we mixed the two mtDNA reads derived from the WGS data—one from the original sample set and one from the contamination dataset—in specific ratios (original:contamination of 100:0, 90:10, 75:25 and 50:50) and downsampled to 1, 2, 5, 10, 15 and 20× human mitochondrial genome coverages in five seeds using SAMtools72. We then applied the above two methods to the simulated data and estimated the levels of non-host human DNA contamination. After the simulation experiment, we estimated the levels of non-host human DNA contamination in the gut metagenome data of 343 Japanese individuals.
Genetic sex prediction based on the human reads in the gut metagenome data
Using scikit-learn (version 0.22.1), we trained a logistic regression model to predict genetic sex (male = 1 and female = 0) based on the Y-to-X chromosome read-depth ratio. Note that we used only the non-pseudo-autosomal regions for this analysis. We then validated this model with the validation dataset (n = 113). We also performed the same analysis after the human read extraction with the different human reference genomes, namely GRCh38 (GCA_000001405.15_GRCh38_no_alt_analysis_set) and T2T-CHM13v2.0 (chm13v2.0_maskedY_rCRS).
Re-identification from a set of genotype data based on the human reads in the gut metagenome data
We calculated the likelihood that each sample in the genotype dataset produced the observed human reads in the faecal samples from two input data—that is, the human reads in the gut metagenome data that were mapped to the human reference genome and the genotype dataset (imputed SNP array data were used in this study) of the multiple samples.
We extracted the SNP sites that were covered by at least a read and included in the reference panel by ‘bcftools mpileup’ with the ‘-T’ option. In this study we used a combined reference panel of the 1KG Project Phase 3 version 5 genotype (n = 104) and Japanese WGS data (n = 1,037) as the reference panel. To get independent SNP sites, we applied clumping to the list of the SNPs that were covered by at least a read. We used the ‘–indep-pairwise 100 30 0.1’ option in PLINK for clumping at Rsq = 0.1. We then calculated the likelihood according to the model proposed in Li and colleagues72. Suppose an SNP site i was covered by ni reads in the gut metagenome data, ki reads were from the reference allele and ni − ki reads were from the alternative allele. In this study bcftools (version 1.15.1) was used to calculate the read coverage with the ‘-q 40 -Q 20’ option. The error probability of the read bases was ε and error independency was assumed. In this study ε was set at 1 × 10−6 following the previously reported assumption72. At the SNP site i, the number of alternative alleles of an individual j (gi,j) could be zero (Ref/Ref), one (Ref/Alt) or two (Alt/Alt), where Ref is the reference allele and Alt is the alternate allele. The likelihood (L) that the sample with gi alternative alleles at SNP site i produced the observed human reads in the gut metagenome data was expressed as
$$L_{\mathrm{i,j}}\left( {g_{\mathrm{i,j}},n_\mathrm{i},k_\mathrm{i}} \right) = \frac{1}{{2^{n_\mathrm{i}}}}\left[ {\left( {2 – g_{\mathrm{i,j}}} \right)\varepsilon + g_{\mathrm{i,j}}\left( {1 – \varepsilon } \right)} \right]^{n_\mathrm{i} – k_\mathrm{i}}\left[ {\left( {g_\mathrm{{i,j}}\varepsilon + \left( {2 – g_\mathrm{{i,j}}} \right)\left( {1 – \varepsilon } \right)} \right.} \right]^{k_\mathrm{i}}$$
When the clumping procedure retained N independent SNP sites, a log-transformed likelihood (likelihood score, LS) that a genotype data produced the observed human reads in the gut metagenome data was expressed as
$$LS_\mathrm{j} = \mathop {\sum }\limits_{i = 1}^N {{{\mathrm{log}}}}(L_{\mathrm{i,j}}(g_\mathrm{{i,j}},n_\mathrm{i},k_\mathrm{i}))$$
Next, we drew the background distribution of the likelihood score from (1) human reads in the gut metagenome data that were mapped to the human reference genome and (2) allele frequency data for the SNP sites used for calculating the likelihood score. Note that the ancestry of the allele frequency data should be matched to the samples. In this study Japanese individuals in the combined reference panel of the 1KG Project Phase 3 version 5 genotype (n = 104) and Japanese WGS data (n = 1,037) were used to calculate the allele frequency. When an alternative allele frequency at SNP site i was pi and the number of the alternative allele was gi,pop (= 0, 1 or 2), the theoretical genotype frequencies at SNP site i were expressed as
$$P\left( {g_{\mathrm{i,pop}},p_\mathrm{i}} \right) = \left\{ {\begin{array}{*{20}{c}} {(1 – p_\mathrm{i})^2,} & {(g_{\mathrm{i,pop}} = 0)} \\ {2p_\mathrm{i}\left( {1 – p_\mathrm{i}} \right),} & {(g_{\mathrm{i,pop}} = 1)} \\ {p_\mathrm{i}^2,} & {(g_{\mathrm{i,pop}} = 2)} \end{array}} \right.$$
The expected log-transformed likelihood that a genotype data randomly drawn from the specified population produced the observed human reads in the metagenome data was then expressed as
$$E\left( {LS_{\mathrm{pop}}} \right) = \mathop {\sum }\limits_{i = 1}^N E\left( {LS_{\mathrm{i,pop}}} \right) = \mathop {\sum }\limits_{i = 1}^N \mathop {\sum }\limits_{g_{\mathrm{i,pop}} = 0}^2 P\left( {g_{\mathrm{i,pop}},p_\mathrm{i}} \right){{{\mathrm{log}}}}(L_\mathrm{i}(g_{\mathrm{i,pop}},n_\mathrm{i},k_\mathrm{i}))$$
Given that the SNP sites were independent, the variance (V) of the likelihood score in a specific population was expressed as
$$\begin{array}{l}V\left( {LS_{\mathrm{pop}}} \right) = \mathop {\sum }\limits_{i = 1}^N V\left( {LS_{\mathrm{i,pop}}} \right) =\\ \mathop {\sum }\limits_{i = 1}^N \mathop {\sum }\limits_{g_{\mathrm{i,pop}} = 0}^2 P\left( {g_{\mathrm{i,pop}},p_\mathrm{i}} \right)[{{{\mathrm{log}}}}(L_\mathrm{i}\left( {g_{\mathrm{i,pop}},n_\mathrm{i},k_\mathrm{i}} \right)) – E\left( {LS_{\mathrm{i,pop}}} \right)]^2\end{array}$$
Using E(LSpop) and V(LSpop), we calculated the standardized likelihood score of the individual j as \((LS_\mathrm{j} – E\left( {LS_{\mathrm{pop}}} \right))/\sqrt {V\left( {LS_{\mathrm{pop}}} \right)}\).
From the experiment with the simulated and real data, we found that the standardized likelihood score was almost normally distributed when the read and genotype data originated from different samples (Extended Data Figs. 3a and 5c). Therefore, we transformed standardized likelihood scores to P values based on a normal distribution. Empirical P values were also calculated for LSj from the randomly simulated genotype data of the 99,999 SNP sites used for calculating the likelihood score. Genotypes of each site were independently drawn from the allele frequency data and empirical P values were defined as: (1 + number of the genotype data with the likelihood score higher than LSj) / 100,000.
Re-identification from a set of genotype data based on simulated human read data
To test the performance of the likelihood score-based re-identification, we simulated human reads by downsampling five WGS datasets to human genome coverages of 1 × 10−5, 2 × 10−5, 5 × 10−5, 1 × 10−4, 2 × 10−4, 5 × 10−4 and 1 × 10−3× using SAMtools. We generated 25 datasets (5 samples × 5 seeds) for each coverage. We then calculated the likelihood scores using the downsampled read data and imputed SNP array data of 100 samples, which included the five samples used to make downsampled read data. We also simulated human reads using three WGS datasets that had familial relationships and repeated the same experimental procedures to evaluate how the familial relationship affected the likelihood score. As a reference panel for defining the SNP sites and calculating the background distribution of the likelihood score, we used Japanese individuals in the combined reference panel of the 1KG Project Phase 3 version 5 genotype and Japanese WGS data throughout all of the simulation analyses.
Using the same sample set, we performed another simulation experiment using only SNP sites whose MAF was within specific ranges, namely (0, 0.05], (0.05, 0.1], (0.1, 0.2], (0.2, 0.3], (0.3, 0.4] and (0.4, 0.5]. For the unrelated five samples used in the simulation experiment, we simulated human reads by downsampling five WGS datasets to a human genome coverage of 1 × 10−2× in five seeds using SAMtools. We then calculated the likelihood score and its background distribution based on the randomly selected 100, 150, 200, 300, 400, 500, 750, 1,000, 1,500 and 2,000 bases on the human reads that covered the SNPs with the specified range of MAF.
We also evaluated how non-host human DNA contamination would impact the re-identification performance using the same sample set. We made use of another five WGS datasets as sources of the simulated contamination. Next, we mixed two sets of WGS data—one from the original sample set and the other from the contamination dataset—in specific ratios (original:contamination of 100:0, 90:10, 75:25, 50:50, 25:75 and 10:90) and downsampled to human genome coverages of 1 × 10−5, 2 × 10−5, 5 × 10−5, 1 × 10−4, 2 × 10−4, 5 × 10−4 and 1 × 10−3× with five seeds using SAMtools. Using the simulated contaminated data and the imputed SNP array data of 99 samples, which included the original sample set but excluded the contaminated sample, we conducted the re-identification analysis.
For the simulated human read data, we also performed the simulation analysis with a wide range of genotype data sizes rather than the imputed SNP array data of 100 samples. We randomly extracted 5,000 unrelated individuals (PIHAT < 0.185) from the BioBank Japan (BBJ) second cohort data ( which were also genotyped using the Infinium Asian screening array. Quality control and genotype imputation of the BBJ second cohort data were simultaneously performed with the genotype data of the 343 in-house samples. Next, we combined the genotype data of 9, 19, 49, 99, 199, 499, 999, 1,999 and 4,999 individuals from the BBJ second cohort data with the genotype data for which the simulated human read data was generated to create a genotyping dataset for each simulated human read data. Finally, we performed the re-identification analysis with the combined genotype dataset and calculated the sensitivity and specificity.
Re-identification from a set of genotype data based on the human reads in the gut metagenome data of 343 individuals
In the experiment with real data, we used 343 samples that had both the gut metagenome data and imputed SNP array data. As a reference panel for defining the SNP sites and calculating the background distribution of the likelihood score, we used Japanese individuals in the combined reference panel of the 1KG Project Phase 3 version 5 genotype and Japanese WGS data throughout all of the real data analyses. We identified gut metagenome data–genotype data (imputed SNP array data was used in this study) pairs derived from the same individuals based on the following two different methods and evaluated the sensitivity and specificity: (1) genotype data with the highest likelihood score and (2) genotype data with P < 4.25 × 10−7 (0.05 / (343 × 343)).
We performed the analysis with a wide range of genotype data sizes as performed for the simulated dataset. We randomly extracted the 5,000 unrelated individuals (PIHAT < 0.185) from the BBJ second cohort data, which was also genotyped using the Infinium Asian screening array. Next, we combined the genotype data of 9, 19, 49, 99, 199, 499, 999, 1,999 and 4,999 individuals from the BBJ second cohort data with the genotype data from the individual with the targeted metagenome data to create a genotyping dataset for each metagenome data. Finally, we performed the re-identification analysis with the combined genotype dataset and calculated the sensitivity and specificity.
We performed a re-identification analysis with the human reads in the 343 gut metagenome data after human read filtering. We performed human read filtering using several different software packages—that is, BMTagger, Kraken 2 (ref. 73), HISAT2 (ref. 74), BWA-MEM75, Bowtie 2 (ref. 64) and KneadData76. All software packages were run using default settings with the reference database constructed from the same human reference genome (GRCh37, human_g1k_v37_decoy). Given that all software packages, excepting KneadData, required read pre-processing, such as adaptor trimming, we used Trimmomatic and Fastp77 before human read masking. Trimmomatic was used in the same setting as in the human read-extraction pipeline. Fastp was used with −3 −5 -w –adaptor TruSeq3-PE-2.fa. We first extracted human reads, which were detected by our pipeline, from the original fastq files. For the extracted unprocessed human reads, we evaluated whether they were removed by each human read filtering method. Given that all software packages, excepting KneadData, output only the mapping or classification results, we only kept reads for which both paired ends failed to align to the human reference genome for them. Using the human reads that were retained even after human read filtering, we performed a re-identification analysis in the same setting as described earlier.
Prediction of the ancestry of gut metagenome data
We calculated the likelihood that the observed human reads in the gut metagenome data from two input data—(1) human reads that were mapped to the human reference genome and (2) allele frequency data of the multiple ancestries—were derived from each ancestry. In this study we used a 1KG dataset ( processed with ‘bcftools view -m2 -M2’ to calculate the allele frequency of the AMR, EUR, AFR, EAS and SAS populations. For the definition of the five ancestries, the following 2,504 samples in the 1KG dataset were used. AMR: Colombians in Medellin, Colombia (n = 94); people with Mexican ancestry in Los Angeles, CA, USA (n = 64); Peruvians in Lima, Peru (n = 85); and Puerto Ricans in Puerto Rico (n = 104). EUR: Utah residents (CEPH) with Northern and Western European ancestry (n = 99), British individuals in England and Scotland (n = 91), Finnish individuals in Finland (n = 99), Iberian populations in Spain (n = 107) and Toscani in Italy (n = 107). AFR: Esan individuals in Nigeria (n = 99); Gambian individuals in Western Division, Mandinka (n = 113); Luhya individuals in Webuye, Kenya (n = 99); Mende individuals in Sierra Leone (n = 85); Yoruba individuals in Ibadan, Nigeria (n = 108); African Caribbean individuals in Barbados (n = 96) and people with African ancestry in Southwest USA (n = 61). EAS: Chinese Dai individuals in Xishuangbanna, China (n = 93); Han Chinese individuals in Beijing, China (n = 103); Southern Han Chinese individuals (n = 105), Japanese individuals in Tokyo, Japan (n = 104); and Kinh individuals in Ho Chi Minh City, Vietnam (n = 99). SAS: Bengali individuals in Bangladesh (n = 86); Gujarati Indians in Houston, TX, USA (n = 103); Indian Telugu individuals in the UK (n = 102); Punjabi individuals in Lahore, Pakistan (n = 96) and Sri Lankan Tamil individuals in the UK (n = 102). We extracted SNP sites covered with the human read in the gut metagenome data from the allele frequency data.
When an alternative allele frequency at an SNP site i in a population A is pi,A and the number of the alternative allele is gi, the theoretical genotype frequencies at SNP site i in a population A are expressed as
$$P\left( {g_\mathrm{i},p_{\mathrm{i,A}}} \right) = \left\{ {\begin{array}{*{20}{c}} {(1 – p_{\mathrm{i,A}})^2,} & {(g_\mathrm{i} = 0)} \\ {2p_{\mathrm{i,A}}\left( {1 – p_{\mathrm{i,A}}} \right),} & {(g_\mathrm{i} = 1)} \\ {p_{\mathrm{i,A}}^2,} & {(g_\mathrm{i} = 2)} \end{array}} \right.$$
In the same setting as in the case of the re-identification from a set of genotype data, the expected value of the likelihood score that the human reads were derived from the population A was expressed as
$$E(LS_\mathrm{A}) = \mathop {\sum }\limits_{i = 1}^N E\left( {LS_{\mathrm{i,A}}} \right) = \mathop {\sum }\limits_{i = 1}^N \mathop {\sum }\limits_{g_\mathrm{i} = 0}^2 P\left( {g_\mathrm{i},p_\mathrm{{i,A}}} \right){{{\mathrm{log}}}}(L_\mathrm{i}(g_\mathrm{i},n_\mathrm{i},k_\mathrm{i}))$$
In this study expected likelihood scores were calculated for all five superpopulations in the 1KG dataset, namely AMR, EUR, AFR, EAS and SAS. The superpopulation with the highest expected likelihood score was defined as the predicted ancestry.
For the simulation experiment, we downloaded the WGS data of 250 samples (50 individuals × 5 populations) from the Human Genome Diversity Project (Supplementary Table 6). We randomly extracted 300,000 paired-end reads for each sample, processed with the human read-extraction pipeline and downsampled with scales of 1 × 10−5, 2 × 10−5, 5 × 10−5, 1 × 10−4, 2 × 10−4, 5 × 10−4 and 1 × 10−3× human genome coverage. The ancestries of the downsampled read data were then predicted based on the above model and the true prediction ratio was calculated. PCA analysis of the SAS samples in the Human Genome Diversity Project dataset was performed with previously published genotype data78 (downloaded from ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516). The genotype data were merged to the 1KG dataset (downloaded from and SNPs in the hapmap3 dataset were extracted (downloaded from the GATK Resource Bundle, Finally, merged genotype data were subjected to pruning (PLINK –indep-pairwise 50 5 0.8) and PCA (PLINK –pca) with PLINK.
For the in-house Japanese (EAS) and multiple-ancestry (EUR, EAS and SAS) datasets, we used 343 and 73 samples, respectively. In addition, we downloaded three publicly available gut metagenome datasets previously published by Karlsson et al. (EUR; accession number: ERP002469)21, Zhu et al, (EAS; accession number: ERP111403)22 and Dhakan et al. (SAS; accession number: SRP114847)23. The downloaded datasets were processed with the human read-extraction pipeline. Next, the ancestries of the gut metagenome data were predicted with the above model and the true prediction ratio was calculated. The in-house EAS and EUR samples were defined based on self-reported ancestries and their ancestries were confirmed by the genotype PCA plot (Extended Data Figs. 5a and 8d). The ancestries of the in-house SAS samples (India), publicly available datasets (EUR, Sweden; EAS, China; and SAS, India) were based on the origin of countries.
Two-step genotype imputation from ultra-deep metagenome data
We used a two-step imputation pipeline previously used in Homburger et al.25 with slight modifications. First, we calculated the genotype likelihoods of SNPs from the human reads in the ultra-deep metagenome data. The mpileup function (‘-q 20 -Q 20’ options), followed by the call function in bcftools were used for the calculation. An update of the genotype likelihoods based on the reference genome data was performed for the SNP sites covered by at least one read using the genotype likelihood option implemented in BEAGLE 4.1 (ref. 79). Finally, the second round of imputation was performed using BEAGLE 5.0 (ref. 80) to generate genotypes at all of the remaining untyped sites. We used the combined reference panel of the 1KG Project Phase 3 version 5 genotype and Japanese WGS data as in the case of the genotype imputation of SNP array data. The concordance ratio to the SNP array data (not imputed), stratified according to the genotypes in the SNP array and MAF calculated from the Japanese individuals in the reference panel, was calculated. The concordance ratio to the WGS data was also calculated for previously reported disease-associated SNPs (inflammatory bowel diseases27 and type 2 diabetes28; PGWAS < 5 × 10−8). Independent disease-associated SNPs were selected using the –clump command of PLINK (–clump-r2 = 0.1 and –clump-kb = 250). Finally, SNPs genotyped using the WGS analysis were extracted for the concordance check.
Genotype calling from ultra-deep metagenome data without imputation
We performed variant calling without two-step imputation using GATK (version 4.1.7.0). The human reads in the gut metagenome data that were mapped to the human reference genome were subjected to analysis using HaplotypeCaller with default parameters. Hard filtering with the following parameters was applied: QD < 2.0, DP < 2.0, QUAL < 30.0, SOR > 3.0, FS > 60.0, MQ < 40.0, MQRankSum < −12.5 and ReadPosRankSum < −8.0. We applied the following additional filter for the per-allele depths based on the result of the concordance check with the WGS data (Supplementary Fig. 10a): alternative allele depth ≥ 2 and reference allele depth ≥ 2 for the Ref/Alt call, and alternative allele depth ≥ 6 and reference allele depth = 0 for the Alt/Alt call. Note that Ref/Ref sites were not called in this pipeline because joint-calling was not applicable.
For the variants called in both the WGS data and the ultra-deep metagenome data, the concordance ratio was calculated with stratification based on the called genotypes in the metagenome data and MAF in the WGS dataset. For the visualization of the reads mapped to the variant sites, we used IGV (version 2.12.2)81.
Statistics and reproducibility
No statistical method was used to pre-determine sample size. No data were excluded from the analyses. We did not use any study design that required randomization or blinding.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.