RNA profiles reveal signatures of future well being and illness in being pregnant

The Mirvie RNA technology

cfRNA isolation

Plasma samples received on dry ice from our collaborators were stored at -80 °C until further processing. Total circulating nucleic acid was extracted from plasma ranging in volume from ~215 µl to 1 ml, using a column-based commercially available extraction kit, following the manufacturer’s instructions (Plasma/Serum Circulating and Exosomal RNA purification kit, Norgen, 42800).

Following extraction, cfDNA was digested using Baseline-ZERO DNase (Epicentre) and the remaining cfRNA was purified using an RNA Clean and Concentrator-5 kit (Zymo, R1016) or an RNeasy MinElute Cleanup kit (Qiagen, 74204).

RT-qPCR assay

We performed PCR with reverse transcription (RT–qPCR) analysis to assess the relative amount of cfRNA extracted from each sample. We measured and compared the threshold cycle (Ct) values ​​from each RNA sample using a three-colour multiplex qPCR assay from the TaqPath 1-Step Multiplex Master Mix kit (ThermoFisher Scientific, A28526) and a Quant Studio 5 system Ct values ​​for an endogenous housekeeping gene (ACTB; ThermoFisher Scientific, 4351368).

cfRNA library preparation

cfRNA libraries were prepared using the SMARTer Stranded Total RNAseq-Pico Input Mammalian kit (Takara, 634418) following the manufacturer’s instructions, except that we did not use ribo depletion. Library quality was assessed by RT–qPCR following the method described for assessing RNA measurements and fragment analysis on a Fragment Analyzer 5300 (Agilent Technologies).

Enrichment and sequencing

Libraries were normalized before pooling for target capture. We used a SureSelect Target Enrichment kit (Agilent Technologies, 5190-8645) and followed the manufacturer’s instructions for hybrid capture. Samples were quantified, and 50-bp, paired-end sequencing was performed on a Novaseq S2. Between 96 and 144 samples were pooled and sequenced per sequencing run.

Analysis for outliers

qPCR of ACTB as well as MultiQC sequencing metrics were monitored to eliminate sample outliers before performing gene expression analyses. Individual samples more than 3 sd from the mean were removed as outliers. A total of 193 of 2,732 samples (7.1%) were removed following this filtering.

Read processing

Reads were processed following a similar protocol to that reported in Ngo et al.37. Briefly, raw sequencing reads were trimmed using trimmomatic38 and then mapped to hg38 using the STAR aligner39. After removing duplicates using Picard tools, gene counts were generated with htseq40.

Cohort correction and feature normalization

For each gene, its relationship to total counts per sample was measured and corrected using linear model residuals. Extended Data Fig. 5a, b shows what this looks like for the gene ACTB.

We also sought to correct the genes such that each cohort had the same mean value for each gene. However, the cohorts came from different parts of the gestational age spectrum. Therefore, only cohort effects orthogonal to the gestational age effect were corrected. This is shown in Extended Data Fig. 5c, d for the gene CAPN6. Each cohort was given its own colour.

Cohort E (bright yellow) had unusually low counts for his gestational age range before correction, and this effect was removed by correction.

Using principal-component analysis (PCA) to compress the high-dimensional space of all genes, the correction could be seen to clarify the separation of samples by gestational age as indicated by the color gradient (Extended Data Fig. 5e, f).

Linear correction algorithm

1. In the training, correct for (remove the effect of) the variable(s) of interest (eg, gestational age) using linear model residuals.

2. Learn the required correction for the variables you wish to correct for in this corrected training data.

3. The residuals of that model (in the raw training and testing data) are your corrected data.

Note: the correction was learned entirely in the training data and the variable of interest in the testing data was never used, negating the possibility of a data leak.

Lasso linear model for gestational age prediction and ANOVA

The Lasso model used in the gestational age model had its parameters chosen via 10-fold cross-validation in the training set. The largest cross-validation score within one standard error of the best cross-validation score was chosen (Breiman strategy). We limited our feature space by excluding pseudogenes and non-coding genes, as well as genes with median expression greater than zero, leaving a total of 13,208 features to evaluate. A final lasso with this was then trained on the whole training set and evaluated in the test set. This was all done with the glmnet R package using the cv.glmnet() function.

The model uses 674 of the available gene features (Supplementary Data 1), although this includes a long tail of features with low contribution. We tested performance for the 50 most informative features from the model and obtained a mean absolute error of 15.4 days. The continued reduction in error as we reached our complete training set of n = 1,908 samples indicated that model learning was not exhausted and that additional samples would have increased performance (Extended Data Fig. 6). Notably, as seen in Extended Data Fig. 6, the similar performance in cross-validation and on the independent held-out test data indicated that the model was not overfit with the 674 gene features. To determine how far the model could be extrapolated, a final model was built using all data; this gave a mean absolute error of 13 days across the entire dataset.

Gestational age learning curve

The main gestational age modeling was done with an 80/20 train/test split. To assess model performance after decreasing amounts of training data, one can repeat analyzes with 70/30 splits, 60/40 splits and so on (doing so repeatedly with different random splits to quantify uncertainty). In this way, one builds a learning curve (Extended Data Fig. 6) with different training set sizes on the x axis and model performance on the y axis.

Gestational age model without cohort correction

For this approach, we selected all samples from healthy pregnancies and split the dataset into a training set (80% of data) and a test set (20% of data), in which samples were stratified by cohort. Samples that did not pass quality-control filtering based on basic sequencing metrics had been previously excluded from analysis. We trained a Lasso model to predict the gestational age at collection for each sample using the mean absolute error as an optimization metric and 10-fold cross-validation in the training set. We used all genes with mean log2(counts per million (CPM) + 1) > 1 (12,921 genes) plus a set of sequencing metrics as features for training. Modeling was performed in log2(CPM + 1) space, and all data were centered and scaled before modeling using the training set statistics. This led to a model with a mean absolute error of 15.9 days in the withheld test set using 487 transcriptomic features. We then selected the top 53 features of this model and retrained the Lasso using the same approach described above, achieving a mean absolute error of 16.6 days in the held-out test set.

Gene set enrichment analysis

Gene set enrichment analysis (GSEA)11,41 was done with the fast GSEA algorithm42 using Bioconductor’s fgsea package43. Gene sets were compiled from the Molecular Signatures Database (MSigDB)11,12 using the CRAN msigdbr v7.2 API and directly from c8.all.v7.3.symbols.gmt. We focused on two collections of gene sets: the Gene Ontology (GO) subcollection of the ontology gene sets, C5:GO, and the cell type signature gene sets, C8 v7.3. Genes were ranked on the basis of their shrunken log-transformed fold change values ​​and associated Wald test P values ​​obtained from analysis of differential expression using Bioconductor’s DESeq2 (ref. 44), represented as –log10(P value) × shrunkenLFC. GSEA was carried out on 372 samples from cohort H collected from 93 women with healthy pregnancies over four draw intervals during pregnancy, 11.4−14 weeks, 18−21 weeks, 22.8−27.8 weeks and 29.2−34.8 weeks. Shrunken log-transformed fold change values ​​and corresponding P values ​​were obtained from all six pairwise contrasts between the four draws. We used 102 fetal gene sets that were significantly enriched (Benjamini-Hochberg adjusted P < 0.01) in at least one pairwise comparison (Supplementary Table 2) in downstream analyses, including analysis of plasma transcriptome partitioning and set-specific longitudinal trends.

Using a GO collection of gene sets, we validated our approach and identified seven pregnancy-related sets that were significantly enriched in the comparison between early- and late-pregnancy samples (Extended Data Figure 1). Three gene sets in the gonadotropin and estrogen pathways exhibited significant changes consistent with known physiology45.

Evaluating changes in plasma transcriptome partitioning

The plasma transcriptome can be viewed phenomenologically as being partitioned into characteristic sets of genes. We assessed this partitioning in each cfRNA sample by converting raw gene counts to CPM and summing CPM over all genes in each of the sets. The resulting cumulative CPM score, which is a relative measure of the abundance of each gene set in the overall transcriptome, was used to directly compare gene sets across collection time points. Cumulative CPM scores for all gene sets significantly enriched between collections 1 and 4 were calculated for every cfRNA sample. The scores for each sample were regressed onto the recorded gestational age (in weeks) using a linear model. Gene sets with an adjusted P value for the gestational age coefficient <0.01 were considered as having a significant (positive or negative) trend in their relative abundance. The association of these trends with the time component in the data was further verified by scrambling the temporal structure and re-examining the trends along the original time variable. For each mother, we also evaluated the monotonicity of the cumulative CPM score function along the collection times. Because there are 24 possible permutations of order for the four collection times and only one of those permutations allows for a monotonic upward trend (with one for a downward trend), we were able to analytically assess the significance of the observed number of monotonic trends among 93 mothers using a chi-squared test.

Preeclampsia analysis and learning curve

CIs for AUCs and sensitivity, specificity and PPV were all found via bootstrapping. PPV was calculated as PPV = (sensitivity × prevalence)/((sensitivity × prevalence) + ((1 – specificity) × (1 – prevalence))).

To build the learning curve (Extended Data Fig. 7), we increased the size of the training set going from two- to ninefold cross-validation with a constant model: logistic regression with gene features chosen by Spearman correlation tests with an adjusted P- value threshold of 0.05. The point on the right connected to the learning curve via a dashed line is the leave-one-out cross-validation result shown in the main text.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Comments are closed.