Results
Classifying collagens and elastin according to expression profile in human cells
In the human cells, abundance and localization determine the function and fate of proteins. Even though the functions of collagens and elastin have been summarized from the previous studies, there is still a lack of comprehensive comparison on the relevant data. To better elaborate the different expression levels of collagens and elastin in various tissues and cells of human body, the relevant transcriptome, single-cell transcriptome and immunohistochemical proteome data of various tissues from 8 human databases were all analyzed and plotted to generate a systematic expression profile of collagen and elastin family.
According to cluster analysis of the tissue-specific RNA-seq data, it was found that the RNA expression of collagen and family could be divided into three categories, namely cluster I, II and III. The cluster I is identified as the high-expression collagens existing in various tissues, among which the highest expression is mainly determined in sexual tissue, muscle and epidermis. There are three subclasses within this cluster. Subclass 1, including collagen I (1A1, 1A2), III (3A1), VI (6A1, 6A2), has higher expression level and stronger tissue-specific feature than subclass 2, which contains collagen IV (4A1, 4A2), XVI (GA1), XVIII (IA1), in female sexual tissue, endometrium, muscle, gallbladder, and adipose tissue. The expression pattern of elastin (ELN) is similar to subclass 2. Subclass 3 comprising of collagen V (5A1, 5A2), VI (6A3), XII (CA1), XV (FA1), XIV (EA1), XV (FA1), XVII (HA1) has similar tissue-specific feature with subclass 1, and has similar expression level with subclass 2. The cluster II belongs to the low-expression collagens containing two subclasses. The first subclass includes collagen IV (4A3, 4A4, 4A6), VII (7A1), VIII (8A1, 8A2), XIII (DA1), XXI (LA1), and XXVIII (SA1), which usually exhibits weak tissue specificity, and the second subclass includes collagen IV (4A5), V (5A3), IX (9A2 and 9A3), and XXVII (RA1), exhibiting the high tissue specificity in brain, eyes and female sex tissues. The other cluster (cluster III) is identified to be the ultralow expression collagens that only expressed in limited tissues, including collagen II (2A1), VI (6A5, 6A6), IX (9A1), X (AA1), XI (BA1), XIX (JA1), XX (KA1), XXII (MA1), XXIII (NA1), XXIV (OA1), XXV (PA1), XXVI (QA1). This classification was further validated through analyzing the relevant tissue RNA-seq data from other databases. To further obtain more accurate profile of collagens and elastin, single cell RNA sequencing (scRNA-seq) data from various human tissues was collected (Uhlen et al., 2010), and assessed to be divided into four clusters according to their respective expression level profile, which was similar with the tissue RNA-seq data.
When ulteriorly immunohistochemical data to map the expression profile of collagen proteins (Thul and Lindskog, 2018; Uhlen et al., 2010), it was obvious that the expression profile of collagen proteins was consistent with those of the transcriptomic data from both tissues and single-cells, confirming the type difference and tissue specificity of collagen family expression pattern. Due to the limitations caused by the antibody performance and experimental difficulty during the immunohistochemistry, mass spectrometry data was also involved to truly reflect the expression specificity of collagen in different tissues. The result found that the protein abundance of the collagen family was well correlated with the RNA abundance, indicating that the expression level of collagen may depended on the level of gene transcription to a large extent.
Previously, collagen proteins were mostly reported to be oxidatively degraded with age, eventually leading to collagen fibers loss (Avila Rodriguez et al., 2018; Shoulders and Raines, 2009). However, there are no relevant studies on the effect at RNA level. In current work, we evaluated the RNA expression of collagen family at different stages in human development, and found that with the increase of age, the expression of most collagen RNA significantly decreased, probably due to reduced transcription levels and RNA stability. Moreover, analysis of the expression correlation between skin care function-related genes and the collagen family in epidermal fibroblasts also demonstrated that the expression of collagen family had a significant positive correlation with the expression of epidermal cell proliferation and growth factor genes, indicating that collagens play important roles in activity of epidermal cell.
As a result, we finally classify the collagen family and elastin, among which a noteworthy cluster containing collagen I, II, III, IV, VI, XI, XV, XVI, XVI, XVIII and elastin (named “primary collagens” in the following) have similar and relatively high expression characteristics and are considered to be targets for our downstream analysis.
Tertiary structure analysis of the primary collagens
Collagen I and III are commonly used recombinant collagens in the market for their high expression in human body (Sorushanova et al., 2019). Collagen I is a trimer protein consisting of two A1 chains and one A2 chain, and its tertiary structure is mainly dependent on the disulfide bond in the C-terminal domain and the hydrogen bond and salt bridge in the triple-helical region. According to the AlphaFold2 model analysis, the triple-helical region of CO1A1 in collagen I was identified to be flexible, which could not fold into a fixed structure. The similar phenomenon is also observed in those highly expressed collagens, such as III, IV, VI, etc.. Intrinsically disordered regions (IDR) of proteins often lead to protein aggregation and liquid-liquid phase separation, enabling to strengthen the assembly of collagen trimers (Brocca et al., 2020). We used IUPred3 to predict the IDR in collagens (Erdos et al., 2021), and identified that the triple-helical regions are the dominant IDR with highly species-conserved in the all collagens. Although the expression pattern of elastin is similar to collagen, its structure is quite different from collagen (low proline content and short IDR sequence), which may be the reason why elastin cannot assemble into trimeric fibers.
Pathological and physiological function analysis of the primary collagens
As the most abundant protein in the human body, the abnormal expression and mutation of collagens can result in the occurrence course of certain disease. Based on the ClinVar and Uniprot database, the point mutations involved in the collagen protein-coding region were counted and investigated to indicate that most of the disease-related missense mutations occurred in the triple-helical region of collagen. Of these mutations, glycine point mutation accounts for the vast majority, and these pathogenic glycine mutations mainly occur nearby the hydroxylated proline as well as on the GXYGXY repeating sequence, the dominant structure of the collagen triple-helical region and the key feature of IDR. Glycine mutation in the GXYGXY repeating sequence may cause severe performance changes of collagen trimers, further leading to pathological changes in human body, like Ehlers-Danlos syndrome and Osteogenesis imperfecta (Bristow et al., 2005). Compared to collagen I and III, collagen IV (CO4A1, A2 and A3) had a lower frequency of pathogenic mutations, suggesting its stronger tolerance for mutations in function. However, such conclusion is not supported to the data analysis on collagen XV (COFA1) and XVI (COGA1), indicating that their abnormalities are weakly pathogenic. Notably, disease-causing mutations in collagen XVIII (COIA1) are almost found at its C-terminus, which may be served as specific functions to significantly distinguish collagen XVIII from other types of collagens.
With the relationship evaluation between collagen family expression levels and the life course of cancer patients in The Cancer Genome Atlas (TCGA) and Gene Expression Profiling Interactive Analysis (GEPIA) database (Hoadley et al., 2018; Tang et al., 2017), it was demonstrated that patients with high expression of collagen I and II showed weaker survivability in most cancer cases, and the type of tumor is closely related to the regulation effect of collagen III, VI, XV, XVI, XVII, XVIII and ELN in the patients. Patients with high expression of collagen IV and XI showed better survivability against cancers, exhibiting that collagen families might be involved in the complex tumorigenic process acting as a facilitator or an inhibitor (Necula et al., 2022; Shi et al., 2021; Sun et al., 2023). Therefore, these results provide an important parameter to determine the target collagen products for practical applications.
Actually, the disordered domain of collagen is derived from its the trimer structure, which also provides a platform for binding to other proteins for supporting some biological functions. Through the assessment of the proteomic data involved in the collagen interaction from the String database (von Mering et al., 2003), collagens were found to tend to bind intra family proteins specifically, which indicated that collagen trimers could be either homogeneous or heterogeneous. Elastin was also observed to bind tightly to collagen III, indicating they may function together as complex, which is consistent with their expression pattern (Van Doren, 2015). In addition to intra-familial interactions, collagens can also interact with filamins and matrix metalloproteinases to demonstrate their cell support and adhesions functions. The observation of gene ontology (GO) analysis was consistent with the previously reported biological function of collagen (Avila Rodriguez et al., 2018; Holmes et al., 2018; Van Doren, 2015). For instance, those collagen interaction proteins were found to be enriched in cytoskeleton, ligation and migration regulation pathways during the biological process (BP). In cellular component (CC), such proteins were enriched in collagen trimer, integrin complex, focal adhesion and cell surface, while they were enriched in collagen binding, extracellular matrix binding, integrin binding and virus receptor activity for molecular functions (MF). However, result of ELN was slightly different from collagen, especially in MF, which focused on oxidase activity and metal ion binding, further indicating its functional specificity.
Genome-wide CRISPR screen is a useful tool to accurately identify the function-related genes within the whole genome. Currently, 1,482 screening conditions and 27,635 screening genes related to human cells have been included in the CRISPR screen summary (CSS) database for elucidating the gene functions. It was shown that the collagens and elastin possessed a high priority in cell proliferation, protein accumulation and screening for small molecule anticancer drug treatment. Besides, the response to viral invasion, as one of the important functions involved in collagens and elastin was consistent with GO analysis results. Such conclusion was not previously proved, which can be used to explain the recent work that the extracellular matrix worked as physical barrier against infectious agents (Holmes et al., 2018; Pfisterer et al., 2021). In particular, collagen XVI was assessed to have a high ranking in the screening of senescence condition, providing the important index for collagen selection in the field of anti-aging process.
Moreover, as the typical feature (GXYGXY) in the triple helix region of collagen, the GXY-form tripeptides are mainly used to supply collagen to human body in most collagen-related nutrition and skin care products. With the analysis of the GXYGXY sequence-rich proteins in the human proteome, it was observed that in addition to collagen and elastin, proteins related mRNA processing and golgi lumen were also enriched in GXYGXY sequence, probably indicating that collagen tripeptides may involve these biological functions (Banushi et al., 2016; Zhang and Stefanovic, 2016).
Improve PANCE to evolve intein with high activity to collagen
As stated above, collagens have huge markets in beauty and food additives worldwide for their important functions in physiology and pathology. However, the expression of recombinant collagens faces a great challenge in the industrialization process because of the disorder and GXY-repeating structure (Fang et al., 2023). Strategies based on the platform of the phage-assisted non-continuous evolution (PANCE) have been involved in solving the solubility and yield of target proteins (Liu et al., 2021; Miller et al., 2020), which have great potentials to strengthen the expression of collagens. To couple the expression of collagen with M13 phage gIII, a key gene coding pIII protein for the infection ability of M13 phage, a biosensor mediated by Mxe GyrA intein, a widely used intein, was designed. Undesirably, the wildtype Mxe GyrA intein possessed so weak cleavage activity to collagen and pIII that the progeny phage could not be packed from infected host cells for the next round of evolution. This may be due to the low cleavage activity of the Mxe GyrA intein against the GPX repeats in collagens, which could be improved by PANCE.
Here still existed another question that low mutagenesis efficiency of PANCE severely inhibited the speed of evolution, because PANCE used a group of error-prone enzymes to mutate host strain and M13 phage genomes, indiscriminately. In order to further enhance the efficiency and shorten the time, the atmospheric and room temperature plasma (ARTP) technology, an efficient and safe physical mutagenesis technique (Zhang et al., 2014), was introduced and developed to couple with PANCE (named ARTP-PANCE) instead of enzymatic mutagenesis. Each round of ARTP-PANCE was divided into three steps. Firstly, primary M13 phages carrying with gene of interest (GOI) infect into fresh host strain, followed by the infected strains are mutant by ARTP. Replace fresh medium by centrifugation, and then determinate the phage titer. Finally, progeny phages with mutation are packed and released form strain body to supernatant of culture medium, which are used to the next round of ARTP-PANCE. Inactive kanamycin resistance gene recovery experiment showed that ARTP-PANCE could greatly shorten the time required for a round of evolution. However, it takes about 6-8 h to determine phage titer by double-layer plate method, which is difficult to instantly detect the state of evolution. We employed loop-mediated isothermal amplification (DNA-LAMP) method to determinate phage titer in 10-30 min, which could immediately monitor the evolution process.
To screen high activity of GyrA intein mutant, we redesigned a biosensor coupling the intein activity to collagen with the infection ability of M13 phages. According to tertiary structure, we split pIII and GyrA intein into N-terminal and C-terminal (Liu et al., 2021). In M13 genome (SP), protein coding region of pIII was replaced by pIII-N connected to GyrA-N through G-P-X collagen region. In accessory plasmid (AP), fused gene of GyrA-C and pIII-C was controlled by phage shock protein operon (PSPO), which would be only transcribed when the host strain was infected by M13 phage. After ARTP mutagenesis, GyrA-N mutant with higher activity generated more intact pIII through trans-splicing reaction, then more invasive progeny phages are packaged and released. Progeny phages of genome with inactive GyrA-N mutant loss the infection ability for pIII deficiency. As the number of evolutions added, phages titer gradually increased. We ultimately obtained the mutation of GyrA intein (named GyrA), and labelled them in tertiary structure. We applied western blot to verify intein cleavage efficiency to collagen and pIII in vivo. Result showed that GyrA had over 10-fold stronger cleavage activity than wildtype. We further employed GyrA to in producing collagen tripeptide glycine-proline-glutamine (GPQ). According to the molecular weight (MW) different of intein and GPQ, we designed a chromatography-free collagen tripeptide purification method to isolate GPQ after cleaved by intein. High performance liquid chromatography (HPLC) results showed that GyrA cleavage produced 1.7-fold higher yield of GPQ than wildtype.
ARTP-PANCE to evolve T7 RNAP for strengthening the transcription efficiency of collagen
For GyrA had a desirable activity to collagen and pIII, which could couple the expression level of collagen with the infection ability of M13 phage. We redesigned ARTP-PANCE of T7 RNAP to strength the expression of collagen. We inserted strong T7 terminator (Calvopina-Chavez et al., 2022) between T7 promoter and coding region to create harsh elongation condition for T7 RNAP, followed by CO3A1-GyrA-pIII fused gene. Progeny phages of genome containing T7 RNAP mutant with higher processivity to collagen have stronger infection ability. Like the result above, phages titer gradually increased in evolution process (Fig. 5C). Finally, we obtained the mutation of T7 RNAP (named T7 RNAP*) detected by sanger sequencing.
To verify the activity of T7 RNAP to collagen, we used CRISPR-associated transposases to insert T7 RNAP gene in the IS1 site of E. coli BL21 BL21. RT-qPCR data showed that the transcriptional levels of CO3A1 and gIII were strengthened in T7 RNAP compared with wildtype, which proved that T7 RNAP had a stronger processivity than wildtype. We used western blot to detect the protein level of CO3A1 in these two strains. Result exhibited that CO3A1 protein expression level mediated by T7 RNAP* was 2.3-fold higher than wildtype T7 RNAP. These results indicated that enhance the transcriptional level of CO3A1 gene was an effective way to increase its protein abundance.
Develop TADR-FADS in evolving proline hydroxylase mutant for reinforcing collagen hydroxylation in vivo
As we described above, proline hydroxylation is important for the function and stability of collagen trimer (Fang et al., 2023; Myllyharju, 2003). E. coli lacks PTM system for hydroxylating collagen in vivo, so we should express an additional proline hydroxylase. Mammalian proline hydroxylases are poly-subunit proteins generally, which are difficult to be assembled correctly in prokaryotes. To solve this problem, we analyzed proline hydroxylases which were single subunit in other species, and eventually chose moumouvirus P4Hc (moumou_P4Hc) as the candidate for its structure and sequence similarity to P4H in human.
To ensure efficient hydroxylation of collagen in vivo, we designed a biosensor of green fluorescent protein (GFP) to couple activity of proline hydroxylases with GFP stability. The pdt-tag is fused into the C-terminal of GFP, which can be specifically degraded by mf-lon protease (Cameron and Collins, 2014). Proline hydroxylase with high activity will hydroxylate the proline in pdt-tag and mf-lon protease, which will destroy this specific degradation of GFP. In order to improve the orthogonality and targeting during mutagenesis, we employed targeted artificial DNA replisome (TADR) technology in directed mutagenesis of moumou_P4Hc (Yi et al., 2021). Fluorescence-activated droplet sorting (FADS) by high-throughput microfluidic was applied to sort microdroplets containing high signal green fluorescence, which may carry with the strain of high active moumou_P4Hc mutants (Baret et al., 2009). Using TADR-FADS, we got a mutant (named P4Hc). This mutation pattern is consistent with TADR previously reported (Yi et al., 2021). Then we used CRISPR-associated transposases to insert T7 RNAP and P4Hc gene in the IS1 site of BL21 strain genome, and tested the proline hydroxylated level of total protein, CO3A1 and ELN. Result showed that moumou_P4Hc could hydroxylated CO3A1 and ELN. And P4Hc had a 2.6-fold stronger activity to CO3A1 and ELN than wildtype.
Finally, the yields of CO3A1 and ELN reached 0.013 and 0.047 g/L, and the proline hydroxylated level reached 62.1% and 67.9% in vivo, respectively.
Improve CFPS System to strengthen the synthesis and modification of collagen in vitro
Because of the repetitive structure and the extremely high proportion of glycine and proline, the expression of collagen in the cell puts great stress on the cell. One factor is the translation attenuation caused by the decrease in the local concentration of amino acids and aminoacyl tRNA, and the other is the poor stability of long RNA. To solve these problems, we adopted and improved cell-free protein synthesis (CFPS) system with T7 RNAP and P4Hc to synthesize collagen in vitro by artificial metabolic compensation. To visually detect collagen expression, we added GFP fluorescent protein to the end of collagen. Fluorescence intensity test results showed that supplementation with proline, glycine, and their tRNA clusters slightly inhibited the expression of normal GFP due to competitive effects, but GFP fused with CO3A1 and ELN were significantly strengthened by this way. Proline hydroxylation mediated by P4Hc* was slightly affected in CFPS system. We further adopted this improved system to synthesis wildtype CO3A1 and ELN, and found that the yields of CO3A1 and ELN increased with addition of translation substrates. These indicated that repeated amino acid sequence will inhibit translation efficiency by insufficient concentration of translation material.
The degradation of RNA is mainly derived from intracellular exonuclease, especially long RNA molecular, which may decrease the translation level of collagen (Qu et al., 2022). Therefore, we used self-circularized ribozyme to circularize collagen mRNA, which can prevent RNA from being attacked by exonuclease (Qu et al., 2022). Through adding guanylate to promote circularization, the fluorescence intensity test results showed that circular mRNA had higher expression efficiency than linear mRNA, and the proline hydroxylated level maintained at about 25-30% in CO3A1 and ELN. We further utilized self-circularized ribozyme to synthesis wildtype CO3A1 and ELN, and found that circular mRNA could increase the yields of CO3A1 and ELN.
Finally, the yields of CO3A1 and ELN increased to 0.31 and 2.58 g/L in CFPS system, and the proline could be effectively hydroxylated in vitro.