Method Article
The protocols described allow the construction, characterization and selection (against the target of choice) of a "domainome" library made from any DNA source. This is achieved by a research pipeline that combines different technologies: phage display, a folding reporter and next generation sequencing with a web tool for data analysis.
Folding reporters are proteins with easily identifiable phenotypes, such as antibiotic resistance, whose folding and function is compromised when fused to poorly folding proteins or random open reading frames. We have developed a strategy where, by using TEM-1 β-lactamase (the enzyme conferring ampicillin resistance) on a genomic scale, we can select collections of correctly folded protein domains from the coding portion of the DNA of any intronless genome. The protein fragments obtained by this approach, the so called "domainome", will be well expressed and soluble, making them suitable for structural/functional studies.
By cloning and displaying the "domainome" directly in a phage display system, we have showed that it is possible to select specific protein domains with the desired binding properties (e.g., to other proteins or to antibodies), thus providing essential experimental information for gene annotation or antigen identification.
The identification of the most enriched clones in a selected polyclonal population can be achieved by using novel next-generation sequencing technologies (NGS). For these reasons, we introduce deep sequencing analysis of the library itself and the selection outputs to provide complete information on diversity, abundance and precise mapping of each of the selected fragment. The protocols presented here show the key steps for library construction, characterization, and validation.
Here, we describe a high-throughput method for the construction and selection of libraries of folded and soluble protein domains from any genic/genomic starting source. The approach combines three different technologies: phage display, the use of a folding reporter and next generation sequencing (NGS) with a specific web tool for data analysis. The methods can be used in many different contexts of protein-based research, for identification and annotation of new proteins/protein domains, characterization of structural and functional properties of known proteins as well as definition of protein-interaction network.
Many open questions are still present in protein-based research and the development of methods for optimal protein production is an important need for several fields of investigation. For example, despite the availability of thousands of prokaryotic and eukaryotic genomes1, a corresponding map of the relative proteomes with a direct annotation of the coded proteins and peptides is still missing for the great majority of organisms. The catalogue of complete proteomes is emerging as a challenging goal requiring a huge effort in terms of time and resources. The gold standard for experimental annotation remains the cloning of all the Open Reading Frames (ORFs) of a genome, building the so called "ORFeome". Usually gene function is assigned based on homology to related genes of known activity but this approach is poorly accurate due to the presence of many incorrect annotations in the reference databases2,3,4,5. Moreover, even for proteins that have been identified and annotated, additional studies are required to achieve characterization in terms of abundance, expression patterns in different contexts, including structural and functional properties as well as interaction networks.
Furthermore, since proteins are composed of different domains, each of them showing specific features and differently contributing to protein functions, the study and the exact definition of these domains can allow a more comprehensive picture, both at the single gene and at the full genome level. All this necessary information makes protein-based research a wide and challenging field.
In this perspective, an important contribution could be given by unbiased and high-throughput methods for protein production. However, the success of such approaches, beside the considerable investment required, relies on the ability to produce soluble/stable protein constructs. This is a major limiting factor since it has been estimated that only about 30% of proteins can be successfully expressed and produced at sufficient levels to be experimentally useful6,7,8. An approach to overcome this limitation is based on the use of randomly fragmented DNA to produce different polypeptides, which together provide overlapping fragment representation of individual genes. Only a small percentage of the randomly generated DNA fragments are functional ORFs whilst the great majority of them are non-functional (due to the presence of stop codons inside their sequences) or encode for un-natural (ORF in a frame other than the original) polypeptides with no biological meaning.
To address all these issues, our group has developed an high-throughput protein expression and interaction analysis platform that can be used on a genomic scale9,10,11,12. This platform integrates the following techniques: 1) a method to select collections of correctly folded protein domains from the coding portion of DNA from any organism; 2) the phage display technology for selecting partners of interactions; 3) the NGS to completely characterize the whole interactome under study and identify the clones of interest; and 4) a web tool for data analysis for users without any bioinformatics or programming skills to perform Interactome-Seq analysis in an easy and user-friendly way.
The use of this platform offers important advantages over alternative strategies of investigation; above all the method is completely unbiased, high-throughput, and modular for study ranging from a single gene up to a whole genome. The first step of the pipeline is the creation of a library from randomly fragmented DNA under study, which is then deeply characterized by NGS. This library is generated using an engineered vector where genes/fragments of interest are cloned between a signal sequence for protein secretion into the periplasmic space (i.e., a Sec leader) and the TEM1 β-lactamase gene. The fusion protein will confer ampicillin resistance and the ability to survive under ampicillin pressure only if cloned fragments are in-frame with both these elements and the resulting fusion protein is correctly folded10,13,14. All clones rescued after antibiotic selection, the so called "filtered clones", are ORFs and, a great majority of them (more than 80%), are derived from real genes9. Moreover, the power of this strategy lies in the findings that all ORF filtered clones are encoding for correctly folded/soluble proteins/domains15. As many clones, present in the library and mapping in the same region/domain, have different starting and ending points, this allows unbiased, single-step identification of the minimum fragments that are likely to result in soluble products.
A further improvement in the technology is given by the use of NGS to characterize the library. The combination of this platform and of a specific web tool for data analysis gives important unbiased information on the exact nucleotide sequences and on the location of selected ORFs on the reference DNA under study without the need of further extensive analyses or experimental effort.
Domainome libraries can be transferred into a selection context and used as a universal instrument to perform functional studies. The high-throughput protein expression and interaction analysis platform that we integrated and that we called Interactome-Seq takes advantage of the phage display technology by transferring the filtered ORF into a phagemid vector and creating a phage-ORF library. Once re-cloned into a phage display context, protein domains are displayed on the surface of M13 particles; in this way domainome libraries can be directly selected for gene fragments encoding domains with specific enzymatic activities or binding properties, allowing interactome networks profiling. This approach was initially described by Zacchi et al.16 and later used in several other context13,17,18.
Compared to other technologies used to study protein-protein interaction (including yeast two hybrid system and mass spectrometry19,20), one major advantage is the amplification of the binding partner that occurs during phage display multiple rounds of selection. This increases the selection sensitivity thus allowing the identification of low abundant binding proteins' domains present in the library. The efficiency of the selection performed with ORF-filtered library is further increased due to the absence of non-functional clones. Finally, the technology allows the selection to be performed against both protein and non-protein baits21,22,23,24,25.
Phage selections using the domainome-phage library can be performed using antibodies coming from sera of patients with different pathological conditions, e.g. autoimmune diseases13, cancer or infection diseases as bait. This approach is used to obtain the so called "antibody signature" of the disease under study allowing to massively identify and characterize the antigens/epitopes specifically recognized by the patients' antibodies at the same time. Compared to other methods the use of phage display allows the identification of both linear and conformational antigenic epitopes. The identification of a specific signature could potentially have an important impact for understanding pathogenesis, new vaccine design, identification of new therapeutic targets and development of new and specific diagnostic and prognostic tools. Moreover, when the study is focused on infectious diseases, a major advantage is that the discovery of immunogenic proteins is independent from pathogen cultivation.
Our approach confirms that the folding reporters can be used on a genomic scale to select the "domainome": a collection of correctly folded, well expressed, soluble protein domains from the coding portion of the DNA and/or cDNA from any organism. Once isolated the protein fragments are useful for many purposes, providing essential experimental information for gene annotation as well as for structural studies, antibody epitope mapping, antigen identification, etc. The completeness of high-throughput data provided by NGS enables the analysis of highly complex samples, such as phage display libraries, and holds the potential to circumvent the traditional laborious picking and testing of individual phage rescued clones.
At the same time thanks to the features of the filtered library and to the extreme sensitivity and power of the NGS analysis, it is possible to identify the protein domain responsible of each interaction directly in an initial screen, without the need to create additional libraries for each bound protein. NGS allows to obtain a comprehensive definition of the whole domainome of any genic/genomic starting source and the data analysis web tool enables the obtainment of a highly specific characterization both from a qualitative and quantitative point of view of the interactome proteins' domains.
1. Construction of the ORF Library (Figure 1)
2. Subcloning of Filtered ORFs in a Phagemid Vector (Figure 2)
3. Phage Library Preparation and Selection Procedure
4. Phage Library Deep Sequencing Platform (Figure 3)
5. Bioinformatic Data Analysis by Using the Interactome-Seq Web Tool
The filtering approach is schematized in Figure 1. Each kind of intronless DNA can be used. In Figure 1A the first part of the filtering approach is represented: after loading on an agarose gel or a bioanalyzer, a good fragmentation of the DNA of interest appears as a smear of fragments with a length distribution in the desired size of 150-750 bp. A representative virtual gel image of the fragmented DNA obtained is given. Fragments loaded on the agarose gel are then recovered, end-repaired and phosphorylated, and then cloned into a previously blunted pFILTER vector to create a library of random DNA fragments. Performing each step of the cloning procedure under optimal conditions is required to obtain good quality library with a total coverage of the DNA under study.
In Figure 1B the filtering approach is represented: the library is grown in the presence of chloramphenicol (pFILTER resistence) alone or chloramphenicol and ampicillin to select for ORF-containing colonies. Only colonies having a DNA fragment corresponding to an ORF produce a functional β-lactamase and survive when antibiotic selection is present. Figure 1C shows how increasing selective pressure allows selection of good folder ORFs versus poor folder ones. The expected result is a decrease of the library size of about 20-fold. Higher number of surviving clones indicates insufficient selective pressure.
ORF fragments can be easily recovered from the filtered library for subsequent application; for interaction studies our strategy takes advantage of phage display technology. In Figure 2, the principal steps of phage library construction are represented: an adequate library is prepared by cutting out filtered fragments from the pFILTER vector and re-cloning into a phagemid plasmid in fusion with the sequence coding for the phage capside protein g3p. Once infected with helper phage, the presence of the vector into bacteria cells allows the production of phage particles displaying ORF-g3p fusion products on their surface thus making the filtered library available for phage display selection and further analysis.
All the libraries are deeply analyzed by NGS, as well as the outputs of the phage selections, as shown in the second part of Figure 3. DNA fragments are rescued from growing colonies by PCR amplification with specific oligonucleotides annealing on the plasmid backbone and carrying specific adapters for the sequencing. NGS is performed and reads are then analyzed with the Interactome-Seq data analysis web tool.
In Figure 4 we reported a schematic representation of the selection procedure of an ORF filtered phage display library. The selection in this example is performed by using antibodies present in the sera from patients affected by different pathologies (i.e. infective pathologies, autoimmune pathologies, cancer). In this case the phage library directly interacts with the antibodies present in the patients' sera and in this way putative specific antigens can be enriched because they are recognized by disease specific antibodies. In this kind of experiment, usually the library is also selected using control sera from healthy patients in order to have a background signal to be used for successive comparison and normalization procedures.
Selections are performed using sera from the same type of patients usually grouped together into different pools in order to reduce inter-individual variability of sera antibody titer. Each pool is independently used for two to three consecutive rounds of selection, to enrich the library for immune-reactive clones specific for the pathology under study. Test set antibodies are incubated with library phages, immune-complexes are recovered by protein A coated magnetics-beads and bound phages are eluted by standard procedures. The selection cycles are performed with increasing washing and binding stringency.
The reads generated by NGS can be analyzed using the Interactome-Seq web tool specifically developed to manage this kind of data. Interactome-Seq data analysis workflow is composed of four sequential steps that, starting from raw sequencing reads, generates the list of putative domains with genomic annotations (Figure 5A). In the first step INPUT (Figure 5A - red box), Interactome-Seq checks if the input files (raw reads, reference genome sequence, annotation list) are properly formatted. In the second step PREPROCESSING (Figure 5A - orange box), low-quality sequencing data are first trimmed using Cutadapt28 depending on quality scores and reads with less than 100 bases in length are discarded. In a subsequent READ ALIGNMENT step (Figure 5A - green box), the remaining reads are aligned with blastn29 to the genome sequence allowing up to 5% of mismatches. A SAM file is generated and only reads with quality score greater than 30 (Q>30) are processed using SAMtools30 and converted into a BAM file. After alignment, Interactome-Seq performs the DOMAINS DETECTION (Figure 5A - blue box), invoking Bedtools31 to filter reads overlapping at least for 80% of their length inside transcripts; the coverage, max depth and focus values are then calculated for each ORF portion covered by mapping reads. The coverage represents the total number of reads assigned to a gene; the depth is the maximum number of reads covering a specific genic portion; the focus is an index obtained from the ratio between max depth and coverage, and it can range between 0 and 1. When the focus is higher than 0.8 and the coverage is higher than the average coverage observed for all mapping regions in the BAM file, the CDS portion is classified as a putative domain/epitope. The last step of the Interactome-Seq pipeline is the OUTPUT (Figure 5A - violet box), a list of putative domains is generated in tabular separated format. The Interactome-Seq pipeline has been included in a web-tool to enable users without any bioinformatics or programming skills to perform Interactome-Seq analysis through the graphical interface and to obtain their results in an easy and user-friendly format. As shown in Figure 5B, the output results of an analysis are displayed using JBrowse32 to enable visualization and exploration. Interactome-Seq generates tracks in the genome browser corresponding to putative domains detected and provides also classical Venn diagrams to show intersections between common putative domains enriched for example in different selections experiments.
Figure 1: Schematic overview of the main steps for the construction of the ORF-filtering library
A) DNA from different source is sonicated and fragmented into random fragments of 150-750 bp length. Fragments are recovered from gel and cloned as blunt into the pFILTER vector; B) filtering step using β-lactamase as a folding reporter. Vector containing not ORF fragments are negatively selected on ampicillin while ORF cloned fragments allow colonies to grow; C) application of an increasing selective pressure (ampicillin concentration in solid growth media from 0 to >100 μg/mL) allow selection of better folded fragments. Please click here to view a larger version of this figure.
Figure 2: Schematic overview of the main steps for the construction of the phage library
A) ORF-filtered fragments are cut out from the filtered vector using specific restriction enzymes. After recovery and purification, fragments are cloned into phagemid vector and transformed; B) phagemid bacterial library is infected with helper phage and, after overnight growth, phages are PEG-precipitated and collected. Please click here to view a larger version of this figure.
Figure 3: ORF libraries sequencing
Sequencing is performed on both the original ORF selected library as well as on the phage display library; 1) on both cases colonies grown are recovered and DNA extracted; 2) DNA fragments are recovered by amplification using specific primers linked to adaptors for sequencing; 3-4) fragments are recovered and deep sequenced using NGS; 5) data are analyzed by using the Interactome-Seq pipeline. Please click here to view a larger version of this figure.
Figure 4: Schematic overview of library selection using patients' antibodies
Phage library is used for selection against antibodies from patients' sera. Antibodies are immobilized on magnetic beads, the phage library capture/selection is performed, three cycles of washes are performed and afterwards selected phages are recovered and used to re-infect E. coli. Re-infected E. coli cells are plated in selective pressure (ampicillin 100 μg/mL). ORF fragments are recovered by amplification and amplicon pools are then sequenced by NGS. Please click here to view a larger version of this figure.
Figure 5: Schematic overview of library analysis
A) Representation of the data analysis workflow, starting from raw FASTQ files to the final annotated domains lists; B) schematic representation of the inputs and outputs of the Interactome-Seq web tool. Please click here to view a larger version of this figure.
The creation of a high quality highly diverse ORFs filtered library is the first critical step in the whole procedure since it will affect all the subsequent steps of the pipeline.
An important advantageous feature of our method is that any source of (intronless) DNA (cDNA, genomic DNA, PCR derived or synthetic DNA) is suitable for library construction. The first parameter that should be taken into account is that the length of the DNA fragments cloned into the pFILTER vector should provide a representation of the entire collection of the domains of a genome or a transcriptome, the so called "domainome". We have demonstrated that protein domains can be successfully cloned, selected and finally identified starting from DNA fragments with a length distribution spanning from 150 to 750 bp33,34, and this is in line with what is reported in the literature showing that most protein domains are of 100 aa length (with a range from 50 to 200 aa)15.
DNA starting material must be fragmented into the size range of choice and later cloned into the filtering (pFILTER)12 vector. During these steps, potential bias could be avoided maximizing the efficiency of all the cloning steps reactions included in the protocol, in particular fragment end-repairing and phosphorylation. The vector preparation is challenging and should be made under optimal conditions as well, to avoid both plasmid degradation and/or contamination by undigested vector.
Once the library has been created, it should be "filtered" in order to retain only ORFs folded fragments. A key parameter to modulate this step is the selective pressure applied that can be modified according to the stringency of the filtering desired. Selection is performed using ampicillin: the higher the concentration used, the lower the number of transformed bacteria colonies able to survive. This reflects the ability of the filtering method to select for good- versus poor- folder ORFs34. This reduction in the number of clones is balanced by the increase in folding properties of selected fragments. Usually, the ampicillin concentration should be enough to reduce to about 1/20 the number of bacterial colonies with respect to those that could be obtained growing the library on chloramphenicol only.
Library validation is usually done by PCR amplification of randomly picked colonies and their sequencing. PCR amplification of some colonies is suggested in order to have a quick estimation of the quality of the library: the length of the inserts should be in the expected range of 150-750 bp and different colonies should present inserts with different size indicating good library preparation in term of variability. This conventional strategy of screening, when applied as the only method for library validation, is not comprehensive and is time consuming, allowing the analysis of only a limited number of colonies and having a high chance of missing most of the important clones. Our approach is based on deep sequencing of the library, this provides complete information on library diversity and abundance and precise mapping of each of the selected fragments.
The implementation of NGS technology with the filtering approach increases the deepness of the analysis by several orders of magnitude. Recently, we have optimized the protocol for sequencing the ORF libraries by using the Illumina platform, and developed a specific web tool for data analysis that makes the analysis of these kind of data for every user without any bioinformatics programming skills.
The library "per se" is a "universal instrument" and can be exploited in different contexts for protein expression and/or selection. Our methodological approach is based on the transferring of the produced ORFeome into a phage display context. Protein fragments are expressed on the phage surface and became suitable for subsequent selection.
This is made by rescuing the filtered ORFs from the pFILTER library by digestion with specific restriction enzymes and re-cloning them into a compatible phagemid vector allowing their fusion with the phage protein g3p.
After the phagemid-ORF library is created, it can be used for the selection against different targets, such as a putative binding protein10 or purified antibodies35,36 as described here. Since phage particles will display on their surface the filtered ORFs, this results in a much more effective selection procedure due to the absence of non-displaying clones that usually overtake it.
After the selection of the phage display ORF library, the output clones can be sequenced and analyzed with the same pipeline. NGS can provide a complete and statistically significant ranking of the most frequently selected ORFs and this allows the identification of the proteins mostly interacting with the bait used. Given the presence of many different versions of each domain differing by few amino acids, the overlap between different sequenced clones also identifies the minimum fragment/domain showing binding properties. Finally, thanks to the coupling of genotype and phenotype information into the phage library, once the domains of choice have been identified, the DNA sequence can be easily rescued from the library for further studies, in vitro and in vivo validation and characterization.
The authors have nothing to disclose.
This work was supported by a grant from the Italian Ministry of Education and University (2010P3S8BR_002 to CP).
Name | Company | Catalog Number | Comments |
Sonopuls ultrasonic homogenizer | Bandelin | HD2070 | or equivalent |
GeneRuler 100 bp Plus DNA Ladder | Thermo Scientific | SM0321 | or equivalent |
GeneRuler 1 kb DNA Ladder | Thermo Fisher Scientific | SM0311 | or equivalent |
Molecular Biology Agarose | BioRad | 161-3102 | or equivalent |
Green Gel Plus | Fisher Molecular Biology | FS-GEL01 | or equivalent |
6x DNA Loading Dye | Thermo Fisher Scientific | R0611 | or equivalent |
QIAquick Gel Extraction Kit | Qiagen | 28704 | or equivalent |
Quick Blunting Kit | New England Biolabs | E1201S | |
NanoDrop 2000 UV-Vis Spectrophotometer | Thermo Fisher Scientific | ND-2000 | |
High-Capacity cDNA Reverse Transcription Kit | Thermo Fisher Scientific | 4368813 | |
Streptavidin Magnetic Beads | New England Biolabs | S1420S | or equivalent |
QIAquick PCR purification Kit | Qiagen | 28104 | or equivalent |
EcoRV | New England Biolabs | R0195L | |
Antarctic Phosphatase | New England Biolabs | M0289S | |
T4 DNA Ligase | New England Biolabs | M0202T | |
Sodium Acetate 3M pH5.2 | general lab supplier | ||
Ethanol for molecular biology | Sigma-Aldrich | E7023 | or equivalent |
DH5aF' bacteria cells | Thermo Fisher Scientific | ||
0,2 ml tubes | general lab supplier | ||
1,5 ml tubes | general lab supplier | ||
0,1 cm electroporation cuvettes | Biosigma | 4905020 | |
Electroporator 2510 | Eppendorf | ||
2x YT medium | Sigma-Aldrich | Y1003 | |
Ampicillin sodium salt | Sigma-Aldrich | A9518 | |
Chloramphenicol | Sigma-Aldrich | C0378 | |
DreamTaq DNA Polymerase | Thermo Fisher Scientific | EP0702 | |
Deoxynucleotide (dNTP) Solution Mix | New England Biolabs | N0447S | |
96-well thermal cycler (with heated lid) | general lab supplier | ||
150 mm plates | general lab supplier | ||
100 mm plates | general lab supplier | ||
Glycerol | Sigma-Aldrich | G5516 | |
BssHII | New England Biolabs | R0199L | |
NheI | New England Biolabs | R0131L | |
QIAprep Spin Miniprep Kit | Qiagen | 27104 | or equivalent |
M13KO7 Helper Phage | GE Healthcare Life Sciences | 27-1524-01 | |
Kanamycin sulfate from Streptomyces kanamyceticus | Sigma-Aldrich | K1377 | |
Polyethylene glycol (PEG) | Sigma-Aldrich | P5413 | |
Sodium Cloride (NaCl) | Sigma-Aldrich | S3014 | |
PBS | general lab supplier | ||
Dynabeads Protein G for Immunoprecipitation | Thermo Fisher Scientific | 10003D | or equivalent |
MagnaRack Magnetic Separation Rack | Thermo Fisher Scientific | CS15000 | or equivalent |
Tween 20 | Sigma-Aldrich | P1379 | |
Nonfat dried milk powder | EuroClone | EMR180500 | |
KAPA HiFi HotStart ReadyMix | Kapa Biosystems, Fisher Scientific | 7958935001 | |
AMPure XP beads | Agencourt, Beckman Coulter | A63881 | |
Nextera XT dual Index Primers | Illumina | FC-131-2001 or FC-131-2002 or FC-131-2003 or FC-131-2004 | |
MiSeq or Hiseq2500 | Illumina | ||
Spectrophotomer | Nanodrop | ||
Agilent Bioanalyzer or TapeStation | Agilent | ||
Forward PCR primer | general lab supplier | 5’ TACCTATTGCCTACGGCA GCCGCTGGATTGTTATTACTC 3’ | |
Reverse PCR primer | general lab supplier | 5’ TGGTGATGGTGAGTACTA TCCAGGCCCAGCAGTGGGTTTG 3’ | |
Forward primer for NGS | general lab supplier | 5’ TCGTCGGCAGCGTCAGA TGTGTATAAGAGACAGGCA GCAAGCGGCGCGCATGC 3’; | |
Reverse primer for NGS | general lab supplier | 5’ GTCTCGTGGGCTCGGAGA TGTGTATAAGAGACAGGGG ATTGGTTTGCCGCTAGC 3’; |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved