Metaproteomic analysis of Chesapeake Bay microbial communities

Background Natural microbial communities are extremely complex and dynamic systems in terms of their population structure and functions. However, little is known about the in situ functions of the microbial communities. Results This study describes the application of proteomic approaches (metaproteomics) to observe expressed protein profiles of natural microbial communities (metaproteomes). The technique was validated using a constructed community and subsequently used to analyze Chesapeake Bay microbial community (0.2 to 3.0 μm) metaproteomes. Chesapeake Bay metaproteomes contained proteins from pI 4–8 with apparent molecular masses between 10–80 kDa. Replicated middle Bay metaproteomes shared ~92% of all detected spots, but only shared 30% and 70% of common protein spots with upper and lower Bay metaproteomes. MALDI-TOF analysis of highly expressed proteins produced no significant matches to known proteins. Three Chesapeake Bay proteins were tentatively identified by LC-MS/MS sequencing coupled with MS-BLAST searching. The proteins identified were of marine microbial origin and correlated with abundant Chesapeake Bay microbial lineages, Bacteroides and α-proteobacteria. Conclusion Our results represent the first metaproteomic study of aquatic microbial assemblages and demonstrate the potential of metaproteomic approaches to link metagenomic data, taxonomic diversity, functional diversity and biological processes in natural environments.


Background
Bacterioplankton contribute significantly to both primary production and biomass in the ocean and coastal water [1,2]. With an average concentration of approximately 10 6 cells ml -1 , bacterioplankton is an important catalyst of biogeochemical processes including oceanic carbon and nitrogen cycles [3,4]. Studying bacterioplankton is challenging because most groups either have never been cultivated [5,6] or grow to very low density in the laboratory [7]. Culture-independent molecular approaches have indicated that environmental bacterial communities are more complex and diverse than previously thought [5,6,8]. Metagenomics is the direct cloning, sequencing, assembly and annotation of DNA from microbial communities and has been applied to waters, soils and extreme environments [9][10][11][12]. A recent metagenomic study of the Sargasso Sea revealed that substantial complex microbial diversity exists in the ocean: 148 novel bacterial phylotypes and more than a million of previously unknown genes were discovered and annotated [12].
As genomic data accumulates from pure cultures and environmental communities, it becomes critical to understand gene expression and protein function. While metagenome sequences provide valuable information on potential functions, accurately predicting ecological function from sequence is nearly impossible without information on what proteins are synthesized under specific conditions [13][14][15]. To address this question, postgenomic molecular approaches such as microarrays to monitor mRNA abundance [16] have been developed. In addition, as proteins/proteomes are the ultimate functional products of genes/genomes, proteomic studies of microbial communities (metaproteomics) are an obvious approach to advance our understanding of microbial community function.
Metaproteomics can provide a direct measurement of functional gene expression in terms of the presence, relative abundance and modification state of proteins [17,18]. Proteomics and metaproteomics rely on twodimensional gel electrophoresis (2D-PAGE) coupled with mass spectrometry (MS) based protein identification relying on mass based (MALDI-TOF MS) or sequence based (LC-ESI-MS/MS) methods. These techniques have only been applied in limited scope to environmental microbial communities. One-dimensional gel electrophoresis (1D-PAGE) coupled with radioactive labelling or enzymatic activity assay has been used to study proteins induced in response to environmental stresses [19,20]. However, little concrete information on the sequences or identities of induced proteins emerged from these studies. A metaproteomic approach was applied to a laboratory-scale activated sludge bioreactor resulting in the identification of three highly expressed proteins presumably originating from an uncultured Rhodocyclus-type polyphosphateaccumulating organism [18]. More recently, using genomic and mass spectrometry-based proteomic methods, metaproteomes from an acid mine drainage (AMD) microbial biofilm community have been identified and linked their in situ functions to the challenging environments [21]. However, all these studies are dealing with low-complexity microbial communities. So far, no studies have yet applied proteomic approaches to natural aquatic microbial communities.
Estuaries represent one of the most complex and productive ecosystems. The Chesapeake Bay is the largest estuary in United States (Fig. 1). It has received a great deal of attention because of its large geographic span and economic significance. With strong environmental gradients, it provides an ideal model system for integrated investigations on composition and function of microbial communities. In this study, we developed a metaproteomic approach to document microbial community protein profiles along a transect of the Chesapeake Bay. Significant differences were noted between proteomes collected at different sites and metaproteome patterns accurately predicted the relationship of sites as determined by 16S rRNA gene PCR-DGGE (denaturing gradient gel electrophoresis). Furthermore, proteins identified from Chesapeake Bay samples appeared to originate from marine bacterioplankton. This study demonstrates that metaproteomic approaches can be successfully applied to naturally occurring and complex microbial communities in their native habitats.

Microbial community collection
Epifluorescence microscopic counts showed that concentrated microbial communities mainly contained free-living bacteria (~ 95%). The recovery efficiency of bacterial Metaproteome sampling stations at the Chesapeake Bay cells using the tangential flow ultrafiltration system was 75 ± 5% (data not shown). With the average concentration of 2.5 × 10 6 cells ml -1 in the starting water samples, the density of microbial cells in the ultrafiltration retentate was about 2.5 × 10 8 cells ml -1 . Thus, about 3.75 × 10 10 cells were analyzed in each sample. Extracts typically contained between 140 and 192 µg of protein giving a value range of 3.7 × 10 -15 to 5.1 × 10 -15 g protein cell -1 . This value is significantly lower than that determined for cultured strains in this study and in general for marine bacteria (60-330 × 10 -15 g protein cell -1 , [22]). It remains to be determined whether this discrepancy indicates that the extraction protocol needs further optimization or is a fundamental property of microbial cells in environmental samples.

1D-PAGE analysis of proteins from isolated bacterial strains and environmental samples
Individual proteins from cultivated marine bacteria were well resolved by 1D-PAGE and produced distinct patterns when 8 Chesapeake Bay bacterial isolates were compared (Fig. 2). The observed molecular masses ranged from ~10 to 250 kDa (Fig. 2, lanes 1-8) whereas proteins from microbial community samples were < 80 kDa (Fig. 2, lanes 9 and 10). Overall resolution was much poorer in community samples as evidenced by less sharply defined bands in these samples. This blurring effect was also noted in a very simple mixed microbial community described below and was not dependent on sampling manipulations (data not shown).

Analysis of isolated strains and artificial mixed communities
Artificial community consisting of Chlorobium tepidum strain WT2321, Escherichia coli strain JM109 and an uncharacterized strain of Pseudomonas fluorescens was analyzed by 2D-PAGE. Preliminary experiments indicated that a 300 ml sample containing 1 × 10 7 cells per ml of the community could be successfully analyzed by 2D-PAGE. Analysis by 1D-PAGE afforded greater sensitivity, ~1 × 10 4 cells per ml, but resolution of individual bands was poor as noted above. Protein assays on samples of the community before dilution and recovery and after indicated that the metaproteomic sample preparation recovered ~ 30% of the total microbial protein present in the original community sample.
Typical results from a 2D-PAGE experiment are shown in Fig. 3. The overlays indicate that 2D-PAGE patterns from single strains of community members only match a fraction of protein spots present in the mock metaproteome sample (Fig. 3a-d). This is qualitatively observed as a large number of green or pink protein spots in the overlay views showing unmatched protein spots. Each individual strain is expected to contribute only one third of the pro-tein content of the community. In contrast, when a sample of the community prior to dilution and recovery is compared to a mock metaproteome that had been subjected to sample handling protocols, almost perfect matching of the samples is seen as evidenced by the large proportion of dark grey to black spots (Fig. 3d) when these images are overlain. Thus, no individual member of the community, which covers the range of cell sizes in the environmental samples, is selectively excluded by the sampling protocol.

Quantitative Comparison of Chesapeake Bay Metaproteome Samples
Metaproteome images from different Chesapeake Bay stations in the upper (station 858), middle (station 804, replicates a and b) and lower Bay (station 707) were compared ( Fig. 4a-d). A number of protein spots were shared by all samples. Some of these are proteins present in RNase, Dnase and protease inhibitor cocktail in the extraction buffer (data not shown), but a number of proteins appear to be common in all samples examined. These are black to dark grey spots in the image overlays ( Fig. 4a-d). A first level of quantitative comparison determined the specific numbers of protein spots shared between samples ( Table 1). The total number of spots compared for each sample is relatively low as the analysis was restricted to spots with sufficient quality and intensify to permit subsequent attempts at protein identification. As expected, replicate metaproteome images from the middle Bay are more similar to one another than the metaproteomes of other stations, sharing ~92 % of all detected spots. Furthermore, the lower and middle Bay metaproteomes are significantly more similar to one another than either is to the upper Bay metaproteomes with ~70 % of all detected spots in common. The upper Bay metaproteomes only shared about ~30 % of detected spots with either the middle or lower Bay metaproteomes.
1D-PAGE patterns of total proteins obtained from 8 different bacteria isolated from Baltimore Inner Harbor

Mr (kDa)
Relative spot intensity was extracted from comparisons of middle Bay to middle Bay and middle Bay with lower Bay metaproteome images. This was not possible with the upper Bay sample as manual matching was employed due to the low level of similarity between samples. Again, as expected, the number of differentially expressed proteins (≥ 3-fold change in matched spot intensity) was nearly twice as large when comparing middle Bay to a lower Bay metaproteomes as when comparing the replicated middle Bay samples (Table 1). These results indicate that both qualitative and highly quantitative comparisons between sites and between time series samples at the same site will be possible using the approaches developed in this study.

Identification of Proteins in Chesapeake Bay Metaproteomes
A total of 41 protein spots were excised from a number of 2-D gels reflecting various molecular weights, charges and relative abundance. Following MALDI-TOF MS, seven spots failed to yield interpretable MS profiles, while the remaining 34 proteins exhibited clear and distinct MS peaks. Database searches using the MASCOT search engine with varying parameter settings (peptide mass tol- The harvesting protocol for microbial communities does not bias against different types of bacteria Figure 3 The harvesting protocol for microbial communities does not bias against different types of bacteria. Proteomes of Chlorobium tepidum (a), Escherichia coli (b) and Pseudomonas fluorescens (c) and the metaproteomes of an artificially constructed community containing all three organisms (d) were overlain and compared to the metaproteomes of the artificial community after dilution and recovery using Compugen Z3 software. Green or pink colored protein spots are unmatched. Gray or black spots are matched. Total 100 µg proteins are loaded on each polyacrylamide gel and the gels are stained by SYPRO Ruby. pI, isoelectric point; Mr, molecular weight. erance from 0.5 to 3 Da, missed cleavages from 1 up to 5) produced no significant matches for these 34 proteins. Subsequent publications from other laboratories and our own simulations using known protein sequences [ [23,24], Hanson, unpublished data] suggest that greater than 97 % amino acid sequence identity is required to provide a positive match when searching with MALDI-TOF MS data.
Seven individual proteins (Fig. 5) isolated from middle Chesapeake Bay (station 804) metaproteome samples were further analyzed by both MALDI-TOF MS and LC-MS/MS sequencing coupled to MS-BLAST searching ( Table 2). MALDI-TOF MS failed to provide identification for any of these samples, similar to the samples described above. LC-MS/MS based searches provided tentative identities for three Chesapeake Bay metaproteome samples. These were identified as homologues of hypothetical proteins annotated in the recently reported Sargasso Sea Mr (kDa) metagenome [12]. Information on potential functions of these proteins was obtained by downloading the full length proteins from the Sargasso Sea database and searching them against known databases by BLASTP ( Table 3). The Sargasso Sea metagenome hypothetical protein corresponding to sample CB1 is not significantly similar to any known proteins in sequence databases. Sample CB3 may correspond to subunit 7 of the NADH:ubiquinone oxidoreductase (complex I) while sample CB6 is similar to a family of predicted aminopeptidases with unspecified functional significance. The tandem mass spectra of samples CB2, CB3 and CB5 had no match with any known proteins or hit keratin and bovine serum albumin that possibly came from background.

Discussion
In this study, we deliberately focused on exploring the proteome profiles from bacterioplankton communities between 0.2 and 3.0 microns in size by the choice of prefiltration and ultrafiltration cut-off sizes. Although the epifluorescence microscopy observation confirmed that the major components are bacterioplankton (~95%), small numbers of eukaryotic microbes were possibly included. These likely did not affect the overall protein profiles observed as analyses were restricted to abundant proteins, which would give the best chance for positive identification.
Metaproteomic approaches have thus far only been applied to laboratory scale bioreactors with a specialized community selected for phosphate removal [18] and a low-complexity natural microbial biofilm [21]. Extending this approach to complex environmental samples was not trivial. Initial studies comparing isolated strains, artificial communities and natural community samples by 1D-PAGE indicated that more resolving power was needed to deal with even simplified communities (data not shown). Thus, a metaproteomic approach utilizing 2D-PAGE and MS based protein identification was adopted. The experimental protocol outlined in this study was designed to avoid metaproteome changes arising from bias in the sample collection or handling. This was tested using artificial constructed bacterial assemblage containing 3 different species with varied cell sizes and we found no significant biases.
The protocol was also field tested by comparing replicated samples from the middle Chesapeake Bay to each other and comparing a range of samples from upper, middle and lower Chesapeake Bay stations. The replicated samples shared more than ~92 % of proteins indicating that the metaproteomic approach applied in this study was robust. Furthermore, significant differences were noted when the middle Bay metaproteomes was compared with lower Bay and upper Bay metaproteomes with only 70 % and 30 % of protein spots in common. This pattern can be likely and partially explained by the difference among the population structures of these samples. Genetic fingerprints indicated that upper Bay bacterioplankton community was different from the middle and lower Bay (Fig. 6). Clustering analysis based on presence/absence of Proteins selected for identification from middle Chesapeake Bay (station 804) Figure 5 Proteins selected for identification from middle Chesapeake Bay (station 804). Total 100 µg protein are loaded on polyacrylamide gel and the gel is stained by SYPRO Ruby. CBl-CB6 samples are common to Chesapeake Bay stations while NC1 is found on negative control gels containing DNase, RNase and protease inhibitors. Results of protein identification are reported in Table 2 and 3. pI, isoelectric point; Mr, molecular weight.  DGGE bands showed that the similarity between middle Bay to lower Bay was 64% while the upper Bay only shared 46% similarity to both of middle Bay and lower Bay. Finally, relative spot abundance was also much more tightly correlated when the replicated middle Bay samples were compared to each other than when they were compared to the lower Bay sample. These results demonstrate the approach outlined here is sufficiently sensitive to detect both coarse (shared spots) and fine (relative spot abundance) quantitative differences between samples, even when relatively low numbers of spots are included in the analysis. This is critical for any comparative approach.
This study, in addition to others, indicates that protein identification is the major challenge for metaproteomics [18,21,23,24]. Although distinct mass spectra from 34 protein spots were obtained by MALDI-TOF MS, no significant matches were found in sequence databases. MALDI-TOF generally requires at least 97 % amino acid sequence identity between query and target to find a significant match [ [25], Hanson unpublished]. It seems unlikely that many proteins in environmental samples will share this level of identity with proteins in sequence databases derived from cultured organisms. Post-translational modifications of proteins also account for the difficulty in the identifications. Thus, MALDI-TOF MS is unlikely to be useful for metaproteomic approaches.
In contrast, LC-MS/MS or N-terminal sequencing coupled to MS-BLAST searching is able to provide tentative identification for metaproteomes. However, the abundance of most proteins is too low to be identified through the venue of N-terminal sequencing. In the community proteomic analysis of a natural acid mine drainage microbial biofilm, the proteins could be identified by MS and assigned to five most abundant microbes because of the availability of metagenomic data. But the relative high likelihood of false-positive protein identification requires matching of two or more peptides per protein for confident detection [21]. Therefore, caution is required for interpretation of the data. In this study, three Chesapeake Bay metaproteome samples matched different hypothetical proteins annotated in the Sargasso Sea metagenome [12]. This result strongly supports a marine origin for these sequences as would be expected for a large number of proteins in the Chesapeake Bay, particularly in lower and middle Bay samples where there is significant salinity. Even with tentative identities, extending that identity to function must be done with some care. The Sargasso Sea metagenome hypothetical protein corresponding to sample CB1 is not significantly similar to any known proteins in sequence databases giving no clues to its function. Sample CB3 may correspond to subunit 7 of the NADH:ubiquinone oxidoreductase or complex I (Table  3). Complex I is a key component of most membrane bound electron transport chains that is responsible for the transfer of electrons from cytoplasmic NADH pools to the membrane bound quinone pool coupled to proton motive force generation. Subunit 7 is a peripheral membrane protein of the quinone reduction core of complex I [26]. The organism containing the closest match is Cytophaga hutchinsonii, a member of the Bacteroidetes assemblage of organisms, which is a substantial fraction of many marine communities [27]. A current study on population structure of Chesapeake Bay bacterioplankton showed that Bacteroidetes group accounts for ~10% of total community in summer time [e. g. Kan unpublished].
Sample CB6 is similar to a family of predicted aminopeptidases with unspecified functional significance. The closest matching protein is from Novosphingobium aromaticivorans. While N. aromaticivorans is normally considered terrestrial, other Novosphingobium and related Sphingobium and Sphingopyxis strains are widely distributed. As an important component of the α-proteobacteria, these groups can be detected in and isolated from marine and estuarine environments [[28,29], Kan unpublished]. This identification along with that of CB3 support an aquatic bacterial origin for these proteins that is consistent with their presence in the Chesapeake Bay.
Unanswered questions remain regarding the applicability of metaproteomics to natural communities. These include the following: Does a focused protein spot on a 2D SDS-PAGE gel from an environmental sample contain one protein or multiple proteins? What type of information is required to infer identity of spots between different samples? What is the sensitivity of metaproteomics to changes in community composition and the physiological status of community members? How can functional inferences provided by metaproteomics be further tested? Will the approach outlined here be applicable to other systems such as soils, sediments, and extreme environments? Clearly, much more work and complementary approaches need to be applied to these problems.

Conclusion
To our knowledge, this study represents the first application of a metaproteomic approach to a high-complexity aquatic microbial community. The main goals of this study were to develop a method capable of collecting planktonic microbial proteins in quantities suitable for analysis by 2D-PAGE. This was accomplished and attempts were made to identify a subset of these proteins. These attempts reinforced the notion that sequence based methods (LC-MS/MS) will be required to make any headway in protein identification in natural systems. Future studies will identify a much larger number of proteins from Chesapeake Bay microbial communities to address the questions raised above and provide insights into microbial community dynamics and function. . These bacteria were grown in 1/2 YTSS broth (4 g yeast extract, 2.5 g tryptone per liter dissolved in in situ water) and harvested at the exponential growth stage using centrifugation (10,000 × g, 5 min, 4°C).

Artificial Community Construction and Recovery
To determine if microbial community analysis by 2D SDS-PAGE is feasible and representative, a simple artificial mixed community was constructed using three bacterial strains of differing size: Chlorobium tepidum strain WT2321 (~0.

Microbial community sampling
Picoplankton communities were collected at three stations along the middle axis of the Chesapeake Bay on 7 June 2003 aboard the R/V Cape Henlopen (Fig. 1)

Protein extraction and purification
For 1D-PAGE, proteins from natural microbial communities and cultured bacteria were extracted using lysis buffer (50 mM Tris-HCl, 2% SDS, 10% v/v glycerol, 0.1 M DTT, 0.01% Bromophenol Blue, pH 6.8). Cells suspended in buffer were heated in a boiling water bath for 2 minutes followed by centrifugation (10,000 × g, 4°C for 3 min).
The supernatant was collected and 20 µg protein for each was loaded onto polyacrylamide gels. Silver staining was applied to 1D-PAGE gels.
For 2D-PAGE samples, cell suspensions were passed through a French Pressure cell (SLM Aminco) at 20,000 lb/in 2 twice and then incubated on ice for 20 minutes. During the ice incubation, samples were vortexed for 15 sec every 5 minutes. Large cellular debris was removed by centrifugation (10,000 × g, 4°C for 5 min). Proteins in the supernatant were precipitated with trichloracetic acid and resuspended in extraction buffer. Protein concentration of the sample was determined using the RC DC protein assay kit (Bio-Rad, Hercules, CA). Extracted proteins were stored at -80°C.

Isoelectric Focusing (IEF) and SDS-PAGE
The first dimension separation of proteins was carried out in the immobilized pH gradient (IPG) strips (11 cm, pH 3-10 or 4-7) on a Bio-Rad Protean IEF Cell system (Bio-Rad, Hercules, CA). Each 2D-PAGE was conducted using 100 µg of total protein. The IEF program was: 250V for 20 min followed with a linear ramp to 8000V for 2.5 hr, and 8000V for a total 40,000 V-hr with a rapid ramp. After the first dimension, the IEF strips were equilibrated in freshly made Buffer 1 (6 M urea, 2% SDS, 0.05 M Tris/HCl pH 8.8, 50% Glycerol) and Buffer 2 (6 M urea, 2% SDS, 0.375 M Tris/HCl pH 8.8, 20% Glycerol and 0.5 g iodoacetamide) (Bio-Rad, Hercules, Calif), respectively.
The second dimension of 2D-PAGE were performed using 8-16% gradient precast polyacrylamide gels (Bio-Rad, Hercules, CA) following the manufacturer's instructions. The gels were stained with SYPRO Ruby (Bio-Rad, Hercules, CA) after electrophoresis and scanned using a Typhoon 9410 fluorescent Imager (Amersham, NJ) with 488nm excitation and emission filter 610 BP30.

Metaproteome Image Analysis
Images were analyzed and quantitatively compared using the Z3 proteomics software package (Compugen, Israel). Gel images were compared in multiple gel mode using the total density in gel method for spot quantification. All gels were subjected to the same spot detection parameters followed by automated matching. Pairwise comparisons of gels were inspected and matches edited manually to eliminate poor quality or low intensity matches. When automatic matching failed, the number of matched and unmatched spots was estimated by manual examination of overlaid 2D SDS-PAGE images.

Protein Identification by Mass Spectrometry
Protein spots were manually excised from gels using Pasteur pipettes and digested as described by Mann et al. [32]. Tryptic peptides were analyzed both via MALDI-TOF and LC-MS/MS. MALDI spectra were acquired on a Bruker (Billerica, MA) Biflex III MALDI mass spectrometer operating in reflectron mode with delayed extraction. External calibration was performed using Calibration Mixture 2 from the Sequazyme Peptide Mass Standards Kit (Applied Biosystems, Foster City, CA). LC-MS/MS was performed on a Micromass (Beverly, MA) Q-TOF Ultima API-US coupled to a Micromass capLC. Tryptic digests were separated using both a C18 trapping column for washing and concentrating (LC Packings (Sunnyvale, CA) 300 µm × 5 mm C18) and a C18 analytical column for enhanced separation (LC Packings 180 µm × 15 cm C18). The solvent system consisted of 95% 0.1% formic Acid, 5% acetonitrile for the aqueous phase and 95% acetonitrile, 5% 0.1% formic Acid for the organic phase. A 60/60 gradient (to 60% organic in 60 mins) running at l µl/min was employed with most peptides eluting by ~30% organic. The LC eluent was electrosprayed directly into the Q-TOF using the nanosprayer source. Data dependent scanning was used with both MS and MS/MS spectra being acquired during an LC run. Spectra were processed and deconvoluted using programs found with the Micromass operating system, MassLynx v. 3.5.
MALDI-TOF peak lists were searched against protein sequence databases using the Matrix Science Mascot web interface http://www.matrixscience.com/ search_form_select.html. Deconvoluted MS/MS spectra were analyzed using a demonstration version of PeaksStudio 3.0 software (Bioinformatics Solutions Inc., Canada) for de novo sequence prediction. All sequences for each protein spot were used as queries in MS-BLAST searches as described by Shevchenko et al. [33] via the MS-BLAST web interface http://dove.embl-heidelberg.de/Blast2/ msblast.html.