PDB Newsletter Number 5 -- Spring, 2000 -- RCSB Published quarterly by the Research Collaboratory for Structural Bioinformatics Weekly PDB news is available on the Web at http://www.rcsb.org/pdb/latest_news.html. This newsletter will soon be available on the Web in HTML format at http://www.rcsb.org/pdb/newsletter/ SNAPSHOT -- April 11, 2000 12,110 released atomic coordinate entries Molecule Type 10,735 proteins, peptides, and viruses 821 nucleic acids 536 protein/nucleic acid complexes 18 carbohydrates Experimental Technique 9,924 diffraction and other 3,621 structure factor files 1,920 NMR 747 NMR restraint files 266 theoretical modeling TABLE OF CONTENTS Message from the PDB Data Deposition and Processing RCSB PDB Celebrates its First Year of Processing Data PDB Deposition Statistics Data Uniformity Release of Cleaned-up Data and Ongoing Uniformity Work Data Query and Reporting New Query and Reporting Features Access New PDB Mirror Site in Brazil PDB Web Site Statistics PDB Outreach Activities PDB at the Biophysical Society Annual Meeting Issue 91 of PDB CD-ROM Released Proposal for CORBA Standard for Macromolecular Structure Data Submitted Molecules of the Quarter: Myoglobin, Bacteriophage phiX174, and DNA Polymerase New Publication Analyzing the PDB Statement of Support RCSB PDB Team -------------------------------------------- MESSAGE FROM THE PDB The first few months of the year 2000 have proven to be busy ones for the PDB Team. We have had anniversaries (our first year of data processing), rolled out new features (searching and reporting enhancements), and released the first groups of data from the uniformity project (citation, R factor, and resolution). An independent measure of the success was the inclusion of the PDB Web site in the 50 most influential and important biotechnology sites of 1999. The Top 50 list of biotechnology sites, published in the December issue of Genetic Engineering News (http://www.genengnews.com/), is compiled annually by Dr. Kevin Ahern in the Department of Biochemistry and Biophysics at Oregon State University in Corvallis. The PDB's Molecule of the Month feature was also included as Science's (http://www.sciencemag.org/) "NetWatch Hot Pick" (vol 287, no 5460). The PDB site has also been quite active, consistently receiving more than one hit per second, and more than one query per minute. During the next few months, the PDB will be preparing for exhibition booths at the American Crystallographic Association's Annual Meeting (St. Paul, MN, July 22-27) and the Fourteenth Symposium of the Protein Society (San Diego, CA, August 5-9). We hope to see many of you at these events. Details will be given, as with all other PDB news, on the PDB home page (http://www.rcsb.org/pdb/). The PDB DATA DEPOSITION AND PROCESSING RCSB PDB CELEBRATES ITS FIRST YEAR OF PROCESSING DATA Since January 27, 1999, the RCSB PDB Team has processed over 3,000 structures -- more than 25% of the current PDB archive. This number includes the structures deposited during the past year via ADIT and AutoDep, as well as older, unprocessed structures that were not completed before the end of the transition period in June 1999. The average time for complete processing and author correspondence for all structures, regardless of their release status, is less than two weeks. During this time, the EBI continued to accept data deposited at that site, and assumed processing responsibilities for the structures submitted there since June 1999. Data deposition questions and updates should be sent to deposit@rcsb.rutgers.edu. PDB DEPOSITION STATISTICS In the first quarter of 2000, more than 500 structures were deposited to the PDB using the AutoDep Input Tool (ADIT; http://pdb.rutgers.edu/) -- 208 in January, 137 in February, and 185 in March. Depositions came from all around the globe, including Asia, Australia, Europe, and North America. 61% of these structures were deposited with a HOLD UNTIL PUBLICATION status; 20% were deposited with a HOLD status, and 19% were deposited with an immediate release status. While proteins were the predominant type of structure deposited, 38 nucleic acids and 26 protein-nucleic acid complexes were also submitted to the PDB. In addition to the more than 400 structures determined by X-ray crystallography, first quarter depositions included 83 NMR-determined structures, 10 theoretical models, and two EM structures. DATA UNIFORMITY RELEASE OF CLEANED-UP DATA AND ONGOING UNIFORMITY WORK As part of the data clean-up project led by the NIST-PDB team, the citation data previously introduced on the PDB beta site are now available on the main production sites. This follows the introduction last fall of reliable access to R-factor and resolution data. Extensive work has also been completed on ligand and source data. All primary citations for PDB entries, as of July 1999, have been validated and corrected, if necessary. This work involved verifying all primary citation data values (title, authors, journal, year, volume, pages) with the published literature using either electronic or hardcopy journal resources. The procedure also involved presenting the citation data values in a uniform format. Whenever possible, links have been added to PUBMED. At present, the primary citations are more than 95% complete. Legacy PDB data files commonly stored R-factor and resolution data in free text format. This way of storing data, together with the changes in conventions and definitions that took place during the last several years, made it hard to establish fast and accurate queries over these data. The R-factor and resolution information for all legacy data have been examined and tabulated. In more than 5% of the files, this work required referring to the original publication. The tabulated data are now used to improve the reliability of user queries. Legacy data hold ligand and hetero atom information in a non-uniform format. Rutgers' and NIST's PDB teams annotated the data and developed uniform names. To facilitate reliable searching, synonym lists were generated from the information in the PDB files and, in several cases, context-based commercial and popular names from publications and reviews were also added. Queries using this information are now available for beta testing (http://www.rcsb.org/pdb/beta.html). The source information for all the legacy data is annotated to follow the conventions and standards used by MEDLINE. These data are currently being implemented for beta testing. Updating and annotating data is an ongoing process, and the RCSB greatly appreciates input from the user community. Users may send corrections to info@rcsb.org. DATA QUERY AND REPORTING NEW QUERY AND REPORTING FEATURES The RCSB continues to enhance and upgrade the capabilities of the PDB customizable search interface and reporting output. Several new query and reporting features, previously available on the PDB beta test site, have been moved to the PDB production sites. These features include: Ligand Search Capability. The SearchFields interface now provides a "Ligands and Prosthetic groups" field to enable queries for entries that contain a specific ligand. Ligands may be specified using either their common names or by the three-character identifiers found in the PDB HET group dictionary. Expanded Information on Ligands and Prosthetic groups. The Structure Explorer interface now provides a table of ligands and prosthetic groups (referred to as "HET groups") within a particular macromolecule. While the previous interface only provided non-polymeric groups, the new implementation also shows non-canonical residues within protein or nucleic acid chains. A hyperlink provides convenient access to a graphical representation of these groups within Rasmol. Direct Hyperlink to Primary Citation. Where available, hyperlinks to MEDLINE on the Structure Explorer summary page are now directly to the abstracts of the primary citations. The underlying information was collected and is maintained by the NIST data curation team within the RCSB. Note that the link to "Other Sources" within Structure Explorer holds an extensive set of links to all related publications within MEDLINE. Faster and Improved Access to Dynamic Links. The static links previously available for each PDB entry from the Structure Explorer/Other Sources pages have been replaced by a dynamically updated and far more extensive set of links created by the Molecular Information Agent (MIA), developed by the NIH-funded National Biomedical Computation Resource at SDSC. Additional information on MIA (http://www.rcsb.org/pdb/mia.html) is available from the PDB site. Access to the set of cross-links provided by MIA has been considerably improved after feedback generated from the initial deployment. All information is now broken down into categories and the initial screen only provides access to resources with direct links possible via the PDB identifier. Hyperlinks to extended sets of links allow convenient access to all other links. Experimental Data Availability. Users can now limit queries to entries for which the experimental data were deposited. The SearchFields interface developed by the PDB-SDSC team now contains an "Experimental Data Availability" checkbox in the customizable section of the form. Checking this box and clicking the New Form button creates a form containing options for restricting the search to only those entries that were deposited along with experimental structure factors or NMR restraint files. Expanded VRML Generation. The VRML interface for generating molecular images has been expanded to include options for scene annotation, for marking residues, drawing sites and symmetry copies. To access these features, select the option "VRML (custom options, full screen display)" from the View Structure page within Structure Explorer. Cross-Linked Files for NMR-Determined Structures. The PDB now provides facile cross-linking of all files for NMR-determined structures. Coordinates sets for structures determined by NMR may be stored in several different PDB files with different PDB IDs, and the Structure Explorer page now provides cross-linking of these files. Specifically, if an averaged minimized structure is deposited to the PDB, this file is separate from the file or files containing the ensemble members comprising the multiple structure solutions. In addition, new SearchFields queries for data collection information, number of chains and source organism have now been implemented in the beta test site. See the beta news page for more information. ACCESS NEW PDB MIRROR SITE IN BRAZIL February 22, 2000 -- The Universidade Federal de Minas Gerais/CENAPAD in Brazil has established a new RCSB PDB mirror site (http://vega.cenapad.ufmg.br/). Following the addition of this new mirror site, there are seven RCSB PDB sites available worldwide. The RCSB's priority when establishing new mirror sites is to ensure that there is good global coverage to facilitate access to PDB data. The RCSB PDB mirrors (http://www.rcsb.org/pdb/mirrors.html) now span four continents with sites at SDSC, Rutgers University, NIST, the Cambridge Crystallographic Data Center in the United Kingdom, the National University of Singapore, Osaka University in Japan, and the Brazil site. PDB WEB SITE STATISTICS An analysis of the PDB Web logs show that February 2000 was the busiest month for daily average Web page hits and files downloaded since April 1999. At these rates, February 2000 would have surpassed the current record month of October 1999, had February been given 31 days. While the traffic to PDB mirror sites and the PDB beta test site continues to increase, most users visit the main PDB site at www.rcsb.org/pdb. .............Daily Average...../...Monthly Totals............................ Month...... Hits......Files..../..Sites.....Kbytes.......Files........Hits Mar 2000...88,167....67,388..../..31,051..33,175,208...1,482,542....1,939,674 Feb 2000..101,007....76,326..../..40,107..50,642,141...2,213,457....2,929,231 Jan 2000...79,422....61,460..../..39,278..45,228,980...1,905,289....2,462,109 PDB OUTREACH ACTIVITIES PDB AT THE BIOPHYSICAL SOCIETY ANNUAL MEETING February 12-16, 2000 -- A presentation, an award, and an exhibition booth were the focus of the PDB's attendance at the 44th Annual Biophysical Society Meeting in New Orleans, Louisiana, February 12-16. Dr. Helen M. Berman, director of the PDB, presented a talk entitled "The Past, Present, and Future of the Protein Data Bank" at the Awards Symposium after receiving the Biophysical Society's Distinguished Service Award for her work with structural databases. PDB members were able to meet with users one-on-one in the PDB's exhibition booth. Thanks to everyone who stopped by with their questions, input, and support. The next PDB exhibit booth will be at the American Crystallographic Association's Annual Meeting in St. Paul, Minnesota, in July. ISSUE 91 OF PDB CD-ROM RELEASED The PDB archive was released on this quarter's CD-ROM set. The CD-ROMs contain the full release of PDB structure files, structure factor files, NMR constraints and some contributed software and resources. The structure of the directories is that of earlier CD-ROMs. The CD-ROM set includes the 11,363 structures as of the December 29, 1999, update. Five CD-ROMs are required to contain these structures in compressed (gzip) format. PROPOSAL FOR CORBA STANDARD FOR MACROMOLECULAR STRUCTURE DATA SUBMITTED February 11, 2000 -- In an activity of interest to the bioinformatics community, the RCSB submitted an initial technology proposal to the Object Management Group (OMG) that defines a CORBA interface for macromolecular structure data. The final CORBA specification, when accepted, will enable the PDB to publish a robust and efficient interface definition for use by programs and other databases accessing the PDB. The submission is based in part, upon scientific nomenclature defined by the International Union of Crystallography. The submission (http://www.omg.org/cgi-bin/doc?lifesci/00-02-02) is available in Framemaker (as a gzipped tar file,), Postscript, and PDF formats. Please note that these files are quite large and may take some time to download. MOLECULES OF THE QUARTER: MYOGLOBIN, BACTERIOPHAGE phiX174, AND DNA POLYMERASE April 4, 2000 -- The PDB has initiated a new feature for its Web site called "Molecule of the Month." Written and drawn by David S. Goodsell, an assistant professor of molecular biology at The Scripps Research Institute in La Jolla, California, these features provide an overview of significant milestones in the growth of the PDB's macromolecular structure data for a diverse audience. Myoglobin: The First Protein Structure January 2000 -- Any discussion of protein structure must necessarily begin with myoglobin (http://www.rcsb.org/pdb/molecules/mb1.html), because myoglobin is where the science of protein structure really began. After years of arduous work, John Kendrew and his coworkers determined the atomic structure of myoglobin, laying the foundation for an era of biological understanding. That first glimpse at protein structure is available at the PDB, under the accession code 1mbn (http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=1mbn). Take a closer look at this molecule, or look directly at the PDB information for 1mbn. You will be amazed, just like the world was in 1960, at the beautiful intricacy of this protein. Myoglobin is a small, bright red protein. It is very common in muscle cells, and gives meat much of its red color. Its job is to store oxygen, for use when muscles are hard at work. If you look at John Kendrew's PDB file, you will notice that the myoglobin that he used was taken from sperm whale muscles. As you can imagine, marine whales and dolphins have a great need for myoglobin, so that they can store extra oxygen for use in their deep dives undersea. Bacteriophage phiX174: A Milestone at the PDB February 2000 -- The 10,000th entry in the Protein Data Bank, the bacteriophage phiX174 (http://www.rcsb.org/pdb/molecules/pdb2_1.html), is a perfect example of how the science of protein structure has progressed in four decades. In 1960, the world got its first look at the structure of a protein. That first structure was the small protein myoglobin, composed of one protein chain and one heme group--about 1,260 atoms in all. By contrast, the 10,000th entry in the PDB contains 420 protein chains and over half a million atoms. Enormous structures like this are not uncommon in the Protein Data Bank. The stakes have risen dramatically since the structure of myoglobin was first revealed. A bacteriophage is a virus that attacks bacteria. The phiX174 bacteriophage attacks the common human bacteria Escherichia coli, infecting the cell and forcing it to make new viruses. Do you think that viruses are living organisms? PhiX174 is composed of a single circle of DNA surrounded by a shell of proteins. That's all. It can inject its DNA into a bacterial cell, then force the cell to create many new viruses. These viruses then burst out of the cell, and go on to hijack more bacteria. By itself, it is like an inert rock. But given the proper bacterial host, it is a powerful reproducing machine. What do you think? Is it alive? DNA Polymerase: The Secret of Life March 2000 -- DNA polymerase (http://www.rcsb.org/pdb/molecules/pdb3_1.html) plays the central role in the processes of life. It carries the weighty responsibility of duplicating our genetic information. Each time a cell divides, DNA polymerase duplicates all of its DNA, and the cell passes one copy to each daughter cell. In this way, genetic information is passed from generation to generation. Our inheritance of DNA creates a living link from each of our own cells back through trillions of generations to the first primordial cells on Earth. The information contained in our DNA, modified and improved over millennia, is our most precious possession, given to us by our parents at birth and passed to our children. DNA polymerase is the most accurate enzyme. It creates an exact copy of your DNA each time, making less than one mistake in a billion bases. This is far better than information in our own world: imagine reading a thousand novels, and finding only one mistake. The excellent match of cytosine to guanine and adenine to thymine, the language of DNA, provides much of the specificity needed for this high accuracy. But DNA polymerase adds an extra step. After it copies each base, it proofreads it and cuts it out if the base is wrong. NEW PUBLICATION ANALYZING THE PDB February 8, 2000 - Helge Weissig and Phil Bourne of the SDSC-PDB team have published a paper on an analysis of the PDB for trends in data quality and consistency. They found that, averaged over the complete collection, the stereochemical quality of atomic models has, in the past few years, moved towards ideal values. At the same time, there are inconsistencies in how data are reported. Water content is not reported consistently and the percent of data collected when reporting the high-resolution shell varies, detracting from the value of resolution as a yardstick for assessing the quality of a structure. A more detailed analysis of these inconsistencies is hampered by the lack of machine-readable experimental data. To the user of macromolecular structure data, this suggests that structural details beyond the standard quality measures of resolution and R value should be considered when using coordinate sets for further derivation or in inferring biological function. To the curators of the PDB, this suggests the need to capture more of the experimental data associated with the experiment in a way that permits straightforward parsing. Weissig, H. and P.E. Bourne. 1999. An analysis of the Protein Data Bank in search of temporal and global trends (http://www3.oup.co.uk/bioinformatics/hdb/Volume_15/Issue_10/150807.sgm.abs.html). Bioinformatics 15:807-831. ----------------------------------------- STATEMENT OF SUPPORT The PDB is supported by funds from the National Science Foundation, the Office of Biology and Environmental Research at the Department of Energy, and two units of the National Institutes of Health: The National Institute of General Medical Sciences and the National Library of Medicine, in addition to resources and staff made available by the host institutions. ----------------------------------------- RCSB PDB TEAM Dr. Helen M. Berman - Director Department of Chemistry Rutgers University 610 Taylor Road Piscataway, NJ 08854-8087 732-445-4667 Fax: 732-445-4320 berman@rcsb.rutgers.edu Dr. John Westbrook - PDB Project Team Leader Department of Chemistry Rutgers University 610 Taylor Road Piscataway, NJ 08854-8087 732-445-4290 Fax: 732-445-4320 jwest@rcsb.rutgers.edu Dr. Gary Gilliland - PDB Project Team Leader National Institute of Standards and Technology Mail Stop 8310 Gaithersburg, MD 20899-8310 301-975-2629 Fax: 301-330-3447 gary.gilliland@nist.gov Dr. Peter Arzberger - PDB Project Team Leader San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0505 858-534-5079 Fax: 858-822-0948 parzberg@sdsc.edu Dr. Philip Bourne - PDB Project Team Leader San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0505 858-534-8301 Fax: 858-822-0873 bourne@sdsc.edu RCSB Members Rutgers University 610 Taylor Road Piscataway, NJ 08854-8087 Kyle Burkhardt - Annotation kburkhar@rcsb.rutgers.edu Victoria Colflesh - Annotation victoria@rcsb.rutgers.edu Dr. Zukang Feng - Annotation zfeng@rcsb.rutgers.edu Michael Huang - Annotation mshuang@rcsb.rutgers.edu Dr. Lisa Iype - Annotation lisa@rcsb.rutgers.edu Dr. Shri Jain - Annotation sjain@rcsb.rutgers.edu Dr. Rachel Kramer - Outreach kramer@rcsb.rutgers.edu Dr. Bohdan Schneider - Annotation bohdan@rcsb.rutgers.edu Dr. Kata Schneider - Annotation kata@rcsb.rutgers.edu Olivera Tosic - Software Development olivera@rcsb.rutgers.edu Christine Zardecki - Outreach zardecki@rcsb.rutgers.edu SDSC UC San Diego 9500 Gilman Drive La Jolla, CA 92093 Dr. John Badger - Outreach badger@sdsc.edu Bryan Bannister - System Integration bryan@sdsc.edu Justin Caballero - Programming and Cross-linking justin@sdsc.edu Dr. Douglas S. Greer - CORBA Development dsg@sdsc.edu Dr. Michael Gribskov - Consultant gribskov@sdsc.edu Dr. John Kowalski - Systems Administrator kowalski@sdsc.edu Arcot Rajasekar - Mirror programming sekar@sdsc.edu Shawn Strande - Project Manager strande@sdsc.edu Dr. Lynn F. Ten Eyck - Consultant teneyckl@sdsc.edu Dr. Helge Weissig - Technical Manager helgew@sdsc.edu Dr. Kenneth Yoshimoto - Mirrors and Database Administrator kenneth@sdsc.edu NIST Biotechnology Division National Institute of Standards and Technology Gaithersburg, MD 20899-8310 Department of Chemistry Dr. Talapady N. Bhat - Data Uniformity and NMR bhat@nist.gov Phoebe Fagan - CR-ROM and Master Archive phoebe.fagan@nist.gov Dr. Diane Hancock - Data Uniformity and NMR diane.hancock@nist.gov Dr. Veerasamy Ravichandran - Data Uniformity veerasamy.ravichandran@nist.gov Paul Reneke - Programming paul.reneke@nist.gov Joan Sauerwein - CD-ROM distribution joan.sauerwein@nist.gov Dr. Narmada Thanki - Data Uniformity narmada@dakota.carb.nist.gov Dr. Michael Tung - Programming michael.tung@nist.gov ----------------------------------------- PDB Newsletter Number 5 -- Spring, 2000 -- RCSB Published quarterly by the Research Collaboratory for Structural Bioinformatics Weekly PDB news is available on the Web at http://www.rcsb.org/pdb/latest_news.html. This newsletter will soon be available on the Web in HTML format at http://www.rcsb.org/pdb/newsletter/