RCSB PDB Newsletter Number 37 -- April 2008 Published quarterly by the Research Collaboratory for Structural Bioinformatics Protein Data Bank Weekly RCSB PDB news is published at www.pdb.org To change your subscription options, please visit lists.sdsc.edu/mailman/listinfo.cgi/rcsb-news ----------------------------------------- TABLE OF CONTENTS Message from the RCSB PDB Data Deposition and Processing sf-convert: A Format Conversion Tool for Structure Factor Files EmDep2: Deposit EM Maps at the MSD-EBI or RCSB PDB 2008 Deposition Statistics Data Processing Versioning Procedures Data Query, Reporting, and Access Website Statistics PDB Statistics Time-stamped Copies of PDB Archive Available via FTP Outreach and Education RCSB PDB Celebrates Teaching, Learning, and More Protein Sculptures on Display at Rutgers Papers Published Education Corner: Moving Pictures: Using Chimera to Make Molecular Multimedia for the Classroom by Dr. Jeramia Ory, Kings College PDB Community Focus: Dr. Christine Orengo, University College London Statement of Support, Partners, Leadership Team Snapshot -------------------------------------------- MESSAGE FROM THE RCSB PDB April 1 marked the 100th edition of the Molecule of the Month, a series produced by David S. Goodsell and featured on the RCSB PDB website. Since January 2000, this series has explored the structure and function of proteins and nucleic acids found in the PDB archive such as transfer RNA, anthrax toxin, and multidrug resistance transporters. To commemorate this event, the RCSB PDB will be offering temporary tattoos of an adrenergic receptor at upcoming meetings. The feature is also available in a specially formatted PDF. Written and illustrated by David S. Goodsell (The Scripps Research Institute), the Molecule of the Month provides an easy introduction to the RCSB PDB for teachers and students. It is used in many classrooms to introduce structures to students, and is an integral part of the protein modeling event at the Science Olympiad. The text and images are related to the featured molecule; the RCSB PDB pages link to examples of the molecule. In response to requests, a view of the highlighted structure in Jmol is included in new features to provide an interactive view of the molecule. New Molecule of the Month features are made available from the RCSB PDB home page with the first update of each month. Alphabetical and chronological listings of past issues are provided. wwPDB partner PDBj has recently started to translate the Molecule of the Month into Japanese. Links to the series are also available from RCSB PDB's Structure Explorer pages. Selecting "Learn more: [M]" takes the reader to any Molecule of the Month feature related to that particular entry. To create the series, Goodsell combines his artistic talent with his scientific expertise in his visual representations of molecular biology. He creates his images so as to capture his excitement about science and communicate it to others. “The combination of art and science gives me a way to access the wonder of nature. It makes me really look at results and think about them in a deeper way,” Goodsell says. “The thing that drives me continually is the beauty of these objects that I’m working on and being amazed at how unusual they are.” -------------------------------------------- DATA DEPOSITION AND PROCESSING sf-convert: A Format Conversion Tool for Structure Factor Files The command-line program, sf-convert, can easily translate data in various formats to the mmCIF format for use with ADIT validation and deposition software. sf-convert can also translate structure factors already released in the PDB from mmCIF to different formats. This tool can input files from the following programs and formats: mmCIF, CIF, MTZ, CNS, Xplor, HKL2000, Scalepack, Dtrek, TNT, SHELX, SAINT, EPMR, XSCALE, XPREP, XTALVIEW, X-GEN, XENGEN, MULTAN, MAIN, and OTHER (an ASCII file with H, K, L, F, and SigmaF separated by a space). sf-convert can then output the data formatted as mmCIF, MTZ, CNS, TNT, SHELX, EPMR, XTALVIEW, HKL2000, Dtrek, XSCALE, MULTAN, MAIN, or OTHER. sf-convert is available for download from sw-tools.pdb.org. EmDep2: Deposit EM Maps at the MSD-EBI or RCSB PDB Electron microscopy map data can now be deposited to the Electron Microscopy Data Bank (EMDB) using the improved web-based tool EmDep2. EmDep2 is available from the existing deposition site at the MSD-EBI in Europe and also from a new deposition site at the RCSB PDB in the USA. The EMDB contains experimentally determined 3D maps and associated experimental data and files. This improvement to EMDB services is the first product of a collaboration between the European Network of Excellence 3D-EM (www.3dem-noe.org) and the recently NIH-funded Partnership for a Unified Data Resource for CryoEM (emdatabank.org) This partnership is comprised of the European Bioinformatics Institute, the Research Colla- boratory for Structural Bioinformatics at Rutgers, and the National Center for Macromolecular Imaging at Baylor College of Medicine. EMDB: www.ebi.ac.uk/msd-srv/docs/emdb EmDep2 (EBI): www.ebi.ac.uk/msd-srv/emdep EmDep2 (RCSB PDB): emdb.rutgers.edu/emdep 2008 Deposition Statistics In the first quarter of 2008, 1626 experimentally-determined structures were deposited to the PDB archive. The entries were processed by wwPDB teams at the RCSB PDB, MSD-EBI, and PDBj. Of the structures deposited in the first quarter of 2008, 75% were deposited with a release status of "hold until publication"; 22.5% were released as soon as annotation of the entry was complete; and 2.5% were held until a particular date. 90% of these entries were determined by X-ray crystallographic methods; 9% were determined by NMR methods. 97% of these depositions were deposited with experimental data. As of February 1, 2008, the deposition of experimental data is required. During the same period of time, 1915 structures were released into the archive. Data Processing Versioning Procedures Data in the PDB archive currently follow either PDB File Format Version 3.0 or 3.1. This is indicated in REMARK 4 of the file. Version 3.0 is the format used for files released as a result of the Remediation Project. Since August 1, 2007, all files processed and released into the archive have followed Version 3.1. When modifications have been made to files released prior to that date, they have been then re-released in Version 3.1. Version 3.1 differs from Version 3.0 in descriptions of the biological unit (REMARK 300/350), geometry (REMARK 500), atom/residues modeled as zero occupancy (REMARK 475/480), non-polymer residues with missing atoms (REMARK 610), and metal coordination (REMARK 620). Documentation describing the differences between these versions is available at www.wwpdb.org/docs.html. Since the beginning of March 2008, the REVDAT record indicates when a Version 3.0 file is re-released as Version 3.1 with the name "VERSN." For example, if the journal record has been updated in an entry that previously followed Version 3.0, the REVDAT would appear as: REVDAT 1 04-MAR-08 1ABC 1 JRNL VERSN REVDAT 1 13-FEB-07 1ABC 0 There is no change to how depositors submit their files. Any required changes in nomenclature can be made automatically by the wwPDB during the annotation process. Documentation about file formats and the Remediation Project is available at www.wwpdb.org. -------------------------------------------- DATA QUERY, REPORTING, AND ACCESS Website Statistics Website access statistics for the first quarter of 2008 are given below. Month Unique Number of Bandwidth Visitors Visits Jan 2008.....128,781.....319,459.......426.87 GB Feb 2008.....139,444.....338,946.......567.18 GB Mar 2008.....152,264.....361,999.......642.98 GB PDB Statistics Which journal has published the most structures? What types of structures have been solved by more than one experimental method? Answers to these questions can be found by exploring the various statistics about the data in the PDB archive available by clicking the PDB Statistics link at the top of every page on the RCSB PDB website. Charts, graphs, and tables related to content distribution include: * Summary Table of Released Entries: Current PDB holdings grouped by experimental method and molecule type * Status of Unreleased Entries: A pie chart that illustrates the status of unreleased entries * Interactive histograms showing the archive by function, resolution, space group, source organism, journal, molecular weight, and enzyme classification * Histogram showing the number of structures solved by structural genomics structures * Table of proteins solved by multiple experimental methods * Current statistics on redundancy in the archive The growth in the number of structures released in the PDB archive can be seen per year, by experimental method, and by molecule type. Other graphs show the growth of unique protein classifications as defined by SCOP (scop.mrc-lmb.cam.ac.uk/scop/index.html) and CATH (cathwww.biochem.ucl.ac.uk). Time-stamped Copies of PDB Archive Available via FTP A time-stamped snapshot of the PDB archive (ftp.wwpdb.org) as of January 7, 2008 has been added to ftp://snapshots.rcsb.org/. Snapshots of the PDB have been archived annually since 2004. It is hoped that these snapshots will provide readily identifiable data sets for research on the PDB archive. The script at ftp://snapshots.rcsb.org/rsyncSnapshots.sh may be used to make a local copy of a snapshot or sections of the snapshot. The directory 20080107 includes the 48,161 experimentally-determined coordinate files that were current as of January 7, 2008. Coordinate data are available in PDB, mmCIF, and XML formats. The date and time stamp of each file indicates the last time the file was modified. -------------------------------------------- OUTREACH AND EDUCATION RCSB PDB Celebrates Teaching, Learning, and More Recent education and outreach activities have included: * Annotators made models of virus structures with local middle school students as part of Princeton University's Science and Engineering Expo on March 19. The models included marshmallow and toothpick representations of the viral shell and paper models of the dengue fever virus. * An exhibit booth was also held at the Teaching & Learning Celebration in New York, NY, March 7-8. Educators and policy makers from the Tri-State area came to the booth to learn about protein structures, the RCSB PDB, and to take home tRNA tattoos. * The RCSB PDB exhibited at the Biophysical Society’s annual meeting (February 2-6; Long Beach, CA). Protein Sculptures on Display at Rutgers Sculptures and photographs by Julian Voss-Andreae were on display at the Rutgers Student Center in New Brunswick, New Jersey in February. Voss-Andreae's unique sculptures are designed to tell stories about hemoglobin, collagen, and other structures essential to life. Julian Voss-Andreae is a German-born sculptor based in Portland. He graduated from the Pacific Northwest College of Art (PNCA) in 2004 with a BFA in sculpture. While still at PNCA, Voss-Andreae developed a novel kind of sculpture based on the structure of proteins, the building blocks of life. Voss-Andreae's work has been commissioned internationally and has been highlighted in journals such as Leonardo and Science. Photographs of Voss-Andreae's sculptures are part of the RCSB PDB's Art of Science traveling exhibit, which also features images available from the RCSB PDB website and the Molecule of the Month. For more information on hosting this exhibit, please contact info@rcsb.org. The next stop for the cycloviolacin sculpture is the Art and Mathematics: The Wonders of Numbers exhibit at The Heckscher Museum of Art, April 12 - June 22, 2008, in Huntington, NY. Papers Published The PDB archive, which began in 1971 as a handwritten petition signed by crystallographers, has developed into an online biological database and resource used by a diverse community of teachers, students, and researchers in academia and industry worldwide. This history is described in an article published in an issue of Acta Crystallographica that commemorates various milestones in the crystallographic community: Helen M. Berman (2008) The Protein Data Bank: a historical perspective Acta Cryst. A64: 88-95. doi: 10.1107/S0108767307035623 A paper describing the work done as part of the wwPDB Remediation Project, including the standardization of IUPAC nomenclature for chemical components, an update of sequence database references and taxonomies, and improvements in the representations of viruses, has been published in Nucleic Acids Research's 2008 Database Issue. K. Henrick, Z. Feng, W. F. Bluhm, D. Dimitropoulos, J. F. Doreleijers, S. Dutta, J. L. Flippen-Anderson, J. Ionides, C. Kamada, E. Krissinel, C. L. Lawson, J. L. Markley, H. Nakamura, R. Newman, Y. Shimizu, J. Swaminathan, S. Velankar, J. Ory, E. L. Ulrich, W. Vranken, J. Westbrook, R. Yamashita, H. Yang, J. Young, M. Yousufuddin, H. M. Berman (2008) Remediation of the Protein Data Bank archive Nucleic Acids Research 36: D426-D433. doi: 10.1093/nar/gkm937 The data from the Remediation Project are available through the FTP archive and wwPDB member sites. Detailed documentation about the Remediation Project is available at www.wwpdb.org. -------------------------------------------- EDUCATION CORNER: Moving Pictures by Dr. Jeramia Ory, Kings College Getting students to grasp the link between 3D structure and biological function is a necessary and challenging part of many undergraduate courses. Structural information can help students “get it” in a way that cannot be underestimated. As an example, numerous students have told me how much easier it is to understand stereochemistry when they can manipulate chemicals on a computer screen in 3D rather than trying to work out wedge/hash 2D conventions. As the number of structures in the PDB archive continues to grow, the challenge lies not in finding structural information related to the topic at hand (the Advanced Search on the RCSB PDB website is a great resource), but in incorporating the information into lecture materials and presentations without draining an instructor’s time or resources. Fortunately for instructors, a number of free programs that excel in molecular visualization and analysis are now available. Instead of reviewing the myriad of programs out there, I will focus the one I use to create multimedia presentations for my students–Chimera(1). Chimera is written and maintained by the Computer Graphics Lab at the University of California, San Francisco. It has a long history in molecular visualization, having started as a program designed in 1980 for high-end graphics workstations. What this means practically is that this research group has been thinking about the needs of the molecular visualization community for a long time. As modern desktop computing power has grown, the visualization community has expanded from its original base of X-ray crystallographers to educators and students as young as high school. While no program can be all things to all people, Chimera comes close. I have personally used it for hands on molecular visualization workshops with groups ranging from high school students to undergraduates with good results. Chimera has a few advantages when compared to other packages out there. Cost. Chimera is free for academic use. This is becoming less of a unique feature with the rise of open source and free software, but it is still an important consideration. After using Chimera for an exercise, I direct students to the download page so they can use it on their home computers if they wish to continue exploring. Just as important, Chimera is available for every major operating system: Windows, Macintosh and Linux (and more). Of course, as critics of free software are fond of saying, “free software is only free if your time has no value.” Luckily, the program is forgiving to new users, and rewards time spent with it. Learning Curve. The last thing educators have is time to waste. Chimera is a powerful analysis and visualization tool and is written with scientists in mind, however, it is quite easy to learn and new users can generally find their way around the program in about an hour. I run a protein visualization exercise in my undergraduate Biochemistry class that walks the students through the basics of Chimera; they align myoglobin and hemoglobin and then color the aligned residues by conservation. Most students complete the exercise in 50 minutes and find it useful to be able to explore protein structure on their own. Should they not finish, the fact that it is free means they can finish up at the campus computer lab or at home. The program is well-documented online (www.cgl.ucsf.edu/chimera) and comes with tutorials for new users. ****For the rest of this article, please see the online newsletter at www.pdb.org**** (1) E.F. Pettersen, T.D. Goddard, C.C. Huang, G.S. Couch, D.M. Greenblatt, E.C. Meng, and T.E. Ferrin (2004) UCSF Chimera--a visualization system for exploratory research and analysis. J Comput Chem. 25(13): 1605-12. -------------------------------------------- PDB Community Focus: Dr. Christine Orengo, University College London Q. In 1997, you and your colleagues established CATH(1)–-a system that is used to classify protein domain structures. How are researchers using CATH today? What types of research and discoveries does it enable? Has its usage changed in the past ten years? A: In the early 1990s, there were over three thousand structures deposited in the PDB and Janet Thornton realized that we could get some very useful insights into protein folding and evolution by grouping these into fold groups and evolutionary families. I was fortunate to join her group at that time and we set about doing this classification with the benefit of a very sensitive structure comparison algorithm developed by Willie Taylor and myself, at NIMR. We designed a hierarchical classification which grouped proteins according to their basic secondary structure composition (Class), 3D shape (Architecture), folding arrangement (Topology), and finally evolutionary ancestry (Homology). Although we largely use automated approaches, identifying domain boundaries in multi-domain proteins, and recognizing homologues are difficult and very time consuming, as they need manual validation, which is why we only have ~80% of the PDB classified to date. We have just introduced some sophisticated new protocols that we think will help us to increase this percentage over the next year. Despite this slight lag with the PDB, CATH is widely used and currently receives about a million web page hits per month from sites all over the world. We have put considerable effort into the design of the resource, trying to present the information in an intuitive and easily accessible form, and I believe this is reflected in its high usage. SCOP(2), a related resource, is also very widely used but because we exploit slightly different criteria to classify folds and provide additional information on superfamilies (e.g. multiple structure alignments), the two resources are somewhat complimentary. I think CATH is particularly useful for teaching. Perhaps the other distinctive feature of CATH is that we have developed our own structure comparison methods and provide a service (CATHEDRAL web server (3)) for scanning new structures against representative domains. This is very popular with structural biologists as it can be used to recognize novel folds or classify new structures into existing superfamilies. The CATH fold library is also exploited by computational biologists developing methods to predict whether a sequence is likely to adopt one of the known structures. We have now extended CATH to include all sequences in the genomes that can be predicted to belong to a CATH superfamily (CATH-Gene3D (4)) and this has allowed us to increase the functional annotations associated with each superfamily hugely. Biologists are increasingly using CATH and Gene3D to obtain structural and functional annotations for their proteins and this has been facilitated by further dissemination of the information through the DAS annotation systems set up by the Biosapiens network (www.biosapiens.info). Perhaps one of the most interesting phenomena revealed by classifying structures is the incredible bias in the populations of the fold groups and evolutionary superfamilies. In 1994, Janet Thornton and I reported the existence of the superfolds, a set of 10 folds which were highly over-represented in CATH (5). This trend still exists and the integration of sequence data through Gene3D has shown that it is not an artifact of sampling but a genuine reflection of the dominance of certain folds in nature. The bias is also apparent at the evolutionary superfamily level. For instance, the 100 largest superfamilies in CATH account for nearly half the domain sequences of predicted structures in completed genomes. As CATH has become more highly populated, it has been used to study and characterize the structural mechanisms involved in the evolution of proteins and their functions; in particular, the extent to which structural embellishments to the domain core can modify the geometry of active sites or influence surface features mediating different protein-protein interactions. The integration of genome sequences in CATH-Gene3D has illuminated functional diversity across superfamilies, and recent changes in the usage of CATH reflects biologists’ interests in performing comparative genome analyses with this extensive functional data. For example, a comparison of CATH superfamilies, universal to bacteria, revealed that the expansion of metabolic and regulatory superfamilies with genome size is balanced, allowing maximum enrichment of the metabolic repertoire within the constraints of maintaining a small genome for fast replication.(6) ****For the rest of this article, please see the online newsletter at www.pdb.org**** (1) C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thornton (1997) CATH–a hierarchic classification of protein domain structures. Structure. 5: 1093-1108. (2) L. Conte, A. Bart, T. Hubbard, S. Brenner, A. Murzin, and C. Chothia (2000) SCOP: a structural classification of proteins database. Nucleic Acids Res. 28(1): 257-259. (3) O.C. Redfern, A. Harrison, T. Dallman, F.M. Pearl, and C.A. Orengo (2007) CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol. 3(11): e232. (4) C. Yeats, J. Lees, A. Reid, P. Kellam, N. Martin, X. Liu, and C. Orengo (2008) Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36(Database issue): D414-8. (5) J. Ranea, D. Buchan, J. Thornton, & C. Orengo (2005) Microeconomic principles explain an optimal genome size in bacteria Genetics. 21: 21-25. (6) C.A. Orengo, D.T. Jones, and J.M. Thornton (1994) Protein superfamilies and domain superfolds. Nature. 372(6507): 631-4. ---------------------------------------- STATEMENT OF SUPPORT The RCSB PDB is supported by funds from the National Science Foundation, the National Institute of General Medical Sciences, the Office of Science, Department of Energy, the National Library of Medicine, the National Cancer Institute, the National Center for Research Resources, the National Institute of Biomedical Imaging and Bioengineering, the National Institute of Neurological Disorders and Stroke, and the National Institute of Diabetes & Digestive & Kidney Diseases. The RCSB PDB is managed by two partner sites of the Research Collaboratory for Structural Bioinformatics: RUTGERS Rutgers, The State University of New Jersey Department of Chemistry and Chemical Biology 610 Taylor Road Piscataway, NJ 08854-8087 SDSC/Skaggs/UCSD San Diego Supercomputer Center and the Skaggs School of Pharmacy and Pharmaceutical Sciences University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0537 RCSB PDB LEADERSHIP TEAM Dr. Helen M. Berman - Director Rutgers University berman@rcsb.rutgers.edu Dr. Philip E. Bourne - Co-Director SDSC/Skaggs/UCSD bourne@sdsc.edu Dr. Martha Quesada - Deputy Director Rutgers University mquesada@rcsb.rutgers.edu A list of current RCSB PDB Team Members is available from the website. The RCSB PDB is a member of the Worldwide PDB (www.wwpdb.org) -------------------------------------------- SNAPSHOT April 1, 2008 49760 released atomic coordinate entries * Molecule Type 45906 proteins, peptides, and viruses 1839 nucleic acids 1982 protein/nucleic acid complexes 33 other * Experimental Technique 42342 X-ray 7150 NMR 170 electron microscopy 98 other 31499 structure factor files 3931 NMR restraint files