PDB Newsletter Number 6 -- Summer 2000 -- RCSB Published quarterly by the Research Collaboratory for Structural Bioinformatics Weekly PDB news is available on the Web at http://www.rcsb.org/pdb/latest_news.html. This newsletter will soon be available on the Web in HTML format at http://www.rcsb.org/pdb/newsletter/ SNAPSHOT -- June 27, 2000 12,592 released atomic coordinate entries Molecule Type 11,171 proteins, peptides, and viruses 853 nucleic acids 550 protein/nucleic acid complexes 18 carbohydrates Experimental Technique 10,316 diffraction and other 3,826 structure factor files 1,995 NMR 794 NMR restraint files 281 theoretical modeling TABLE OF CONTENTS Message from the PDB Data Deposition and Processing PDB Deposition Statistics PDB File Format FAQ Available PDB Launches a Deposition Mirror Site at Osaka University Data Uniformity and Master Archive Plans for Releasing Data from the Uniformity Project Automated Filing System for the PDB Master Archive Data Query and Distribution PDB Website Statistics Alternative URL's PDB Outreach Activities RCSB Hosts CCPN Meeting PDB Featured at BIO2000 PDB CD-ROM Set 92 Released Molecules of the Quarter: Collagen, Cytochrome c Oxidase, and HIV-1 Protease PDB Job Listings Statement of Support RCSB PDB Team -------------------------------------------- MESSAGE FROM THE PDB Beginning July 1, 1999, the Research Collaboratory for Structural Bioinformatics (RCSB) has been completely responsible for the management of the Protein Data Bank. Since completing the transition period a full three months ahead of schedule, the RCSB has continued to improve our tools for data deposition, processing, query, and reporting. We look forward to further developing the PDB to provide a rich resource for our user community. The PDB will be attending a few conferences this summer - we hope that many of you will stop by and visit our exhibit booths at the American Crystallographic Association's Annual Meeting (St. Paul, MN, July 22- 27), the Fourteenth Symposium of the Protein Society (San Diego, CA, August 5-9), and the Eighth International Conference on Intelligent Systems for Molecular Biology (San Diego, CA, August 19-23). There will also be a PDB Users Meeting at the ACA Meeting on Tuesday, July 24, 2000 where we look forward to meeting you and answering your questions. The PDB DATA DEPOSITION AND PROCESSING PDB DEPOSITION STATISTICS In the second quarter of 2000, 618 structures were deposited to the PDB using ADIT -- 189 in April, 220 in May, and 209 in June. Depositions came from all around the globe, including Asia, Australia, Europe, and North America. These depositions are processed to completion by the RCSB-Rutgers team and returned to the author for final approval, generally within twelve days. Approximately 17% of these entries were deposited with a HOLD release status; 58% with a hold until publication status; and 25% with an immediate release status. PDB FILE FORMAT FAQ AVAILABLE The PDB has released a document that addresses questions about the PDB file format frequently posed by depositors and users over the past year. The File Format FAQ at http://pdb.rutgers.edu/format-faq-v1.html compiles information from the PDB Contents Guide document originally created at Brookhaven National Laboratory, a careful study of existing files, an RCSB Workshop held in October 1998, and discussion with many users of the data. The guidelines presented here are those used by the annotation staff at the RCSB PDB. Questions not addressed in this document should be sent to deposit@rcsb.rutgers.edu. PDB LAUNCHES A DEPOSITION MIRROR SITE AT OSAKA UNIVERSITY The PDB has established a deposition mirror site at the Institute for Protein Research at Osaka University in Osaka, Japan. The AutoDep Input Tool (ADIT) is now open to accept PDB depositions at http://pdbdep.protein.osaka-u.jp/adit/ as well as at http://pdb.rutgers.edu/adit/. The Osaka ADIT mirror has all of the features available from the main ADIT server, including automatic ID assignment. ADIT was developed by the RCSB to provide a simple method for depositing structure data. Entries deposited at this site will be forwarded to the PDB for processing and inclusion in the database. DATA UNIFORMITY AND MASTER ARCHIVE PLANS FOR RELEASING DATA FROM THE UNIFORMITY PROJECT The goal of the PDB Data Uniformity project is to maintain the greatest possible consistency within the entire archive. Uniformity is a key prerequisite for any meaningful query or systematic analysis of the archive. Two complementary methods have been used to update and unify the data in the PDB archive. The Data Uniformity project began by examining individual entries within groups of chemically related structures. During its long history the PDB format has undergone a number of changes. In the file-by-file uniformity processing, each entry is brought up to the current PDB format standard. This includes adding records that were not present in early entries where possible, correcting outstanding reported problems, and providing standard nomenclature. Each file is rechecked using our current validation software. Approximately 3000 entries containing nucleic acids, globins, retroviral proteases, and aspartic proteases have been processed in this way. In addition to file-by-file uniformity procedures, the Uniformity Project has also targeted key records within each PDB entry for archive-wide uniformity processing. This archive-wide approach has been used to update citation, R-factor, and resolution records. These results have been loaded into the PDB database where they can be accessed from one of the PDB query interfaces or viewed in the PDB Structure Explorer reports. Other records targeted for archive-wide uniformity processing include ligand descriptions, protein classification, sequence, and source data. Some of these records are now available on the PDB beta test website. This work complements the data clean-up project undertaken by the MSD group at the European Bioinformatics Institute. In the future, all of the records resulting from the archive-wide uniformity processing will be updated in the PDB entries as part of the file-by-file uniformity procedure using the plan described below. One of the important issues for PDB's on-going data processing, including its Data Uniformity project, is the management of multiple nomenclatures. The problem of providing alternative nomenclatures within the PDB format is a well-recognized problem. In assuming the stewardship for PDB archive, RCSB was charged with the responsibility for maintaining the greatest possible consistency within the entire archive. Unfortunately, uniformity considerations are often at odds with preferences of depositors who provide additional insight into the description of an entry that is outside traditional PDB practice. Recent discussions on the PDB list server regarding the assignment of chain identifiers to ligands and solvent provide important illustrations of this on-going problem. In planning for the release of the entries from the various uniformity projects, PDB has sought a release scheme that would: (1) provide the flexibility to permit users of the archive to access alternative nomenclatures within the limitations of the existing PDB format, (2) integrate the results of archive-wide and file-by-file uniformity processing, and (3) preserve the integrity of the archival PDB format files available from the PDB ftp site. In consultation with the PDB Database Committee and the PDB Advisory Committee, we arrived at the following plan: * Data will continue to be distributed in the current PDB data format from the RCSB PDB FTP site. The nomenclature, including chain ID assignment, will continue to follow the rules previously described. * Data will also be distributed in mmCIF format from a new ftp area. mmCIF provides a detailed and fully parsable description of macromolecular structure and experiment. mmCIF is also equipped to deal with alternative nomenclatures. This mmCIF ftp area will be used to distribute the remediated data from the Data Uniformity project, and to distribute newly processed entries in mmCIF format with support for multiple nomenclatures. * Software tools to create PDB format files from the new mmCIF files will be provided. These software tools will permit users to select the particular nomenclature to be written to a PDB format file. For instance, it will be possible to create a PDB file using either PDB hydrogen or IUPAC hydrogen atom name conventions, or using either author or PDB chain ID conventions. An important benefit of this approach is that all of this flexibility can be provided using a single archival mmCIF file. The new ftp area for mmCIF data will be implemented in the Fall. More details about this site will be provided in the near future. Questions and comments should be sent to info@rcsb.org. AUTOMATED FILING SYSTEM FOR THE MASTER ARCHIVE A computer driven filing system for the PDB Master Archive has been installed at NIST. The filing system is comprised of two carousel-type automated systems. The carousel systems are 10 feet high and each have 18 carriers, one with eight and the other with nine drawers per carrier, for a combined storage capacity of over 4,500 linear filing feet. The physical systems balance ease of operation and improved access to files with complete security for the contents. As the files are transferred to the new filing system they will be bar-coded for future access. The PDB files contain author correspondence in addition to a hard copy of the deposited data. While the data may be needed for historical reference or for checking the current files, the privacy of the author's correspondence will be maintained. DATA QUERY AND DISTRIBUTION PDB WEBSITE STATISTICS An analysis of access statistics for the primary PDB website shows that the number of hits it has received has remained consistently high over the past few months, with monthly totals of over 2,800,000 hits and over 2,100,000 files downloaded since February 2000. A new record of 3,117,732 website hits was achieved during the month of March 2000, which is 144,513 more hits than the previous record month of October 1999. The number of files downloaded in March had also reached a new record high of 2,379,710 files, 138,953 more files than in October 1999. We hope that the final figures for the month of June 2000 will show that this pattern has continued. While the www.rcsb.org address continues to receive the most traffic, use of the mirror sites and beta test site continues to increase. Daily Average Monthly Totals Month Hits Files Sites KBytes Files Hits Jun 00* 83665 64913 30372 43326408 1687740 2175309 May 00 92593 71973 38561 52538201 2231175 2870411 Apr 00 94585 72290 39089 47959718 2168708 2837551 Mar 00 100572 76764 42440 52121187 2379710 3117732 *Statistics as of 6/27/00 ALTERNATIVE URLs The URL http://www.pdb.org now directs the Web browser to the home page for the primary PDB site at http://www.rcsb.org/pdb/. This alternative URL has been established to help new PDB users find the RCSB PDB website. In addition to the partner sites available at http://rutgers.rcsb.org and http://nist.rcsb.org, the RCSB mirror sites are also available as alternatives to the main PDB website. Here is a listing of mirror site addresses: CCDC http://pdb.ccdc.cam.ac.uk/ United Kingdom ftp://pdb.ccdc.cam.ac.uk/rcsb/ National University http://pdb.bic.nus.edu.sg/ of Singapore ftp://pdb.bic.nus.edu.sg/pub/pdb/ Singapore Osaka University http://pdb.protein.osaka-u.ac.jp/ Japan ftp://ftp.protein.osaka-u.ac.jp/pub/pdb/ Universidade Federal http://www.pdb.ufmg.br/ de Minas Gerais ftp://vega.cenapad.ufmg.br/pub/pdb/ Brazil PDB OUTREACH ACTIVITIES RCSB HOSTS CCPN MEETING The RCSB was pleased to host the second workshop for the Collaborative Computational Project (CCP) for NMR on May 22, 2000 at NIST. The first workshop was held February 7-8, 2000 in Hinxton, UK. The CCP NMR (CCPN) project is funded by the Biotechnology and Biological Sciences Research Council (BBSRC) of the United Kingdom. The CCPN project aims to develop for the NMR community data exchange standards and software packages analogous to what the CCP4 project has developed for the crystallographic community. Further information about CCPN and workshop summaries are available at http://www.bio.cam.ac.uk/nmr/ccp/. The next CCPN meeting will be held in Florence, Italy in conjunction with the XIX International Conference on Magnetic Resonance in Biological Systems, August 20-25, 2000. PDB FEATURED AT BIO2000 The PDB saw an extremely high and broad interest in Biotechnology at the BIO2000 meeting held in Boston, Massachusetts, March 26-30. The anticipated attendance of 5,000 soared to over 10,000, focusing on industry and business. The Protein Data Bank (PDB) was featured at an exhibit booth. Many visitors to the booth saw the PDB website for the first time. A variety of users, including students, professors, pharmaceutical company representatives, and software developers, were impressed with the amount of free information and the simplicity of access to the data. PDB CD-ROM SET 92 RELEASED Issue 92 of the PDB CD-ROM set, containing 12,009 structure files on five disks, is now available. Files containing the experimental data (Structure factors and NMR constraints) are also now included in this distribution. A new file included with this release is holdings.doc. This file lists the numbers of structures determined by various experimental techniques (X-ray, NMR, Theoretical) and by molecule type (Proteins, Peptides, and Viruses; Protein Nucleic Acid Complexes; Nucleic Acids; and Carbohydrates). This file also lists the number of experimental data files (X-ray and NMR) available. The holdings.doc file is found on CD-ROM disk 1 in the directory pub. The names of two directories have been changed with this release in order to provide a more obvious connection between directory names and directory contents. Entries is the new name for the old Distr directory that contains the main structure entries (coordinate files). Strucfac is the new name for the old Nonst directory that contains the structure factor data files. The CD-ROM set is provided by the PDB to assist researchers who do not have easy internet access to the primary PDB website or to any of its mirrors. The PDB CD-ROM set is released quarterly at no charge. The CD-ROM set may be ordered on-line, by email, fax, or mail as follows: Online orders: http://www.nist.gov/srd/nist80.htm Email orders: srdata@nist.gov Fax orders: (+1) 301 926 0416 Mail orders: RCSB/NIST, Mail Stop 2310, Gaithersburg, MD 20899-2310 MOLECULES OF THE QUARTER: COLLAGEN, CYTOCHROME c OXIDASE, AND HIV-1 PROTEASE The PDB has continued to implement its popular "Molecule of the Month" feature. Written and drawn by David S. Goodsell, an assistant professor of molecular biology at The Scripps Research Institute in La Jolla, California, these features provide an overview of significant milestones in the growth of the PDB's macromolecular structure data for a diverse audience. Here is a sample of the information that is presented in these articles: Collagen: Your Most Plentiful Protein April, 2000 -- About one quarter of all of the protein in your body is collagen. Collagen is a major structural protein, forming molecular cables that strengthen the tendons, and vast, resilient sheets that support the skin and internal organs. Bones and teeth are made by adding mineral crystals to collagen. Collagen provides structure to our bodies, protecting and supporting the softer tissues and connecting them with the skeleton. But in spite of its critical function in the body, collagen is a relatively simple protein. Collagen is composed of three chains, wound together in a tight triple helix. Each chain is over 1,400 amino acids long. A repeated sequence of three amino acids forms this sturdy chain structure. Every third amino acid is glycine, a small amino acid that fits perfectly inside the helix. Many of the remaining positions in the chain are filled by two unexpected amino acids: proline and a modified version of proline, hydroxyproline. We wouldn't expect proline to be this common, because it forms a kink in the polypeptide chain that is difficult to accommodate in typical globular proteins. But, it seems to be just the right shape for this structural protein. Hydroxyproline, which is critical for collagen stability, is created by modifying normal proline amino acids after the collagen chain is built. The reaction requires vitamin C to assist in the addition of oxygen. Unfortunately, we cannot make vitamin C within our bodies, and if we don't get enough in our diet, the results can be disastrous. Vitamin C deficiency slows the production of hydroxyproline and stops the construction of new collagen, ultimately causing scurvy. The symptoms of scurvy--loss of teeth and easy bruising--are caused by the lack of collagen to repair the wear-and-tear caused by everyday activities. Collagen from livestock animals is a familiar ingredient for cooking. Like most proteins, when collagen is heated, it loses all of its structure. The triple helix unwinds and the chains separate. Then, when this denatured mass of tangled chains cools down, it soaks up all of the surrounding water like a sponge, forming gelatin. Cytochrome c Oxidase: Oxygen and Life May, 2000 -- Oxygen is an unstable molecule. If given a chance, it will break apart and combine with other molecules. This is the process of oxidation, seen in our familiar world as the rusting of iron in cars and nails. But, surprisingly, the unusual electronic properties of oxygen molecules make this reaction very slow. So, paper doesn't spontaneously burn up--flames must be kindled. All animals and plants, and many microorganisms, use the instability of oxygen to power the processes of life. The molecules in food are oxidized and the energy is used to build new molecules, to swim or crawl, and to reproduce. Food is not oxidized in a fiery flame, however. It is oxidized in many slow steps, each carefully controlled and designed to capture as much usable energy as possible. Cytochrome c oxidase controls the last step of food oxidation. At this point, the atoms themselves have all been removed and all that is left are a few of the electrons from the food molecules. Cytochrome c oxidase takes these electrons and attaches them to an oxygen molecule. Then, a few hydrogen ions are added as well, forming two water molecules. The reaction of oxygen and hydrogen to form water is a favorable process, releasing a good deal of energy. In our familiar world, hydrogen and oxygen combine explosively, which is the reason that dirigibles are filled with helium instead of hydrogen. In our cells, however, the energy is carefully harnessed by cytochrome c oxidase to charge a battery, or perhaps more correctly, to charge a capacitor. Cytochrome c oxidase is a membrane protein. Most of the surface atoms are carbon and sulfur. In the cell, these atoms are buried inside a membrane. The regions at the top and bottom are covered with charged oxygen and nitrogen atoms. These regions, which prefer a watery environment, stick out on opposite faces of the membrane. This arrangement is perfect for the job performed by cytochrome c oxidase, which uses the reaction of oxygen to water to power a molecular pump. As oxygen is consumed, the energy is stored by pumping hydrogen ions from one side of the membrane to the other. Later, the energy can be used to build ATP or power a motor by letting the hydrogen ions seep back across the membrane. HIV-1 Protease: A Target for AIDS Therapy June, 2000 -- Drugs that attack HIV-1 protease are one of the triumphs of modern medicine. The AIDS epidemic started a few short decades ago--before that, HIV was unknown. These drugs demonstrate the powerful tools that medical science has to combat a new disease. Already, researchers have discovered a panel of effective drugs that slow the growth of the virus to a standstill. Important problems still remain, however. In particular, an effective vaccine against HIV is not available. But today, HIV-infected individuals have potent options for treatment. HIV-1 protease performs an essential step in the life cycle of HIV. Like many viruses, HIV makes many of its proteins in one long piece, with several proteins strung together. HIV-1 protease has the job of cutting this long 'polyprotein' into the proper protein-sized pieces. The timing of this step is critical. The intact polyprotein is necessary early in the life cycle, when it assembles the immature form of the virus. Then, the polyprotein must be cut into the proper pieces to form the mature virus, which can then infect a new cell. The cleavage reactions must be timed perfectly, allowing the immature virus to assemble properly before the polyprotein is broken. Because of its sensitive and essential function, HIV-1 protease is an excellent target for drug therapy. Drugs bind tightly to the protease, blocking its action, and the virus perishes because it is unable to mature into its infectious form. Knowing the atomic structure of HIV-1 protease has made much of this work possible. The first structures were reported in 1989. A decade later, over one hundred structures are available in the PDB, including several genetic strains of the enzyme, complexes of the enzyme with many different drugs and inhibitors, and dozens of mutant enzymes. Hundreds more are stored in the proprietary databases of pharmaceutical companies, where they are used to test and refine new drug candidates. Overall, HIV-1 protease is now one of the best-studied enzymes known to medicine. It is an enigmatic enzyme, however, that still hides many of its secrets. PDB JOB LISTINGS The PDB is pleased to announce that career opportunities at the PDB will be posted at http://www.rcsb.org/pdb/jobs. The current available openings are: The Biotechnology Division of the National Institute of Standards and Technology (NIST), Department of Commerce is actively seeking a Research Chemist/ Post Doc for a two year term appointment for a qualified individual. Applicants should have a background in biochemistry, biotechnology, and/or structural biology that is appropriate for annotating data derived from X-ray crystallographic and NMR structure determinations of biological macromolecules for the purpose of creating uniform data for distribution through web-based tools. Programming skills in database environments, UNIX, and common languages is desirable. Send resume to: bhat@nist.gov or gary.gilliland@nist.gov The Biotechnology Division of the National Institute of Standards and Technology (NIST), Department of Commerce is actively seeking a Computer Specialist/ Post Doc for a two year term appointment for a qualified individual. Applicants should have a background in advanced programming (C++, JAVA, etc.) and database development (SYBASE, ORACLE, etc.) for developing uniform structural data for biological macromolecules. Experience in working in a UNIX environment and a background in working with biological information is desirable. Send resume to: bhat@nist.gov or gary.gilliland@nist.gov The San Diego Supercomputer Center (SDSC) at the University of California, San Diego (UCSD) is actively seeking two qualified individuals for career appointments. Work primarily involves developing applications for the Protein Data Bank (PDB) database query and analysis functions. Examples include: writing scientific application code, working on collaborative scientific and program- ming activities across several partner sites, and contributing to the research educational mission of the project. Other key activities include: contributing to and implementing new technologies, particularly in the areas of enabling technologies, meta-computing, and visualization. Qualifications sought include: a basic understanding of a scientific research area such as biology or chemistry and the rudiments of computation science for that discipline. Must have broad experience in support and application of relational database programs such as Sybase and Oracle for delivery of scientific data via the Internet. Advanced knowledge of specified, key programming languages and operating systems such as C++, Perl, CGI, Java, and application development environments relevant to supercomputing programming. Application development experience under Sun Solaris and/or SGI IRIX operating systems. Experience with a wide range of computer platforms/operating systems including Sun/Solaris, SGI/IRIX, and PC/NT. EEO/AA Employer. Send resume to: bourne@sdsc.edu or helgew@sdsc.edu ----------------------------------------- STATEMENT OF SUPPORT The PDB is supported by funds from the National Science Foundation, the Office of Biology and Environmental Research at the Department of Energy, and two units of the National Institutes of Health: The National Institute of General Medical Sciences and the National Library of Medicine, in addition to resources and staff made available by the host institutions. ----------------------------------------- RCSB PDB TEAM Dr. Helen M. Berman - Director Department of Chemistry Rutgers University 610 Taylor Road Piscataway, NJ 08854-8087 732-445-4667 Fax: 732-445-4320 berman@rcsb.rutgers.edu Dr. John Westbrook - PDB Project Team Leader Department of Chemistry Rutgers University 610 Taylor Road Piscataway, NJ 08854-8087 732-445-4290 Fax: 732-445-4320 jwest@rcsb.rutgers.edu Dr. Gary Gilliland - PDB Project Team Leader National Institute of Standards and Technology Mail Stop 8310 Gaithersburg, MD 20899-8310 301-975-2629 Fax: 301-330-3447 gary.gilliland@nist.gov Dr. Peter Arzberger - PDB Project Team Leader San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0505 858-534-5079 Fax: 858-822-0948 parzberg@sdsc.edu Dr. Philip Bourne - PDB Project Team Leader San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0505 858-534-8301 Fax: 858-822-0873 bourne@sdsc.edu RCSB Members Rutgers University 610 Taylor Road Piscataway, NJ 08854-8087 Kyle Burkhardt kburkhar@rcsb.rutgers.edu Victoria Colflesh victoria@rcsb.rutgers.edu Dr. Zukang Feng zfeng@rcsb.rutgers.edu Michael Huang mshuang@rcsb.rutgers.edu Dr. Lisa Iype lisa@rcsb.rutgers.edu Dr. Shri Jain sjain@rcsb.rutgers.edu Dr. Rachel Kramer kramer@rcsb.rutgers.edu Dr. Bohdan Schneider bohdan@rcsb.rutgers.edu Dr. Kata Schneider kata@rcsb.rutgers.edu Olivera Tosic olivera@rcsb.rutgers.edu Christine Zardecki zardecki@rcsb.rutgers.edu SDSC UC San Diego 9500 Gilman Drive La Jolla, CA 92093 Bryan Bannister bryan@sdsc.edu Tammy Battistuz tammyb@sdsc.edu Justin Caballero justin@sdsc.edu Dr. Douglas S. Greer dsg@sdsc.edu Dr. Michael Gribskov gribskov@sdsc.edu Dorothy Kegler dkegler@sdsc.edu Dr. John Kowalski kowalski@sdsc.edu Dr. Lynn F. Ten Eyck teneyckl@sdsc.edu Dr. Helge Weissig helgew@sdsc.edu Dr. Kenneth Yoshimoto kenneth@sdsc.edu NIST Biotechnology Division National Institute of Standards and Technology Gaithersburg, MD 20899-8310 Department of Chemistry Dr. Talapady N. Bhat bhat@nist.gov Phoebe Fagan phoebe.fagan@nist.gov Dr. Diane Hancock diane.hancock@nist.gov Dr. Veerasamy Ravichandran veerasamy.ravichandran@nist.gov Paul Reneke paul.reneke@nist.gov Joan Sauerwein joan.sauerwein@nist.gov Dr. Narmada Thanki narmada@dakota.carb.nist.gov Dr. Michael Tung michael.tung@nist.gov ----------------------------------------- PDB Newsletter Number 6 -- Summer 2000 -- RCSB Published quarterly by the Research Collaboratory for Structural Bioinformatics Weekly PDB news is available on the Web at http://www.rcsb.org/pdb/latest_news.html. This newsletter will soon be available on the Web in HTML format at http://www.rcsb.org/pdb/newsletter/