Performance Based Management
Self-Assessment Report
October 2004

Home


Computer Information Resource Management

Introduction/Background

Point of Contact:  Bob Cowles
Telephone No.:  (650) 926-4965
E-mail:  rdc@slac.stanford.edu

Date of last assessment: October 2003

Departmental Overview

Laboratory Mission

The Stanford Linear Accelerator Center is the lead Department of Energy (DOE) laboratory for electron-based high energy physics. It is dedicated to research in elementary particle physics, accelerator physics and in allied fields that can make use of its synchrotron radiation facilities—including biology, chemistry, geology, materials science and environmental engineering. Operated on behalf of the DOE by Stanford University, SLAC is a national user facility serving universities, industry and other research institutions throughout the world. Its mission can be summarized as follows:

Organizational Mission

The Computer Information Resource Management functional area is responsible for coordinating Information Management activities within the Laboratory. This coordination effort includes encouragement of information standards to ensure broad availability of information resources and of computer and systems procurements that have Laboratory support, and are part of Laboratory wide information planning practices.

The Computer Information Resource Management functional area self-assessment is based on, and measured against, performance objectives and standards as reflected in the SLAC contract that were defined by SLAC managers and DOE points of contact in order to address customer satisfaction, cost efficiency, and contract compliance.

Identification of Self-Assessment Report Staff

Names, titles, affiliations of participants

Robert Cowles, Computer Security Officer

Richard Mount, Director of SLAC Computing Services (SCS)

Scope of Self-Assessment

Items of Interest in 2004

The BaBar program has been extremely productive.  By virtue of the continued luminosity increases in PEP II, BaBar has recorded 1.7 petabytes of data.  In order to accommodate this plethora of data, SLAC Computing Services (SCS) and BaBar physicists have been expanding computing resources as rapidly as possible.  Three different areas deserve special mention:

 

Architecture

Operating System

Speed

Processors

Number of Systems

UltraSparc II

Solaris

440MHz

1

900

Pentium III

Linux

866MHz

2

512

Pentium III

Linux

1.4GHz

2

512

Pentium IV

Linux

2.6GHz

2

384

 

bullet The STK Silos continue to be the major source of Mass Storage for the High Energy Physics Program.  The tape drives were upgraded from 60GB, 10MB/s (9940) to 200GB, 30MB/s models (9940B), and were increased in number from 30 to 40 early in FY04. bullet The HPSS system provided a year of stable and productive use.  There are two systems, one for the primary BaBar data store, used only for production activities, and another user accessible one that provides essentially unlimited storage for individual researchers.

Intel Systems Summary

Windows 2000/2003 Infrastructure and Windows XP Client Support

Active Directory has provided increased security with its use of Kerberos and additional administrative functionality. Through the new Active Directory infrastructure, software upgrades and security updates are rolled out to Windows XP clients through a combination of Windows Update service, Group Policy Objects and local Software Update Services servers.  Commonly used software and updates (e.g., MS Office) are automatically installed on Windows XP clients using Group Policy Objects.   By implementing this Active-Directory-based Windows infrastructure and migrating to Windows XP clients, SLAC has been able to greatly reduce the costs of keeping up with the critical security patches.  In the last few months, while many institutions have had large numbers of computers infected, only a handful of SLAC’s managed Windows systems were compromised. The Windows print queue was centralized. Testing has begun on how to implement Windows XP Service Pack 2.

NetIQ software was procured for monitoring Windows servers, including Exchange and web servers, and is currently being utilized to monitor machine and service uptime. Additional functionality will be added over the coming year

Quest Reporter software, allows scanning of Windows machine logs and tracking software inventories was purchased as wll.

A web-based system, from which users download and install supported software for the Windows environment is being implemented.  The system tracks the usage of licensed software in a database, and confirms the user’s agreement to the licensing terms before the software is downloaded.

Additional Firewalled Systems

A firewall was implemented for the Physical Electronics group (PEL).  Additional protection was needed for hardware reliant on software drivers that could only be used with an unsupported operating system.

 A firewall was also implemented for the GLAST Clean Room Monitoring System.   Additional firewalls are in the process of being  built for other GLAST systems  (e.g., Online Detector System).

 To accommodate the new PeopleSoft infrastructure, a firewall protecting the new network containing this infrastructure (EPN2) is being built for the Business Services Division.

Exchange 2003

The conversion from Exchange 5.5 to Exchange 2003 was completed in August, 2004. One Exchange 5.5 server remains up for synchronization between the Exchange accounts database and the SLAC Institutional and SCS Account Resources databases. After a method is implemented to synchronize directly to Windows Active Directory, the last Exchange 5.5 server will be shut down and testing/implementation of RPC over HTTP will begin. Due to performance issues discovered in pre-production testing, the Exchange 2003 databases were placed on 4 Sun StorEdge 6120 Storage Arrays. The 6120s are large enough to provide capacity for expected growth through the end of FY06, but keeping the growth in check will be a challenge.

Windows Storage Systems

The amount of Windows storage has been doubled in FY04.  The current Windows storage is 8 TB.  To effectively manage the growth, quota software has been procured and implemented. Quotas are initially set at 500MB for new users and 10GB for new groups. Users can ask for increases up to 2GB. Groups will be given additional space equally as it is procured. If a large amount of space is needed and is not available, bulk storage can be purchased through SCS. Software was purchased to improve the management and backup of centrally provided, network accessible storage for Windows systems.  The goal is for data to be retrieved from backup within 4 hours.

Security Scanning

During FY04, we increased the usage of Internet Security Systems Internet Scanner 7. We are currently scanning ~6 days/week for latest Windows patches. 2 days/week the scans result in Windows System Admins receiving emailed reports if they have systems missing patches. This appears to have increased the level of compliance dramatically. One Security group member attended a class on ISS Site Protector and will be implementing this product during FY05 to enable scheduled scans across the network and easier report generation/distribution.

System administrator training

System administrators have received the following training:

Networking

The SLAC local Area Network (LAN) is based on a core of interconnected, redundant Cisco 6509 router/switches. Building switches, the farm core, and the border router are connected to the core routers by 1Gbps links. The number of nodes connected to the network is around 10,000. Rolling upgrades of the building switches are in progress to replace obsolete equipment, enable security upgrades, and add capacity and speed for end nodes. Desktops are being upgraded, as requested, from 10Mbps to 100Mbps links.

 The compute and data server farms machines are connected to farm switches at 100Mbps that in turn are connected to the farm core switch at 1Gbps. The demands of increased throughput and size of the compute and data farms will require that we upgrade to 10Gbps the connections of farm switches to the farm core switch during FY05. We are also determining how to upgrade the core and border router to 10Gbps to accommodate the planned upgrade of the offsite ESnet link to 10Gbps in FY05.

External Networking

External networking is vital to SLAC’s scientific mission.  Traffic flows over a 622 Mbits/s link into ESNet, plus a 1Gbps link to Internet2 via Stanford University.  In FY04, all newly acquired raw data from the BaBar experiment was moved in close to real time to Padova in Italy for initial reconstruction to allow INFN to meet its agreed commitment to BaBar computing.  This traffic formed the largest single flow over ESnet.

Seismic Stability Upgrades

Areas of the computer center’s raised flooring are being systematically replaced both to improve seismic stability and reliability of the computer center and to remove under floor obstructions to cooling airflow. As computers get smaller and denser, the increased load results in higher weights per square foot.  In FY03, a complete redesign study was undertaken to consider the raised floor design in light of these increased floor loads. 

One more area of the computer center’s machine room on the 2nd floor was completed in FY04.  The project has now been redirected to the area currently used for tape storage on the 1st floor. Upgrading this area will allow heavier loads to be moved to the 1st floor, thereby slowing the increase in load currently being placed on the 2nd floor.

Electrical Power Improvements for Central Servers

In FY03, work began on upgrading an electrical substation behind the computer center (Substation 7), from 1 MW to 2.5 MW with future expansion capability to 4MW.  This upgrade is necessary to provide enough power to computing equipment for BaBar and the SLAC scientific program. The substation upgrade was completed by the end of FY04. In FY05, it will be necessary to install equipment and wiring to distribute this electrical power to both the 1st and 2nd floors of the computer center.

Controlled measurements plus weather-related power outages provided the opportunity to perform load studies to determine the length of time the UPS systems can provide total power for the computer center and to establish needs for additional battery packs as more hardware is installed.  The UPS systems are intended to provide power to maintain SLAC’s computing and networking infrastructure while (planned) backup generators are being brought online.

Given the load studies and the projections for growth of BaBar computing, an additional 3 UPS (450 KVA) systems will be needed in the next two fiscal years. These UPS systems will need to be placed outside of the computer center. Some our existing 3 UPS systems need to be moved from the 4th floor of the computer center due to live load limits, the area of the pad will need to be large enough to incorporate those systems in addition to the anticipated growth. Conceptual and architectural designs were begun in FY04 for a location and building to house the UPS systems and switch gear for power distribution of substation 7.

To handle the increased cooling required by the growing BaBar and SLAC computing loads, an additional Stulz air handling unit was installed in FY04.

Computing Research and Development

SLAC computing receives funding for a number of computing research and development activities from the DOE and other funding agencies.  The performance of such activities is peer-reviewed and assessed at appropriate intervals by the program offices concerned.  Current research projects include the Particle Physics Data Grid, Internet Performance Monitoring, and the development of a huge-memory architecture for scientific data analysis.

Performance Side-Bar Indicators 

HPSS and Disk-based databases

A sign of system growth is the mass storage system HPSS that acts as the primary repository of the data collected by the BaBar experiment.  The amount of data stored in FY04 grew from1.2 petabytes at the start of the year to over 1.7 petabytes at the end of the year as shown in the figure below.

 

 

The growth of disk space and its allocation to scientific data storage is shown below.

Delivery of Data to Remote Scientists

The figure below shows the external network traffic from SLAC over ESNet, dominated by scientific data flows.  SLAC is identified by ESNET, along with Fermilab, as the leading distributor of data to remote scientists.

Compute Farms

As in FY03, SLAC Compute Farms continued their rapid growth.  The table below shows the growth in capacity available to the scientific program.  The capacity units below are the total GigaHertz across the farms; while not all physics programs scaled as expected from Pentium III to Pentium IV, this remains a useful rough guide to total capacity.

 

An additional Linux cluster is in operation in the form of a 64 node dual Pentium IV Myrinet cluster.  The former 32 node cluster has been relegated to a development platform, and this new facility is available for lab-wide use in the development of parallel algorithms.

Status of Goals during FY04

  1. Complete the migration to Windows 2000 infrastructure and Windows XP clients.  Evaluate and upgrade to Windows Server 2003.

The migration to Windows 2000 infrastructure and Windows XP clients has been completed.  The Windows Server 2003 upgrade is in progress.

  1. Develop Performance Measures based on the tools available to the Laboratory.

The new metric of scientific data distribution was introduced.

  1. Complete implementation of  a monitoring solution for the Windows infrastructure

Monitoring of system and service uptime and automated service restart has been implemented. Implementation of additional functionality is ongoing.

  1. Complete implementation of the 2nd tier storage for the Windows environment.

 Implementation in progress.

  1. Continue to provide resources to support planned increases in the BaBar requirements for computing resources.

BaBar computing needs are planned on a forward-looking basis by the BaBar Computing Steering Committee and approved by the BaBar International Finance Committee which includes DOE representatives.  The agreed SLAC commitment to the expansion of BaBar computing was met in full.

  1. Continue replacement of raised flooring on 2nd floor of computer center

A second area of raised flooring on the 2nd floor of the computer center was replaced. There was a major redesign and seismic review due to the ever increasing density and resulting weight that is achieved in today’s computer racks. As a result of the review, we are installing custom pedestals of oversized dimensions and a solid bar welded stringer system. In the summer of FY04, it was determined that the next phase of the replacement project should be moved to the 1st floor, thus allowing us to transfer  the heaviest and hottest loads to a lower seismic shear factor.

  1. Begin replacement of the raised flooring on the 1st floor of the computer center.

A new design for the raised flooring on the 1st flooring is underway that will increase the height to 18 inches in order to accommodate the increasing heat loads produced by today’s higher density computers. The current compute clusters found in support of the BaBar project are consuming approximately 16K watts per rack or nearly 500 Watts per square foot. At the same time we intend to incorporate a single design for all future raised floor replacements that meets seismic requirements by the use of Teflon isolation plates beneath the racks.

  1. Distribute power for upgraded Substation 7 to 1st and 2nd floors of computer center

With substation 7 having been upgraded at the end of FY04, we are now moving into the implementation phase of distributing power to the 1st and 2nd floors. A new bus bar delivery system is being considered for the distribution of power such that we will have greater flexibility and redundancy. This will allow us to isolate racks between different sources of power quickly and more easily such that work can be done on one source of power without affecting critical computing services.

  1. Install three additional 27 ton Stulz air cooler units.

One new Stultz air handler was installed and put into operation on the 2nd floor in conjunction with the FY04 phase of the raised floor replacement project. The additional 2 Stulz units will be installed and placed on the 1st floor at the same time the raised flooring is replaced.

  1. Install a power monitoring and trending system

A power monitoring system was installed in FY04 on 11 major distribution panels that are sourced from substation 8. These meters provide instantaneous usage loads as well as historical trending, fault monitoring and analysis.

Improvement Action Plan/Goals

Goals for FY05

  1. Migrate to Windows Server 2003 from Windows Server 2000.
  2. Test and deploy Windows XP Service Pack 2.
  3. Complete migration to the new Windows storage and backup architectures.
  4. Implement the new EPN firewall for Business Services, and the additional firewalls for GLAST.
  5. Continue to develop Performance Measures based on the tools available to the Laboratory.
  6. Complete implementation of a monitoring solution for the Windows infrastructure.
  7. Continue to provide resources to support planned increases in the BaBar requirements for computing resources.
  8. Begin replacement of the raised flooring on the 1st floor of the computer center after relocating and disposing of the old Tape Archives that have been copied to new media and/or are no longer required.
  9. Begin the installation of power distribution for upgraded Substation 7 to 1st and 2nd floors of computer center.
  10. Install two additional 27 ton Stulz air cooler units on the 1st floor.
  11. Add power monitoring units as the power from substation 7 is distributed to the 1st and 2nd floors.
  12. Install Internet Security Systems Site Protector and use for scheduled site-wide scanning and report generation.
  13. Install Live Office communications Server to be used as our supported instant messaging service
  14. Migrate all centrally maintained IIS web servers from version 5 to version 6.
  15. Continue upgrade of the core network for the computing batch farm to 10 Gbps links.
  16. Develop plan for upgrading core and border network to 10 Gbps.
  17. Continue to upgrade obsolete building switches.

 

Back to Index Page