Cluster Computing Policy

 

Computer Use

Computers, software, and communications systems provided by the ARM Data Center (ADC) are to be used for work associated with and within the scope of the ARM Program. The use of ADC resources for personal or non-work-related activities is prohibited. All computers, networks, e-mail, and storage systems are property of the United States Government. Any misuse or unauthorized access is prohibited, and is subject to criminal and civil penalties.

The ADC clusters are provided to our users without any warranty. The ADC cluster will not be held liable in the event of any system failure or data loss or corruption for any reason including, but not limited to: negligence, malicious action, accidental loss, software errors, hardware failures, network losses, or inadequate configuration of any computing resource or ancillary system.

Data Use

 

Prohibited Data

The ADC Cluster is a computational resource for ARM specific operational and scientific research use and only contain data related to scientific research and do not contain personally identifiable information (data that falls under the Privacy Act of 1974 5U.S.C. 552a). Use of ADC resources to store, manipulate, or remotely access any national security information is strictly prohibited. This includes, but is not limited to: classified information, unclassified controlled nuclear information (UCNI), naval nuclear propulsion information (NNPI), the design or development of nuclear, biological, or chemical weapons or any weapons of mass destruction.

Principal investigators, users, or project delegates that use ARM cluster resources, or are responsible for overseeing projects that use ARM cluster resources, are strictly responsible for knowing whether their project generates any of these prohibited data types or information that falls under Export Control. For questions, contact cluster support.

Confidentiality, Integrity, and Availability

The ARM systems provide protections to maintain the confidentiality, integrity, and availability of user data. Measures include the availability of file permissions, archival systems with access control lists, and parity and CRC checks on data paths and files. It is the user’s responsibility to set access controls appropriately for the data. In the event of system failure or malicious actions, ARM makes no guarantee against loss of data or that a user’s data cannot be accessed, changed, or deleted by another individual. It is the user’s responsibility to ensure the appropriate level of backup and integrity checks on critical data and programs.

Data Modification/Destruction

Users are prohibited from taking unauthorized actions to intentionally modify or delete information or programs that do not pertain to their jobs/code.

Data Retention

ARM reserves the right to remove any data at any time and/or transfer data to other users working on the same or similar project once a user account is deleted or a person no longer has a business association with ARM. After a project has ended or has been terminated, all data related to the project will be purged from all ARM cluster related resources within 30 days.

Software Use

All software used on ARM computers must be appropriately acquired and used according to the appropriate software license agreement. Possession, use, or transmission of illegally obtained software is prohibited. Likewise, users shall not copy, store, or transfer copyrighted software, except as permitted by the owner of the copyright. Only export-controlled codes approved by the Export Control Office may be run by parties with sensitive data agreements.

Malicious Software

Users must not intentionally introduce or use malicious software such as computer viruses, Trojan horses, or worms.

Reconstruction of Information or Software

Users are not allowed to reconstruct information or software for which they are not authorized. This includes but is not limited to any reverse engineering of copyrighted software or firmware present on ARM computing resources.

User Accountability

 

Users are accountable for their actions and may be held accountable to applicable administrative or legal sanctions.

Reconstruction of Information or Software

Users are advised that there is no expectation of privacy of your activities on any system that is owned by, leased or operated by UT-Battelle on behalf of the U.S. Department of Energy (DOE). The Company retains the right to monitor all activities on these systems, to access any computer files or electronic mail messages, and to disclose all or part of information gained to authorized individuals or investigative agencies, all without prior notice to, or consent from, any user, sender, or addressee. This access to information or a system by an authorized individual or investigative agency is in effect during the period of your access to information on a DOE computer and for a period of three years thereafter.

ARM personnel and users are required to address, safeguard against, and report misuse, abuse and criminal activities. Misuse of ARM resources can lead to temporary or permanent disabling of accounts, loss of DOE allocations, and administrative or legal actions. Users who have not accessed a ARM computing resource in at least 6 months will be disabled. They will need to reapply to regain access to their account. All users must reapply annually.

Authentication and Authorization

On either ARM clusters, each user is permitted only a single account, usually corresponding to the user’s XCAMS/UCAMS ID and password. Security is a major concern on the HPC clusters. Access to a cluster requires use of ssh to a cluster login node using single-factor (password) authentication.

Accounts on the ARM machines are for the exclusive use of the individual user named in the account application. Users should not share accounts with anyone. If evidence is found that more than one person is using an account, that account will be disabled immediately. Users are not to attempt to receive unintended messages or access information by some unauthorized means, such as imitating another system, impersonating another user or other person, misuse of legal user credentials (usernames, password, etc.), or by causing some system component to function incorrectly.

Users are prohibited from changing or circumventing access controls to allow themselves or others to perform actions outside their authorized privileges. Users must notify the support immediately when they become aware that any of the accounts used to access ARM have been compromised.

Users should inform the ARM promptly of any changes in their contact information (E-mail, phone, affiliation, etc.) Updates should be sent to cluster support.

Maintenance Periods

The ORNL HPC strives to operate its clusters on a 24/7 basis, except for regularly scheduled maintenance periods (approximately 14 days per year). The ORNL HPC publishes a regular maintenance schedule and notifies users well in advance of scheduled maintenance periods. Users are responsible for managing their workloads accordingly (see details in section for things to consider for jobs and scheduled outages). In the rare event of an emergency maintenance period, users may receive little or no notice.

Login Nodes

The login nodes on the clusters are shared by all users and are intended only for lightweight activities such as editing, reviewing program input and output files, file transfers, and job submission and monitoring. Users are not permitted to run programs (including interactive programs like Matlab®, R, or IDL®) on the login nodes, and the system administrators reserve the right to terminate without prior notice any user session found to be consuming excessive resources on a login node. For security reasons, the system software on the login nodes may be updated frequently and become inconsistent with the system software installed on the compute nodes, so it is strongly recommended that all compilations and other program building activities be performed on a compute node. (Special queues are available to provide rapid access to compute nodes for development activities.)

Foreign National Access

Applicants who appear on a restricted foreign country listing in section 15 CFR 740.7 License Exceptions for Computers are denied access based on US Foreign Policy. The countries cited are Cuba, Iran, North Korea, Sudan, and Syria. Additionally, no work may be performed on ARM computers on behalf of foreign nationals from these countries.

Denial of Service

Users may not deliberately interfere with other users accessing system resources.

Policies

 

The clusters are operated by the ORNL HPC following policies approved by ARM management and ORNL cyber security, for research and operational use. Computational and storage usage by approved users are monitored and limited appropriately to ensure that all individuals have equitable access. All policies regarding HPC usage are reviewed regularly and adjusted as necessary by ARM management, ADC Management, NCCS staff and ORNL Cyber Security Users should consult the website or ENG for more complete, detailed, and up-to-date information about the clusters, policies for their use, and how to access and use them.

Cluster Interactive Node Policy

  1. The interactive nodes are the front-end interface systems for access to the HPC clusters. For example, if you ssh'd into "user@stratus.ornl.gov", you will actually be connected to the interactive login node. This interactive node is restricted to the groups/users that have allocation to run on this cluster.
  2. Interactive nodes are your interface with the computational nodes and are where you interact with the batch/queuing system. Please see the Quick Start for details. Processes run directly on these nodes should be limited to tasks such as editing, data transfer and management, data analysis, compiling codes and debugging, as long as it is not resource intensive (memory, CPU, network and/or I/O). Any resource intensive work must be run on the compute nodes through the batch system.
  3. The cluster will utilize a fair use algorithm that uses a job scheduler, which manages the job queue and dispatches jobs to compute nodes according to availability and user-defined criteria such as CPU and memory requirements specified during job submission. The job queues have time and resource limits to ensure fair use. In addition, the job priorities themselves are adjusted based on the user’s recent cluster usage (over the last two days) by giving light users high priority. There is an operational queue that is strictly dedicated for ARM operations that will have dedicated times for operational runs without interference/interruptions from other queues.
  4. Owners of a killed job are notified via an email message when the job is killed by an administrator.
  5. The Lustre scratch spaces that are visible on the compute nodes of the clusters are mounted on the interactive nodes. (See more details at the Quick Start)
  6. Migrate your data using the interactive nodes to one of the scratch spaces.  Run your I/O intensive batch work from the scratch space, NOT from an NFS mount. Note: an I/O intensive process could either be excessive MB/second or excessive I/O operations/second. For further details please see the storage policy.

General Queuing Guidelines

  • Users who have exhausted their allocation will not have their jobs eligible to run unless there are available free cycles on the system. Preemption rules are applied to some select queues on the Stratus cluster - see the cluster's queuing guidelines here.
  • Special access can be given to a long job that may exceed the MAX walltime limit on a case-by-case basis. Please send any request such access to cluster support.

Reservations

Users may request to reserve nodes for special circumstances. The maximum duration for reservations is ( days/weeks). The user should send a request to cluster support with the following information:

  • Number of Nodes/Cores
  • Starting date and time (please ask ahead by at least the MAX walltime for that cluster)
  • Duration
  • User on the reservation
  • Any special requirements (longer MAX walltime for example)

Disclaimers

Every effort will be made to provide uninterrupted completion of all jobs that start to run. However, in rare instances, unforeseen circumstances may cause job failures or require suspending or even killing jobs without warning. Users are strongly urged to make use of checkpointing or other standard programming practices to allow for this possibility. Wall clock completion times are considered when downtimes are scheduled. If the job is not expected to complete before the outage begins then the job will not run.

Every effort will be made to ensure that data stored on the clusters are not lost or damaged. Home directories will not be backed up. It is each users responsibility to ensure that important files created in home, project or temporary scratch storage are preserved by copying them to alternative storage facilities intended for that purpose.