Summit Condo Announcement

date: 
Mon, 01 Jan 2018 -0700

RMACC Summit Condo Announcement

FY18 solicitation for condo participationcontributions into RMACC Summit supercomputer

University of Colorado Boulder and Colorado State University are pleased to announce a second opportunity to contribute compute nodes to the RMACC Summit HPC infrastructure. Such contributions benefit from being a part of the Summit interconnect, by having access to the Summit storage and scratch file system, and can be used in aggregate with other contributions and all existing Summit compute resources.

 

Contributed nodes are owned by those purchasing the resources, but are integrated into the overall Summit infrastructure and are operated by the University of Colorado Research Computing team. Such nodes are not reserved for use by the contributor, but entitle the contributor (and their authorized collaborators) to a special Slurm account that grants proportional priority access to Summit compute resources, based on the number of core-hours/day the node is theoretically capable of producing. More information can be found below in the Q and A section. Please note that allocations are divided into monthly units. 

 

CU Boulder and CSU contributors receive a substantial subsidy covering approximately $1,700/node of necessary infrastructure. This subsidy is in addition to existing discounts pre-negotiated with the vendor.

Available node types

For this opportunity we are accepting orders for Skylake compute nodes and Pascal GPU-accelerator nodes, both of which represent generational upgrades over analogous existing Summit nodes.

“Skylake” compute node

  • 2x Intel Xeon Gold 6126 processors (24 cores total, 2.6 GHz base clock speed)
  • 192 GiB RAM
  • 1x 240 GB SSD
  • Cost: $12,868/node ($11,172/node with infrastructure subsidy)

“Pascal” GPU node

  • 2x Intel Xeon Gold 6126 processors (24 cores total, 2.6 GHz base clock speed)
  • 192 GiB RAM
  • 1x 240 GB SSD
  • 2x NVIDIA Tesla P100 GPUs (16GiB/each onboard)
  • Cost: $22,392 ($20,696/node with infrastructure subsidy)

Required infrastructure and other costs

Summit compute nodes require additional infrastructure, beyond the node itself. This infrastructure includes:

 

  • Chassis (for Summit compute nodes)
  • Network switches and cables (both Ethernet and Omni-Path)
  • Rackspace, power, cooling, and power cabling
  • GPFS client licenses
  • Additional Summit storage LUNs (to maintain proportional performance)
  • Additional Summit storage metadata LUNs (to maintain proportional metadata performance)
  • Installation and integration

 

The cost of a Skylake compute node chassis is prorated within the cost of a Skylake compute node at a rate of 1/4 chassis per node. The cost of a Summit storage LUN is prorated within the cost of each node at a rate of 1/12 LUN per node. The remaining infrastructure is subsidized for University of Colorado Boulder contributors by Research Computing, and for CSU contributors by central IT and the Office of the Vice President for Research.

Timeline

Orders for condo nodes will be accepted through 19 January 2018, with installation scheduled for February (dependent on supplier order and delivery timelines that are affected by the size of the order).

 

Contributed nodes will be fully supported for the remainder of Summit’s production lifetime, which is expected to extend until August 2021.

Frequently asked questions

Q: Can Summit condo nodes receive property tags to comply with funding agency requirements?

A: Yes, Summit condo nodes will be individually tagged upon request.

 

Q: Why have you switched to 192 GiB memory/node?

A: The Skylake CPU architecture uses 6 memory channels per socket, or 12 per node. Each node needs to be populated with 12 memory modules to provide access to the full memory bandwidth of the Skylake architecture. Furthermore, the smallest dual-ranked module available is 16 GiB/module, leading to 12 x 16GiB = 192 GiB.

 

Q: Why are these node more expensive than last year’s offering?

A: A number of technical and market conditions have converged to produce an admittedly higher price-per-node compared with the previous Summit condo offering:

 

  • Intel Skylake CPUs are selling for a notably higher price compared with similar SKUs in the Haswell CPU product line.
  • Additional memory channels in the Skylake architecture require more memory modules (and consequently more memory/node) compared with a Haswell or Broadwell build.
  • Market fluctuations have led to increased memory prices overall.

 

However, the proposed Skylake compute nodes offer a cost savings in the form of providing integrated Omni-Path connectivity via the 6126F processor variant. This means that Skylake compute nodes do not require a dedicated “host-fabric interface” card. (Pascal GPU nodes still require a dedicated HFI, however.)

 

Q: Are my compute jobs guaranteed to run on my nodes?

A: No, contributed Summit condo nodes are fully integrated into the rest of Summit, and are not reserved for the exclusive use of the contributor. Instead your jobs will have access to a priority share of the cluster equivalent to the processing capacity of the node(s) you have contributed. One benefit of this arrangement is that jobs will not be restricted in physical size to what would run on the purchased nodes. For example, a condo user who has purchased one node could, alternatively, run 2-nodes jobs for 6 months in a given year.

 

Compared with the previous Haswell- and Kepler-based contribution cycle, we acknowledge that Skylake and Pascal contributors may desire access specifically to the architecture that they have contributed, rather than simple access to Summit compute resources overall. We intend to support this use case, and will document and announce policies that support it as they are developed.

 

Q: Can I put my own operating system and software on my nodes?

A: Contributed nodes are operated as part of Summit, and will be provisioned to match the default software configuration for all Summit nodes. However, a custom OS may be provisioned in a container using Singularity. You can also install your own software packages in RC Core Storage (e.g., in your /projects directory) which is then made available to all Summit compute nodes.

 

Q: Can I customize the hardware on my nodes?

A: To minimize operational and support complexity, and to maximize bulk discounts, all contributed nodes must match one of the hardware configuration defined above.

 

Q: If I later decide I don’t want my nodes to be part of Summit, can I physically retrieve them and run them myself?

A: Yes, but keep in mind that “general compute” nodes will only function when installed in a Dell C6400 chassis. You would need to obtain your own chassis in order to run them yourself. GPU nodes are discrete nodes that may be run independently. However, any node removed from Summit would lose access to RC Core Storage, the Omni-Path interconnect, Summit storage and scratch, and any other RC shared infrastructure.

 

Q: Do I need to provide an allocation proposal each year in order to receive my allocation share?

A: No, but the RC group will annually request a brief report outlining research and education accomplishments that were enabled by your nodes. This information helps us to justify the investments from our funding and oversight organizations, and is not a condition of access.

 

Q: How does my allocation share work?

A: The Slurm “fairshare” setting will be configured with a target number of service units (roughly analogous to core-hours) for your condo project, equivalent to the processing power of the nodes you have contributed. If you have used fewer than this target number over time (roughly in a given month) your jobs will receive higher queue priority in an attempt to reach the target. If you have used more compute resources than you have contributed, your jobs will receive lower priority in order to allow other projects to reach their allocated targets.

 

Q: What does it mean for my jobs to get “priority access”?

A: Condo project accounts receive an additional priority boost compared to jobs run in standard (non-condo) allocations.

 

Q: Am I guaranteed to be able to run as many core-hours per year as my condo share corresponds to?

A: No. For example, if you don’t submit any jobs for 11 months then it may not be possible to fit your full share into the remaining month. However, if you submit jobs regularly throughout the year and if demand from other allocations is smaller than expected, it would be possible for you to run more core-hours than your share per year.

 

Q: If I buy a general compute node, can I run some jobs on a GPU node?

A: Yes, your share can be used on any part of Summit.

 

Q: If I have a condo share, can I also request an additional allocation?

A: Yes, your request for additional core-hours would go through the usual proposal process.  Any time awarded to this separate proposal would be tracked separately from your condo share.

 

Q: If I buy two condo nodes, can I run jobs in my condo share that span more than two nodes?

A: Yes, but keep in mind that wider jobs will cause you to reach your fairshare target more quickly.