Wednesday, July 20, 2011

Job Schedulers – Overview

Simply saying, a scheduler is a software utility responsible for assigning jobs and tasks to resources according to pre-determined policies and availability of resource. A job can be comprised of one or more tasks along with relevant information on the required resources (number of nodes, GPUs, network bandwidth, application license etc.). Jobs are submitted to a queue for proper batch processing and optimization of resource utilization (for example, using nearest nodes instead of far-connected nodes). There may be one or more queues, each with policies around priorities, permissions etc. There are multiple options for job schedulers – commercial (close source) and open source.

Commercial Schedulers:
Moab - a cluster workload management package from Adaptive Computing that integrates the scheduling, managing, monitoring, and reporting of cluster workloads. Moab Workload Manager is part of the Moab Cluster Suite. Moab’s development was based on the Open Source Maui job scheduling package.

Platform LSF - manage and accelerate batch workload processing for mission-critical compute- or data-intensive application workload.

Open Source Schedulers:
SGE (Sun Grid Engine) - a distributed resource management software system. Almost identical to the commercial version - Sun N1 Grid Engine, offered by Sun Microsystems, now Oracle.

TORQUE - an open source distributed resource management system providing control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project and incorporates community and professional development. TORQUE may be freely used, modified, and distributed and is designed to work with the Maui Scheduler.

Maui Scheduler - an open source job scheduler for clusters, capable of supporting an array of scheduling policies, fair share capabilities, dynamically determined priorities, and exclusive reservations. It also includes system diagnostics, extensive resource utilization tracking, statistics, and reporting engine, as well as a built-in simulator for analyzing workload, resource, and policy changes.

Platform LAVA - an open source scheduler solution based on the workload management product LSF and designed to meet a range of workload scheduling needs for clusters with up to 512 nodes.

Condor - an open source workload manager, developed at the University of Wisconsin – Madison. Condor performs the traditional batch job queuing and scheduling roles. Red Hat has based its MRG Grid product (part of Red Hat Enterprise MRG) on Condor.

No comments:

Post a Comment