Friday, December 28, 2007

How to job schedule or batch control?

Any IT system administration has a need for some automation, batch control, job scheduling or whatever you want to call it. Such can be setup with cronjobs, at-jobs or scheduled task setups, most likely on the server where the job/application must run.

MSSQL 2005 maintenance plans have the option of running off one server, but executing on another. Similar option should be present for schtasks on Windows server, but as with MSSQL i have not tried it, I have always executed everything on the local machine where the schedule is setup.

Some issues with this standard job scheduling setup will come up as you go along, say you want to know either of the following:
  • How the execution went?
  • What is executing right now?
  • What was the standard output of a previous run?
  • How long time did the previous jobs take?
  • Did a job finish before a certain time or within a certain lenght of runtime?
  • This job should only run if these first jobs have finished properly.

All this and more seems like valid points in any normal IT administration. Some of the things I quite often want to do is:

  • Add new onetime only jobs, that need to run just once, eg. execute a script that creates a user, deletes a user, or stop/start a service, etc. etc.
  • Add new permanent jobs, keeping history of changes in start time etc.
  • Handle schedules of database servers, such as MSSQL, MySQL and Oracle.

More of what I would like in a batch control/job schedule system:

Must have:

  • Setup applications with several jobsteps consisting of commandlines
  • Jobs must be startable at certain times
  • Keep track of history of executions, including time, returncodes etc.
  • Jobs and applications must be run based on other application dependencies
  • Timeouts and alternative actions, eg. alarms(email etc)
  • Gui for monitoring batch progress
  • Joboutput (standard output) must be available central

Nice to have:

  • Load evaluation/weight of nodes, deciding where to send jobs.
  • Failover execution if worknodes fail certain checks (node health check support)
  • Eliminate the need for a central control server. Most nice would be all nodes to be aware of every other node, allow failover and pick up new nodes if they come alive again.

When looking into what systems can do this I get caught up in a mix of grid computing, load balancing and job/batch scheduling:


There are 31 pages in this section of this category.

Job scheduler
B
Batch queue
BatchMan
BatchPipes
Batchman
C
CONTROL-M
Command queue
Condor High-Throughput Computing System
Cronacle
G
Grid MP
H
IBM Houston Automated Spooling Program
I
IBM 2780/3780
IBM Tivoli Workload Scheduler
IBM Tivoli
Workload Scheduler LoadLeveler
J
Job Control Language
Job Entry
Subsystem 2/3
L
Load Sharing Facility
M
Maui Cluster Scheduler
Moab Cluster Suite
O
Open Source Job Scheduler
P
PTC
Scheduler
P cont.
Portable Batch System
R
RTDA Network Computer
Remote Job Entry
Retriever Communications
S
S-graph
SAP
Central Process Scheduling
SHARCNET
Sun Grid Engine
U
Unicenter
Autosys Job Management
X
Xgrid

Currently we are using TWS, and a homemade system which can do all we need, plus is extendable. Of course TWS is something we are forced to use, the other system would be just fine.

Of course I will be limited to open source or free systems, so I came up with these few systems I would like to try out:

  • TORQUE is opensource and support available, that is nice for the enterprise. Torque is available for FreeBSD via ports, even very actively maintained.
  • Sun Grid Engine is a batch queueing system implementing a superset of the
    functionality of the POSIX batch queueing framework. Also in FreeBSD ports.

On a side note, i stumbled upon a cluster admin article for unix with ssh, where cssh is suggested, but with much more in comments on http://www.debian-administration.org/.

No comments: