Managing Jobs
Once a job is submitted, it has its own lifecycle on the cluster which is distinct from the submitter. To manage the lifecycle of jobs, you can use either SQL or CLI commands to list, cancel, suspend, resume, and restart them.
Listing Jobs
Use the list-jobs
command to get a list of all jobs running in the
cluster:
Example result:
You can also list completed jobs by adding the -a
parameter:
Example result:
Example result:
For more details about this statement, see the SQL reference documentation.
Canceling Jobs
Streaming jobs run indefinitely until canceled. To stop a job, you must cancel it.
Example result:
When a job is canceled, the snapshot for the job is lost and the job
can’t be resumed. Canceled jobs have a failed
status.
Result:
When a job is canceled, the snapshot for the job is lost and the job
can’t be resumed. Canceled jobs have a failed
status.
To save a snapshot of the job, use the WITH SNAPSHOT
clause.
For more details about this statement, see the SQL reference documentation.
Suspending and Resuming Jobs
Suspending and resuming jobs can be useful for example when you need to perform maintenance on a data source or sink without disrupting a running job.
In SQL, you can also update the configuration of a suspended job and resume it.
When a job is suspended, all the metadata about the job is kept in the cluster. A snapshot of the job’s computational state is taken during a suspend operation and then once resumed, the job is gracefully started from the same snapshot.
To suspend and resume a job, it must be configured with a processing guarantee. To learn more about setting a processing guarantee, see Configuring Jobs. |
Use the suspend <job_name_or_id>
and resume <job_name_or_id>
commands to suspend and resume jobs:
Example result:
Example result:
For fault tolerance, streaming jobs are automatically suspended on failure. For more details, see Processing Guarantees.
With the ALTER JOB statement, you can suspend and resume a job that is running on a cluster.
Result:
Currently, Jet processors implement basic memory management by limiting the number of objects individual processors store. When this number is exceeded, the job fails. To recover the failed job, try updating the job configuration to increase the processor limit, and resume the job.
You might also consider increasing the number of records that each processor can accumulate, if SQL operations such as grouping, sorting, or joining end in errors.
For more details about this statement, see the SQL reference documentation.
Restarting Jobs
Restarting a job allows you to suspend and resume it in one step. This can be useful when you want to have control over when the job should be scaled. For example, if a job’s auto-scaling
option is disabled and you add 3 nodes to a cluster you can manually restart the job at the desired point to make sure that all the new nodes can run it.
Example result:
Result:
For more details about this statement, see the SQL reference documentation.