123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151 |
- .. Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
- .. http://www.apache.org/licenses/LICENSE-2.0
- .. Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
- Concepts
- ========
- In this section, you would know the core concepts of *PyDolphinScheduler*.
- Process Definition
- ------------------
- Process definition describe the whole things except `tasks`_ and `tasks dependence`_, which including
- name, schedule interval, schedule start time and end time. You would know scheduler
- Process definition could be initialized in normal assign statement or in context manger.
- .. code-block:: python
- # Initialization with assign statement
- pd = ProcessDefinition(name="my first process definition")
- # Or context manger
- with ProcessDefinition(name="my first process definition") as pd:
- pd.submit()
- Process definition is the main object communicate between *PyDolphinScheduler* and DolphinScheduler daemon.
- After process definition and task is be declared, you could use `submit` and `run` notify server your definition.
- If you just want to submit your definition and create workflow, without run it, you should use attribute `submit`.
- But if you want to run the workflow after you submit it, you could use attribute `run`.
- .. code-block:: python
- # Just submit definition, without run it
- pd.submit()
-
- # Both submit and run definition
- pd.run()
- Schedule
- ~~~~~~~~
- We use parameter `schedule` determine the schedule interval of workflow, *PyDolphinScheduler* support seven
- asterisks expression, and each of the meaning of position as below
- .. code-block:: text
- * * * * * * *
- ┬ ┬ ┬ ┬ ┬ ┬ ┬
- │ │ │ │ │ │ │
- │ │ │ │ │ │ └─── year
- │ │ │ │ │ └───── day of week (0 - 7) (0 to 6 are Sunday to Saturday, or use names; 7 is Sunday, the same as 0)
- │ │ │ │ └─────── month (1 - 12)
- │ │ │ └───────── day of month (1 - 31)
- │ │ └─────────── hour (0 - 23)
- │ └───────────── min (0 - 59)
- └─────────────── second (0 - 59)
- Here we add some example crontab:
- - `0 0 0 * * ? *`: Workflow execute every day at 00:00:00.
- - `10 2 * * * ? *`: Workflow execute hourly day at ten pass two.
- - `10,11 20 0 1,2 * ? *`: Workflow execute first and second day of month at 00:20:10 and 00:20:11.
- Tenant
- ~~~~~~
- Tenant is the user who run task command in machine or in virtual machine. it could be assign by simple string.
- .. code-block:: python
- #
- pd = ProcessDefinition(name="process definition tenant", tenant="tenant_exists")
- .. note::
- Make should tenant exists in target machine, otherwise it will raise an error when you try to run command
- Tasks
- -----
- Task is the minimum unit running actual job, and it is nodes of DAG, aka directed acyclic graph. You could define
- what you want to in the task. It have some required parameter to make uniqueness and definition.
- Here we use :py:meth:`pydolphinscheduler.tasks.Shell` as example, parameter `name` and `command` is required and must be provider. Parameter
- `name` set name to the task, and parameter `command` declare the command you wish to run in this task.
- .. code-block:: python
- # We named this task as "shell", and just run command `echo shell task`
- shell_task = Shell(name="shell", command="echo shell task")
- If you want to see all type of tasks, you could see :doc:`tasks/index`.
- Tasks Dependence
- ~~~~~~~~~~~~~~~~
- You could define many tasks in on single `Process Definition`_. If all those task is in parallel processing,
- then you could leave them alone without adding any additional information. But if there have some tasks should
- not be run unless pre task in workflow have be done, we should set task dependence to them. Set tasks dependence
- have two mainly way and both of them is easy. You could use bitwise operator `>>` and `<<`, or task attribute
- `set_downstream` and `set_upstream` to do it.
- .. code-block:: python
- # Set task1 as task2 upstream
- task1 >> task2
- # You could use attribute `set_downstream` too, is same as `task1 >> task2`
- task1.set_downstream(task2)
-
- # Set task1 as task2 downstream
- task1 << task2
- # It is same as attribute `set_upstream`
- task1.set_upstream(task2)
-
- # Beside, we could set dependence between task and sequence of tasks,
- # we set `task1` is upstream to both `task2` and `task3`. It is useful
- # for some tasks have same dependence.
- task1 >> [task2, task3]
- Task With Process Definition
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- In most of data orchestration cases, you should assigned attribute `process_definition` to task instance to
- decide workflow of task. You could set `process_definition` in both normal assign or in context manger mode
- .. code-block:: python
- # Normal assign, have to explicit declaration and pass `ProcessDefinition` instance to task
- pd = ProcessDefinition(name="my first process definition")
- shell_task = Shell(name="shell", command="echo shell task", process_definition=pd)
- # Context manger, `ProcessDefinition` instance pd would implicit declaration to task
- with ProcessDefinition(name="my first process definition") as pd:
- shell_task = Shell(name="shell", command="echo shell task",
- With both `Process Definition`_, `Tasks`_ and `Tasks Dependence`_, we could build a workflow with multiple tasks.
|