concept.rst 6.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
  1. .. Licensed to the Apache Software Foundation (ASF) under one
  2. or more contributor license agreements. See the NOTICE file
  3. distributed with this work for additional information
  4. regarding copyright ownership. The ASF licenses this file
  5. to you under the Apache License, Version 2.0 (the
  6. "License"); you may not use this file except in compliance
  7. with the License. You may obtain a copy of the License at
  8. .. http://www.apache.org/licenses/LICENSE-2.0
  9. .. Unless required by applicable law or agreed to in writing,
  10. software distributed under the License is distributed on an
  11. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  12. KIND, either express or implied. See the License for the
  13. specific language governing permissions and limitations
  14. under the License.
  15. Concepts
  16. ========
  17. In this section, you would know the core concepts of *PyDolphinScheduler*.
  18. Process Definition
  19. ------------------
  20. Process definition describe the whole things except `tasks`_ and `tasks dependence`_, which including
  21. name, schedule interval, schedule start time and end time. You would know scheduler
  22. Process definition could be initialized in normal assign statement or in context manger.
  23. .. code-block:: python
  24. # Initialization with assign statement
  25. pd = ProcessDefinition(name="my first process definition")
  26. # Or context manger
  27. with ProcessDefinition(name="my first process definition") as pd:
  28. pd.submit()
  29. Process definition is the main object communicate between *PyDolphinScheduler* and DolphinScheduler daemon.
  30. After process definition and task is be declared, you could use `submit` and `run` notify server your definition.
  31. If you just want to submit your definition and create workflow, without run it, you should use attribute `submit`.
  32. But if you want to run the workflow after you submit it, you could use attribute `run`.
  33. .. code-block:: python
  34. # Just submit definition, without run it
  35. pd.submit()
  36. # Both submit and run definition
  37. pd.run()
  38. Schedule
  39. ~~~~~~~~
  40. We use parameter `schedule` determine the schedule interval of workflow, *PyDolphinScheduler* support seven
  41. asterisks expression, and each of the meaning of position as below
  42. .. code-block:: text
  43. * * * * * * *
  44. ┬ ┬ ┬ ┬ ┬ ┬ ┬
  45. │ │ │ │ │ │ │
  46. │ │ │ │ │ │ └─── year
  47. │ │ │ │ │ └───── day of week (0 - 7) (0 to 6 are Sunday to Saturday, or use names; 7 is Sunday, the same as 0)
  48. │ │ │ │ └─────── month (1 - 12)
  49. │ │ │ └───────── day of month (1 - 31)
  50. │ │ └─────────── hour (0 - 23)
  51. │ └───────────── min (0 - 59)
  52. └─────────────── second (0 - 59)
  53. Here we add some example crontab:
  54. - `0 0 0 * * ? *`: Workflow execute every day at 00:00:00.
  55. - `10 2 * * * ? *`: Workflow execute hourly day at ten pass two.
  56. - `10,11 20 0 1,2 * ? *`: Workflow execute first and second day of month at 00:20:10 and 00:20:11.
  57. Tenant
  58. ~~~~~~
  59. Tenant is the user who run task command in machine or in virtual machine. it could be assign by simple string.
  60. .. code-block:: python
  61. #
  62. pd = ProcessDefinition(name="process definition tenant", tenant="tenant_exists")
  63. .. note::
  64. Make should tenant exists in target machine, otherwise it will raise an error when you try to run command
  65. Tasks
  66. -----
  67. Task is the minimum unit running actual job, and it is nodes of DAG, aka directed acyclic graph. You could define
  68. what you want to in the task. It have some required parameter to make uniqueness and definition.
  69. Here we use :py:meth:`pydolphinscheduler.tasks.Shell` as example, parameter `name` and `command` is required and must be provider. Parameter
  70. `name` set name to the task, and parameter `command` declare the command you wish to run in this task.
  71. .. code-block:: python
  72. # We named this task as "shell", and just run command `echo shell task`
  73. shell_task = Shell(name="shell", command="echo shell task")
  74. If you want to see all type of tasks, you could see :doc:`tasks/index`.
  75. Tasks Dependence
  76. ~~~~~~~~~~~~~~~~
  77. You could define many tasks in on single `Process Definition`_. If all those task is in parallel processing,
  78. then you could leave them alone without adding any additional information. But if there have some tasks should
  79. not be run unless pre task in workflow have be done, we should set task dependence to them. Set tasks dependence
  80. have two mainly way and both of them is easy. You could use bitwise operator `>>` and `<<`, or task attribute
  81. `set_downstream` and `set_upstream` to do it.
  82. .. code-block:: python
  83. # Set task1 as task2 upstream
  84. task1 >> task2
  85. # You could use attribute `set_downstream` too, is same as `task1 >> task2`
  86. task1.set_downstream(task2)
  87. # Set task1 as task2 downstream
  88. task1 << task2
  89. # It is same as attribute `set_upstream`
  90. task1.set_upstream(task2)
  91. # Beside, we could set dependence between task and sequence of tasks,
  92. # we set `task1` is upstream to both `task2` and `task3`. It is useful
  93. # for some tasks have same dependence.
  94. task1 >> [task2, task3]
  95. Task With Process Definition
  96. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  97. In most of data orchestration cases, you should assigned attribute `process_definition` to task instance to
  98. decide workflow of task. You could set `process_definition` in both normal assign or in context manger mode
  99. .. code-block:: python
  100. # Normal assign, have to explicit declaration and pass `ProcessDefinition` instance to task
  101. pd = ProcessDefinition(name="my first process definition")
  102. shell_task = Shell(name="shell", command="echo shell task", process_definition=pd)
  103. # Context manger, `ProcessDefinition` instance pd would implicit declaration to task
  104. with ProcessDefinition(name="my first process definition") as pd:
  105. shell_task = Shell(name="shell", command="echo shell task",
  106. With both `Process Definition`_, `Tasks`_ and `Tasks Dependence`_, we could build a workflow with multiple tasks.