|
@@ -1,63 +1,63 @@
|
|
|
-# DataX
|
|
|
-
|
|
|
-## Overview
|
|
|
-
|
|
|
-DataX task type for executing DataX programs. For DataX nodes, the worker will execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file.
|
|
|
-
|
|
|
-## Create Task
|
|
|
-
|
|
|
-- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
|
|
|
-- Drag from the toolbar <img src="/img/tasks/icons/datax.png" width="15"/> task node to canvas.
|
|
|
-
|
|
|
-## Task Parameter
|
|
|
-
|
|
|
-- **Node name**: The node name in a workflow definition is unique.
|
|
|
-- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
|
|
|
-- **Descriptive information**: Describe the function of the node.
|
|
|
-- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
|
|
|
-- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
|
|
|
-- **Environment Name**: Configure the environment name in which run the script.
|
|
|
-- **Times of failed retry attempts**: The number of times the task failed to resubmit.
|
|
|
-- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
|
|
|
-- **Delayed execution time**: The time (unit minute) that a task delays in execution.
|
|
|
-- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
|
|
|
-- **Custom template**: Customize the content of the DataX node's JSON profile when the default DataSource provided does not meet the requirements.
|
|
|
-- **JSON**: JSON configuration file for DataX synchronization.
|
|
|
-- **Custom parameters**: SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement.
|
|
|
-- **Data source**: Select the data source to extract data.
|
|
|
-- **SQL statement**: The SQL statement used to extract data from the target database, the SQL query column name is automatically parsed when execute the node, and mapped to the target table to synchronize column name. When the column names of the source table and the target table are inconsistent, they can be converted by column alias (as)
|
|
|
-- **Target library**: Select the target library for data synchronization.
|
|
|
-- **Pre-SQL**: Pre-SQL executes before the SQL statement (executed by the target database).
|
|
|
-- **Post-SQL**: Post-SQL executes after the SQL statement (executed by the target database).
|
|
|
-- **Stream limit (number of bytes)**: Limit the number of bytes for a query.
|
|
|
-- **Limit flow (number of records)**: Limit the number of records for a query.
|
|
|
-- **Running memory**: Set the minimum and maximum memory required, which can be set according to the actual production environment.
|
|
|
-- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
|
|
|
-
|
|
|
-## Task Example
|
|
|
-
|
|
|
-This example demonstrates how to import data from Hive into MySQL.
|
|
|
-
|
|
|
-### Configure the DataX environment in DolphinScheduler
|
|
|
-
|
|
|
-If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The following is the configuration file: `bin/env/dolphinscheduler_env.sh`.
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-After finish the environment configuration, need to restart DolphinScheduler.
|
|
|
-
|
|
|
-### Configure DataX Task Node
|
|
|
-
|
|
|
-As the default DataSource does not contain data read from Hive, require a custom JSON, refer to: [HDFS Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md). Note: Partition directories exist on the HDFS path, when importing data in real world situations, partitioning is recommended to be passed as a parameter, using custom parameters.
|
|
|
-
|
|
|
-After finish the required JSON file, you can configure the node by following the steps in the diagram below:
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-### View Execution Result
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-### Notice
|
|
|
-
|
|
|
-If the default DataSource provided does not meet your needs, you can configure the writer and reader of the DataX according to the actual usage environment in the custom template options, available at [DataX](https://github.com/alibaba/DataX).
|
|
|
+# DataX
|
|
|
+
|
|
|
+## Overview
|
|
|
+
|
|
|
+DataX task type for executing DataX programs. For DataX nodes, the worker will execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file.
|
|
|
+
|
|
|
+## Create Task
|
|
|
+
|
|
|
+- Click Project Management -> Project Name -> Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page.
|
|
|
+- Drag the <img src="/img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board.
|
|
|
+
|
|
|
+## Task Parameter
|
|
|
+
|
|
|
+- **Node name**: The node name in a workflow definition is unique.
|
|
|
+- **Run flag**: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch.
|
|
|
+- **Descriptive information**: describe the function of the node.
|
|
|
+- **Task priority**: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle.
|
|
|
+- **Worker grouping**: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution.
|
|
|
+- **Environment Name**: Configure the environment name in which to run the script.
|
|
|
+- **Number of failed retry attempts**: The number of times the task failed to be resubmitted.
|
|
|
+- **Failed retry interval**: The time, in cents, interval for resubmitting the task after a failed task.
|
|
|
+- **Delayed execution time**: The time, in cents, that a task is delayed in execution.
|
|
|
+- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail.
|
|
|
+- **Custom template**: Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements.
|
|
|
+- **json**: json configuration file for DataX synchronization.
|
|
|
+- **Custom parameters**: SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement.
|
|
|
+- **Data source**: Select the data source from which the data will be extracted.
|
|
|
+- **sql statement**: the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias.
|
|
|
+- **Target library**: Select the target library for data synchronization.
|
|
|
+- **Pre-sql**: Pre-sql is executed before the sql statement (executed by the target library).
|
|
|
+- **Post-sql**: Post-sql is executed after the sql statement (executed by the target library).
|
|
|
+- **Stream limit (number of bytes)**: Limits the number of bytes in the query.
|
|
|
+- **Limit flow (number of records)**: Limit the number of records for a query.
|
|
|
+- **Running memory**: the minimum and maximum memory required can be configured to suit the actual production environment.
|
|
|
+- **Predecessor task**: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task.
|
|
|
+
|
|
|
+## Task Example
|
|
|
+
|
|
|
+This example demonstrates importing data from Hive into MySQL.
|
|
|
+
|
|
|
+### Configuring the DataX environment in DolphinScheduler
|
|
|
+
|
|
|
+If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The configuration file is as follows: `/dolphinscheduler/conf/env/dolphinscheduler_env.sh`.
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+After the environment has been configured, DolphinScheduler needs to be restarted.
|
|
|
+
|
|
|
+### Configuring DataX Task Node
|
|
|
+
|
|
|
+As the default data source does not contain data to be read from Hive, a custom json is required, refer to: [HDFS Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md). Note: Partition directories exist on the HDFS path, when importing data in real world situations, partitioning is recommended to be passed as a parameter, using custom parameters.
|
|
|
+
|
|
|
+After writing the required json file, you can configure the node content by following the steps in the diagram below.
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+### View run results
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+### Notice
|
|
|
+
|
|
|
+If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX.
|