Dataset visualization issue #2915

salamandra2508 · 2024-10-02T06:38:11Z

After installing Marquez and setup in my environments (AWS EKS + helm + argocd) I can see the Airflow DAGs in Marquez UI but do not see datasets.

![Datasets-Repo-https-code-rbi-tech-raiffeisen-ua-data-airflow-examples-git-Branch-dev-new-10-02-2024_09_33_AM]
(https:/user-attachments/assets/4bf50c17-4fe8-4764-925c-a488a14f80d5)

how to achieve this? Or can some one provide DAG example to check it? I'm pretty new with this stuff.

boring-cyborg · 2024-10-02T06:38:14Z

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

phixMe · 2024-10-08T20:22:26Z

It looks like your configuration is good since you've got jobs in the system.

Are you using the correct Airflow operators to extract lineage metadata? (ie: not python or k8s operator) Have you tried to turn on the debug logs in Airflow so that you can see the inputs and outputs are indeed populated with schemas? Are there are namespaces created for your datasets?

salamandra2508 · 2024-10-10T07:54:35Z

It looks like your configuration is good since you've got jobs in the system.

Are you using the correct Airflow operators to extract lineage metadata? (ie: not python or k8s operator) Have you tried to turn on the debug logs in Airflow so that you can see the inputs and outputs are indeed populated with schemas? Are there are namespaces created for your datasets?

I tried to create DAG based on example ( btw example from repo (https:/MarquezProject/marquez/blob/main/examples/airflow/airflow.md) works fine based on Postgres operator) but when I tried to create same dag via sparksubmit operator, I can see jobs but no dataset .
my spark script :
`from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import random
import sys

def run_stage(stage):
spark = SparkSession.builder.appName(f"CombinedDataJob_{stage}").getOrCreate()

s3_output_table1 = "s3a://test-busket/marquez/table1.csv"
s3_output_table2 = "s3a://test-busket/marquez/table2.csv"
s3_output_merged = "s3a://test-busket/marquez/merged_table.csv"
s3_output_final = "s3a://test-busket/marquez/final_dataset.csv"

if stage == 'generate':
    # Stage 1: Generate 2 tables with some date and save them
    table1_data = [(1, 'A'), (2, 'B')]
    table2_data = [(1, 'First'), (2, 'Second')]

    df_table1 = spark.createDataFrame(table1_data, ['id', 'value'])
    df_table1.write.mode('overwrite').csv(s3_output_table1, header=True)

    df_table2 = spark.createDataFrame(table2_data, ['id', 'description'])
    df_table2.write.mode('overwrite').csv(s3_output_table2, header=True)

elif stage == 'merge':
    # Stage 2: Merge these tables into a third table
    df_table1 = spark.read.csv(s3_output_table1, header=True, inferSchema=True)
    df_table2 = spark.read.csv(s3_output_table2, header=True, inferSchema=True)
    df_merged = df_table1.join(df_table2, 'id')
    df_merged.write.mode('overwrite').csv(s3_output_merged, header=True)

elif stage == 'process':
    # Stage 3: Create a dataset for the output of the merged table
    df_merged = spark.read.csv(s3_output_merged, header=True, inferSchema=True)
    df_final = df_merged.withColumn('processed_flag', lit(True))
    df_final.write.mode('overwrite').csv(s3_output_final, header=True)

spark.stop()

if name == "main":
stage = sys.argv[1] # The stage to run is passed as the first argument
run_stage(stage)
DAG: from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.utils.dates import days_ago
from airflow.datasets import Dataset # Importing the Dataset class
from datetime import datetime, timedelta

Import necessary methods and variables

from draif_common.common_env import get_default_spark_conf, LIB_FOLDER

default_args = {
'owner': 'datascience',
'depends_on_past': False,
'start_date': days_ago(1),
'email_on_failure': False,
'email_on_retry': False,
'email': ['[email protected]'],
'execution_timeout': timedelta(minutes=300),
}

dag = DAG(
dag_id='complex_data_pipeline_dag',
schedule_interval=None,
catchup=False,
default_args=default_args,
description='A complex data pipeline DAG'
)

conf = get_default_spark_conf()

generate_tables = SparkSubmitOperator(
task_id='generate_tables',
application=f'{LIB_FOLDER}/combined_data_job.py',
conn_id='SparkConnection',
total_executor_cores=2,
executor_cores=1,
executor_memory='1g',
driver_memory='1g',
conf=conf,
application_args=['generate'],
dag=dag
)

merge_tables = SparkSubmitOperator(
task_id='merge_tables',
application=f'{LIB_FOLDER}/combined_data_job.py',
conn_id='SparkConnection',
total_executor_cores=2,
executor_cores=1,
executor_memory='1g',
driver_memory='1g',
conf=conf,
application_args=['merge'],
dag=dag
)

process_final_dataset = SparkSubmitOperator(
task_id='process_final_dataset',
application=f'{LIB_FOLDER}/combined_data_job.py',
conn_id='SparkConnection',
total_executor_cores=2,
executor_cores=1,
executor_memory='1g',
driver_memory='1g',
conf=conf,
application_args=['process'],
outlets=[
Dataset('s3a://TEST-BUCKET/marquez/final_dataset.csv') # Track final dataset
],
dag=dag
)

generate_tables >> merge_tables >> process_final_dataset
`

pawel-big-lebowski · 2024-10-11T06:30:35Z

@salamandra2508 I see you're trying to manually define outlets in operator. How about turning on Spark OpenLineage listener and getting the lineage automatically from spark job https://openlineage.io/docs/integrations/spark/ ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset visualization issue #2915

Dataset visualization issue #2915

salamandra2508 commented Oct 2, 2024

boring-cyborg bot commented Oct 2, 2024

phixMe commented Oct 8, 2024

salamandra2508 commented Oct 10, 2024

pawel-big-lebowski commented Oct 11, 2024

Dataset visualization issue #2915

Dataset visualization issue #2915

Comments

salamandra2508 commented Oct 2, 2024

boring-cyborg bot commented Oct 2, 2024

phixMe commented Oct 8, 2024

salamandra2508 commented Oct 10, 2024

Import necessary methods and variables

pawel-big-lebowski commented Oct 11, 2024