Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DX with just #165

Closed
wants to merge 6 commits into from
Closed

Improve DX with just #165

wants to merge 6 commits into from

Conversation

gwenwindflower
Copy link
Contributor

@gwenwindflower gwenwindflower commented Mar 8, 2024

resolves #144

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

codegen is awesome, but the developer experience is not the best. Due to its aim of using dbt itself to template the code, and not comprising on any sketchy hacks to write files, you're left getting command line output that you need to copy and paste into files. Many people over the years (myself included) have wrapped codegen with shell scripts, Python scripts etc to solve for this, but largely these efforts have remained isolated and haven't made their way back into this repo.

Thanks to some excellent work and thinking from @wjhrdy we're aiming to change that, starting with adding a robust set of just commands to the repo.

Using just we can wrap and chain the macros with all the power of shell scripts (and even other scripting languages!)

Checklist

  • This code is associated with an issue which has been triaged and accepted for development.
  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@gwenwindflower
Copy link
Contributor Author

gwenwindflower commented Mar 8, 2024

@wjhrdy here's our branch to collaborate on. I've left the core recipes as they are, but have expanded with heavy commenting/documentation in the file and allowing for people to choose their package manager and related config for that. We don't offer any official documentation from dbt Labs on using poetry with dbt Core, so I don't feel comfortable with relying on that as the only path for these recipes to work, thus the added logic. Totally good with keeping that as an option though and I made sure to highlight why it's a good choice for this purpose in the commenting/documentation. I haven't fully tested that this system works, just a draft for now, but hopefully this should let people use pip by default, while still optionally going with poetry or uv (my personal preference).

Let me know if this works for you, and feel free to push commits from here! Thanks again for the amazing work on this already, looking forward to seeing where we can go with this!

Like all things, I'm very open to discussing and changing my mind about how to approach this 😸. We're trying to push people towards having per-project installs of dbt in a virtual environment, so this is in-keeping with that.

@gwenwindflower
Copy link
Contributor Author

Also @dbeatty10 if you want to get in on this, since you've also got the just bug!


alias gmy := dbt-generate-model-yaml

# Generate YAML config for any unconfigured SQL models.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a one line comment and move down to just above
dbt-generate-missing-yaml

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it doesn't show up when you list recipes

# Optionally accepts a parameter for folder to search in.
default_folder := 'models'
dbt-generate-missing-yaml folder=default_folder:
@for sql_file in $(find {{folder}} -type f -name '*.sql'); do \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@for sql_file in $(find {{folder}} -type f -name '*.sql'); do \
@for sql_file in $(find {{folder}} -type f -name '\*.sql' ! -name '\*\_v\[0-9\]\*.sql' ! -name '\*\_V\[0-9\]\*.sql'); do \\

this should be changed to acomidate the pattern of having versioned sql files with a singular yml definition.

@{{dbt}} run-operation codegen.generate_model_import_ctes --args '{"model_name": "{{model_name}}"}' > /tmp/{{model_name}}.tmpsql
@awk '/with .* as \($/{p=1} p' /tmp/{{model_name}}.tmpsql | sed 's/^.*with/with/' > /tmp/temp{{model_name}} && mv /tmp/temp{{model_name}} {{generated_folder}}/{{model_name}}.sql
@echo "SQL {{model_name}} updated in {{generated_folder}}/{{model_name}}.sql"

Copy link

@wjhrdy wjhrdy Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more likely a feature request for codegen its self but this is a feature I added to our justfile

Suggested change
# Generate docs blocks markdown for all unique columns in a folder. For making DRY documentation. Inspired by https://docs.getdbt.com/blog/generating-dynamic-docs-dbt
dbt-generate-column-docs-md folder=default_folder:
#!/usr/bin/env bash
models=$({{dbt}} ls -m {{folder}} -q --output name | xargs -I{} echo -n ' "{}",' | sed 's/,#$//')
models="[$models]"
model_name="_$(echo {{folder}} | cut -d'/' -f2- | tr '/' '_')"
{{dbt}} run-operation generate_model_yaml --args "{ \"include_data_types\": false, \"model_names\": $models}" > /tmp/${model_name}.tmpyml
awk '/models:/{p=1} p' /tmp/${model_name}.tmpyml > /tmp/temp${model_name} && mv /tmp/temp${model_name} {{folder}}/${model_name}.yml.tmp
grep ' \- name:' {{folder}}/${model_name}.yml.tmp | cut -c 15- | sort -u > {{folder}}/${model_name}.md.tmp
if [ ! -d "generated" ]; then \
mkdir -p generated; \
fi
if [ -f {{folder}}/${model_name}.md ]; then \
merge=true; \
output_folder="generated"; \
else \
output_folder="{{folder}}"; \
fi
while read line; do
echo "{% docs column_${model_name}__${line} %}" >> ${output_folder}/${model_name}.md;
echo "" >> ${output_folder}/${model_name}.md;
echo "{% enddocs %}" >> ${output_folder}/${model_name}.md;
echo "" >> ${output_folder}/${model_name}.md;
done < {{folder}}/${model_name}.md.tmp
rm {{folder}}/${model_name}.md.tmp
rm {{folder}}/${model_name}.yml.tmp
echo "Docs blocks ${model_name} generated in ${output_folder}/${model_name}.md"
if [ "$merge" = true ] ; then
file1="${output_folder}/${model_name}.md"
file2="{{folder}}/${model_name}.md"
output="{{folder}}/${model_name}.md"
if [[ "$OSTYPE" == "darwin"* ]]; then
echo "Warning: If you are running this script on a Mac, please make sure you have gawk installed. You can install it using 'brew install gawk'"
awk_cmd="gawk"
else
awk_cmd="awk"
fi
$awk_cmd '
/^{% docs / {tag=$0; next}
/^{% enddocs %}/ {a[tag]=doc; doc=""; next}
{doc=(doc==""?"":doc "\n") $0}
END {n = asorti(a, b); for (i=1; i<=n; i++) print b[i] "\n" a[b[i]] "\n{% enddocs %}\n"}
' "$file1" "$file2" > /tmp/${model_name}.md.tmp && mv /tmp/${model_name}.md.tmp $output
echo "Docs blocks ${model_name} merged in ${output}"
fi

@echo "Model {{table}} generated in generated/stg_{{table}}.sql"

alias gbase := dbt-generate-base-model

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably more useful for established projects but I use this all the time:

Suggested change
dbt-merge-model-yaml new_yaml old_yaml output:
#!/usr/bin/env python3
import yaml
from collections import OrderedDict
import sys
import re
def selective_quote_presenter(dumper, data):
if re.search(r"[\'\{\}\[\],:]", data) or data == '': # Check if the string contains characters that are hard for yaml to parse
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style=None)
yaml.add_representer(str, selective_quote_presenter)
# Add a constructor for OrderedDict to the yaml module
def dict_constructor(loader, node):
return OrderedDict(loader.construct_pairs(node))
yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, dict_constructor)
# Add a representer for OrderedDict to the Dumper class
def dict_representer(dumper, data):
return dumper.represent_mapping(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, data.items())
yaml.Dumper.add_representer(OrderedDict, dict_representer)
# Get the file names from command line arguments
generated = {{new_yaml}}
base = {{old_yaml}}
output_file = {{old_yaml}}
# Load the contents of the first file
with open(generated, 'r') as file:
data1 = yaml.load(file, Loader=yaml.Loader)
# Load the contents of the second file
with open(base, 'r') as file:
data2 = yaml.load(file, Loader=yaml.Loader)
# merge the two dictionaries
for model2 in data2['models']:
model1 = next((item for item in data1['models'] if item['name'] == model2['name']), None)
if model1:
merged_model = {**model1, **{k: model2[k] for k in model2.keys() - {'columns'}}}
model1.update(merged_model)
for column2 in model2.get('columns', []):
column1 = next((item for item in model1.get('columns', []) if item['name'] == column2['name']), None)
if column1:
merged_column = {**column1, **column2}
column1.update(merged_column)
else:
model1.get('columns', []).append(column2)
else:
data1['models'].append(model2)
model_keys_order = ['name', 'tags', 'description', 'docs', 'latest_version', 'deprecation_date', 'access', 'config', 'constraints', 'tests', 'columns', 'versions']
column_keys_order = ['name', 'data_type', 'description', 'meta', 'quote', 'constraints', 'tests', 'tags']
# After merging, sort the keys
for model in data1['models']:
model1 = OrderedDict(sorted(model.items(), key=lambda i: model_keys_order.index(i[0])))
for column in model1.get('columns', []):
column1 = OrderedDict(sorted(column.items(), key=lambda i: column_keys_order.index(i[0])))
column.clear()
column.update(column1)
model.clear()
model.update(model1)
# Assume `output` is the string containing your YAML output
output = yaml.dump(data1, Dumper=yaml.Dumper)
# Add a newline after each line starting with " description:"
output = re.sub(r"(?<!columns:\n)( - name:.*\n)", r"\n\1", output)
# Now `output` contains the modified YAML
# Write the merged data back to the output file
with open(output_file, 'w') as file:
file.write(output)
dbt-update-column-yaml folder=default_folder target=target_default:
#!/usr/bin/env bash
for yml_file in $(find {{folder}} -type f -name '*.yml' ! -name '_*'); do
model_name=${yml_file##*/}
model_name=${model_name%.yml}
just dbt-generate-model-yaml $model_name generated {{target}}
just dbt-merge-model-yaml generated/$model_name.yml $yml_file $yml_file
done

Copy link

@wjhrdy wjhrdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few features and fixes. If you want me to do this in a different way please let me know

@jasnonaz
Copy link

jasnonaz commented Apr 4, 2024

Hi Willy,

I’m Jason - manager of the DX team here at dbt Labs. I want to thank you for all of the time and effort you’ve put into thinking through this PR and how it could improve dbt-codegen.

I was running through the upcoming release with Winnie and seeing this PR made me realize two things:

  1. This is an extremely interesting angle to investigate
  2. This is a relatively large change to the usage of dbt-codegen - one that most likely falls outside the size of change we’re looking to make in the upcoming minor release, which is mainly aimed at addressing potential fixes and improvements that fit squarely within existing functionality

For that reason, we’re going to close this as won’t do for now. It’s possible that in the future we might be looking to more holistically revamp the codegen package, at which time we’d be happy to pick this up with you.

I want to apologize for having this happen at this stage in the process - typically the best time for us to mark something as won’t do as when the initial issue is created, not after you put in a bunch of work. That’s on me for not setting the bounds correctly on this codegen release - I hope you know that we really appreciate all the care and thought you’ve put into this.

If I might make a suggestion, you’ve done a bunch of cool investigation here and this might make for an interesting #show-and-tell post on the dbt Community Forum - I think the folks there would really appreciate seeing the thinking you've done.

Thank you again,

Jason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving interface within a dbt repo using just
3 participants