Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of ETL/DbEntity into ETL/DbModel #138

Merged
merged 16 commits into from
May 15, 2017

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented May 9, 2017

This PR refactors the existing ETL/DbEntity code into a new namespace (ETL/DbModel) and also removes cruft that was not needed and makes it easier to extend. This work is in preparation for supporting improvements such as better subquery support, programmatic changes to queries and tables during ETL, and renaming columns as opposed to dropping/adding them.

See also PR ubccr/xdmod-xsede#32 for additional required changes.

Note that commit 533b2f3 clearly shows several files as renamed and 0969e8f shows many of the changes to those files but the code shown at the end of the PR shows them deleted and recreated. I'm not sure why this happens.

Description

The ETL\DbEntity code was largely untouched since inception 1.5 years ago and needed to be updated to support desirable features such as subqueries, change column, and improved programmatic access and modification of tables and queries during ETL (for batch aggregation speed improvements, for example).

The newly refactored ETL/DbModel code has been updated to continue to support a lightweight representation of database queries and tables.

The following changes have been made:

  • The namespace has been changed to a more appropriate ETL/DBModel
  • There are no longer any public properties and all properties are stored in a protected properties array in DbEntity.php and accessed via __get(), __set(), and __isset(). This means that classes no longer need to implement getters in most cases and only need to extend __set() for complex data types such as arrays of objects where a particular type of object must be created.
  • Each object is capable of exporting itself as a stdClass object that can then be fed back into itself as the configuration object. This allows for programmatic modification of an object on the fly during ETL.
  • Improved use of interfaces to enforce the class contract.
  • Improved error checking.
  • The static Table::discover() method no longer needed to be static.
  • Removed much duplicated code and code that was no longer needed as this is implemented in the base or parent class.
  • Much of the functionality is implemented in DbEntity and classes that extend this class now only implement the definition, verification, and setting of the properties that they define.
  • Tests to cover the following cases
    • Creating a table from a JSON file
    • Generating CREATE TABLE statements
    • Verification of table JSON
    • Programmatic creation of table elements from stdClass objects
    • Programmatic altering of table structure and generating an ALTER TABLE statement
    • Resetting table properties to their default values
    • Generate a query from a JSON file and generate a SELECT statement
    • Generate CREATE TABLE and SELECT statements for an aggregation table
    • Test generating a table and query object from a JSON file, extracting a stdClass representation, and feeding this back to ensure the same statement is generated
  • Minor bugfixes discovered during testing

Motivation and Context

General code cleanup and preparation for new features. Also added tests.

Tests performed

Component tests were run (PHPUnit) and the resource-allocations and xdcdb-jobs pipelines were run with their data being compared to the a baseline from v6.6 data. Baseline commands are shown here for reference and were the same options as branch 6.7 executions.

Test resource-allocations and xdcdb-jobs pipelines

~/xdmod-6.6-baseline/share/tools/etl$ php etl_overseer.php -c ../../../etc/etl/etl.json -n 1 -p resource-allocations -o "truncate_destination=true" -v notice

2017-05-08 14:47:39 [notice] dw_extract_transform_load start (process_start_time: 2017-05-08 14:47:39)
2017-05-08 14:47:39 [warning] Duplicate Data Endpoint name 'Cloud DB'
2017-05-08 14:47:39 [notice] Start processing section 'resource-allocations'
2017-05-08 14:47:41 [notice] (action: ResourceAllocationsIngestor (ETL\Ingestor\DatabaseIngestor), start_time: 1494269259.7344, end_time: 1494269261.4183, elapsed_time: 1.6839, records_examined: 468, records_loaded: 468)
2017-05-08 14:47:41 [notice] (action: XrasHistoricalUpdate (ETL\Ingestor\UpdateIngestor), start_time: 1494269261.4222, end_time: 1494269261.8926, elapsed_time: 0.47046, records_loaded: 226, records_updated: 223)
2017-05-08 14:47:42 [notice] aggregate end (unit: quarter, periods: 34, start_date: none, end_date: none, start_time: 1494269262.1477, end_time: 1494269262.3801, elapsed_time: 0.23241)
2017-05-08 14:47:42 [notice] Duplicate column after substitution: ("${AGGREGATION_UNIT}: ${:PERIOD_VALUE}") '${AGGREGATION_UNIT}' -> 'year'
2017-05-08 14:47:42 [notice] aggregate end (unit: year, periods: 9, start_date: none, end_date: none, start_time: 1494269262.4004, end_time: 1494269262.4574, elapsed_time: 0.05709)
2017-05-08 14:47:42 [notice] end (action: ResourceAllocationsAggregator (ETL\Aggregator\SimpleAggregator), start_time: 1494269261.8954, end_time: 1494269262.4616, elapsed_time: 0.009s)
2017-05-08 14:47:42 [notice] Finished processing section 'resource-allocations'
2017-05-08 14:47:42 [notice] dw_extract_transform_load end (process_end_time: 2017-05-08 14:47:42)

~/xdmod-6.6-baseline/share/tools/etl$ php etl_overseer.php -c ../../../etc/etl/etl.json -s 2016-12-01 -e "2016-12-31 23:59:59" -p xdcdb-jobs -v notice

2017-05-08 14:47:39 [notice] dw_extract_transform_load start (process_start_time: 2017-05-08 14:47:39)
2017-05-08 14:47:39 [warning] Duplicate Data Endpoint name 'Cloud DB'
2017-05-08 14:47:39 [notice] Start processing section 'resource-allocations'
2017-05-08 14:47:41 [notice] (action: ResourceAllocationsIngestor (ETL\Ingestor\DatabaseIngestor), start_time: 1494269259.7344, end_time: 1494269261.4183, elapsed_time: 1.6839, records_examined: 468, records_loaded: 468)
2017-05-08 14:47:41 [notice] (action: XrasHistoricalUpdate (ETL\Ingestor\UpdateIngestor), start_time: 1494269261.4222, end_time: 1494269261.8926, elapsed_time: 0.47046, records_loaded: 226, records_updated: 223)
2017-05-08 14:47:42 [notice] aggregate end (unit: quarter, periods: 34, start_date: none, end_date: none, start_time: 1494269262.1477, end_time: 1494269262.3801, elapsed_time: 0.23241)
2017-05-08 14:47:42 [notice] Duplicate column after substitution: ("${AGGREGATION_UNIT}: ${:PERIOD_VALUE}") '${AGGREGATION_UNIT}' -> 'year'
2017-05-08 14:47:42 [notice] aggregate end (unit: year, periods: 9, start_date: none, end_date: none, start_time: 1494269262.4004, end_time: 1494269262.4574, elapsed_time: 0.05709)
2017-05-08 14:47:42 [notice] end (action: ResourceAllocationsAggregator (ETL\Aggregator\SimpleAggregator), start_time: 1494269261.8954, end_time: 1494269262.4616, elapsed_time: 0.009s)
2017-05-08 14:47:42 [notice] Finished processing section 'resource-allocations'
2017-05-08 14:47:42 [notice] dw_extract_transform_load end (process_end_time: 2017-05-08 14:47:42)
smgallo@smgallo-cloud-dev:~/xdmod-6.6-baseline/share/tools/etl$ php etl_overseer.php -c ../../../etc/etl/etl.json -s 2016-12-01 -e "2016-12-31 23:59:59" -p xdcdb-jobs -o "truncate_destination=true" -v notice
2017-05-08 15:01:55 [notice] dw_extract_transform_load start (process_start_time: 2017-05-08 15:01:55)
2017-05-08 15:01:55 [warning] Duplicate Data Endpoint name 'Cloud DB'
2017-05-08 15:01:55 [notice] Start processing section 'xdcdb-jobs'
2017-05-08 15:06:42 [notice] (action: XdcdbJobRecordIngestor (ETL\Ingestor\DatabaseIngestor), start_time: 1494270115.3477, end_time: 1494270402.7072, elapsed_time: 287.35946, records_examined: 923746, records_loaded: 923746)
2017-05-08 15:07:07 [notice] (action: XdcdbPostIngestJobUpdates (ETL\Maintenance\ExecuteSql), start_time: 1494270402.7114, end_time: 1494270427.1868, elapsed_time: 24.47538)
2017-05-08 15:08:02 [notice] aggregate end (unit: day, periods: 38, start_date: 2016-12-01 00:00:00, end_date: 2016-12-31 23:59:59, start_time: 1494270427.2551, end_time: 1494270482.9501, elapsed_time: 55.69507)
2017-05-08 15:08:29 [notice] aggregate end (unit: month, periods: 2, start_date: 2016-12-01 00:00:00, end_date: 2016-12-31 23:59:59, start_time: 1494270483.1032, end_time: 1494270509.1344, elapsed_time: 26.03122)
2017-05-08 15:08:55 [notice] aggregate end (unit: quarter, periods: 1, start_date: 2016-12-01 00:00:00, end_date: 2016-12-31 23:59:59, start_time: 1494270509.2079, end_time: 1494270535.3413, elapsed_time: 26.13339)
2017-05-08 15:08:57 [notice] Duplicate column after substitution: ("${AGGREGATION_UNIT}: ${:PERIOD_VALUE}") '${AGGREGATION_UNIT}' -> 'year'
2017-05-08 15:09:20 [notice] aggregate end (unit: year, periods: 1, start_date: 2016-12-01 00:00:00, end_date: 2016-12-31 23:59:59, start_time: 1494270535.3905, end_time: 1494270560.3657, elapsed_time: 24.97521)
2017-05-08 15:09:20 [notice] end (action: XdcdbJobRecordAggregator (ETL\Aggregator\SimpleAggregator), start_time: 1494270427.1898, end_time: 1494270560.3831, elapsed_time: 2.22s)
2017-05-08 15:09:20 [notice] Finished processing section 'xdcdb-jobs'
2017-05-08 15:09:20 [notice] dw_extract_transform_load end (process_end_time: 2017-05-08 15:09:20)

Verify data in generated tables. Note that the default values for submission_venue_id, job_record_type_id, and job_task_type_id have changed between 6.6 and 6.7 so these columns are ignored in the jobs data (we do not currently populate them other than the default).

./verify_table_data.php -c datawarehouse -s modw_baseline -d modw_etltest \
-t resource_allocations -t resourceallocationfact_by_quarter -t resourceallocationfact_by_year \
-n 2 -v info

2017-05-08 13:57:08 [notice] Compare tables src=modw_baseline.resource_allocations, dest=modw_etltest.resource_allocations
2017-05-08 13:57:08 [info] 10 columns
2017-05-08 13:57:08 [info] Row counts: modw_baseline.resource_allocations = 468; modw_etltest.resource_allocations = 468
2017-05-08 13:57:08 [notice] Identical
2017-05-08 13:57:08 [notice] Compare tables src=modw_baseline.resourceallocationfact_by_quarter, dest=modw_etltest.resourceallocationfact_by_quarter
2017-05-08 13:57:08 [info] 11 columns
2017-05-08 13:57:08 [info] Row counts: modw_baseline.resourceallocationfact_by_quarter = 466; modw_etltest.resourceallocationfact_by_quarter = 466
2017-05-08 13:57:08 [notice] Identical
2017-05-08 13:57:08 [notice] Compare tables src=modw_baseline.resourceallocationfact_by_year, dest=modw_etltest.resourceallocationfact_by_year
2017-05-08 13:57:08 [info] 10 columns
2017-05-08 13:57:08 [info] Row counts: modw_baseline.resourceallocationfact_by_year = 466; modw_etltest.resourceallocationfact_by_year = 466
2017-05-08 13:57:08 [notice] Identical

./verify_table_data.php -c datawarehouse -s modw_baseline -d modw_etltest \
-t job_records -t job_tasks \
-x submission_venue_id -x job_record_type_id -x job_task_type_id -x last_modified \
-n 2 -v info

2017-05-08 14:45:27 [notice] Compare tables src=modw_baseline.job_records, dest=modw_etltest.job_records
2017-05-08 14:45:27 [info] Exclude columns: submission_venue_id, job_record_type_id, job_task_type_id, last_modified
2017-05-08 14:45:27 [info] 32 columns
2017-05-08 14:45:27 [info] Row counts: modw_baseline.job_records = 923,746; modw_etltest.job_records = 923,746
2017-05-08 14:45:31 [notice] Identical
2017-05-08 14:45:31 [notice] Compare tables src=modw_baseline.job_tasks, dest=modw_etltest.job_tasks
2017-05-08 14:45:31 [info] Exclude columns: submission_venue_id, job_record_type_id, job_task_type_id, last_modified
2017-05-08 14:45:31 [info] 29 columns
2017-05-08 14:45:31 [info] Row counts: modw_baseline.job_tasks = 923,746; modw_etltest.job_tasks = 923,746
2017-05-08 14:45:37 [notice] Identical

./verify_table_data.php -c datawarehouse -s modw_baseline -d modw_etltest --autodetect-column-comparison \
-t jobfact_by_day -t jobfact_by_month -t jobfact_by_quarted -t jobfact_by_year \
-x submission_venue_id -x job_record_type_id -x job_task_type_id \
--ignore-column-type -n 2 -v info

2017-05-08 14:46:53 [notice] Compare tables src=modw_baseline.jobfact_by_day, dest=modw_etltest.jobfact_by_day
2017-05-08 14:46:53 [info] Exclude columns: submission_venue_id, job_record_type_id, job_task_type_id
2017-05-08 14:46:53 [info] 44 columns
2017-05-08 14:46:53 [info] Row counts: modw_baseline.jobfact_by_day = 49,342; modw_etltest.jobfact_by_day = 49,342
2017-05-08 14:46:53 [notice] Identical
2017-05-08 14:46:53 [notice] Compare tables src=modw_baseline.jobfact_by_month, dest=modw_etltest.jobfact_by_month
2017-05-08 14:46:53 [info] Exclude columns: submission_venue_id, job_record_type_id, job_task_type_id
2017-05-08 14:46:53 [info] 44 columns
2017-05-08 14:46:53 [info] Row counts: modw_baseline.jobfact_by_month = 15,403; modw_etltest.jobfact_by_month = 15,403
2017-05-08 14:46:53 [notice] Identical
2017-05-08 14:46:53 [notice] Compare tables src=modw_baseline.jobfact_by_quarter, dest=modw_etltest.jobfact_by_quarter
2017-05-08 14:46:53 [info] Exclude columns: submission_venue_id, job_record_type_id, job_task_type_id
2017-05-08 14:46:53 [info] 44 columns
2017-05-08 14:46:53 [info] Row counts: modw_baseline.jobfact_by_quarter = 14,928; modw_etltest.jobfact_by_quarter = 14,928
2017-05-08 14:46:55 [notice] Identical
2017-05-08 14:46:55 [notice] Compare tables src=modw_baseline.jobfact_by_year, dest=modw_etltest.jobfact_by_year
2017-05-08 14:46:55 [info] Exclude columns: submission_venue_id, job_record_type_id, job_task_type_id
2017-05-08 14:46:55 [info] 43 columns
2017-05-08 14:46:55 [info] Row counts: modw_baseline.jobfact_by_year = 14,928; modw_etltest.jobfact_by_year = 14,928
2017-05-08 14:46:55 [notice] Identical

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@smgallo smgallo added enhancement Enhancement of the functionality of an existing feature Category:ETL Extract Transform Load labels May 9, 2017
@smgallo smgallo added this to the v6.7.0 milestone May 9, 2017
@@ -1,3 +1,5 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like some extra lines ended up here you dont need.

Copy link
Contributor

@plessbd plessbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after removing leading new lines in tools/etl/etl_table_manager.php

@smgallo smgallo merged commit 58b2bb8 into ubccr:xdmod6.7 May 15, 2017
@smgallo smgallo deleted the etl/db-model branch May 15, 2017 15:32
@jsperhac
Copy link
Contributor

This enhancement breaks existing tests that are found in open_xdmod/modules/xdmod/tests/lib/ETL. These tests should be replaced or fixed.
etl-test-output.txt

@smgallo
Copy link
Contributor Author

smgallo commented May 17, 2017

@jsperhac Did you run composer install to get the latest test artifacts repo additions?

@jsperhac
Copy link
Contributor

Ah, sounds like my repo scripts are not accounting for the test artifacts. Let me try...

@tyearke tyearke modified the milestones: v7.0.0, v6.7.0 Jun 6, 2017
chakrabortyr pushed a commit to chakrabortyr/xdmod that referenced this pull request Oct 17, 2017
* Added tests for DbEntity\Table
* Moved namespace ETL\DbEntity to ETL\DbModel
* Refactoring of ETL/DbModel to remove old code and support upcoming features
* Updated commit for xdmod-test-artifacts
* Fix record formula verification and saving of overseer restriction value
* Add @plessbd blacklist filters
* Ignore case where temp directory already exists for tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:ETL Extract Transform Load enhancement Enhancement of the functionality of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants