diff --git a/.github/workflows/asf-site.ci.yml b/.github/workflows/asf-site.ci.yml index a6e7cc537f23..2ba16343fa02 100644 --- a/.github/workflows/asf-site.ci.yml +++ b/.github/workflows/asf-site.ci.yml @@ -26,7 +26,7 @@ jobs: git pull --rebase hudi asf-site - uses: actions/setup-node@v2 with: - node-version: '16' + node-version: '18' - name: Build website run: | pushd ${{ env.DOCS_ROOT }} diff --git a/website/blog/2020-11-11-hudi-indexing-mechanisms.md b/website/blog/2020-11-11-hudi-indexing-mechanisms.md index 4ffc01ed63e3..a25c81756c61 100644 --- a/website/blog/2020-11-11-hudi-indexing-mechanisms.md +++ b/website/blog/2020-11-11-hudi-indexing-mechanisms.md @@ -123,7 +123,7 @@ Some interesting work underway in this area: - Record level index implementation, as a secondary index using another Hudi table. Going forward, this will remain an area of active investment for the project. we are always looking for contributors who can drive these roadmap items forward. -Please [engage](/contribute/get-involved) with our community if you want to get involved. +Please [engage](/community/get-involved) with our community if you want to get involved. diff --git a/website/blog/2021-02-13-hudi-key-generators.md b/website/blog/2021-02-13-hudi-key-generators.md index 87b54514e5f5..18781cf74c04 100644 --- a/website/blog/2021-02-13-hudi-key-generators.md +++ b/website/blog/2021-02-13-hudi-key-generators.md @@ -37,7 +37,7 @@ key generators. | ```hoodie.datasource.write.partitionpath.field``` | Refers to partition path field. This is a mandatory field. | | ```hoodie.datasource.write.keygenerator.class``` | Refers to Key generator class(including full path). Could refer to any of the available ones or user defined one. This is a mandatory field. | | ```hoodie.datasource.write.partitionpath.urlencode```| When set to true, partition path will be url encoded. Default value is false. | -| ```hoodie.datasource.write.hive_style_partitioning```| When set to true, uses hive style partitioning. Partition field name will be prefixed to the value. Format: “=”. Default value is false.| +| ```hoodie.datasource.write.hive_style_partitioning```| When set to true, uses hive style partitioning. Partition field name will be prefixed to the value. Format: “\=\”. Default value is false.| NOTE: Please use `hoodie.datasource.write.keygenerator.class` instead of `hoodie.datasource.write.keygenerator.type`. The second config was introduced more recently. diff --git a/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md b/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md index 39895619004b..60b13bba2ae2 100644 --- a/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md +++ b/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md @@ -107,7 +107,7 @@ To use incremental models, you need to perform these two activities: dbt provides you a macro `is_incremental()` which is very useful to define the filters exclusively for incremental materializations. -Often, you'll want to filter for "new" rows, as in, rows that have been created since the last time dbt ran this model. The best way to find the timestamp of the most recent run of this model is by checking the most recent timestamp in your target table. dbt makes it easy to query your target table by using the "[{{ this }}](https://docs.getdbt.com/reference/dbt-jinja-functions/this)" variable. +Often, you'll want to filter for "new" rows, as in, rows that have been created since the last time dbt ran this model. The best way to find the timestamp of the most recent run of this model is by checking the most recent timestamp in your target table. dbt makes it easy to query your target table by using the "[\{{ this }}](https://docs.getdbt.com/reference/dbt-jinja-functions/this)" variable. ```sql title="models/my_model.sql" {{ diff --git a/website/contribute/developer-setup.md b/website/contribute/developer-setup.md index dc19bb222c3f..1dbea05c5a13 100644 --- a/website/contribute/developer-setup.md +++ b/website/contribute/developer-setup.md @@ -335,7 +335,7 @@ Use `alt use` to use v1 version of docker-compose while running integration test ## Communication All communication is expected to align with the [Code of Conduct](https://www.apache.org/foundation/policies/conduct). -Discussion about contributing code to Hudi happens on the [dev@ mailing list](/contribute/get-involved). Introduce yourself! +Discussion about contributing code to Hudi happens on the [dev@ mailing list](/community/get-involved). Introduce yourself! ## Code & Project Structure diff --git a/website/contribute/rfc-process.md b/website/contribute/rfc-process.md index 17afb98314a3..b3f699e091ab 100644 --- a/website/contribute/rfc-process.md +++ b/website/contribute/rfc-process.md @@ -40,7 +40,7 @@ Use this discussion thread to get an agreement from people on the mailing list t 1. Create a folder `rfc-` under `rfc` folder, where `` is replaced by the actual RFC number used. 2. Copy the rfc template file `rfc/template.md` to `rfc/rfc-/rfc-.md` and proceed to draft your design document. 3. [Optional] Place any images used by the same directory using the `![alt text](./image.png)` markdown syntax. -4. Add at least 2 PMC members as approvers (you can find their github usernames [here](/contribute/team)). You are free to add any number of dev members to your reviewers list. +4. Add at least 2 PMC members as approvers (you can find their github usernames [here](/community/team)). You are free to add any number of dev members to your reviewers list. 5. Raise a PR against the master branch with `[RFC-]` in the title and work through feedback, until the RFC approved (by approving the Github PR itself) 6. Before landing the PR, please change the status to "IN PROGRESS" under `rfc/README.md` and keep it maintained as you go about implementing, completing or even abandoning. diff --git a/website/docs/cli.md b/website/docs/cli.md index cd0b8b8c1247..1c30b9b6fa6e 100644 --- a/website/docs/cli.md +++ b/website/docs/cli.md @@ -452,7 +452,7 @@ To manually schedule or run a compaction, use the below command. This command us operations. **NOTE:** Make sure no other application is scheduling compaction for this table concurrently -{: .notice--info} +\{: .notice--info} ```java hudi:trips->help compaction schedule @@ -538,7 +538,7 @@ hudi:stock_ticks_mor->compaction validate --instant 20181005222601 ``` **NOTE:** The following commands must be executed without any other writer/ingestion application running. -{: .notice--warning} +\{: .notice--warning} Sometimes, it becomes necessary to remove a fileId from a compaction-plan inorder to speed-up or unblock compaction operation. Any new log-files that happened on this file after the compaction got scheduled will be safely renamed @@ -753,4 +753,4 @@ table change-table-type COW ╟────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────────────────────╢ ║ hoodie.timeline.layout.version │ 1 │ 1 ║ ╚════════════════════════════════════════════════╧══════════════════════════════════════╧══════════════════════════════════════╝ -``` \ No newline at end of file +``` diff --git a/website/docs/clustering.md b/website/docs/clustering.md index 3052e171b6eb..80a8717e1774 100644 --- a/website/docs/clustering.md +++ b/website/docs/clustering.md @@ -159,8 +159,7 @@ The available strategies are as follows: ### Update Strategy Currently, clustering can only be scheduled for tables/partitions not receiving any concurrent updates. By default, -the config for update strategy - [`hoodie.clustering.updates.strategy`](/docs/configurations/#hoodieclusteringupdatesstrategy) is set to *** -SparkRejectUpdateStrategy***. If some file group has updates during clustering then it will reject updates and throw an +the config for update strategy - [`hoodie.clustering.updates.strategy`](/docs/configurations/#hoodieclusteringupdatesstrategy) is set to ***SparkRejectUpdateStrategy***. If some file group has updates during clustering then it will reject updates and throw an exception. However, in some use-cases updates are very sparse and do not touch most file groups. The default strategy to simply reject updates does not seem fair. In such use-cases, users can set the config to ***SparkAllowUpdateStrategy***. @@ -270,6 +269,7 @@ whose location can be pased as `—props` when starting the Hudi Streamer (just A sample spark-submit command to setup HoodieStreamer is as below: + ```bash spark-submit \ --class org.apache.hudi.utilities.streamer.HoodieStreamer \ @@ -341,4 +341,4 @@ out-of-the-box. Note that as of now only linear sort is supported in Java execut ## Related Resources

Videos

-* [Understanding Clustering in Apache Hudi and the Benefits of Asynchronous Clustering](https://www.youtube.com/watch?v=R_sm4wlGXuE) \ No newline at end of file +* [Understanding Clustering in Apache Hudi and the Benefits of Asynchronous Clustering](https://www.youtube.com/watch?v=R_sm4wlGXuE) diff --git a/website/docs/compaction.md b/website/docs/compaction.md index c3504236da73..de5bd20a0e1f 100644 --- a/website/docs/compaction.md +++ b/website/docs/compaction.md @@ -53,13 +53,8 @@ Hudi provides various options for both these strategies as discussed below. | Config Name | Default | Description | |----------------------------------------------------|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------| -| hoodie.compact.inline.trigger.strategy | NUM_COMMITS (Optional) | org.apache.hudi.table.action.compact.CompactionTriggerStrategy: Controls when compaction is scheduled.
`Config Param: INLINE_COMPACT_TRIGGER_STRATEGY` | -Possible values:
  • `NUM_COMMITS`: triggers compaction when there are at least N delta commits after last -completed compaction.
  • `NUM_COMMITS_AFTER_LAST_REQUEST`: triggers compaction when there are at least N delta commits -after last completed or requested compaction.
  • `TIME_ELAPSED`: triggers compaction after N seconds since last -compaction.
  • `NUM_AND_TIME`: triggers compaction when both there are at least N delta commits and N seconds -elapsed (both must be satisfied) after last completed compaction.
  • `NUM_OR_TIME`: triggers compaction when both -there are at least N delta commits or N seconds elapsed (either condition is satisfied) after last completed compaction.
+| hoodie.compact.inline.trigger.strategy | NUM_COMMITS (Optional) | org.apache.hudi.table.action.compact.CompactionTriggerStrategy: Controls when compaction is scheduled.
`Config Param: INLINE_COMPACT_TRIGGER_STRATEGY`
+
  • `NUM_COMMITS`: triggers compaction when there are at least N delta commits after last completed compaction.
  • `NUM_COMMITS_AFTER_LAST_REQUEST`: triggers compaction when there are at least N delta commits after last completed or requested compaction.
  • `TIME_ELAPSED`: triggers compaction after N seconds since last compaction.
  • `NUM_AND_TIME`: triggers compaction when both there are at least N delta commits and N seconds elapsed (both must be satisfied) after last completed compaction.
  • `NUM_OR_TIME`: triggers compaction when both there are at least N delta commits or N seconds elapsed (either condition is satisfied) after last completed compaction.
| #### Compaction Strategies | Config Name | Default | Description | @@ -81,7 +76,7 @@ order of creation of Hive Partitions. It helps to compact data in latest partiti Total_IO allowed.
  • `UnBoundedCompactionStrategy`: UnBoundedCompactionStrategy will not change ordering or filter any compaction. It is a pass-through and will compact all the base files which has a log file. This usually means no-intelligence on compaction.
  • `UnBoundedPartitionAwareCompactionStrategy`:UnBoundedPartitionAwareCompactionStrategy is a custom UnBounded Strategy. This will filter all the partitions that -are eligible to be compacted by a {@link BoundedPartitionAwareCompactionStrategy} and return the result. This is done +are eligible to be compacted by a \{@link BoundedPartitionAwareCompactionStrategy} and return the result. This is done so that a long running UnBoundedPartitionAwareCompactionStrategy does not step over partitions in a shorter running BoundedPartitionAwareCompactionStrategy. Essentially, this is an inverse of the partitions chosen in BoundedPartitionAwareCompactionStrategy
  • diff --git a/website/docs/configurations.md b/website/docs/configurations.md index 8364bb025060..aac045eac1c2 100644 --- a/website/docs/configurations.md +++ b/website/docs/configurations.md @@ -249,9 +249,9 @@ The following set of configurations help validate new data before commits. | Config Name | Default | Description | | ------------------------------------------------------------------------------------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [hoodie.precommit.validators](#hoodieprecommitvalidators) | | Comma separated list of class names that can be invoked to validate commit
    `Config Param: VALIDATOR_CLASS_NAMES` | -| [hoodie.precommit.validators.equality.sql.queries](#hoodieprecommitvalidatorsequalitysqlqueries) | | Spark SQL queries to run on table before committing new data to validate state before and after commit. Multiple queries separated by ';' delimiter are supported. Example: "select count(*) from \<TABLE_NAME\> Note \<TABLE_NAME\> is replaced by table state before and after commit.
    `Config Param: EQUALITY_SQL_QUERIES` | -| [hoodie.precommit.validators.inequality.sql.queries](#hoodieprecommitvalidatorsinequalitysqlqueries) | | Spark SQL queries to run on table before committing new data to validate state before and after commit.Multiple queries separated by ';' delimiter are supported.Example query: 'select count(*) from \<TABLE_NAME\> where col=null'Note \<TABLE_NAME\> variable is expected to be present in query.
    `Config Param: INEQUALITY_SQL_QUERIES` | -| [hoodie.precommit.validators.single.value.sql.queries](#hoodieprecommitvalidatorssinglevaluesqlqueries) | | Spark SQL queries to run on table before committing new data to validate state after commit.Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: 'query1#result1:query2#result2'Note \<TABLE_NAME\> variable is expected to be present in query.
    `Config Param: SINGLE_VALUE_SQL_QUERIES` | +| [hoodie.precommit.validators.equality.sql.queries](#hoodieprecommitvalidatorsequalitysqlqueries) | | Spark SQL queries to run on table before committing new data to validate state before and after commit. Multiple queries separated by ';' delimiter are supported. Example: "select count(*) from <TABLE_NAME> Note <TABLE_NAME> is replaced by table state before and after commit.
    `Config Param: EQUALITY_SQL_QUERIES` | +| [hoodie.precommit.validators.inequality.sql.queries](#hoodieprecommitvalidatorsinequalitysqlqueries) | | Spark SQL queries to run on table before committing new data to validate state before and after commit.Multiple queries separated by ';' delimiter are supported.Example query: 'select count(*) from <TABLE_NAME> where col=null'Note <TABLE_NAME> variable is expected to be present in query.
    `Config Param: INEQUALITY_SQL_QUERIES` | +| [hoodie.precommit.validators.single.value.sql.queries](#hoodieprecommitvalidatorssinglevaluesqlqueries) | | Spark SQL queries to run on table before committing new data to validate state after commit.Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: 'query1#result1:query2#result2'Note <TABLE_NAME> variable is expected to be present in query.
    `Config Param: SINGLE_VALUE_SQL_QUERIES` | --- ## Flink Sql Configs {#FLINK_SQL} @@ -1818,22 +1818,22 @@ Configs that are common during ingestion across different cloud stores [**Advanced Configs**](#Cloud-Source-Configs-advanced-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------------------------------------------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.streamer.source.cloud.data.datasource.options](#hoodiestreamersourceclouddatadatasourceoptions) | (N/A) | A JSON string passed to the Spark DataFrameReader while loading the dataset. Example: hoodie.streamer.gcp.spark.datasource.options={"header":"true","encoding":"UTF-8"}
    `Config Param: SPARK_DATASOURCE_OPTIONS` | -| [hoodie.streamer.source.cloud.data.ignore.relpath.prefix](#hoodiestreamersourceclouddataignorerelpathprefix) | (N/A) | Ignore objects in the bucket whose relative path starts this prefix
    `Config Param: IGNORE_RELATIVE_PATH_PREFIX` | -| [hoodie.streamer.source.cloud.data.ignore.relpath.substring](#hoodiestreamersourceclouddataignorerelpathsubstring) | (N/A) | Ignore objects in the bucket whose relative path contains this substring
    `Config Param: IGNORE_RELATIVE_PATH_SUBSTR` | -| [hoodie.streamer.source.cloud.data.partition.fields.from.path](#hoodiestreamersourceclouddatapartitionfieldsfrompath) | (N/A) | A comma delimited list of path-based partition fields in the source file structure.
    `Config Param: PATH_BASED_PARTITION_FIELDS`
    `Since Version: 0.14.0` | -| [hoodie.streamer.source.cloud.data.partition.max.size](#hoodiestreamersourceclouddatapartitionmaxsize) | (N/A) | specify this value in bytes, to coalesce partitions of source dataset not greater than specified limit
    `Config Param: SOURCE_MAX_BYTES_PER_PARTITION`
    `Since Version: 0.14.1` | -| [hoodie.streamer.source.cloud.data.select.file.extension](#hoodiestreamersourceclouddataselectfileextension) | (N/A) | Only match files with this extension. By default, this is the same as hoodie.streamer.source.hoodieincr.file.format
    `Config Param: CLOUD_DATAFILE_EXTENSION` | -| [hoodie.streamer.source.cloud.data.select.relpath.prefix](#hoodiestreamersourceclouddataselectrelpathprefix) | (N/A) | Only selects objects in the bucket whose relative path starts with this prefix
    `Config Param: SELECT_RELATIVE_PATH_PREFIX` | -| [hoodie.streamer.source.cloud.data.check.file.exists](#hoodiestreamersourceclouddatacheckfileexists) | false | If true, checks whether file exists before attempting to pull it
    `Config Param: ENABLE_EXISTS_CHECK` | -| [hoodie.streamer.source.cloud.data.datafile.format](#hoodiestreamersourceclouddatadatafileformat) | parquet | Format of the data file. By default, this will be the same as hoodie.streamer.source.hoodieincr.file.format
    `Config Param: DATAFILE_FORMAT` | -| [hoodie.streamer.source.cloud.data.reader.comma.separated.path.format](#hoodiestreamersourceclouddatareadercommaseparatedpathformat) | false | Boolean value for specifying path format in load args of spark.read.format("..").load("a.xml,b.xml,c.xml"), * set true if path format needs to be comma separated string value, if false it's passed as array of strings like * spark.read.format("..").load(new String[]{a.xml,b.xml,c.xml})
    `Config Param: SPARK_DATASOURCE_READER_COMMA_SEPARATED_PATH_FORMAT`
    `Since Version: 0.14.1` | -| [hoodie.streamer.source.cloud.meta.ack](#hoodiestreamersourcecloudmetaack) | true | Whether to acknowledge Metadata messages during Cloud Ingestion or not. This is useful during dev and testing. In Prod this should always be true. In case of Cloud Pubsub, not acknowledging means Pubsub will keep redelivering the same messages.
    `Config Param: ACK_MESSAGES` | -| [hoodie.streamer.source.cloud.meta.batch.size](#hoodiestreamersourcecloudmetabatchsize) | 10 | Number of metadata messages to pull in one API call to the cloud events queue. Multiple API calls with this batch size are sent to cloud events queue, until we consume hoodie.streamer.source.cloud.meta.max.num.messages.per.syncfrom the queue or hoodie.streamer.source.cloud.meta.max.fetch.time.per.sync.ms amount of time has passed or queue is empty.
    `Config Param: BATCH_SIZE_CONF` | -| [hoodie.streamer.source.cloud.meta.max.fetch.time.per.sync.secs](#hoodiestreamersourcecloudmetamaxfetchtimepersyncsecs) | 60 | Max time in secs to consume hoodie.streamer.source.cloud.meta.max.num.messages.per.sync messages from cloud queue. Cloud event queues like SQS, PubSub can return empty responses even when messages are available the queue, this config ensures we don't wait forever to consume MAX_MESSAGES_CONF messages, but time out and move on further.
    `Config Param: MAX_FETCH_TIME_PER_SYNC_SECS`
    `Since Version: 0.14.1` | -| [hoodie.streamer.source.cloud.meta.max.num.messages.per.sync](#hoodiestreamersourcecloudmetamaxnummessagespersync) | 1000 | Maximum number of messages to consume per sync round. Multiple rounds of hoodie.streamer.source.cloud.meta.batch.size could be invoked to reach max messages as configured by this config
    `Config Param: MAX_NUM_MESSAGES_PER_SYNC`
    `Since Version: 0.14.1` | +| Config Name | Default | Description | +| ------------------------------------------------------------------------------------------------------------------------------------ | -------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.streamer.source.cloud.data.datasource.options](#hoodiestreamersourceclouddatadatasourceoptions) | (N/A) | A JSON string passed to the Spark DataFrameReader while loading the dataset. Example: `hoodie.streamer.gcp.spark.datasource.options={"header":"true","encoding":"UTF-8"}`
    `Config Param: SPARK_DATASOURCE_OPTIONS` | +| [hoodie.streamer.source.cloud.data.ignore.relpath.prefix](#hoodiestreamersourceclouddataignorerelpathprefix) | (N/A) | Ignore objects in the bucket whose relative path starts this prefix
    `Config Param: IGNORE_RELATIVE_PATH_PREFIX` | +| [hoodie.streamer.source.cloud.data.ignore.relpath.substring](#hoodiestreamersourceclouddataignorerelpathsubstring) | (N/A) | Ignore objects in the bucket whose relative path contains this substring
    `Config Param: IGNORE_RELATIVE_PATH_SUBSTR` | +| [hoodie.streamer.source.cloud.data.partition.fields.from.path](#hoodiestreamersourceclouddatapartitionfieldsfrompath) | (N/A) | A comma delimited list of path-based partition fields in the source file structure.
    `Config Param: PATH_BASED_PARTITION_FIELDS`
    `Since Version: 0.14.0` | +| [hoodie.streamer.source.cloud.data.partition.max.size](#hoodiestreamersourceclouddatapartitionmaxsize) | (N/A) | specify this value in bytes, to coalesce partitions of source dataset not greater than specified limit
    `Config Param: SOURCE_MAX_BYTES_PER_PARTITION`
    `Since Version: 0.14.1` | +| [hoodie.streamer.source.cloud.data.select.file.extension](#hoodiestreamersourceclouddataselectfileextension) | (N/A) | Only match files with this extension. By default, this is the same as hoodie.streamer.source.hoodieincr.file.format
    `Config Param: CLOUD_DATAFILE_EXTENSION` | +| [hoodie.streamer.source.cloud.data.select.relpath.prefix](#hoodiestreamersourceclouddataselectrelpathprefix) | (N/A) | Only selects objects in the bucket whose relative path starts with this prefix
    `Config Param: SELECT_RELATIVE_PATH_PREFIX` | +| [hoodie.streamer.source.cloud.data.check.file.exists](#hoodiestreamersourceclouddatacheckfileexists) | false | If true, checks whether file exists before attempting to pull it
    `Config Param: ENABLE_EXISTS_CHECK` | +| [hoodie.streamer.source.cloud.data.datafile.format](#hoodiestreamersourceclouddatadatafileformat) | parquet | Format of the data file. By default, this will be the same as hoodie.streamer.source.hoodieincr.file.format
    `Config Param: DATAFILE_FORMAT` | +| [hoodie.streamer.source.cloud.data.reader.comma.separated.path.format](#hoodiestreamersourceclouddatareadercommaseparatedpathformat) | false | Boolean value for specifying path format in load args of spark.read.format("..").load("a.xml,b.xml,c.xml"), * set true if path format needs to be comma separated string value, if false it's passed as array of strings like spark.read.format("..").load(new String[]\{a.xml,b.xml,c.xml})
    `Config Param: SPARK_DATASOURCE_READER_COMMA_SEPARATED_PATH_FORMAT`
    `Since Version: 0.14.1` | +| [hoodie.streamer.source.cloud.meta.ack](#hoodiestreamersourcecloudmetaack) | true | Whether to acknowledge Metadata messages during Cloud Ingestion or not. This is useful during dev and testing. In Prod this should always be true. In case of Cloud Pubsub, not acknowledging means Pubsub will keep redelivering the same messages.
    `Config Param: ACK_MESSAGES` | +| [hoodie.streamer.source.cloud.meta.batch.size](#hoodiestreamersourcecloudmetabatchsize) | 10 | Number of metadata messages to pull in one API call to the cloud events queue. Multiple API calls with this batch size are sent to cloud events queue, until we consume `hoodie.streamer.source.cloud.meta.max.num.messages.per.sync` from the queue or `hoodie.streamer.source.cloud.meta.max.fetch.time.per.sync.ms` amount of time has passed or queue is empty.
    `Config Param: BATCH_SIZE_CONF` | +| [hoodie.streamer.source.cloud.meta.max.fetch.time.per.sync.secs](#hoodiestreamersourcecloudmetamaxfetchtimepersyncsecs) | 60 | Max time in secs to consume `hoodie.streamer.source.cloud.meta.max.num.messages.per.sync` messages from cloud queue. Cloud event queues like SQS, PubSub can return empty responses even when messages are available the queue, this config ensures we don't wait forever to consume MAX_MESSAGES_CONF messages, but time out and move on further.
    `Config Param: MAX_FETCH_TIME_PER_SYNC_SECS`
    `Since Version: 0.14.1` | +| [hoodie.streamer.source.cloud.meta.max.num.messages.per.sync](#hoodiestreamersourcecloudmetamaxnummessagespersync) | 1000 | Maximum number of messages to consume per sync round. Multiple rounds of `hoodie.streamer.source.cloud.meta.batch.size` could be invoked to reach max messages as configured by this config
    `Config Param: MAX_NUM_MESSAGES_PER_SYNC`
    `Since Version: 0.14.1` | --- @@ -2051,14 +2051,14 @@ Configurations controlling the behavior of incremental pulling from S3 events me [**Advanced Configs**](#S3-Event-based-Hudi-Incremental-Source-Configs-advanced-configs) -| Config Name | Default | Description | -| ----------------------------------------------------------------------------------------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.streamer.source.s3incr.ignore.key.prefix](#hoodiestreamersources3incrignorekeyprefix) | (N/A) | Control whether to ignore the s3 objects starting with this prefix
    `Config Param: S3_IGNORE_KEY_PREFIX` | -| [hoodie.streamer.source.s3incr.ignore.key.substring](#hoodiestreamersources3incrignorekeysubstring) | (N/A) | Control whether to ignore the s3 objects with this substring
    `Config Param: S3_IGNORE_KEY_SUBSTRING` | -| [hoodie.streamer.source.s3incr.key.prefix](#hoodiestreamersources3incrkeyprefix) | (N/A) | Control whether to filter the s3 objects starting with this prefix
    `Config Param: S3_KEY_PREFIX` | -| [hoodie.streamer.source.s3incr.spark.datasource.options](#hoodiestreamersources3incrsparkdatasourceoptions) | (N/A) | Json string, passed to the reader while loading dataset. Example Hudi Streamer conf --hoodie-conf hoodie.streamer.source.s3incr.spark.datasource.options={"header":"true","encoding":"UTF-8"}
    `Config Param: SPARK_DATASOURCE_OPTIONS` | -| [hoodie.streamer.source.s3incr.check.file.exists](#hoodiestreamersources3incrcheckfileexists) | false | Control whether we do existence check for files before consuming them
    `Config Param: S3_INCR_ENABLE_EXISTS_CHECK` | -| [hoodie.streamer.source.s3incr.fs.prefix](#hoodiestreamersources3incrfsprefix) | s3 | The file system prefix.
    `Config Param: S3_FS_PREFIX` | +| Config Name | Default | Description | +| ----------------------------------------------------------------------------------------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [hoodie.streamer.source.s3incr.ignore.key.prefix](#hoodiestreamersources3incrignorekeyprefix) | (N/A) | Control whether to ignore the s3 objects starting with this prefix
    `Config Param: S3_IGNORE_KEY_PREFIX` | +| [hoodie.streamer.source.s3incr.ignore.key.substring](#hoodiestreamersources3incrignorekeysubstring) | (N/A) | Control whether to ignore the s3 objects with this substring
    `Config Param: S3_IGNORE_KEY_SUBSTRING` | +| [hoodie.streamer.source.s3incr.key.prefix](#hoodiestreamersources3incrkeyprefix) | (N/A) | Control whether to filter the s3 objects starting with this prefix
    `Config Param: S3_KEY_PREFIX` | +| [hoodie.streamer.source.s3incr.spark.datasource.options](#hoodiestreamersources3incrsparkdatasourceoptions) | (N/A) | JSON string, passed to the reader while loading dataset. Example Hudi Streamer conf --hoodie-conf `hoodie.streamer.source.s3incr.spark.datasource.options={"header":"true","encoding":"UTF-8"}`
    `Config Param: SPARK_DATASOURCE_OPTIONS` | +| [hoodie.streamer.source.s3incr.check.file.exists](#hoodiestreamersources3incrcheckfileexists) | false | Control whether we do existence check for files before consuming them
    `Config Param: S3_INCR_ENABLE_EXISTS_CHECK` | +| [hoodie.streamer.source.s3incr.fs.prefix](#hoodiestreamersources3incrfsprefix) | s3 | The file system prefix.
    `Config Param: S3_FS_PREFIX` | --- diff --git a/website/docs/encryption.md b/website/docs/encryption.md index f6483420aaf1..9bce5d646a48 100644 --- a/website/docs/encryption.md +++ b/website/docs/encryption.md @@ -43,7 +43,7 @@ QuickstartUtils.DataGenerator dataGen = new QuickstartUtils.DataGenerator(); List inserts = convertToStringList(dataGen.generateInserts(3)); Dataset inputDF1 = spark.read().json(jsc.parallelize(inserts, 1)); inputDF1.write().format("org.apache.hudi") - .option("hoodie.table.name", "encryption_table") + .option("hoodie.table.name", "encryption_table") .option("hoodie.upsert.shuffle.parallelism","2") .option("hoodie.insert.shuffle.parallelism","2") .option("hoodie.delete.shuffle.parallelism","2") @@ -70,4 +70,4 @@ Read more from [Spark docs](https://spark.apache.org/docs/latest/sql-data-source ### Note -This feature is currently only available for COW tables due to only Parquet base files present there. \ No newline at end of file +This feature is currently only available for COW tables due to only Parquet base files present there. diff --git a/website/docs/faq_general.md b/website/docs/faq_general.md index 2682d17e9506..61b6c12a4b5d 100644 --- a/website/docs/faq_general.md +++ b/website/docs/faq_general.md @@ -24,7 +24,7 @@ While we can merely refer to this as stream processing, we call it _incremental One of the core use-cases for Apache Hudi is enabling seamless, efficient database ingestion to your lake, and change data capture is a direct application of that. Hudi’s core design primitives support fast upserts and deletes of data that are suitable for CDC and streaming use cases. Here is a glimpse of some of the challenges accompanying streaming and cdc workloads that Hudi handles efficiently out of the box. -* **_Processing of deletes:_** Deletes are treated no differently than updates and are logged with the same filegroups where the corresponding keys exist. This helps process deletes faster same like regular inserts and updates and Hudi processes deletes at file group level using compaction in MOR tables. This can be very expensive in other open source systems that store deletes as separate files than data files and incur N(Data files)\*N(Delete files) merge cost to process deletes every time, soon lending into a complex graph problem to solve whose planning itself is expensive. This gets worse with volume, especially when dealing with CDC style workloads that streams changes to records frequently. +* **_Processing of deletes:_** Deletes are treated no differently than updates and are logged with the same filegroups where the corresponding keys exist. This helps process deletes faster same like regular inserts and updates and Hudi processes deletes at file group level using compaction in MOR tables. This can be very expensive in other open source systems that store deletes as separate files than data files and incur N(Data files)*N(Delete files) merge cost to process deletes every time, soon lending into a complex graph problem to solve whose planning itself is expensive. This gets worse with volume, especially when dealing with CDC style workloads that streams changes to records frequently. * **_Operational overhead of merging deletes at scale:_** When deletes are stored as separate files without any notion of data locality, the merging of data and deletes can become a run away job that cannot complete in time due to various reasons (Spark retries, executor failure, OOM, etc.). As more data files and delete files are added, the merge becomes even more expensive and complex later on, making it hard to manage in practice causing operation overhead. Hudi removes this complexity from users by treating deletes similarly to any other write operation. * **_File sizing with updates:_** Other open source systems, process updates by generating new data files for inserting the new records after deletion, where both data files and delete files get introduced for every batch of updates. This yields to small file problem and requires file sizing. Whereas, Hudi embraces mutations to the data, and manages the table automatically by keeping file sizes in check without passing the burden of file sizing to users as manual maintenance. * **_Support for partial updates and payload ordering:_** Hudi support partial updates where already existing record can be updated for specific fields that are non null from newer records (with newer timestamps). Similarly, Hudi supports payload ordering with timestamp through specific payload implementation where late-arriving data with older timestamps will be ignored or dropped. Users can even implement custom logic and plug in to handle what they want. @@ -75,7 +75,7 @@ streams, Hudi supports key based de-duplication before inserting records. For e- systems like Kafka MirrorMaker that can introduce duplicates during failures. Even for plain old batch pipelines, keys help eliminate duplication that could be caused by backfill pipelines, where commonly it's unclear what set of records need to be re-written. We are actively working on making keys easier by only requiring them for Upsert and/or automatically -generate the key internally (much like RDBMS row\_ids) +generate the key internally (much like RDBMS row_ids) ### How does Hudi actually store data inside a table? @@ -85,7 +85,7 @@ At a high level, Hudi is based on MVCC design that writes data to versioned parq Hudi recommends keeping coarse grained top level partition paths e.g date(ts) and within each such partition do clustering in a flexible way to z-order, sort data based on interested columns. This provides excellent performance by : minimzing the number of files in each partition, while still packing data that will be queried together physically closer (what partitioning aims to achieve). -Let's take an example of a table, where we store log\_events with two fields `ts` (time at which event was produced) and `cust_id` (user for which event was produced) and a common option is to partition by both date(ts) and cust\_id. +Let's take an example of a table, where we store log_events with two fields `ts` (time at which event was produced) and `cust_id` (user for which event was produced) and a common option is to partition by both date(ts) and cust_id. Some users may want to start granular with hour(ts) and then later evolve to new partitioning scheme say date(ts). But this means, the number of partitions in the table could be very high - 365 days x 1K customers = at-least 365K potentially small parquet files, that can significantly slow down queries, facing throttling issues on the actual S3/DFS reads. For the afore mentioned reasons, we don't recommend mixing different partitioning schemes within the same table, since it adds operational complexity, and unpredictable performance. diff --git a/website/docs/faq_storage.md b/website/docs/faq_storage.md index d74e65dfc3c0..fcce76aa46e1 100644 --- a/website/docs/faq_storage.md +++ b/website/docs/faq_storage.md @@ -27,10 +27,10 @@ All you need to do is to edit the table type property in hoodie.properties(locat But manually changing it will result in checksum errors. So, we have to go via hudi-cli. 1. Copy existing hoodie.properties to a new location. -2. Edit table type to MERGE\_ON\_READ +2. Edit table type to MERGE_ON_READ 3. launch hudi-cli - 1. connect --path hudi\_table\_path - 2. repair overwrite-hoodie-props --new-props-file new\_hoodie.properties + 1. connect --path hudi_table_path + 2. repair overwrite-hoodie-props --new-props-file new_hoodie.properties ### How can I find the average record size in a commit? @@ -101,9 +101,9 @@ By generating a commit time ahead of time, Hudi is able to stamp each record wit Hudi supports customizable partition values which could be a derived value of another field. Also, storing the partition value only as part of the field results in losing type information when queried by various query engines. -### How do I configure Bloom filter (when Bloom/Global\_Bloom index is used)? +### How do I configure Bloom filter (when Bloom/Global_Bloom index is used)? -Bloom filters are used in bloom indexes to look up the location of record keys in write path. Bloom filters are used only when the index type is chosen as “BLOOM” or “GLOBAL\_BLOOM”. Hudi has few config knobs that users can use to tune their bloom filters. +Bloom filters are used in bloom indexes to look up the location of record keys in write path. Bloom filters are used only when the index type is chosen as “BLOOM” or “GLOBAL_BLOOM”. Hudi has few config knobs that users can use to tune their bloom filters. On a high level, hudi has two types of blooms: Simple and Dynamic. @@ -113,19 +113,19 @@ Simple, as the name suggests, is simple. Size is statically allocated based on f `hoodie.index.bloom.num_entries` refers to the total number of entries per bloom filter, which refers to one file slice. Default value is 60000. -`hoodie.index.bloom.fpp` refers to the false positive probability with the bloom filter. Default value: 1\*10^-9. +`hoodie.index.bloom.fpp` refers to the false positive probability with the bloom filter. Default value: 1*10^-9. -Size of the bloom filter depends on these two values. This is statically allocated and here is the formula that determines the size of bloom. Until the total number of entries added to the bloom is within the configured `hoodie.index.bloom.num_entries` value, the fpp will be honored. i.e. with default values of 60k and 1\*10^-9, bloom filter serialized size = 430kb. But if more entries are added, then the false positive probability will not be honored. Chances that more false positives could be returned if you add more number of entries than the configured value. So, users are expected to set the right values for both num\_entries and fpp. +Size of the bloom filter depends on these two values. This is statically allocated and here is the formula that determines the size of bloom. Until the total number of entries added to the bloom is within the configured `hoodie.index.bloom.num_entries` value, the fpp will be honored. i.e. with default values of 60k and 1*10^-9, bloom filter serialized size = 430kb. But if more entries are added, then the false positive probability will not be honored. Chances that more false positives could be returned if you add more number of entries than the configured value. So, users are expected to set the right values for both num_entries and fpp. Hudi suggests to have roughly 100 to 120 mb sized files for better query performance. So, based on the record size, one could determine how many records could fit into one data file. -Lets say your data file max size is 128Mb and default avg record size is 1024 bytes. Hence, roughly this translates to 130k entries per data file. For this config, you should set num\_entries to ~130k. +Lets say your data file max size is 128Mb and default avg record size is 1024 bytes. Hence, roughly this translates to 130k entries per data file. For this config, you should set num_entries to ~130k. Dynamic bloom filter: `hoodie.bloom.index.filter.type` : DYNAMIC -This is an advanced version of the bloom filter which grows dynamically as the number of entries grows. So, users are expected to set two values wrt num\_entries. `hoodie.index.bloom.num_entries` will determine the starting size of the bloom. `hoodie.bloom.index.filter.dynamic.max.entries` will determine the max size to which the bloom can grow upto. And fpp needs to be set similar to “Simple” bloom filter. Bloom size will be allotted based on the first config `hoodie.index.bloom.num_entries`. Once the number of entries reaches this value, bloom will dynamically grow its size to 2X. This will go on until the size reaches a max of `hoodie.bloom.index.filter.dynamic.max.entries` value. Until the size reaches this max value, fpp will be honored. If the entries added exceeds the max value, then the fpp may not be honored. +This is an advanced version of the bloom filter which grows dynamically as the number of entries grows. So, users are expected to set two values wrt num_entries. `hoodie.index.bloom.num_entries` will determine the starting size of the bloom. `hoodie.bloom.index.filter.dynamic.max.entries` will determine the max size to which the bloom can grow upto. And fpp needs to be set similar to “Simple” bloom filter. Bloom size will be allotted based on the first config `hoodie.index.bloom.num_entries`. Once the number of entries reaches this value, bloom will dynamically grow its size to 2X. This will go on until the size reaches a max of `hoodie.bloom.index.filter.dynamic.max.entries` value. Until the size reaches this max value, fpp will be honored. If the entries added exceeds the max value, then the fpp may not be honored. ### How do I verify datasource schema reconciliation in Hudi? @@ -167,13 +167,13 @@ spark.sql("select * from hudi.test_recon1;").show() After first write: -| \_hoodie\_commit\_time | \_hoodie\_commit\_seqno | \_hoodie\_record\_key | \_hoodie\_partition\_path | \_hoodie\_file\_name | Url | ts | uuid | +| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | Url | ts | uuid | | ---| ---| ---| ---| ---| ---| ---| --- | | 20220622204044318 | 20220622204044318... | 1 | | 890aafc0-d897-44d... | [hudi.apache.com](http://hudi.apache.com) | 1 | 1 | After the second write: -| \_hoodie\_commit\_time | \_hoodie\_commit\_seqno | \_hoodie\_record\_key | \_hoodie\_partition\_path | \_hoodie\_file\_name | Url | ts | uuid | +| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | Url | ts | uuid | | ---| ---| ---| ---| ---| ---| ---| --- | | 20220622204044318 | 20220622204044318... | 1 | | 890aafc0-d897-44d... | [hudi.apache.com](http://hudi.apache.com) | 1 | 1 | | 20220622204208997 | 20220622204208997... | 2 | | 890aafc0-d897-44d... | null | 1 | 2 | diff --git a/website/docs/faq_writing_tables.md b/website/docs/faq_writing_tables.md index 90874efbf4f8..bed07a16e57a 100644 --- a/website/docs/faq_writing_tables.md +++ b/website/docs/faq_writing_tables.md @@ -78,11 +78,11 @@ No. Hudi removes all the copies of a record key when deletes are issued. Here is When issuing an `upsert` operation on a table and the batch of records provided contains multiple entries for a given key, then all of them are reduced into a single final value by repeatedly calling payload class's [preCombine()](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L40) method . By default, we pick the record with the greatest value (determined by calling .compareTo()) giving latest-write-wins style semantics. [This FAQ entry](faq_writing_tables#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage) shows the interface for HoodieRecordPayload if you are interested. -For an insert or bulk\_insert operation, no such pre-combining is performed. Thus, if your input contains duplicates, the table would also contain duplicates. If you don't want duplicate records either issue an **upsert** or consider specifying option to de-duplicate input in either datasource using [`hoodie.datasource.write.insert.drop.duplicates`](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) & [`hoodie.combine.before.insert`](/docs/configurations/#hoodiecombinebeforeinsert) or in deltastreamer using [`--filter-dupes`](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229). +For an insert or bulk_insert operation, no such pre-combining is performed. Thus, if your input contains duplicates, the table would also contain duplicates. If you don't want duplicate records either issue an **upsert** or consider specifying option to de-duplicate input in either datasource using [`hoodie.datasource.write.insert.drop.duplicates`](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) & [`hoodie.combine.before.insert`](/docs/configurations/#hoodiecombinebeforeinsert) or in deltastreamer using [`--filter-dupes`](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229). ### How can I pass hudi configurations to my spark writer job? -Hudi configuration options covering the datasource and low level Hudi write client (which both deltastreamer & datasource internally call) are [here](/docs/configurations/). Invoking _\--help_ on any tool such as DeltaStreamer would print all the usage options. A lot of the options that control upsert, file sizing behavior are defined at the write client level and below is how we pass them to different options available for writing data. +Hudi configuration options covering the datasource and low level Hudi write client (which both deltastreamer & datasource internally call) are [here](/docs/configurations/). Invoking _--help_ on any tool such as DeltaStreamer would print all the usage options. A lot of the options that control upsert, file sizing behavior are defined at the write client level and below is how we pass them to different options available for writing data. * For Spark DataSource, you can use the "options" API of DataFrameWriter to pass in these configs. @@ -94,7 +94,7 @@ inputDF.write().format("org.apache.hudi") ``` * When using `HoodieWriteClient` directly, you can simply construct HoodieWriteConfig object with the configs in the link you mentioned. -* When using HoodieDeltaStreamer tool to ingest, you can set the configs in properties file and pass the file as the cmdline argument "_\--props_" +* When using HoodieDeltaStreamer tool to ingest, you can set the configs in properties file and pass the file as the cmdline argument "_--props_" ### How to create Hive style partition folder structure? @@ -122,12 +122,12 @@ The speed at which you can write into Hudi depends on the [write operation](/doc | Storage Type | Type of workload | Performance | Tips | | ---| ---| ---| --- | -| copy on write | bulk\_insert | Should match vanilla spark writing + an additional sort to properly size files | properly size [bulk insert parallelism](/docs/configurations#hoodiebulkinsertshuffleparallelism) to get right number of files. use insert if you want this auto tuned . Configure [hoodie.bulkinsert.sort.mode](/docs/configurations#hoodiebulkinsertsortmode) for better file sizes at the cost of memory. The default value NONE offers the fastest performance and matches `spark.write.parquet()` in terms of number of files, overheads. | +| copy on write | bulk_insert | Should match vanilla spark writing + an additional sort to properly size files | properly size [bulk insert parallelism](/docs/configurations#hoodiebulkinsertshuffleparallelism) to get right number of files. Use insert if you want this auto tuned. Configure [hoodie.bulkinsert.sort.mode](/docs/configurations#hoodiebulkinsertsortmode) for better file sizes at the cost of memory. The default value `NONE` offers the fastest performance and matches `spark.write.parquet()` in terms of number of files, overheads. | | copy on write | insert | Similar to bulk insert, except the file sizes are auto tuned requiring input to be cached into memory and custom partitioned. | Performance would be bound by how parallel you can write the ingested data. Tune [this limit](/docs/configurations#hoodieinsertshuffleparallelism) up, if you see that writes are happening from only a few executors. | -| copy on write | upsert/ de-duplicate & insert | Both of these would involve index lookup. Compared to naively using Spark (or similar framework)'s JOIN to identify the affected records, Hudi indexing is often 7-10x faster as long as you have ordered keys (discussed below) or <50% updates. Compared to naively overwriting entire partitions, Hudi write can be several magnitudes faster depending on how many files in a given partition is actually updated. For e.g, if a partition has 1000 files out of which only 100 is dirtied every ingestion run, then Hudi would only read/merge a total of 100 files and thus 10x faster than naively rewriting entire partition. | Ultimately performance would be bound by how quickly we can read and write a parquet file and that depends on the size of the parquet file, configured [here](/docs/configurations#hoodieparquetmaxfilesize). Also be sure to properly tune your [bloom filters](/docs/configurations#INDEX). [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) will auto-tune this. | -| merge on read | bulk insert | Currently new data only goes to parquet files and thus performance here should be similar to copy\_on\_write bulk insert. This has the nice side-effect of getting data into parquet directly for query performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add support for logging inserts directly and this up drastically. | | +| copy on write | upsert/ de-duplicate & insert | Both of these would involve index lookup. Compared to naively using Spark (or similar framework)'s JOIN to identify the affected records, Hudi indexing is often 7-10x faster as long as you have ordered keys (discussed below) or less than 50% updates. Compared to naively overwriting entire partitions, Hudi write can be several magnitudes faster depending on how many files in a given partition is actually updated. For example, if a partition has 1000 files out of which only 100 is dirtied every ingestion run, then Hudi would only read/merge a total of 100 files and thus 10x faster than naively rewriting entire partition. | Ultimately performance would be bound by how quickly we can read and write a parquet file and that depends on the size of the parquet file, configured [here](/docs/configurations#hoodieparquetmaxfilesize). Also be sure to properly tune your [bloom filters](/docs/configurations#INDEX). [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) will auto-tune this. | +| merge on read | bulk insert | Currently new data only goes to parquet files and thus performance here should be similar to copy on write bulk insert. This has the nice side-effect of getting data into parquet directly for query performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add support for logging inserts directly and this up drastically. | | | merge on read | insert | Similar to above | | -| merge on read | upsert/ de-duplicate & insert | Indexing performance would remain the same as copy-on-write, while ingest latency for updates (costliest I/O operation in copy\_on\_write) are sent to log files and thus with asynchronous compaction provides very very good ingest performance with low write amplification. | | +| merge on read | upsert/ de-duplicate & insert | Indexing performance would remain the same as copy-on-write, while ingest latency for updates (costliest I/O operation in copy on write) are sent to log files and thus with asynchronous compaction provides very good ingest performance with low write amplification. | | Like with many typical system that manage time-series data, Hudi performs much better if your keys have a timestamp prefix or monotonically increasing/decreasing. You can almost always achieve this. Even if you have UUID keys, you can follow tricks like [this](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/) to get keys that are ordered. See also [Tuning Guide](/docs/tuning-guide) for more tips on JVM and other configurations. @@ -145,14 +145,14 @@ There are 2 ways to avoid creating tons of small files in Hudi and both of them a) **Auto Size small files during ingestion**: This solution trades ingest/writing time to keep queries always efficient. Common approaches to writing very small files and then later stitching them together only solve for system scalability issues posed by small files and also let queries slow down by exposing small files to them anyway. -Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk\_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) +Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. * Indexes with **canIndexLogFiles = true** : Inserts of new data go directly to log files. In this case, you can configure the [maximum log size](/docs/configurations#hoodielogfilemaxsize) and a [factor](/docs/configurations#hoodielogfiletoparquetcompressionratio) that denotes reduction in size when data moves from avro to parquet files. -* Indexes with **canIndexLogFiles = false** : Inserts of new data go only to parquet files. In this case, the same configurations as above for the COPY\_ON\_WRITE case applies. +* Indexes with **canIndexLogFiles = false** : Inserts of new data go only to parquet files. In this case, the same configurations as above for the COPY_ON_WRITE case applies. NOTE : In either case, small files will be auto sized only if there is no PENDING compaction or associated log file for that particular file slice. For example, for case 1: If you had a log file and a compaction C1 was scheduled to convert that log file to parquet, no more inserts can go into that log file. For case 2: If you had a parquet file and an update ended up creating an associated delta log file, no more inserts can go into that parquet file. Only after the compaction has been performed and there are NO log files associated with the base parquet file, can new inserts be sent to auto size that parquet file. @@ -173,7 +173,7 @@ hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPa With each commit, Hudi creates a new table version in the metastore. This can be reduced by setting the option -[hoodie.datasource.meta\_sync.condition.sync](/docs/configurations#hoodiedatasourcemeta_syncconditionsync) to true. +[hoodie.datasource.meta_sync.condition.sync](/docs/configurations#hoodiedatasourcemeta_syncconditionsync) to true. This will ensure that hive sync is triggered on schema or partitions changes. @@ -187,7 +187,7 @@ Hudi employs [optimistic concurrency control](/docs/concurrency_control#supporte ### Can single-writer inserts have duplicates? -By default, Hudi turns off key based de-duplication for INSERT/BULK\_INSERT operations and thus the table could contain duplicates. If users believe, they have duplicates in inserts, they can either issue UPSERT or consider specifying the option to de-duplicate input in either datasource using [`hoodie.datasource.write.insert.drop.duplicates`](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) & [`hoodie.combine.before.insert`](/docs/configurations/#hoodiecombinebeforeinsert) or in deltastreamer using [`--filter-dupes`](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229). +By default, Hudi turns off key based de-duplication for INSERT/BULK_INSERT operations and thus the table could contain duplicates. If users believe, they have duplicates in inserts, they can either issue UPSERT or consider specifying the option to de-duplicate input in either datasource using [`hoodie.datasource.write.insert.drop.duplicates`](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) & [`hoodie.combine.before.insert`](/docs/configurations/#hoodiecombinebeforeinsert) or in deltastreamer using [`--filter-dupes`](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229). ### Can concurrent inserts cause duplicates? diff --git a/website/docs/file_layouts.md b/website/docs/file_layouts.md index 39a5b15b7c13..3cfb8a7d8374 100644 --- a/website/docs/file_layouts.md +++ b/website/docs/file_layouts.md @@ -3,7 +3,7 @@ title: File Layouts toc: false --- -The following describes the general file layout structure for Apache Hudi. Please refer the ** [tech spec](https://hudi.apache.org/tech-specs#file-layout-hierarchy) ** for a more detailed description of the file layouts. +The following describes the general file layout structure for Apache Hudi. Please refer the **[tech spec](https://hudi.apache.org/tech-specs#file-layout-hierarchy)** for a more detailed description of the file layouts. * Hudi organizes data tables into a directory structure under a base path on a distributed file system * Tables are broken up into partitions * Within each partition, files are organized into file groups, uniquely identified by a file ID diff --git a/website/docs/ibm_cos_hoodie.md b/website/docs/ibm_cos_hoodie.md index 5ac743394ff1..d4e897153576 100644 --- a/website/docs/ibm_cos_hoodie.md +++ b/website/docs/ibm_cos_hoodie.md @@ -32,38 +32,38 @@ For example, using HMAC keys and service name `myCOS`: - fs.stocator.scheme.list - cos + fs.stocator.scheme.list + cos - fs.cos.impl - com.ibm.stocator.fs.ObjectStoreFileSystem + fs.cos.impl + com.ibm.stocator.fs.ObjectStoreFileSystem - fs.stocator.cos.impl - com.ibm.stocator.fs.cos.COSAPIClient + fs.stocator.cos.impl + com.ibm.stocator.fs.cos.COSAPIClient - fs.stocator.cos.scheme - cos + fs.stocator.cos.scheme + cos - fs.cos.myCos.access.key - ACCESS KEY + fs.cos.myCos.access.key + ACCESS KEY - fs.cos.myCos.endpoint - http://s3-api.us-geo.objectstorage.softlayer.net + fs.cos.myCos.endpoint + http://s3-api.us-geo.objectstorage.softlayer.net - fs.cos.myCos.secret.key - SECRET KEY + fs.cos.myCos.secret.key + SECRET KEY ``` diff --git a/website/docs/intro.md b/website/docs/intro.md new file mode 100644 index 000000000000..45e8604c8bf8 --- /dev/null +++ b/website/docs/intro.md @@ -0,0 +1,47 @@ +--- +sidebar_position: 1 +--- + +# Tutorial Intro + +Let's discover **Docusaurus in less than 5 minutes**. + +## Getting Started + +Get started by **creating a new site**. + +Or **try Docusaurus immediately** with **[docusaurus.new](https://docusaurus.new)**. + +### What you'll need + +- [Node.js](https://nodejs.org/en/download/) version 18.0 or above: + - When installing Node.js, you are recommended to check all checkboxes related to dependencies. + +## Generate a new site + +Generate a new Docusaurus site using the **classic template**. + +The classic template will automatically be added to your project after you run the command: + +```bash +npm init docusaurus@latest my-website classic +``` + +You can type this command into Command Prompt, Powershell, Terminal, or any other integrated terminal of your code editor. + +The command also installs all necessary dependencies you need to run Docusaurus. + +## Start your site + +Run the development server: + +```bash +cd my-website +npm run start +``` + +The `cd` command changes the directory you're working with. In order to work with your newly created Docusaurus site, you'll need to navigate the terminal there. + +The `npm run start` command builds your website locally and serves it through a development server, ready for you to view at http://localhost:3000/. + +Open `docs/intro.md` (this page) and edit some lines: the site **reloads automatically** and displays your changes. diff --git a/website/docs/procedures.md b/website/docs/procedures.md index f57b8713bd5f..1dbeb899b14f 100644 --- a/website/docs/procedures.md +++ b/website/docs/procedures.md @@ -48,7 +48,7 @@ call help(cmd => 'show_commits'); | result | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| parameters:
    param type_name default_value required
    table string None true
    limit integer 10 false
    outputType:
    name type_name nullable metadata
    commit_time string true {}
    action string true {}
    total_bytes_written long true {}
    total_files_added long true {}
    total_files_updated long true {}
    total_partitions_written long true {}
    total_records_written long true {}
    total_update_records_written long true {}
    total_errors long true {} | +| parameters:
    param type_name default_value required
    table string None true
    limit integer 10 false
    outputType:
    name type_name nullable metadata
    commit_time string true \{}
    action string true \{}
    total_bytes_written long true \{}
    total_files_added long true \{}
    total_files_updated long true \{}
    total_partitions_written long true \{}
    total_records_written long true \{}
    total_update_records_written long true \{}
    total_errors long true \{} | ## Commit management @@ -166,8 +166,8 @@ call show_commit_extra_metadata(table => 'test_hudi_table'); | instant_time | action | metadata_key | metadata_value | |-------------------|-------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| 20230206174349556 | deltacommit | schema | {"type":"record","name":"hudi_mor_tbl","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"int"},{"name":"ts","type":"long"}]} | -| 20230206174349556 | deltacommit | latest_schema | {"max_column_id":8,"version_id":20230206174349556,"type":"record","fields":[{"id":0,"name":"_hoodie_commit_time","optional":true,"type":"string","doc":""},{"id":1,"name":"_hoodie_commit_seqno","optional":true,"type":"string","doc":""},{"id":2,"name":"_hoodie_record_key","optional":true,"type":"string","doc":""},{"id":3,"name":"_hoodie_partition_path","optional":true,"type":"string","doc":""},{"id":4,"name":"_hoodie_file_name","optional":true,"type":"string","doc":""},{"id":5,"name":"id","optional":false,"type":"int"},{"id":8,"name":"ts","optional":false,"type":"long"}]} | +| 20230206174349556 | deltacommit | schema | \{"type":"record","name":"hudi_mor_tbl","fields":[\{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},\{"name":"id","type":"int"},\{"name":"ts","type":"long"}]} | +| 20230206174349556 | deltacommit | latest_schema | \{"max_column_id":8,"version_id":20230206174349556,"type":"record","fields":[\{"id":0,"name":"_hoodie_commit_time","optional":true,"type":"string","doc":""},\{"id":1,"name":"_hoodie_commit_seqno","optional":true,"type":"string","doc":""},\{"id":2,"name":"_hoodie_record_key","optional":true,"type":"string","doc":""},\{"id":3,"name":"_hoodie_partition_path","optional":true,"type":"string","doc":""},\{"id":4,"name":"_hoodie_file_name","optional":true,"type":"string","doc":""},\{"id":5,"name":"id","optional":false,"type":"int"},\{"id":8,"name":"ts","optional":false,"type":"long"}]} | ### show_archived_commits @@ -1090,7 +1090,7 @@ call show_logfile_records(table => 'test_hudi_table', log_file_path_pattern => ' | records | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| {"_hoodie_commit_time": "20230205133427059", "_hoodie_commit_seqno": "20230205133427059_0_10", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "3438e233-7b50-4eff-adbb-70b1cd76f518-0", "id": 1, "name": "a1", "price": 40.0, "ts": 1111} | +| \{"_hoodie_commit_time": "20230205133427059", "_hoodie_commit_seqno": "20230205133427059_0_10", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "3438e233-7b50-4eff-adbb-70b1cd76f518-0", "id": 1, "name": "a1", "price": 40.0, "ts": 1111} | ### show_logfile_metadata @@ -1123,7 +1123,7 @@ call show_logfile_metadata(table => 'hudi_mor_tbl', log_file_path_pattern => 'hd | instant_time | record_count | block_type | header_metadata | footer_metadata | |-------------------|--------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------| -| 20230205133427059 | 1 | AVRO_DATA_BLOCK | {"INSTANT_TIME":"20230205133427059","SCHEMA":"{\"type\":\"record\",\"name\":\"hudi_mor_tbl_record\",\"namespace\":\"hoodie.hudi_mor_tbl\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_commit_seqno\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_record_key\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_partition_path\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_file_name\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"price\",\"type\":\"double\"},{\"name\":\"ts\",\"type\":\"long\"}]}"} | {} | +| 20230205133427059 | 1 | AVRO_DATA_BLOCK | \{"INSTANT_TIME":"20230205133427059","SCHEMA":"\{"type":"record","name":"hudi_mor_tbl_record","namespace":"hoodie.hudi_mor_tbl","fields":[\{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},\{"name":"id","type":"int"},\{"name":"name","type":"string"},\{"name":"price","type":"double"},\{"name":"ts","type":"long"}]}"} | {} | ### show_invalid_parquet @@ -1998,4 +1998,4 @@ call downgrade_table(table => 'test_hudi_table', to_version => 'FOUR'); | result | |--------| -| true | \ No newline at end of file +| true | diff --git a/website/docs/schema_evolution.md b/website/docs/schema_evolution.md index 31dd73662fcd..1638d6ad1c6f 100755 --- a/website/docs/schema_evolution.md +++ b/website/docs/schema_evolution.md @@ -84,7 +84,7 @@ Column specification consists of five field, next to each other. | Parameter | Description | |:-------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| col_name | name of the new column. To add sub-column col1 to a nested map type column member map>, set this field to member.value.col1 | +| col_name | name of the new column. To add sub-column col1 to a nested map type column member map\>, set this field to member.value.col1 | | col_type | type of the new column. | | nullable | whether or not the new column allows null values. (optional) | | comment | comment of the new column. (optional) | @@ -316,4 +316,4 @@ scala> spark.sql("select rowId, partitionId, preComb, name, versionId, intToLong

    Videos

    * [Learn Schema Evolution in Apache Hudi Transaction Datalake with hands on labs](https://youtu.be/s1_-zl3sfLE) -* [How do I identify Schema Changes in Hudi Tables and Send Email Alert when New Column added/removed](https://www.youtube.com/watch?v=_i5G4ojpwlk) \ No newline at end of file +* [How do I identify Schema Changes in Hudi Tables and Send Email Alert when New Column added/removed](https://www.youtube.com/watch?v=_i5G4ojpwlk) diff --git a/website/docs/syncing_metastore.md b/website/docs/syncing_metastore.md index 2aada772a6ae..0d2887b1058c 100644 --- a/website/docs/syncing_metastore.md +++ b/website/docs/syncing_metastore.md @@ -86,8 +86,8 @@ hoodie.datasource.hive_sync.password= ### Query using HiveQL ``` -beeline -u jdbc:hive2://hiveserver:10000/my_db \ - --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \ +beeline -u jdbc:hive2://hiveserver:10000/my_db + --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false Beeline version 1.2.1.spark2 by Apache Hive @@ -130,7 +130,7 @@ once you have built the hudi-hive module. Following is how we sync the above Dat ```java cd hudi-hive -./run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path --database default --table +./run_sync_tool.sh --jdbc-url jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path --database default --table ``` Starting with Hudi 0.5.1 version read optimized version of merge-on-read tables are suffixed '_ro' by default. For backwards compatibility with older Hudi versions, an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off '_ro' suffixing if desired. Explore other hive sync options using the following command: diff --git a/website/docs/timeline.md b/website/docs/timeline.md index 4ac64ef3a950..3d44f8d7426c 100644 --- a/website/docs/timeline.md +++ b/website/docs/timeline.md @@ -35,10 +35,9 @@ in one of the following states * `COMPLETED` - Denotes completion of an action on the timeline All the actions in requested/inflight states are stored in the active timeline as files named * -*_.._**. Completed actions are stored along with a time that +*_\.\.\_**. Completed actions are stored along with a time that denotes when the action was completed, in a file named * -*_\_.._** - +*_\\_\.\.**
    hudi_timeline.png
    diff --git a/website/docs/troubleshooting.md b/website/docs/troubleshooting.md index db93a76d187b..4696694d41d8 100644 --- a/website/docs/troubleshooting.md +++ b/website/docs/troubleshooting.md @@ -31,7 +31,7 @@ This can possibly occur if your schema has some non-nullable field whose value i #### INT96, INT64 and timestamp compatibility -[https://hudi.apache.org/docs/configurations#hoodiedatasourcehive\_syncsupport\_timestamp](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncsupport_timestamp) +[https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncsupport_timestamp](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncsupport_timestamp) #### I am seeing lot of archive files. How do I control the number of archive commit files generated? @@ -115,11 +115,11 @@ if Hive Sync is enabled in the [deltastreamer](https://github.com/apache/hudi/bl Section below generally aids in debugging Hudi failures. Off the bat, the following metadata is added to every record to help triage issues easily using standard Hadoop SQL engines (Hive/PrestoDB/Spark) -* **\_hoodie\_record\_key** - Treated as a primary key within each DFS partition, basis of all updates/inserts -* **\_hoodie\_commit\_time** - Last commit that touched this record -* **\_hoodie_commit_seqno** - This field contains a unique sequence number for each record within each transaction. -* **\_hoodie\_file\_name** - Actual file name containing the record (super useful to triage duplicates) -* **\_hoodie\_partition\_path** - Path from basePath that identifies the partition containing this record +* **_hoodie_record_key** - Treated as a primary key within each DFS partition, basis of all updates/inserts +* **_hoodie_commit_time** - Last commit that touched this record +* **_hoodie_commit_seqno** - This field contains a unique sequence number for each record within each transaction. +* **_hoodie_file_name** - Actual file name containing the record (super useful to triage duplicates) +* **_hoodie_partition_path** - Path from basePath that identifies the partition containing this record #### Missing records @@ -192,11 +192,11 @@ set hive.metastore.disallow.incompatible.col.type.changes=false; This occurs because HiveSyncTool currently supports only few compatible data type conversions. Doing any other incompatible change will throw this exception. Please check the data type evolution for the concerned field and verify if it indeed can be considered as a valid data type conversion as per Hudi code base. -#### org.apache.hadoop.hive.ql.parse.SemanticException: Database does not exist: test\_db +#### org.apache.hadoop.hive.ql.parse.SemanticException: Database does not exist: test_db -This generally occurs if you are trying to do Hive sync for your Hudi dataset and the configured hive\_sync database does not exist. Please create the corresponding database on your Hive cluster and try again. +This generally occurs if you are trying to do Hive sync for your Hudi dataset and the configured hive_sync database does not exist. Please create the corresponding database on your Hive cluster and try again. -#### org.apache.thrift.TApplicationException: Invalid method name: 'get\_table\_req' +#### org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req' This issue is caused by hive version conflicts, hudi built with hive-2.3.x version, so if still want hudi work with older hive version @@ -219,4 +219,4 @@ to lowercase. While we allow capitalization on Hudi tables, if you would like to use all lowercase letters. More details on how this issue presents can be found [here](https://github.com/apache/hudi/issues/6832). -#### \ No newline at end of file +#### diff --git a/website/docs/writing_tables_streaming_writes.md b/website/docs/writing_tables_streaming_writes.md index 77ff044ca63d..1dbf6b6dc6d6 100644 --- a/website/docs/writing_tables_streaming_writes.md +++ b/website/docs/writing_tables_streaming_writes.md @@ -4,6 +4,9 @@ keywords: [hudi, spark, flink, streaming, processing] last_modified_at: 2024-03-13T15:59:57-04:00 --- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + ## Spark Streaming You can write Hudi tables using spark's structured streaming. @@ -70,17 +73,17 @@ hudi_streaming_options = { } # create streaming df -df = spark.readStream \ - .format("hudi") \ +df = spark.readStream + .format("hudi") .load(basePath) # write stream to new hudi table -df.writeStream.format("hudi") \ - .options(**hudi_streaming_options) \ - .outputMode("append") \ - .option("path", baseStreamingPath) \ - .option("checkpointLocation", checkpointLocation) \ - .trigger(once=True) \ +df.writeStream.format("hudi") + .options(**hudi_streaming_options) + .outputMode("append") + .option("path", baseStreamingPath) + .option("checkpointLocation", checkpointLocation) + .trigger(once=True) .start() ``` diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js index 730e900beba7..3d970b3b6bfc 100644 --- a/website/docusaurus.config.js +++ b/website/docusaurus.config.js @@ -1,4 +1,4 @@ -const darkCodeTheme = require('prism-react-renderer/themes/dracula'); +const { themes } = require('prism-react-renderer'); const versions = require('./versions.json'); const VersionsArchived = require('./versionsArchived.json'); const allDocHomesPaths = [ @@ -134,6 +134,7 @@ module.exports = { apiKey: 'e300f1558b703c001c515c0e7f8e0908', indexName: 'apache_hudi', contextualSearch: true, + appId: 'BH4D9OD16A', }, navbar: { logo: { @@ -444,7 +445,7 @@ module.exports = { 'Copyright © 2021 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
    Hudi, Apache and the Apache feather logo are trademarks of The Apache Software Foundation.', }, prism: { - theme: darkCodeTheme, + theme: themes.dracula, additionalLanguages: ['java', 'scala'], prismPath: require.resolve('./src/theme/prism-include-languages.js'), }, @@ -457,20 +458,6 @@ module.exports = { defaultMode: 'light', disableSwitch: true, }, - blog: { - path: 'blog', // Path to the existing blog folder - routeBasePath: 'blog', // Route for the existing blog - include: ['*.md', '*.mdx'], // File types to include for the existing blog - - // Add the new blog for videos - videoBlog: { - path: 'video-blog', // Path to the video blog folder - routeBasePath: 'videos', // Route for the video blog - include: ['*.md', '*.mdx'], // File types to include for the video blog - videoBlogRoute: '/videos' - // Add any other specific settings for the video blog - }, - }, }, presets: [ [ diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/current/ibm_cos_hoodie.md b/website/i18n/cn/docusaurus-plugin-content-docs/current/ibm_cos_hoodie.md index b93841e60d2f..7cd03ab91420 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/current/ibm_cos_hoodie.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/ibm_cos_hoodie.md @@ -33,38 +33,38 @@ Hudi 适配 IBM Cloud Object Storage 需要两项配置: - fs.stocator.scheme.list - cos + fs.stocator.scheme.list + cos - fs.cos.impl - com.ibm.stocator.fs.ObjectStoreFileSystem + fs.cos.impl + com.ibm.stocator.fs.ObjectStoreFileSystem - fs.stocator.cos.impl - com.ibm.stocator.fs.cos.COSAPIClient + fs.stocator.cos.impl + com.ibm.stocator.fs.cos.COSAPIClient - fs.stocator.cos.scheme - cos + fs.stocator.cos.scheme + cos - fs.cos.myCos.access.key - ACCESS KEY + fs.cos.myCos.access.key + ACCESS KEY - fs.cos.myCos.endpoint - http://s3-api.us-geo.objectstorage.softlayer.net + fs.cos.myCos.endpoint + http://s3-api.us-geo.objectstorage.softlayer.net - fs.cos.myCos.secret.key - SECRET KEY + fs.cos.myCos.secret.key + SECRET KEY ``` diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/current/spark_quick-start-guide.md b/website/i18n/cn/docusaurus-plugin-content-docs/current/spark_quick-start-guide.md index ee87543f7832..7ced36bf08cb 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/current/spark_quick-start-guide.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/spark_quick-start-guide.md @@ -247,7 +247,7 @@ dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() [数据生成器](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L50) 可以基于[行程样本模式](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57) 生成插入和更新的样本。 -{: .notice--info} +\{: .notice--info} ## 插入数据 {#inserts} @@ -284,7 +284,7 @@ df.write.format("hudi"). \ 有关将数据提取到Hudi中的方法的信息,请参阅[写入Hudi数据集](/cn/docs/writing_data)。 这里我们使用默认的写操作:`插入更新`。 如果您的工作负载没有`更新`,也可以使用更快的`插入`或`批量插入`操作。 想了解更多信息,请参阅[写操作](/cn/docs/writing_data#write-operations) -{: .notice--info} +\{: .notice--info} ## 查询数据 {#query} @@ -307,7 +307,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_pat 该查询提供已提取数据的读取优化视图。由于我们的分区路径(`region/country/city`)是嵌套的3个级别 从基本路径开始,我们使用了`load(basePath + "/*/*/*/*")`。 有关支持的所有存储类型和视图的更多信息,请参考[存储类型和视图](/cn/docs/concepts#storage-types--views)。 -{: .notice--info} +\{: .notice--info} ## 更新数据 {#updates} @@ -330,7 +330,7 @@ denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `d 注意,保存模式现在为`追加`。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 [查询](#query)现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的[commit](/cn/docs/concepts) 。在之前提交的相同的`_hoodie_record_key`中寻找`_hoodie_commit_time`, `rider`, `driver`字段变更。 -{: .notice--info} +\{: .notice--info} ## 增量查询 @@ -365,7 +365,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` 这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。 -{: .notice--info} +\{: .notice--info} ## 特定时间点查询 diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.5.3/quick-start-guide.md b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.5.3/quick-start-guide.md index 5a3b4fc4de96..5623082c23c5 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.5.3/quick-start-guide.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.5.3/quick-start-guide.md @@ -224,7 +224,7 @@ dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() [数据生成器](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L50) 可以基于[行程样本模式](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57) 生成插入和更新的样本。 -{: .notice--info} +\{: .notice--info} ## 插入数据 {#inserts} @@ -260,7 +260,7 @@ df.write.format("hudi"). \ 有关将数据提取到Hudi中的方法的信息,请参阅[写入Hudi数据集](/cn/docs/writing_data)。 这里我们使用默认的写操作:`插入更新`。 如果您的工作负载没有`更新`,也可以使用更快的`插入`或`批量插入`操作。 想了解更多信息,请参阅[写操作](/cn/docs/writing_data#write-operations) -{: .notice--info} +\{: .notice--info} ## 查询数据 {#query} @@ -283,7 +283,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_pat 该查询提供已提取数据的读取优化视图。由于我们的分区路径(`region/country/city`)是嵌套的3个级别 从基本路径开始,我们使用了`load(basePath + "/*/*/*/*")`。 有关支持的所有存储类型和视图的更多信息,请参考[存储类型和视图](/cn/docs/concepts#storage-types--views)。 -{: .notice--info} +\{: .notice--info} ## 更新数据 {#updates} @@ -302,7 +302,7 @@ df.write.format("hudi"). \ 注意,保存模式现在为`追加`。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 [查询](#query)现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的[commit](/cn/docs/concepts) 。在之前提交的相同的`_hoodie_record_key`中寻找`_hoodie_commit_time`, `rider`, `driver`字段变更。 -{: .notice--info} +\{: .notice--info} ## 增量查询 @@ -337,7 +337,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` 这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。 -{: .notice--info} +\{: .notice--info} ## 特定时间点查询 diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/ibm_cos_hoodie.md b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/ibm_cos_hoodie.md index 4692f9f615a3..0be3d1deb505 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/ibm_cos_hoodie.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/ibm_cos_hoodie.md @@ -34,38 +34,38 @@ For example, using HMAC keys and service name `myCOS`: - fs.stocator.scheme.list - cos + fs.stocator.scheme.list + cos - fs.cos.impl - com.ibm.stocator.fs.ObjectStoreFileSystem + fs.cos.impl + com.ibm.stocator.fs.ObjectStoreFileSystem - fs.stocator.cos.impl - com.ibm.stocator.fs.cos.COSAPIClient + fs.stocator.cos.impl + com.ibm.stocator.fs.cos.COSAPIClient - fs.stocator.cos.scheme - cos + fs.stocator.cos.scheme + cos - fs.cos.myCos.access.key - ACCESS KEY + fs.cos.myCos.access.key + ACCESS KEY - fs.cos.myCos.endpoint - http://s3-api.us-geo.objectstorage.softlayer.net + fs.cos.myCos.endpoint + http://s3-api.us-geo.objectstorage.softlayer.net - fs.cos.myCos.secret.key - SECRET KEY + fs.cos.myCos.secret.key + SECRET KEY ``` diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/quick-start-guide.md b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/quick-start-guide.md index 651dc57ba2d8..a2419508443b 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/quick-start-guide.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.7.0/quick-start-guide.md @@ -227,7 +227,7 @@ dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() [数据生成器](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L50) 可以基于[行程样本模式](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57) 生成插入和更新的样本。 -{: .notice--info} +\{: .notice--info} ## 插入数据 {#inserts} @@ -264,7 +264,7 @@ df.write.format("hudi"). \ 有关将数据提取到Hudi中的方法的信息,请参阅[写入Hudi数据集](/cn/docs/writing_data)。 这里我们使用默认的写操作:`插入更新`。 如果您的工作负载没有`更新`,也可以使用更快的`插入`或`批量插入`操作。 想了解更多信息,请参阅[写操作](/cn/docs/writing_data#write-operations) -{: .notice--info} +\{: .notice--info} ## 查询数据 {#query} @@ -287,7 +287,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_pat 该查询提供已提取数据的读取优化视图。由于我们的分区路径(`region/country/city`)是嵌套的3个级别 从基本路径开始,我们使用了`load(basePath + "/*/*/*/*")`。 有关支持的所有存储类型和视图的更多信息,请参考[存储类型和视图](/cn/docs/concepts#storage-types--views)。 -{: .notice--info} +\{: .notice--info} ## 更新数据 {#updates} @@ -310,7 +310,7 @@ denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `d 注意,保存模式现在为`追加`。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 [查询](#query)现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的[commit](/cn/docs/concepts) 。在之前提交的相同的`_hoodie_record_key`中寻找`_hoodie_commit_time`, `rider`, `driver`字段变更。 -{: .notice--info} +\{: .notice--info} ## 增量查询 @@ -345,7 +345,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` 这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。 -{: .notice--info} +\{: .notice--info} ## 特定时间点查询 diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/ibm_cos_hoodie.md b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/ibm_cos_hoodie.md index b502e1267939..1260299c57af 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/ibm_cos_hoodie.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/ibm_cos_hoodie.md @@ -34,38 +34,38 @@ For example, using HMAC keys and service name `myCOS`: - fs.stocator.scheme.list - cos + fs.stocator.scheme.list + cos - fs.cos.impl - com.ibm.stocator.fs.ObjectStoreFileSystem + fs.cos.impl + com.ibm.stocator.fs.ObjectStoreFileSystem - fs.stocator.cos.impl - com.ibm.stocator.fs.cos.COSAPIClient + fs.stocator.cos.impl + com.ibm.stocator.fs.cos.COSAPIClient - fs.stocator.cos.scheme - cos + fs.stocator.cos.scheme + cos - fs.cos.myCos.access.key - ACCESS KEY + fs.cos.myCos.access.key + ACCESS KEY - fs.cos.myCos.endpoint - http://s3-api.us-geo.objectstorage.softlayer.net + fs.cos.myCos.endpoint + http://s3-api.us-geo.objectstorage.softlayer.net - fs.cos.myCos.secret.key - SECRET KEY + fs.cos.myCos.secret.key + SECRET KEY ``` diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/spark_quick-start-guide.md b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/spark_quick-start-guide.md index c8dbe1e22bc9..a87a7caa0eb8 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/spark_quick-start-guide.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/spark_quick-start-guide.md @@ -247,7 +247,7 @@ dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() [数据生成器](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L50) 可以基于[行程样本模式](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57) 生成插入和更新的样本。 -{: .notice--info} +\{: .notice--info} ## 插入数据 {#inserts} @@ -284,7 +284,7 @@ df.write.format("hudi"). \ 有关将数据提取到Hudi中的方法的信息,请参阅[写入Hudi数据集](/cn/docs/writing_data)。 这里我们使用默认的写操作:`插入更新`。 如果您的工作负载没有`更新`,也可以使用更快的`插入`或`批量插入`操作。 想了解更多信息,请参阅[写操作](/cn/docs/writing_data#write-operations) -{: .notice--info} +\{: .notice--info} ## 查询数据 {#query} @@ -307,7 +307,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_pat 该查询提供已提取数据的读取优化视图。由于我们的分区路径(`region/country/city`)是嵌套的3个级别 从基本路径开始,我们使用了`load(basePath + "/*/*/*/*")`。 有关支持的所有存储类型和视图的更多信息,请参考[存储类型和视图](/cn/docs/concepts#storage-types--views)。 -{: .notice--info} +\{: .notice--info} ## 更新数据 {#updates} @@ -330,7 +330,7 @@ denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `d 注意,保存模式现在为`追加`。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 [查询](#query)现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的[commit](/cn/docs/concepts) 。在之前提交的相同的`_hoodie_record_key`中寻找`_hoodie_commit_time`, `rider`, `driver`字段变更。 -{: .notice--info} +\{: .notice--info} ## 增量查询 @@ -365,7 +365,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` 这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。 -{: .notice--info} +\{: .notice--info} ## 特定时间点查询 diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/ibm_cos_hoodie.md b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/ibm_cos_hoodie.md index d7749e6917cf..c2b80fe35188 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/ibm_cos_hoodie.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/ibm_cos_hoodie.md @@ -33,38 +33,38 @@ For example, using HMAC keys and service name `myCOS`: - fs.stocator.scheme.list - cos + fs.stocator.scheme.list + cos - fs.cos.impl - com.ibm.stocator.fs.ObjectStoreFileSystem + fs.cos.impl + com.ibm.stocator.fs.ObjectStoreFileSystem - fs.stocator.cos.impl - com.ibm.stocator.fs.cos.COSAPIClient + fs.stocator.cos.impl + com.ibm.stocator.fs.cos.COSAPIClient - fs.stocator.cos.scheme - cos + fs.stocator.cos.scheme + cos - fs.cos.myCos.access.key - ACCESS KEY + fs.cos.myCos.access.key + ACCESS KEY - fs.cos.myCos.endpoint - http://s3-api.us-geo.objectstorage.softlayer.net + fs.cos.myCos.endpoint + http://s3-api.us-geo.objectstorage.softlayer.net - fs.cos.myCos.secret.key - SECRET KEY + fs.cos.myCos.secret.key + SECRET KEY ``` diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/spark_quick-start-guide.md b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/spark_quick-start-guide.md index ee87543f7832..7ced36bf08cb 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/spark_quick-start-guide.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.9.0/spark_quick-start-guide.md @@ -247,7 +247,7 @@ dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() [数据生成器](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L50) 可以基于[行程样本模式](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57) 生成插入和更新的样本。 -{: .notice--info} +\{: .notice--info} ## 插入数据 {#inserts} @@ -284,7 +284,7 @@ df.write.format("hudi"). \ 有关将数据提取到Hudi中的方法的信息,请参阅[写入Hudi数据集](/cn/docs/writing_data)。 这里我们使用默认的写操作:`插入更新`。 如果您的工作负载没有`更新`,也可以使用更快的`插入`或`批量插入`操作。 想了解更多信息,请参阅[写操作](/cn/docs/writing_data#write-operations) -{: .notice--info} +\{: .notice--info} ## 查询数据 {#query} @@ -307,7 +307,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_pat 该查询提供已提取数据的读取优化视图。由于我们的分区路径(`region/country/city`)是嵌套的3个级别 从基本路径开始,我们使用了`load(basePath + "/*/*/*/*")`。 有关支持的所有存储类型和视图的更多信息,请参考[存储类型和视图](/cn/docs/concepts#storage-types--views)。 -{: .notice--info} +\{: .notice--info} ## 更新数据 {#updates} @@ -330,7 +330,7 @@ denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `d 注意,保存模式现在为`追加`。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 [查询](#query)现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的[commit](/cn/docs/concepts) 。在之前提交的相同的`_hoodie_record_key`中寻找`_hoodie_commit_time`, `rider`, `driver`字段变更。 -{: .notice--info} +\{: .notice--info} ## 增量查询 @@ -365,7 +365,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` 这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。 -{: .notice--info} +\{: .notice--info} ## 特定时间点查询 diff --git a/website/i18n/cn/docusaurus-plugin-content-pages/developer-setup.md b/website/i18n/cn/docusaurus-plugin-content-pages/developer-setup.md index d35b18674bfb..47df38ee8e84 100644 --- a/website/i18n/cn/docusaurus-plugin-content-pages/developer-setup.md +++ b/website/i18n/cn/docusaurus-plugin-content-pages/developer-setup.md @@ -18,7 +18,7 @@ To contribute code, you need Agreement](https://www.apache.org/licenses/icla.pdf) (ICLA) to the Apache Software Foundation (ASF). - (Recommended) Create an account on [JIRA](https://issues.apache.org/jira/projects/HUDI/summary) to open issues/find similar issues. - - (Recommended) Join our dev mailing list & slack channel, listed on [community](/contribute/get-involved) page. + - (Recommended) Join our dev mailing list & slack channel, listed on [community](/community/get-involved.md) page. ## IDE Setup @@ -236,7 +236,7 @@ For technical suggestions, you can also leverage [our RFC Process](https://cwiki ## Communication All communication is expected to align with the [Code of Conduct](https://www.apache.org/foundation/policies/conduct). -Discussion about contributing code to Hudi happens on the [dev@ mailing list](/contribute/get-involved). Introduce yourself! +Discussion about contributing code to Hudi happens on the [dev@ mailing list](/community/get-involved.md). Introduce yourself! ## Code & Project Structure diff --git a/website/package.json b/website/package.json index fc7b4eeba2be..9fce186fc46f 100644 --- a/website/package.json +++ b/website/package.json @@ -14,25 +14,27 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { - "@docusaurus/core": "2.0.0-beta.14", - "@docusaurus/plugin-client-redirects": "2.0.0-beta.14", - "@docusaurus/plugin-sitemap": "2.0.0-beta.14", - "@docusaurus/preset-classic": "2.0.0-beta.14", - "@docusaurus/theme-search-algolia": "2.0.0-beta.14", - "@fontsource/comfortaa": "^4.5.0", - "@mdx-js/react": "^1.6.21", - "@svgr/webpack": "^5.5.0", - "classnames": "^2.3.1", - "clsx": "^1.1.1", - "embla-carousel-react": "^6.2.0", + "@docusaurus/core": "^3.5.1", + "@docusaurus/plugin-client-redirects": "^3.5.1", + "@docusaurus/plugin-content-docs": "^3.5.1", + "@docusaurus/plugin-sitemap": "^3.5.1", + "@docusaurus/preset-classic": "^3.5.1", + "@docusaurus/theme-classic": "^3.5.1", + "@docusaurus/theme-search-algolia": "^3.5.1", + "@fontsource/comfortaa": "^5.0.20", + "@mdx-js/react": "^3.0.0", + "@svgr/webpack": "^8.1.0", + "classnames": "^2.5.1", + "clsx": "^2.1.1", + "embla-carousel-react": "^8.1.8", "file-loader": "^6.2.0", - "prism-react-renderer": "^1.2.1", - "react": "^17.0.1", - "react-dom": "^17.0.1", - "react-type-animation": "^2.1.2", - "std-env": "3.0.1", + "prism-react-renderer": "^2.3.1", + "react": "^18.2.0", + "react-dom": "^18.2.0", + "react-type-animation": "^3.2.0", + "std-env": "^3.7.0", "url-loader": "^4.1.1", - "yarn": "^1.22.11" + "yarn": "^1.22.22" }, "browserslist": { "production": [ @@ -47,8 +49,13 @@ ] }, "devDependencies": { - "@babel/core": "^7.15.0", - "@babel/preset-env": "^7.15.0", - "babel-loader": "^8.2.2" + "@babel/core": "^7.25.2", + "@babel/preset-env": "^7.25.3", + "@docusaurus/module-type-aliases": "3.5.1", + "@docusaurus/types": "3.5.1", + "babel-loader": "^9.1.3" + }, + "engines": { + "node": ">=18.0" } } diff --git a/website/releases/download.md b/website/releases/download.md index 148be08abc28..6259ffaa587b 100644 --- a/website/releases/download.md +++ b/website/releases/download.md @@ -56,4 +56,4 @@ or ``` % gpg --import KEYS % gpg --verify hudi-X.Y.Z.src.tgz.asc hudi-X.Y.Z.src.tgz -``` \ No newline at end of file +``` diff --git a/website/releases/older-releases.md b/website/releases/older-releases.md index ea044b31efd4..b70d068575e8 100644 --- a/website/releases/older-releases.md +++ b/website/releases/older-releases.md @@ -90,7 +90,7 @@ The raw release notes are available [here](https://issues.apache.org/jira/secure - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12. * **IMPORTANT** This version requires your runtime spark version to be upgraded to 2.4+. * Hudi now supports both Scala 2.11 and Scala 2.12, please refer to [Build with Scala 2.12](https://github.com/apache/hudi#build-with-scala-212) to build with Scala 2.12. - Also, the packages hudi-spark, hudi-utilities, hudi-spark-bundle and hudi-utilities-bundle are changed correspondingly to hudi-spark_{scala_version}, hudi-spark_{scala_version}, hudi-utilities_{scala_version}, hudi-spark-bundle_{scala_version} and hudi-utilities-bundle_{scala_version}. + Also, the packages hudi-spark, hudi-utilities, hudi-spark-bundle and hudi-utilities-bundle are changed correspondingly to hudi-spark_\{scala_version\}, hudi-spark_\{scala_version\}, hudi-utilities_\{scala_version\}, hudi-spark-bundle_\{scala_version\} and hudi-utilities-bundle_\{scala_version\}. Note that scala_version here is one of (2.11, 2.12). * With 0.5.1, we added functionality to stop using renames for Hudi timeline metadata operations. This feature is automatically enabled for newly created Hudi tables. For existing tables, this feature is turned off by default. Please read this [section](https://hudi.apache.org/docs/deployment#upgrading), before enabling this feature for existing hudi tables. To enable the new hudi timeline layout which avoids renames, use the write config "hoodie.timeline.layout.version=1". Alternatively, you can use "repair overwrite-hoodie-props" to append the line "hoodie.timeline.layout.version=1" to hoodie.properties. Note that in any case, upgrade hudi readers (query engines) first with 0.5.1-incubating release before upgrading writer. diff --git a/website/releases/release-0.12.0.md b/website/releases/release-0.12.0.md index 93be2c17e55a..630fad410b08 100644 --- a/website/releases/release-0.12.0.md +++ b/website/releases/release-0.12.0.md @@ -27,7 +27,9 @@ note of the following updates before upgrading to Hudi 0.12.0. In this release, the default value for a few configurations have been changed. They are as follows: - `hoodie.bulkinsert.sort.mode`: This config is used to determine mode for sorting records for bulk insert. Its default value has been changed from `GLOBAL_SORT` to `NONE`, which means no sorting is done and it matches `spark.write.parquet()` in terms of overhead. + - `hoodie.datasource.hive_sync.partition_extractor_class`: This config is used to extract and transform partition value during Hive sync. Its default value has been changed from `SlashEncodedDayPartitionValueExtractor` to `MultiPartKeysValueExtractor`. If you relied on the previous default value (i.e., have not set it explicitly), you are required to set the config to `org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor`. From this release, if this config is not set and Hive sync is enabled, then partition value extractor class will be **automatically inferred** on the basis of number of partition fields and whether or not hive style partitioning is enabled. + - The following configs will be inferred, if not set manually, from other configs' values: - `META_SYNC_BASE_FILE_FORMAT`: infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT` diff --git a/website/releases/release-0.12.1.md b/website/releases/release-0.12.1.md index 8d1f002a79b0..07cb43fbab3c 100644 --- a/website/releases/release-0.12.1.md +++ b/website/releases/release-0.12.1.md @@ -40,10 +40,10 @@ If all of the following is applicable to you: 1. Using Spark as an execution engine 2. Using Bulk Insert (using row-writing - , + https://hudi.apache.org/docs/next/configurations#hoodiedatasourcewriterowwriterenable, enabled *by default*) 3. Using Bloom Index (with range-pruning - + https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges enabled, enabled *by default*) for "UPSERT" operations Recommended to upgrading to 0.12.1 to avoid getting duplicate records in your pipeline. diff --git a/website/releases/release-0.13.0.md b/website/releases/release-0.13.0.md index e27050ceacec..59b33416a4aa 100644 --- a/website/releases/release-0.13.0.md +++ b/website/releases/release-0.13.0.md @@ -428,7 +428,7 @@ Consistent Hashing Index is still an evolving feature and currently there are so stop your write pipeline and enable clustering. You should take extreme care to not run both concurrently because it might result in conflicts and a failed pipeline. Once clustering is complete, you can resume your regular write pipeline, which will have compaction enabled. - ::: + We are working towards automating these and making it easier for users to leverage the Consistent Hashing Index. You can follow the ongoing work on the Consistent Hashing Index [here](https://issues.apache.org/jira/browse/HUDI-3000). @@ -445,7 +445,8 @@ To improve the concurrency control, the 0.13.0 release introduces a new feature, detect the conflict during the data writing phase and abort the writing early on once a conflict is detected, using Hudi's marker mechanism. Hudi can now stop a conflicting writer much earlier because of the early conflict detection and release computing resources necessary to cluster, improving resource utilization. - + ::: + :::caution The early conflict detection in OCC is ***EXPERIMENTAL*** in 0.13.0 release. ::: diff --git a/website/releases/release-1.0.0-beta1.md b/website/releases/release-1.0.0-beta1.md index 5d05582c60c4..fa8c371b1e3c 100644 --- a/website/releases/release-1.0.0-beta1.md +++ b/website/releases/release-1.0.0-beta1.md @@ -38,9 +38,9 @@ changes in this release: - Now all commit metadata is serialized to avro. This allows us to add new fields in the future without breaking compatibility and also maintain uniformity in metadata across all actions. - All completed commit metadata file name will also have completion time. All the actions in requested/inflight states - are stored in the active timeline as files named ... Completed + are stored in the active timeline as files named \.\.\. Completed actions are stored along with a time that denotes when the action was completed, in a file named < - begin_instant_time>_.. This allows us to implement file slicing for non-blocking + begin_instant_time>_\.\. This allows us to implement file slicing for non-blocking concurrecy control. - Completed actions, their plans and completion metadata are stored in a more scalable [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) based timeline organized in an * diff --git a/website/src/components/BlogsSlider/BlogCard.js b/website/src/components/BlogsSlider/BlogCard.js index 7daa32f7d9a0..b918a087508e 100644 --- a/website/src/components/BlogsSlider/BlogCard.js +++ b/website/src/components/BlogsSlider/BlogCard.js @@ -9,9 +9,14 @@ import styles from "./styles.module.css"; const BlogCard = ({ blog }) => { const { withBaseUrl } = useBaseUrlUtils(); const { frontMatter, assets, metadata } = blog; - const { formattedDate, title, authors, permalink } = metadata; + const { date, title, authors, permalink } = metadata; const image = assets.image ?? frontMatter.image ?? "/assets/images/hudi.png"; + const dateObj = new Date(date); + const options = { year: 'numeric', month: 'long', day: 'numeric' }; + const formattedDate = dateObj.toLocaleDateString('en-US', options); + + return (
    diff --git a/website/src/components/BlogsSlider/index.js b/website/src/components/BlogsSlider/index.js index 88a168034642..dbd36ef41ee9 100644 --- a/website/src/components/BlogsSlider/index.js +++ b/website/src/components/BlogsSlider/index.js @@ -36,7 +36,7 @@ const BlogsSlider = () => { const [emblaRef, emblaApi] = useEmblaCarousel({ loop: true, slidesToScroll: 1, - align: 0 + align: 'start' }); const [activeIndex, setActiveIndex] = useState(0); diff --git a/website/src/components/HomepageHeader/Icons/index.js b/website/src/components/HomepageHeader/Icons/index.js new file mode 100644 index 000000000000..d97050019d04 --- /dev/null +++ b/website/src/components/HomepageHeader/Icons/index.js @@ -0,0 +1,846 @@ +import React from "react"; + +const MutabilitySupport = () => { + return ( + + + + + + + + + + + + + + + + + ); +}; + +const IncrementalProcessing = () => { + return ( + + + + + + + + + + + + + ); +}; + +const ACIDTransactions = () => { + return ( + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ); +}; + +const HistoricalTimeTravel = () => { + return ( + + + + + + + + + + + + ); +}; + +const Interoperable = () => { + return ( + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ); +}; + +const TableServices = () => { + return ( + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ); +}; + +const RichPlatform = () => { + return ( + + + + + + + + + + + + + ); +}; + +const MultiModalIndexes = () => { + return ( + + + + + + + + + + + + + + + + + + + + + + + + + + ); +}; + +const SchemaEvolution = () => { + return ( + + + + + + + + + + + + + + + + + + + + + + + ); +}; + +export { + MutabilitySupport, + IncrementalProcessing, + ACIDTransactions, + HistoricalTimeTravel, + Interoperable, + TableServices, + RichPlatform, + MultiModalIndexes, + SchemaEvolution, +}; diff --git a/website/src/components/Redirect.js b/website/src/components/Redirect.js index a895fe365f24..35cce35edf1d 100644 --- a/website/src/components/Redirect.js +++ b/website/src/components/Redirect.js @@ -1,15 +1,15 @@ import React from 'react'; -import useIsBrowser from '@docusaurus/useIsBrowser'; export default function Redirect({children, url}) { - const isBrowser = useIsBrowser(); - if (isBrowser) { + + if (global?.window?.location?.href) { global.window.location.href = url; } + return ( {children} or click here ); -} \ No newline at end of file +} diff --git a/website/src/components/Title/index.js b/website/src/components/Title/index.js index cf48089c7d44..83e6ff71b5f3 100644 --- a/website/src/components/Title/index.js +++ b/website/src/components/Title/index.js @@ -1,13 +1,12 @@ import React from "react"; import styles from "@site/src/components/Title/styles.module.css"; import Heading from "@theme/Heading"; -const AnchoredH2 = Heading("h2"); const Title = ({ primaryText, secondaryText, id }) => { return ( - + {primaryText}  {secondaryText} - + ); }; diff --git a/website/src/css/custom.css b/website/src/css/custom.css index efcebf3fe822..8c2932f5bdfa 100644 --- a/website/src/css/custom.css +++ b/website/src/css/custom.css @@ -7,6 +7,7 @@ /* You can override the default Infima variables here. */ :root { + --ifm-menu-link-padding-horizontal: 1rem; --ifm-color-primary-text: rgb(13, 177, 249); --ifm-color-secondary-text: rgb(41, 85, 122); --ifm-color-black: rgba(28, 30, 33, 1); @@ -54,7 +55,7 @@ html[data-theme='dark'] .docusaurus-highlight-code-line { display: flex; height: 30px; width: 30px; - + } @media (max-width: 767px) { @@ -71,10 +72,10 @@ html[data-theme='dark'] .docusaurus-highlight-code-line { } @media only screen and (max-width: 1460px){ - .navbar__item { + .navbar__item, .navbar__link { font-size: 0.65em !important; } - + } @media only screen and (max-width: 1250px){ @@ -82,8 +83,9 @@ html[data-theme='dark'] .docusaurus-highlight-code-line { font-size: 0.6em !important; padding: 5px; } - - + .navbar__link { + font-size: 0.6em !important; + } } @media only screen and (max-width: 1055px){ @@ -91,6 +93,44 @@ html[data-theme='dark'] .docusaurus-highlight-code-line { font-size: 0.5em !important; padding: 5px; } + .navbar__link { + font-size: 0.5em !important; + } +} +.navbar__link { + width: max-content; +} + +@media(max-width:1820px){ + .navbar__item, .navbar__link { + font-size:90% !important; + } +} +@media(max-width:1602px){ + .navbar__item, .navbar__link { + font-size:80% !important; + } +} +@media(max-width:1440px){ + .navbar__item, .navbar__link { + font-size:70% !important; + } +} +@media(max-width:1440px){ + .navbar__item, .navbar__link { + font-size:70% !important; + } +} + +@media(max-width:1350px){ + a.menu__link.navbarFontSize_src-theme-Navbar-MobileSidebar-PrimaryMenu-styles-module { + font-size: .9rem !important; + } +} + +.navbar-sidebar__item .menu__link{ + padding-left: 1rem; + padding-right: 1rem; } .navbar__item { @@ -209,11 +249,12 @@ table.features i.feather { footer .container { margin-block-start: 1.83em; + max-width: var(--ifm-container-width-xl) !important; } .blog-wrapper .container { max-width: 100%; - + } .who-uses { @@ -296,7 +337,7 @@ footer .container { display:inline; overflow: hidden; vertical-align: text-top; - + } h1.blogPostTitle_src-theme-BlogPostItem-styles-module{ @@ -376,6 +417,25 @@ h1.blogPostTitle_src-theme-BlogPostItem-styles-module{ text-align: left; } +.footer__logo { + max-width: 10rem; +} + +.dropdown > .navbar__link:after { + border-color: currentColor transparent; + border-style: solid; + content: ''; + position: relative; + top: -2px; + border: solid black; + border-width: 0 1.9px 1.9px 0; + display: inline-block; + padding: 2.5px; + transform: rotate(45deg); + margin-top: -13px; + margin-left: 8px; +} + @media(max-width:1524px) { .navbar__item { padding: var(--ifm-navbar-item-padding-vertical) 7px; @@ -391,11 +451,68 @@ h1.blogPostTitle_src-theme-BlogPostItem-styles-module{ } } +.theme-doc-sidebar-item-link .menu__link{ + padding-left: 1rem !important; + padding-right: 1rem !important; +} + +.tagsListInline { + margin-top: 10px !important; +} +.tagsListInline b { + display: none; +} + +.tagsListInline ul { + margin-left: 10px !important; + font-size: .875rem !important; + margin-top: 10px !important; +} + +.tagsListInline ul li { + margin: -5px 20px 0 -10px; +} + +.tagsListInline ul li a{ + padding: 0.3rem 5px 0.3rem; +} + +/* Docusaurus-specific utility class */ +.docusaurus-mt-lg { + margin-top: 3rem; +} div[class^="announcementBar"][role="banner"] { color: white; background-color: #29557A; padding: 5px 0; height: auto; -} \ No newline at end of file +} + +.theme-doc-sidebar-item-link .menu__link{ + padding-left: 1rem !important; + padding-right: 1rem !important; +} + +.tagsListInline { + margin-top: 10px !important; +} + +.tagsListInline b { + display: none; +} + +.tagsListInline ul { + margin-left: 10px !important; + font-size: .875rem !important; + margin-top: 10px !important; +} + +.tagsListInline ul li { + margin: -5px 20px 0 -10px; +} + +.tagsListInline ul li a{ + padding: 0.3rem 5px 0.3rem; +} diff --git a/website/src/pages/blog/streaming-data-lake-platform.md b/website/src/pages/blog/streaming-data-lake-platform.md index 0b4a36cccc02..48b4fac0b27e 100644 --- a/website/src/pages/blog/streaming-data-lake-platform.md +++ b/website/src/pages/blog/streaming-data-lake-platform.md @@ -4,12 +4,8 @@ title: quickstart path: /blog/streaming-data-lake-platform --- -import {Route} from '@docusaurus/router'; +import {Redirect} from '@docusaurus/router'; - { -global.window && (global.window.location.href = '/blog/2021/07/21/streaming-data-lake-platform'); -return null; -}} + diff --git a/website/src/pages/index.js b/website/src/pages/index.js index faa5eefbcd49..bb4401029ccf 100644 --- a/website/src/pages/index.js +++ b/website/src/pages/index.js @@ -11,13 +11,13 @@ import BlogsSlider from "@site/src/components/BlogsSlider"; import styles from './styles.module.css'; function NewReleaseMessage() { - return ( -
    -
    -
    -
    -
    - ); + return ( +
    +
    +
    +
    +
    + ); } export default function Home() { diff --git a/website/src/pages/index.module.css b/website/src/pages/index.module.css new file mode 100644 index 000000000000..9f71a5da775b --- /dev/null +++ b/website/src/pages/index.module.css @@ -0,0 +1,23 @@ +/** + * CSS files with the .module.css suffix will be treated as CSS modules + * and scoped locally. + */ + +.heroBanner { + padding: 4rem 0; + text-align: center; + position: relative; + overflow: hidden; +} + +@media screen and (max-width: 996px) { + .heroBanner { + padding: 2rem; + } +} + +.buttons { + display: flex; + align-items: center; + justify-content: center; +} diff --git a/website/src/pages/markdown-page.md b/website/src/pages/markdown-page.md new file mode 100644 index 000000000000..9756c5b6685a --- /dev/null +++ b/website/src/pages/markdown-page.md @@ -0,0 +1,7 @@ +--- +title: Markdown page example +--- + +# Markdown page example + +You don't need React to write simple standalone pages. diff --git a/website/src/pages/quickstart.md b/website/src/pages/quickstart.md index a5ccbbe7712e..ae6f391c6383 100644 --- a/website/src/pages/quickstart.md +++ b/website/src/pages/quickstart.md @@ -3,12 +3,8 @@ id: quickstart title: quickstart --- -import {Route} from '@docusaurus/router'; +import {Redirect} from '@docusaurus/router'; - { -global.window && (global.window.location.href = '/docs/quick-start-guide'); -return null; -}} + diff --git a/website/src/pages/tech-specs-1point0.md b/website/src/pages/tech-specs-1point0.md index 082de711219f..b8664129034e 100644 --- a/website/src/pages/tech-specs-1point0.md +++ b/website/src/pages/tech-specs-1point0.md @@ -39,15 +39,15 @@ Hudi organizes a table as a collection of files (objects in cloud storage) that Metadata about the table is stored at a location on storage, referred to as **_basepath_**, which contains a special reserved _.hoodie_ directory under the base path is used to store transaction logs, metadata and indexes. A special file [`hoodie.properties`](http://hoodie.properties/) under basepath persists table level configurations, shared by writers and readers of the table. These configurations are explained [here](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java), and any config without a default value needs to be specified during table creation. ```plain -/data/hudi_trips/ <-- Base Path -├── .hoodie/ <-- Meta Path +/data/hudi_trips/ <-- Base Path +├── .hoodie/ <-- Meta Path | └── hoodie.properties <-- Table Configs │ └── metadata/ <-- Metadata | └── files/ <-- Files that make up the table | └── col_stats/ <-- Statistics on files and columns ├── americas/ <-- Data stored as folder tree │ ├── brazil/ -│ │ └── sao_paulo/ <-- Partition Path +│ │ └── sao_paulo <-- Partition Path │ │ ├── [data_files] │ └── united_states/ │ └── san_francisco/ @@ -107,7 +107,7 @@ Completed actions, their plans and completion metadata are stored in a more scal ```bash /.hoodie/archived/ -├── _version_ <-- stores the manifest version that is current +├── _version_ <-- stores the manifest version that is current ├── manifest_1 <-- manifests store list of files in timeline ├── manifest_2 <-- compactions, cleaning, writes produce new manifest files ├── ... @@ -418,7 +418,7 @@ The record index is stored in Hudi metadata table under the partition `record_in | Fields | Description | |----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | partitionName | A string that refers to the partition name the record belongs to. | -| fileIdHighBits | A long that refers to high 64 bits if the fileId is based on UUID format. A UUID based fileId is stored as 3 pieces in RLI (fileIdHighBits, fileIdLowBits and fileIndex). FileID format is {UUID}-{fileIndex}. | +| fileIdHighBits | A long that refers to high 64 bits if the fileId is based on UUID format. A UUID based fileId is stored as 3 pieces in RLI (fileIdHighBits, fileIdLowBits and fileIndex). FileID format is \{UUID}-\{fileIndex}. | | fileIdLowBits | A long that refers to low 64 bits if the fileId is based on UUID format. | | fileIndex | An integer that refers to index representing file index which is used to reconstruct UUID based fileID. Applicable when the fileId is based on UUID format. | | fileIdEncoding | An integer that represents fileId encoding. Possible values are 0 and 1. O represents UUID based fileID, and 1 represents raw string format of the fileId. When the encoding is 0, reader can deduce fileID from fileIdLowBits, fileIdLowBits and fileIndex. | diff --git a/website/src/pages/tech-specs.md b/website/src/pages/tech-specs.md index a56c6032ea1b..9c09832f0a0c 100644 --- a/website/src/pages/tech-specs.md +++ b/website/src/pages/tech-specs.md @@ -33,23 +33,23 @@ At a high level, Hudi organizes data into a directory structure under the base p Note that, unlike Hive style partitioning, partition columns are not removed from data files and partitioning is a mere organization of data files. A special reserved *.hoodie* directory under the base path is used to store transaction logs and metadata. A special file `hoodie.properties` persists table level configurations, shared by writers and readers of the table. These configurations are explained [here](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java), and any config without a default value needs to be specified during table creation. - - /data/hudi_trips/ <== Base Path - ├── .hoodie/ <== Meta Path - | └── hoodie.properties <== Table Configs - │ └── metadata/ <== Table Metadata - ├── americas/ - │ ├── brazil/ - │ │ └── sao_paulo/ <== Partition Path - │ │ ├── - │ └── united_states/ - │ └── san_francisco/ - │ ├── - └── asia/ - └── india/ - └── chennai/ - ├── - +```plain +/data/hudi_trips/ <== Base Path +├── .hoodie/ <== Meta Path +| └── hoodie.properties <== Table Configs +│ └── metadata/ <== Table Metadata +├── americas/ +│ ├── brazil/ +│ │ └── sao_paulo/ <== Partition Path +│ │ ├── +│ └── united_states/ +│ └── san_francisco/ +│ ├── +└── asia/ + └── india/ + └── chennai/ + ├── +``` ### Table Types Hudi storage format supports two table types offering different trade-offs between ingest and query performance and the data files are stored differently based on the chosen table type. @@ -102,8 +102,9 @@ Data consistency in Hudi is provided using Multi-version Concurrency Control (MV All actions and the state transitions are registered with the timeline using an atomic write of a special meta-file inside the *.hoodie* directory. The requirement from the underlying storage system is to support an atomic-put and read-after-write consistency. The meta file naming structure is as follows - - [Action timestamp].[Action type].[Action state] +```$xslt +[Action timestamp].[Action type].[Action state] +``` **Action timestamp:** Monotonically increasing value to denote strict ordering of actions in the timeline. This could be provided by an external token provider or rely on the system epoch time at millisecond granularity. @@ -210,9 +211,9 @@ file pruning for filters and join conditions in the query. The payload is an ins | HoodieRecordIndexInfo | `partitionName` | string | partition name to which the record belongs | | | `fileIdEncoding` | int | determines the fields used to deduce file id. When the encoding is 0, file Id can be deduced from fileIdLowBits, fileIdHighBits and fileIndex. When encoding is 1, file Id is available in raw string format in fileId field | | | `fileId` | string | file id in raw string format is available when encoding is set to 1 | -| | `fileIdHighBits` | long | file Id can be deduced as {UUID}-{fileIndex} when encoding is set to 0. fileIdHighBits and fileIdLowBits form the UUID | -| | `fileIdLowBits` | long | file Id can be deduced as {UUID}-{fileIndex} when encoding is set to 0. fileIdHighBits and fileIdLowBits form the UUID | -| | `fileIndex` | int | file Id can be deduced as {UUID}-{fileIndex} when encoding is set to 0. fileIdHighBits and fileIdLowBits form the UUID | +| | `fileIdHighBits` | long | file Id can be deduced as \{UUID}-\{fileIndex} when encoding is set to 0. fileIdHighBits and fileIdLowBits form the UUID | +| | `fileIdLowBits` | long | file Id can be deduced as \{UUID}-\{fileIndex} when encoding is set to 0. fileIdHighBits and fileIdLowBits form the UUID | +| | `fileIndex` | int | file Id can be deduced as \{UUID}-\{fileIndex} when encoding is set to 0. fileIdHighBits and fileIdLowBits form the UUID | | | `instantTime` | long | Epoch time in millisecond representing the commit time at which record was added | @@ -228,7 +229,9 @@ As mentioned in the data model, data is partitioned coarsely through a directory The base file name format is: - [File Id]_[File Write Token]_[Transaction timestamp].[File Extension] +```$xslt +[File Id]_[File Write Token]_[Transaction timestamp].[File Extension] +``` - **File Id** - Uniquely identify a base file within the table. Multiple versions of the base file share the same file id. - **File Write Token** - Monotonically increasing token for every attempt to write the base file. This should help uniquely identifying the base file when there are failures and retries. Cleaning can remove partial/uncommitted base files if the write token is not the latest in the file group @@ -240,8 +243,9 @@ The base file name format is: ### Log File Format The log file name format is: - - [File Id]_[Base Transaction timestamp].[Log File Extension].[Log File Version]_[File Write Token] +```$xslt +[File Id]_[Base Transaction timestamp].[Log File Extension].[Log File Version]_[File Write Token] +``` - **File Id** - File Id of the base file in the slice - **Base Transaction timestamp** - Commit timestamp on the base file for which the log file is updating the deletes/updates for diff --git a/website/src/theme/BlogLayout/index.js b/website/src/theme/BlogLayout/index.js index 2fdabdeca0a0..641c566de919 100644 --- a/website/src/theme/BlogLayout/index.js +++ b/website/src/theme/BlogLayout/index.js @@ -10,36 +10,36 @@ import Layout from '@theme/Layout'; import BlogSidebar from '@theme/BlogSidebar'; function BlogLayout(props) { - const {sidebar, toc, children, ...layoutProps} = props; - const hasSidebar = sidebar && sidebar.items.length > 0; - const isBlogListPage = props.pageClassName === "blog-list-page"; - const isTagsPostList = props.pageClassName === "blog-tags-post-list-page"; + const {sidebar, toc, children, ...layoutProps} = props; + const hasSidebar = sidebar && sidebar.items.length > 0; + const isBlogListPage = props.pageClassName === "blog-list-page"; + const isTagsPostList = props.pageClassName === "blog-tags-post-list-page"; - return ( - -
    -
    - {hasSidebar && ( - - )} -
    - {children} -
    - {toc &&
    {toc}
    } -
    -
    -
    - ); + return ( + +
    +
    + {hasSidebar && ( + + )} +
    + {children} +
    + {toc &&
    {toc}
    } +
    +
    +
    + ); } export default BlogLayout; diff --git a/website/src/theme/BlogPostItem/BlogPostBox.js b/website/src/theme/BlogPostItem/BlogPostBox.js new file mode 100644 index 000000000000..359a7e3cf52e --- /dev/null +++ b/website/src/theme/BlogPostItem/BlogPostBox.js @@ -0,0 +1,124 @@ +import React from 'react'; +import clsx from 'clsx'; +import Link from '@docusaurus/Link'; +import styles from './blogPostBoxStyles.module.css'; +import AuthorName from "@site/src/components/AuthorName"; +import { useBaseUrlUtils } from "@docusaurus/core/lib/client/exports/useBaseUrl"; +import Tag from "@theme/Tag"; +import {useLocation} from '@docusaurus/router'; +export default function BlogPostBox({metadata = {}, assets, frontMatter}) { + const { withBaseUrl } = useBaseUrlUtils(); + const { + date, + permalink, + tags, + title, + authors, + } = metadata; + const location = useLocation(); + + const image = assets.image ?? frontMatter.image ?? '/assets/images/hudi-logo-medium.png'; + + const manageVideoOpen = (videoLink) => { + if(videoLink) { + window.open(videoLink, '_blank', 'noopener noreferrer'); + } + } + + const tagsList = () => { + return ( +
      + {tags.map(({label, permalink: tagPermalink}) => ( +
    • + +
    • + ))} +
    + ); + } + const AuthorsList = () => { + const dateObj = new Date(date); + const options = { year: 'numeric', month: 'long', day: 'numeric' }; + const formattedDate = dateObj.toLocaleDateString('en-US', options); + + const authorsCount = authors.length; + if (authorsCount === 0) { + return ( +
    + {formattedDate} +
    + ) + } + + return ( +
    + {formattedDate} by + +
    + ); + } + + const renderPostHeader = () => { + const TitleHeading = 'h2'; + return ( +
    +
    + {image && ( +
    + { + location.pathname.startsWith('/blog') ? + + : + manageVideoOpen(frontMatter?.navigate)} + src={withBaseUrl(image, { + absolute: true, + })} + className={clsx(styles.videoImage, 'blog-image')} + /> + } + +
    + )} + + {location.pathname.startsWith('/blog') ? + + + {title} + + + : + manageVideoOpen(frontMatter?.navigate)} + className={styles.blogPostTitle} itemProp="headline"> + {title} + + } + +
    + {AuthorsList()} +
    +
    + {!!tags.length && ( + tagsList() + )} +
    + ); + }; + + + + return( +
    + {renderPostHeader()} +
    + ) +} diff --git a/website/src/theme/BlogPostItem/Container/index.js b/website/src/theme/BlogPostItem/Container/index.js new file mode 100644 index 000000000000..22ffcc50322f --- /dev/null +++ b/website/src/theme/BlogPostItem/Container/index.js @@ -0,0 +1,4 @@ +import React from 'react'; +export default function BlogPostItemContainer({children, className}) { + return
    {children}
    ; +} diff --git a/website/src/theme/BlogPostItem/Content/index.js b/website/src/theme/BlogPostItem/Content/index.js new file mode 100644 index 000000000000..9620429a23af --- /dev/null +++ b/website/src/theme/BlogPostItem/Content/index.js @@ -0,0 +1,16 @@ +import React from 'react'; +import clsx from 'clsx'; +import {blogPostContainerID} from '@docusaurus/utils-common'; +import {useBlogPost} from '@docusaurus/plugin-content-blog/client'; +import MDXContent from '@theme/MDXContent'; +export default function BlogPostItemContent({children, className}) { + const {isBlogPostPage} = useBlogPost(); + return ( +
    + {children} +
    + ); +} diff --git a/website/src/theme/BlogPostItem/Footer/ReadMoreLink/index.js b/website/src/theme/BlogPostItem/Footer/ReadMoreLink/index.js new file mode 100644 index 000000000000..9b93d2549a2e --- /dev/null +++ b/website/src/theme/BlogPostItem/Footer/ReadMoreLink/index.js @@ -0,0 +1,32 @@ +import React from 'react'; +import Translate, {translate} from '@docusaurus/Translate'; +import Link from '@docusaurus/Link'; +function ReadMoreLabel() { + return ( + + + Read More + + + ); +} +export default function BlogPostItemFooterReadMoreLink(props) { + const {blogPostTitle, ...linkProps} = props; + return ( + + + + ); +} diff --git a/website/src/theme/BlogPostItem/Footer/index.js b/website/src/theme/BlogPostItem/Footer/index.js new file mode 100644 index 000000000000..80cf3b09427b --- /dev/null +++ b/website/src/theme/BlogPostItem/Footer/index.js @@ -0,0 +1,64 @@ +import React from 'react'; +import clsx from 'clsx'; +import {useBlogPost} from '@docusaurus/plugin-content-blog/client'; +import {ThemeClassNames} from '@docusaurus/theme-common'; +import EditMetaRow from '@theme/EditMetaRow'; +import TagsListInline from '@theme/TagsListInline'; +import ReadMoreLink from '@theme/BlogPostItem/Footer/ReadMoreLink'; +export default function BlogPostItemFooter() { + const {metadata, isBlogPostPage} = useBlogPost(); + const { + tags, + title, + editUrl, + hasTruncateMarker, + lastUpdatedBy, + lastUpdatedAt, + } = metadata; + // A post is truncated if it's in the "list view" and it has a truncate marker + const truncatedPost = !isBlogPostPage && hasTruncateMarker; + const tagsExists = tags.length > 0; + const renderFooter = tagsExists || truncatedPost || editUrl; + if (!renderFooter) { + return null; + } + // BlogPost footer - details view + if (isBlogPostPage) { + const canDisplayEditMetaRow = !!(editUrl || lastUpdatedAt || lastUpdatedBy); + return ( +
    + {canDisplayEditMetaRow && ( + + )} +
    + ); + } + // BlogPost footer - list view + else { + return ( +
    + {tagsExists && ( +
    + +
    + )} + {truncatedPost && ( +
    + +
    + )} +
    + ); + } +} diff --git a/website/src/theme/BlogPostItem/Header/Authors/index.js b/website/src/theme/BlogPostItem/Header/Authors/index.js new file mode 100644 index 000000000000..dd243caffe3c --- /dev/null +++ b/website/src/theme/BlogPostItem/Header/Authors/index.js @@ -0,0 +1,37 @@ +import React from 'react'; +import clsx from 'clsx'; +import {useBlogPost} from '@docusaurus/plugin-content-blog/client'; +import BlogAuthor from '@theme/Blog/Components/Author'; +import styles from './styles.module.css'; +// Component responsible for the authors layout +export default function BlogPostItemHeaderAuthors({className}) { + const { + metadata: {authors}, + assets, + } = useBlogPost(); + const authorsCount = authors.length; + if (authorsCount === 0) { + return null; + } + const imageOnly = authors.every(({name}) => !name); + const singleAuthor = authors.length === 1; + return ( +
    + {authors.map((author, idx) => ( +
    + {author.name} +
    + ))} +
    + ); +} diff --git a/website/src/theme/BlogPostItem/Header/Authors/styles.module.css b/website/src/theme/BlogPostItem/Header/Authors/styles.module.css new file mode 100644 index 000000000000..01a73306960b --- /dev/null +++ b/website/src/theme/BlogPostItem/Header/Authors/styles.module.css @@ -0,0 +1,22 @@ +.authorCol { + max-width: inherit !important; +} + +.imageOnlyAuthorRow { + display: flex; + flex-flow: row wrap; +} + +.imageOnlyAuthorCol { + margin-left: 0.3rem; + margin-right: 0.3rem; +} + +.authorWrapper { + margin-left: 10px; + .avatar__name { + span { + font-weight: 600 !important; + } + } +} diff --git a/website/src/theme/BlogPostItem/Header/Info/index.js b/website/src/theme/BlogPostItem/Header/Info/index.js new file mode 100644 index 000000000000..42553fb79248 --- /dev/null +++ b/website/src/theme/BlogPostItem/Header/Info/index.js @@ -0,0 +1,59 @@ +import React from 'react'; +import clsx from 'clsx'; +import {translate} from '@docusaurus/Translate'; +import {usePluralForm} from '@docusaurus/theme-common'; +import {useDateTimeFormat} from '@docusaurus/theme-common/internal'; +import {useBlogPost} from '@docusaurus/plugin-content-blog/client'; +import BlogPostItemHeaderAuthors from '@theme/BlogPostItem/Header/Authors'; +import styles from './styles.module.css'; +// Very simple pluralization: probably good enough for now +function useReadingTimePlural() { + const {selectMessage} = usePluralForm(); + return (readingTimeFloat) => { + const readingTime = Math.ceil(readingTimeFloat); + return selectMessage( + readingTime, + translate( + { + id: 'theme.blog.post.readingTime.plurals', + description: + 'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)', + message: 'One min read|{readingTime} min read', + }, + {readingTime}, + ), + ); + }; +} +function ReadingTime({readingTime}) { + const readingTimePlural = useReadingTimePlural(); + return {readingTimePlural(readingTime)}; +} +function DateTime({date, formattedDate}) { + return ; +} +function Spacer() { + return {' · '}; +} +export default function BlogPostItemHeaderInfo({className}) { + const {metadata} = useBlogPost(); + const {date, readingTime} = metadata; + const dateTimeFormat = useDateTimeFormat({ + day: 'numeric', + month: 'long', + year: 'numeric', + timeZone: 'UTC', + }); + const formatDate = (blogDate) => dateTimeFormat.format(new Date(blogDate)); + return ( +
    + + + {typeof readingTime !== 'undefined' && ( + <> + + + )} +
    + ); +} diff --git a/website/src/theme/BlogPostItem/Header/Info/styles.module.css b/website/src/theme/BlogPostItem/Header/Info/styles.module.css new file mode 100644 index 000000000000..46e2febfcba2 --- /dev/null +++ b/website/src/theme/BlogPostItem/Header/Info/styles.module.css @@ -0,0 +1,17 @@ +.container { + color: #1c1e21; + display: flex; + flex-direction: row; + font-size: 1.1rem; + margin-left: 2px; +} + +.spacer { + font-size: 35px; + line-height: 28px; +} + +.marker { + margin-left: 30px; + display: list-item; +} diff --git a/website/src/theme/BlogPostItem/Header/Title/index.js b/website/src/theme/BlogPostItem/Header/Title/index.js new file mode 100644 index 000000000000..d711475381b3 --- /dev/null +++ b/website/src/theme/BlogPostItem/Header/Title/index.js @@ -0,0 +1,15 @@ +import React from 'react'; +import clsx from 'clsx'; +import Link from '@docusaurus/Link'; +import {useBlogPost} from '@docusaurus/plugin-content-blog/client'; +import styles from './styles.module.css'; +export default function BlogPostItemHeaderTitle({className}) { + const {metadata, isBlogPostPage} = useBlogPost(); + const {permalink, title} = metadata; + const TitleHeading = isBlogPostPage ? 'h1' : 'h2'; + return ( + + {isBlogPostPage ? title : {title}} + + ); +} diff --git a/website/src/theme/BlogPostItem/Header/Title/styles.module.css b/website/src/theme/BlogPostItem/Header/Title/styles.module.css new file mode 100644 index 000000000000..d5cd1c97a39b --- /dev/null +++ b/website/src/theme/BlogPostItem/Header/Title/styles.module.css @@ -0,0 +1,13 @@ +.title { + font-size: 2rem; + color: #12B1FF; +} + +/** + Blog post title should be smaller on smaller devices +**/ +@media (max-width: 576px) { + .title { + font-size: 2rem; + } +} diff --git a/website/src/theme/BlogPostItem/Header/index.js b/website/src/theme/BlogPostItem/Header/index.js new file mode 100644 index 000000000000..3b565a8e519a --- /dev/null +++ b/website/src/theme/BlogPostItem/Header/index.js @@ -0,0 +1,36 @@ +import React from 'react'; +import BlogPostItemHeaderTitle from '@theme/BlogPostItem/Header/Title'; +import BlogPostItemHeaderInfo from '@theme/BlogPostItem/Header/Info'; +import clsx from "clsx"; +import TagsListInline from "@theme/TagsListInline"; +import {useBlogPost} from "@docusaurus/plugin-content-blog/client"; +import {ThemeClassNames} from "@docusaurus/theme-common"; + +export default function BlogPostItemHeader() { + const {metadata, isBlogPostPage} = useBlogPost(); + const { + tags, + hasTruncateMarker, + } = metadata; + // A post is truncated if it's in the "list view" and it has a truncate marker + const truncatedPost = !isBlogPostPage && hasTruncateMarker; + const tagsExists = tags.length > 0; + return ( +
    + + + {tagsExists && ( +
    +
    + +
    +
    + )} +
    + ); +} diff --git a/website/src/theme/BlogPostItem/styles.module.css b/website/src/theme/BlogPostItem/blogPostBoxStyles.module.css similarity index 100% rename from website/src/theme/BlogPostItem/styles.module.css rename to website/src/theme/BlogPostItem/blogPostBoxStyles.module.css diff --git a/website/src/theme/BlogPostItem/index.js b/website/src/theme/BlogPostItem/index.js index 5415f03737d1..4593a89b7018 100644 --- a/website/src/theme/BlogPostItem/index.js +++ b/website/src/theme/BlogPostItem/index.js @@ -1,218 +1,28 @@ -/** - * Copyright (c) Facebook, Inc. and its affiliates. - * - * This source code is licensed under the MIT license found in the - * LICENSE file in the root directory of this source tree. - */ import React from 'react'; import clsx from 'clsx'; -import {MDXProvider} from '@mdx-js/react'; -import {translate} from '@docusaurus/Translate'; -import Link from '@docusaurus/Link'; -import {useBaseUrlUtils} from '@docusaurus/useBaseUrl'; -import {usePluralForm} from '@docusaurus/theme-common'; -import MDXComponents from '@theme/MDXComponents'; -import EditThisPage from '@theme/EditThisPage'; -import styles from './styles.module.css'; -import Tag from '@theme/Tag'; -import AuthorName from "@site/src/components/AuthorName"; -import { useLocation } from 'react-router-dom'; -import classNames from "classnames"; - -function useReadingTimePlural() { - const {selectMessage} = usePluralForm(); - return (readingTimeFloat) => { - const readingTime = Math.ceil(readingTimeFloat); - return selectMessage( - readingTime, - translate( - { - id: 'theme.blog.post.readingTime.plurals', - description: - 'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)', - message: 'One min read|{readingTime} min read', - }, - { - readingTime, - }, - ), - ); - }; +import {useBlogPost} from '@docusaurus/plugin-content-blog/client'; +import BlogPostItemContainer from '@theme/BlogPostItem/Container'; +import BlogPostItemHeader from '@theme/BlogPostItem/Header'; +import BlogPostItemContent from '@theme/BlogPostItem/Content'; +import BlogPostItemFooter from '@theme/BlogPostItem/Footer'; +import BlogPostBox from "./BlogPostBox"; +// apply a bottom margin in list view +function useContainerClassName() { + const {isBlogPostPage} = useBlogPost(); + return !isBlogPostPage ? 'margin-bottom--xl' : undefined; } - -function BlogPostItem(props) { - const readingTimePlural = useReadingTimePlural(); - const location = useLocation(); - const { withBaseUrl } = useBaseUrlUtils(); - - const { - children, - frontMatter, - assets, - metadata, - truncated, - isBlogPostPage = false, - } = props; - - const { - date, - formattedDate, - permalink, - tags, - readingTime, - title, - editUrl, - authors, - } = metadata; - const image = assets.image ?? frontMatter.image ?? '/assets/images/hudi-logo-medium.png'; - const tagsExists = tags.length > 0; - - const manageVideoOpen = (videoLink) => { - if(videoLink) { - window.open(videoLink, '_blank', 'noopener noreferrer'); - } - } - - const tagsList = () => { - return ( - <> -
      - - {tags.map(({label, permalink: tagPermalink}) => ( -
    • - -
    • - ))} -
    - - ); - } - const AuthorsList = () => { - - const authorsCount = authors.length; - if (authorsCount === 0) { - return ( -
    - -
    - ) - - } - - return ( - <> - {isBlogPostPage ?
    - - - -
    :
    - - -
    } - - - ); - } - - const renderPostHeader = () => { - const TitleHeading = isBlogPostPage ? 'h1' : 'h2'; - return ( -
    -
    - {!isBlogPostPage && image && ( -
    - { - location.pathname.startsWith('/blog') ? - - : - manageVideoOpen(frontMatter?.navigate)} - src={withBaseUrl(image, { - absolute: true, - })} - className={classNames(styles.videoImage, 'blog-image')} - /> - } - -
    - )} - - {isBlogPostPage ? ( - - {title} - - ) : ( - location.pathname.startsWith('/blog') ? - - - {title} - - - : - manageVideoOpen(frontMatter?.navigate)} - className={styles.blogPostTitle} itemProp="headline"> - {title} - - )} - -
    - {AuthorsList()} - {isBlogPostPage && readingTime &&
    - <> - {typeof readingTime !== 'undefined' && ( - <> - {readingTimePlural(readingTime)} - - )} - -
    - } -
    -
    - {!!tags.length && ( - tagsList() - )} -
    - ); - }; - - return ( -
    - {renderPostHeader()} - - {isBlogPostPage && ( -
    - {children} -
    - )} - - {(tagsExists || truncated) && isBlogPostPage && editUrl && ( - -
    - -
    -
    - )} -
    - ); +export default function BlogPostItem({children, className}) { + const containerClassName = useContainerClassName(); + const {isBlogPostPage, metadata,assets, frontMatter, ...rest} = useBlogPost(); + + if (!isBlogPostPage) { + return + } + return ( + + + {children} + + + ); } - -export default BlogPostItem; diff --git a/website/src/theme/DocPage/index.js b/website/src/theme/DocPage/index.js deleted file mode 100644 index a8b5bf2ea360..000000000000 --- a/website/src/theme/DocPage/index.js +++ /dev/null @@ -1,196 +0,0 @@ -/** - * Copyright (c) Facebook, Inc. and its affiliates. - * - * This source code is licensed under the MIT license found in the - * LICENSE file in the root directory of this source tree. - */ -import React, {useState, useCallback, useEffect} from 'react'; -import {MDXProvider} from '@mdx-js/react'; -import renderRoutes from '@docusaurus/renderRoutes'; -import Layout from '@theme/Layout'; -import DocSidebar from '@theme/DocSidebar'; -import MDXComponents from '@theme/MDXComponents'; -import NotFound from '@theme/NotFound'; -import IconArrow from '@theme/IconArrow'; -import BackToTopButton from '@theme/BackToTopButton'; -import {matchPath} from '@docusaurus/router'; -import {translate} from '@docusaurus/Translate'; -import clsx from 'clsx'; -import styles from './styles.module.css'; -import { - ThemeClassNames, - docVersionSearchTag, - DocsSidebarProvider, - useDocsSidebar, - DocsVersionProvider, -} from '@docusaurus/theme-common'; -import Head from '@docusaurus/Head'; - -function DocPageContent({ - currentDocRoute, - versionMetadata, - children, - sidebarName, -}) { - const sidebar = useDocsSidebar(); - - const {pluginId, version} = versionMetadata; - const [hiddenSidebarContainer, setHiddenSidebarContainer] = useState(false); - const [hiddenSidebar, setHiddenSidebar] = useState(false); - const toggleSidebar = useCallback(() => { - if (hiddenSidebar) { - setHiddenSidebar(false); - } - - setHiddenSidebarContainer((value) => !value); - }, [hiddenSidebar]); - if(typeof window !== 'undefined') { - useEffect(() => { - const timeout = setTimeout(() => { - const [_, hashValue] = window.location.href.split('#'); - - const element = document.querySelectorAll(`[href="#${hashValue}"]`)?.[0]; - if(element) { - const headerOffset = 90; - const elementPosition = element.getBoundingClientRect().top; - const offsetPosition = elementPosition + window.pageYOffset - headerOffset; - window.scrollTo({ - top: offsetPosition - }); - } - }, 100); - - return () => { - clearTimeout(timeout); - } - }, [window.location.href]); - } - return ( - -
    - - - {sidebar && ( - - )} -
    -
    - {children} -
    -
    -
    -
    - ); -} - -const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, `${matchPath}/basic_configurations`, `${matchPath}/timeline`, `${matchPath}/table_types`, `${matchPath}/migration_guide`, `${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, `${matchPath}/metadata`, `${matchPath}/metadata_indexing`, `${matchPath}/record_payload`, `${matchPath}/file_sizing`, `${matchPath}/hoodie_cleaner`, `${matchPath}/concurrency_control`, , `${matchPath}/write_operations`, `${matchPath}/key_generation`]; -const showCustomStylesForDocs = (matchPath, pathname) => arrayOfPages(matchPath).includes(pathname); -function DocPage(props) { - const { - route: {routes: docRoutes}, - versionMetadata, - location, - } = props; - const currentDocRoute = docRoutes.find((docRoute) => - matchPath(location.pathname, docRoute), - ); - - if (!currentDocRoute) { - return ; - } // For now, the sidebarName is added as route config: not ideal! - - const sidebarName = currentDocRoute.sidebar; - const sidebar = sidebarName - ? versionMetadata.docsSidebars[sidebarName] - : null; - - const addCustomClass = showCustomStylesForDocs(props.match.path, props.location.pathname) - return ( - <> - - {/* TODO we should add a core addRoute({htmlClassName}) generic plugin option */} - - - - - - {renderRoutes(docRoutes, { - versionMetadata, - })} - - - - - ); -} - -export default DocPage; diff --git a/website/src/theme/DocPage/styles.module.css b/website/src/theme/DocPage/styles.module.css deleted file mode 100644 index 778a2bbbe1b0..000000000000 --- a/website/src/theme/DocPage/styles.module.css +++ /dev/null @@ -1,85 +0,0 @@ -/** - * Copyright (c) Facebook, Inc. and its affiliates. - * - * This source code is licensed under the MIT license found in the - * LICENSE file in the root directory of this source tree. - */ - -:root { - --doc-sidebar-width: 300px; - --doc-sidebar-hidden-width: 30px; -} - -:global(.docs-wrapper) { - display: flex; -} - -.docPage, -.docMainContainer { - display: flex; - width: 100%; -} - -.docSidebarContainer { - display: none; -} - -@media (min-width: 997px) { - .docMainContainer { - flex-grow: 1; - max-width: calc(100% - var(--doc-sidebar-width)); - } - - .docMainContainerEnhanced { - max-width: calc(100% - var(--doc-sidebar-hidden-width)); - } - - .docSidebarContainer { - display: block; - width: var(--doc-sidebar-width); - margin-top: calc(-1 * var(--ifm-navbar-height)); - border-right: 1px solid var(--ifm-toc-border-color); - will-change: width; - transition: width var(--ifm-transition-fast) ease; - clip-path: inset(0); - } - - .docSidebarContainerHidden { - width: var(--doc-sidebar-hidden-width); - cursor: pointer; - } - - .collapsedDocSidebar { - position: sticky; - top: 0; - height: 100%; - max-height: 100vh; - display: flex; - align-items: center; - justify-content: center; - transition: background-color var(--ifm-transition-fast) ease; - } - - .collapsedDocSidebar:hover, - .collapsedDocSidebar:focus { - background-color: var(--ifm-color-emphasis-200); - } - - .expandSidebarButtonIcon { - transform: rotate(0); - } - html[dir='rtl'] .expandSidebarButtonIcon { - transform: rotate(180deg); - } - - html[data-theme='dark'] .collapsedDocSidebar:hover, - html[data-theme='dark'] .collapsedDocSidebar:focus { - background-color: var(--collapse-button-bg-color-dark); - } - - .docItemWrapperEnhanced { - max-width: calc( - var(--ifm-container-width) + var(--doc-sidebar-width) - ) !important; - } -} diff --git a/website/src/theme/Layout/index.js b/website/src/theme/Layout/index.js deleted file mode 100644 index b7e0ed4b39ed..000000000000 --- a/website/src/theme/Layout/index.js +++ /dev/null @@ -1,48 +0,0 @@ -/** - * Copyright (c) Facebook, Inc. and its affiliates. - * - * This source code is licensed under the MIT license found in the - * LICENSE file in the root directory of this source tree. - */ -import React from 'react'; -import clsx from 'clsx'; -import ErrorBoundary from '@docusaurus/ErrorBoundary'; -import SkipToContent from '@theme/SkipToContent'; -import AnnouncementBar from '@theme/AnnouncementBar'; -import Navbar from '@theme/Navbar'; -import Footer from '@theme/Footer'; -import LayoutProviders from '@theme/LayoutProviders'; -import LayoutHead from '@theme/LayoutHead'; -import useKeyboardNavigation from '@theme/hooks/useKeyboardNavigation'; -import {ThemeClassNames} from '@docusaurus/theme-common'; -import ErrorPageContent from '@theme/ErrorPageContent'; -import './styles.css'; - -function Layout(props) { - const {children, noFooter, wrapperClassName, pageClassName} = props; - useKeyboardNavigation(); - return ( - - - - - - - - - -
    - {children} -
    - - {!noFooter &&