Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lossing data when inserting data to bitmap table by clickhouse-jdbc #641

Closed
victortony opened this issue Apr 26, 2021 · 8 comments · Fixed by #1248
Closed

lossing data when inserting data to bitmap table by clickhouse-jdbc #641

victortony opened this issue Apr 26, 2021 · 8 comments · Fixed by #1248
Labels

Comments

@victortony
Copy link

victortony commented Apr 26, 2021

Hi, I have two tables in my system: dm_user_set_mapping_di and dm_user_set_mapping_di_tmp.

CREATE TABLE profile.dm_user_set_mapping_di (dateUInt32,userset_keyString,uvAggregateFunction(groupBitmap, UInt64) ) ENGINE = Distributed( 'perftest_3shards_1replicas', 'profile', 'dm_user_set_mapping_di_local', cityHash64(userset_key)

CREATE TABLE profile.dm_user_set_mapping_di_tmp (dateUInt32,userset_keyString,u_idUInt64 ) ENGINE = Distributed( 'perftest_3shards_1replicas', 'profile', 'dm_user_set_mapping_di_tmp_local', intHash64(u_id) )
Firstly, my system load data to "dm_user_set_mapping_di_tmp" by clickhouse-jdbc. And then execute the following sql to load data to "dm_user_set_mapping_di". The problem is that "dm_user_set_mapping_di_tmp" is accurate , but "dm_user_set_mapping_di" loss data by clickhouse-jdbc. When execute "insert" sql in clickhouse client without clickhouse-jdbc, the data in dm_user_set_mapping_di" become accurate.

insert into profile.dws_mifi_loan_user_profile_df (date, label_name, label_value, uv) select date, label_name, label_value, groupBitmapState(toUInt64(u_id)) as uv from profile.dws_mifi_loan_user_profile_df_tmp where date=20210423 group by date, label_name, label_value

I think maybe the data doesn't be updated, when the first table loaded data. Should I flush or update data of the first table when loading the second table, even I create a new jdbc connection?

ClickHouse client version 19.16.3.6.
Clickhouse-jdbc version 0.2.6.

@zhicwu
Copy link
Contributor

zhicwu commented Apr 26, 2021

@victortony, the SQL you provided is for populating DWS table dws_mifi_loan_user_profile_df, which is different from the dimension table dm_user_set_mapping_di.

If my understanding is correct, I think it's not related to JDBC driver, as the insertion happened on server without involving JDBC driver, if you're certain that no data loss found in the temp table.

ClickHouse 19.16 and 20.3 are no longer supported, so maybe you should try 21.3 or at least 20.8? Moreover, if you can upgrade JDBC driver to 0.3.0, you no longer need the temp table as you can insert RoaringBitmap directly into the AggregateFunction column - please refer to example at here.

@victortony
Copy link
Author

victortony commented Apr 28, 2021

@victortony, the SQL you provided is for populating DWS table dws_mifi_loan_user_profile_df, which is different from the dimension table dm_user_set_mapping_di.

If my understanding is correct, I think it's not related to JDBC driver, as the insertion happened on server without involving JDBC driver, if you're certain that no data loss found in the temp table.

ClickHouse 19.16 and 20.3 are no longer supported, so maybe you should try 21.3 or at least 20.8? Moreover, if you can upgrade JDBC driver to 0.3.0, you no longer need the temp table as you can insert RoaringBitmap directly into the AggregateFunction column - please refer to example at here.

不好意思,我把sql中的表名写错了,确认是使用clickhouse客户端进行插入bitmap数据是全的。
我按照您的建议升级到了21.3.8。我的集群是两个节点,升级后,其中一个节点变得非常慢,show tables都要用半分钟。而且插入bitmap整行的的丢失数据,不知道是发生了什么?
看日志一直在报错,错误信息如下:
0. DB::TCPHandler::receiveHello() @ 0xf657442 in /usr/bin/clickhouse

  1. DB::TCPHandler::runImpl() @ 0xf6503f9 in /usr/bin/clickhouse
  2. DB::TCPHandler::run() @ 0xf6633e9 in /usr/bin/clickhouse
  3. Poco::Net::TCPServerConnection::start() @ 0x11d1732f in /usr/bin/clickhouse
  4. Poco::Net::TCPServerDispatcher::run() @ 0x11d18d41 in /usr/bin/clickhouse
  5. Poco::PooledThread::run() @ 0x11e4f469 in /usr/bin/clickhouse
  6. Poco::ThreadImpl::runnableEntry(void*) @ 0x11e4b2ca in /usr/bin/clickhouse
  7. start_thread @ 0x7dc5 in /usr/lib64/libpthread-2.17.so
  8. clone @ 0xf773d in /usr/lib64/libc-2.17.so
    (version 21.3.8.76 (official build))
    2021.04.28 10:54:35.954359 [ 364594 ] {} ServerErrorHandler: Code: 101, e.displayText() = DB::NetException: Unexpected packet from client, Stack trace (when copying this message, always include the lines below):

@zhicwu
Copy link
Contributor

zhicwu commented Apr 28, 2021

I don't think it's related to JDBC driver but something else with your cluster. Below worked for me on a standalone 21.3 server. Perhaps you can consult on Telegram/Slack or create an issue at here with detailed information like steps to reproduce?

drop database if exists testbm;
create database testbm;

CREATE TABLE testbm.dm_user_set_mapping_di (
	date UInt32,upsert_key String, uv AggregateFunction(groupBitmap, UInt64)
) ENGINE = MergeTree() partition by date order by (date, upsert_key);
create table testbm.dm_user_set_mapping_di_tmp (
	date UInt32, upsert_key String, u_id UInt64
) ENGINE = MergeTree()
partition by date order by (date, upsert_key, intHash64(u_id))
sample by intHash64(u_id);

insert into testbm.dm_user_set_mapping_di_tmp
select 20210423, concat('key_', toString(rand64() % 2333)), number
from system.numbers limit 10000000;

insert into testbm.dm_user_set_mapping_di(date, upsert_key, uv)
select date, upsert_key, groupBitmapState(u_id)
from testbm.dm_user_set_mapping_di_tmp
where date=20210423
group by date, upsert_key;

select a.date, a.upsert_key,
	bitmapCardinality(a.uv) - ifnull(b.cnt, 0) as cnt_diff,
	b.bm is null ? 0 : bitmapHasAll(a.uv, b.bm) as has_all
from testbm.dm_user_set_mapping_di a
left join (
	select date, upsert_key, uniqExact(u_id) as cnt, groupBitmapState(u_id) as bm
	from testbm.dm_user_set_mapping_di_tmp
	group by date, upsert_key
	order by upsert_key
) b on a.date = b.date and a.upsert_key = b.upsert_key
where b.date is null or cnt_diff != 0 or has_all != 1
union all
select date, upsert_key, toInt64(bitmapCardinality(uv)) as cnt_diff, toUInt8(0) as has_all
from testbm.dm_user_set_mapping_di
where (date, upsert_key) not in (select distinct date, upsert_key from testbm.dm_user_set_mapping_di_tmp)

@victortony
Copy link
Author

thank you very much

@victortony
Copy link
Author

victortony commented May 13, 2021

@victortony, the SQL you provided is for populating DWS table dws_mifi_loan_user_profile_df, which is different from the dimension table dm_user_set_mapping_di.

If my understanding is correct, I think it's not related to JDBC driver, as the insertion happened on server without involving JDBC driver, if you're certain that no data loss found in the temp table.

ClickHouse 19.16 and 20.3 are no longer supported, so maybe you should try 21.3 or at least 20.8? Moreover, if you can upgrade JDBC driver to 0.3.0, you no longer need the temp table as you can insert RoaringBitmap directly into the AggregateFunction column - please refer to example at here.

hi, 升级到21.3.8还是有问题,我的tmp表数据量是对的 ,但是使用JDBC执行insert生成bitmap数据量总差很多,在clickhouse客户端执行就一条不差。
insert into profile.dws_mifi_loan_user_profile_df (date, label_name, label_value, uv) select date, label_name, label_value, groupBitmapState(toUInt64(u_id)) as uv from profile.dws_mifi_loan_user_profile_df_tmp where date=20210423 group by date, label_name, label_value

导入数据相关代码:
`
/**

  • 生成connection
  • @return
    */
    def getConnection(): ClickHouseConnection = {
    val dataSource = new ClickHouseDataSource(CH_URL)
    dataSource.getConnection(CH_USERNAME, CH_PASS)
    }

/**

  • 导入数据
  • @param iterator
  • @param sqlTemplate
  • @param date
    */
    def batchImportData(iterator: Iterator[Row],
    sqlTemplate: String,
    date: String) = {
    var index = 1
    val conn = getConnection()
    val psmt = conn.prepareStatement(sqlTemplate)
    while (iterator.hasNext) {
    val row: Row = iterator.next()
    var fieldIndex = 1
    row.schema.fields.foreach(field => {
    field.dataType match {
    case StringType =>
    psmt.setString(fieldIndex, row.getAsString)
    case LongType =>
    psmt.setLong(fieldIndex, row.getAsLong)
    case IntegerType =>
    psmt.setInt(fieldIndex, row.getAsInt)
    case _ => println(s" other type: ${field.dataType}")
    }
    fieldIndex += 1
    })
    psmt.addBatch()
    if (index % BATCH_COUNT == 0) {
    psmt.executeBatch()
    psmt.clearBatch()
    }
    index = index + 1
    }
    psmt.executeBatch()
    // psmt.clearBatch()
//关闭链接
psmt.close()
conn.close()

}`

生成bitmap相关代码:
def buildBitmap(date: String): Unit = { val conn = getConnection() val stmt = conn.createStatement() val sql = BITMAP_GENERATION_SQL_TEMPLATE.format(date) stmt.executeQuery(sql) stmt.close() conn.close() }

main方法
val insertSql = buildPrepareSqlTemplate(df) df.repartition(PARTITION_NUM).foreachPartition(iterator => { batchImportData(iterator, insertSql, date) }) logger.info("begin to generate bitmap...") buildBitmap(date) logger.info("after generate bitmap...")

@zhicwu
Copy link
Contributor

zhicwu commented May 13, 2021

Not sure if it's related to #655. Does the latest code on develop branch work for you?

@victortony
Copy link
Author

hi, is it merged into any release version?

@zhicwu
Copy link
Contributor

zhicwu commented May 18, 2021

hi, is it merged into any release version?

Not yet. You'll need to check out code from develop and build a snapshot package by issue mvn clean package.

@zhicwu zhicwu linked a pull request Feb 19, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants