ChunkStore: Iterator always returns an empty dataframe #384

davidwneary · 2017-06-29T18:19:52Z

Arctic Version

1.42.0

Arctic Store

# ChunkStore

Platform and version

Python 3.5

Description of problem and/or code sample that reproduces the issue

I am using the DateChunker with chunk_size='D' and the data stored has a frequency of about 10 seconds.

for chunk in self.__library.iterator('key'):
        no_of_rows = len(chunk)

no_of_rows in the above code is always 0 and chunk is always an empty dataframe.

From debugging this looks like it could be because get_chunk_ranges creates a a list of date tuples like:

...
<class 'tuple'>: (b'2016-10-18', b'2016-10-18'),
<class 'tuple'>: (b'2016-10-19', b'2016-10-19')
...

so using c.to_range(chunk[0], chunk[1]) produces:

DateRange(start=datetime.datetime(2016, 10, 18, 0, 0), end=datetime.datetime(2016, 10, 18, 0, 0))

therefore, since the date range is infinitely small, no dates are returned.

Is this a bug or am I doing something wrong?

Cheers
Dave

The text was updated successfully, but these errors were encountered:

bmoscon · 2017-06-29T23:57:16Z

Its pretty hard to tell what you are doing from the code sample you produced.

'D' chunking results in chunks that comprise all data for the given day. So the chunk range is correct - the start of the chunk is Oct 18, 2016 and thats also the end of the chunk since its a daily chunk.

I'm not sure what __library is. I assume its a chunkstore instance?

If you call the iterator methoditerator in chunkstore it returns a generator that should return all the data for a given symbol, date ordered. Are you actually calling it with 'key' ? the key is the symbol you used to write the data, I'm guessing you're just putting that as a place holder for the real symbol, but just checking.

Does get_chunk_ranges produce any meaningful output? From what you have, it looks like there is a ton of data for the symbol - it should return a tuple for every chunk actually in the database. What about when you do a read on the symbol - do you get all the data back?

Also, due to the nature of your last issue, I want to make sure this isnt on CosmosDB.

bmoscon · 2017-06-29T23:58:34Z

It will be 100x easier for me to help you if you can produce a very small code / data sample that reproduces the issue as well.

something like create a small dataframe - write it to arctic - other calls that reproduce the error.

bmoscon · 2017-06-30T00:00:59Z

in case you haven't already seen it:

https:/manahl/arctic/wiki/Chunkstore

davidwneary · 2017-06-30T13:20:03Z

Hi @bmoscon,

Sorry for the lack of clarity, I've written a small script that replicates the issue running locally against mongodb (not cosmosdb):

from datetime import datetime

import pandas
from arctic import Arctic, CHUNK_STORE
from pandas import DataFrame
from pandas.tseries.index import DatetimeIndex

# Create store and library
store = Arctic('localhost', 'test')
store.initialize_library('lib', lib_type=CHUNK_STORE)
lib = store['lib']

# Create dataframe of time measurements taken every 6 hours
date_range = pandas.date_range(start=datetime(2017, 5, 1, 1), periods=8, freq='6H')

df = DataFrame(data={'something': [100, 200, 300, 400, 500, 600, 700, 800]},
               index=DatetimeIndex(date_range, name='date'))

# Write to database
lib.write('testkey', df, chunk_size='D')

# Iterate
for chunk in lib.iterator('testkey'):
    no_of_rows = len(chunk)
    print(no_of_rows)
    print(chunk)

# Read
print(lib.read('testkey'))

The output of the above script is:

0
Empty DataFrame
Columns: [something]
Index: []

0
Empty DataFrame
Columns: [something]
Index: []

date                    something
2017-05-01 01:00:00        100
2017-05-01 07:00:00        200
2017-05-01 13:00:00        300
2017-05-01 19:00:00        400
2017-05-02 01:00:00        500
2017-05-02 07:00:00        600
2017-05-02 13:00:00        700
2017-05-02 19:00:00        800

I would expect the output to be:

4
date                    something
2017-05-01 01:00:00        100
2017-05-01 07:00:00        200
2017-05-01 13:00:00        300
2017-05-01 19:00:00        400

4
date                    something
2017-05-02 01:00:00        500
2017-05-02 07:00:00        600
2017-05-02 13:00:00        700
2017-05-02 19:00:00        800


date                    something
2017-05-01 01:00:00        100
2017-05-01 07:00:00        200
2017-05-01 13:00:00        300
2017-05-01 19:00:00        400
2017-05-02 01:00:00        500
2017-05-02 07:00:00        600
2017-05-02 13:00:00        700
2017-05-02 19:00:00        800

The chunks variable inside the iterator() method is

[
    (b'2017-05-01', b'2017-05-01'), 
    (b'2017-05-02', b'2017-05-02')]`
]

this looks sensible, as you say we're chunking by day but then this produces ranges of:

[
    DateRange(start=datetime.datetime(2017, 5, 1, 0, 0), end=datetime.datetime(2017, 5, 1, 0, 0)), 
    DateRange(start=datetime.datetime(2017, 5, 2, 0, 0), end=datetime.datetime(2017, 5, 2, 0, 0))
]

The only timestamps that could possibly be in these date ranges are 2017-05-01T00:00:00 and 2017-05-02T00:00:00, respectively.

From looking closer it looks like this could be fixed by simply passing filter_data=False into the the following line:

yield self.read(symbol, chunk_range=c.to_range(chunk[0], chunk[1]))

What do you think?

bmoscon · 2017-06-30T13:27:34Z

Alright, let me take a look this evening. Thanks for sending the code that replicates it. I can use that to generate a test case as well for the unit tests.

bmoscon · 2017-07-01T01:17:51Z

oh ok, i see now that the issue is related to the times being part of the datetime index. Nothing we ever tried or intended to work, but let me see if I can get it working and preserve the base case (no times)

bmoscon · 2017-07-02T23:22:26Z

fixed and merged

davidwneary · 2017-07-03T12:45:50Z

Thanks a lot!

Quick question just for my own understanding:

the issue is related to the times being part of the datetime index. Nothing we ever tried or intended to work

How were you using the chunkstore with no times stored in the datetime index? Is that just because you only had one data point per day? But in that case, what's the use case of the ChunkStore with a 'day' chunk size as it results in one document (or chunk) per data point?

Sorry for the questions, I just want to make sure I'm not misusing the library.

bmoscon · 2017-07-04T00:07:44Z

i wouldnt use daily really, but there is a use case - imagine you have gigs of data per day. Ideally you want enough data in your chunk size that there is enough to get good compression out of it, but not so big that its bad for reads.

davidwneary · 2017-07-04T08:23:10Z

I understand. Thanks a lot for your quick response and help!

bmoscon added a commit that referenced this issue Jul 1, 2017

Fix issue #384

a653ef7

bmoscon added a commit that referenced this issue Jul 2, 2017

Fix issue #384 (#386)

b02edc2

bmoscon closed this as completed Jul 2, 2017

atamkapoor mentioned this issue Dec 9, 2022

Missing last chunk in CHUNK_STORE #976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChunkStore: Iterator always returns an empty dataframe #384

ChunkStore: Iterator always returns an empty dataframe #384

davidwneary commented Jun 29, 2017 •

edited

Loading

bmoscon commented Jun 29, 2017 •

edited

Loading

bmoscon commented Jun 29, 2017

bmoscon commented Jun 30, 2017

davidwneary commented Jun 30, 2017 •

edited

Loading

bmoscon commented Jun 30, 2017

bmoscon commented Jul 1, 2017

bmoscon commented Jul 2, 2017

davidwneary commented Jul 3, 2017

bmoscon commented Jul 4, 2017

davidwneary commented Jul 4, 2017

ChunkStore: Iterator always returns an empty dataframe #384

ChunkStore: Iterator always returns an empty dataframe #384

Comments

davidwneary commented Jun 29, 2017 • edited Loading

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

bmoscon commented Jun 29, 2017 • edited Loading

bmoscon commented Jun 29, 2017

bmoscon commented Jun 30, 2017

davidwneary commented Jun 30, 2017 • edited Loading

bmoscon commented Jun 30, 2017

bmoscon commented Jul 1, 2017

bmoscon commented Jul 2, 2017

davidwneary commented Jul 3, 2017

bmoscon commented Jul 4, 2017

davidwneary commented Jul 4, 2017

davidwneary commented Jun 29, 2017 •

edited

Loading

bmoscon commented Jun 29, 2017 •

edited

Loading

davidwneary commented Jun 30, 2017 •

edited

Loading