Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChunkStore: Iterator always returns an empty dataframe #384

Closed
davidwneary opened this issue Jun 29, 2017 · 10 comments
Closed

ChunkStore: Iterator always returns an empty dataframe #384

davidwneary opened this issue Jun 29, 2017 · 10 comments

Comments

@davidwneary
Copy link

davidwneary commented Jun 29, 2017

Arctic Version

1.42.0

Arctic Store

# ChunkStore

Platform and version

Python 3.5

Description of problem and/or code sample that reproduces the issue

I am using the DateChunker with chunk_size='D' and the data stored has a frequency of about 10 seconds.

for chunk in self.__library.iterator('key'):
        no_of_rows = len(chunk)

no_of_rows in the above code is always 0 and chunk is always an empty dataframe.

From debugging this looks like it could be because get_chunk_ranges creates a a list of date tuples like:

...
<class 'tuple'>: (b'2016-10-18', b'2016-10-18'),
<class 'tuple'>: (b'2016-10-19', b'2016-10-19')
...

so using c.to_range(chunk[0], chunk[1]) produces:

DateRange(start=datetime.datetime(2016, 10, 18, 0, 0), end=datetime.datetime(2016, 10, 18, 0, 0))

therefore, since the date range is infinitely small, no dates are returned.

Is this a bug or am I doing something wrong?

Cheers
Dave

@bmoscon
Copy link
Collaborator

bmoscon commented Jun 29, 2017

Its pretty hard to tell what you are doing from the code sample you produced.

'D' chunking results in chunks that comprise all data for the given day. So the chunk range is correct - the start of the chunk is Oct 18, 2016 and thats also the end of the chunk since its a daily chunk.

I'm not sure what __library is. I assume its a chunkstore instance?

If you call the iterator methoditerator in chunkstore it returns a generator that should return all the data for a given symbol, date ordered. Are you actually calling it with 'key' ? the key is the symbol you used to write the data, I'm guessing you're just putting that as a place holder for the real symbol, but just checking.

Does get_chunk_ranges produce any meaningful output? From what you have, it looks like there is a ton of data for the symbol - it should return a tuple for every chunk actually in the database. What about when you do a read on the symbol - do you get all the data back?

Also, due to the nature of your last issue, I want to make sure this isnt on CosmosDB.

@bmoscon
Copy link
Collaborator

bmoscon commented Jun 29, 2017

It will be 100x easier for me to help you if you can produce a very small code / data sample that reproduces the issue as well.

something like create a small dataframe - write it to arctic - other calls that reproduce the error.

@bmoscon
Copy link
Collaborator

bmoscon commented Jun 30, 2017

in case you haven't already seen it:

https:/manahl/arctic/wiki/Chunkstore

@davidwneary
Copy link
Author

davidwneary commented Jun 30, 2017

Hi @bmoscon,

Sorry for the lack of clarity, I've written a small script that replicates the issue running locally against mongodb (not cosmosdb):

from datetime import datetime

import pandas
from arctic import Arctic, CHUNK_STORE
from pandas import DataFrame
from pandas.tseries.index import DatetimeIndex

# Create store and library
store = Arctic('localhost', 'test')
store.initialize_library('lib', lib_type=CHUNK_STORE)
lib = store['lib']

# Create dataframe of time measurements taken every 6 hours
date_range = pandas.date_range(start=datetime(2017, 5, 1, 1), periods=8, freq='6H')

df = DataFrame(data={'something': [100, 200, 300, 400, 500, 600, 700, 800]},
               index=DatetimeIndex(date_range, name='date'))

# Write to database
lib.write('testkey', df, chunk_size='D')

# Iterate
for chunk in lib.iterator('testkey'):
    no_of_rows = len(chunk)
    print(no_of_rows)
    print(chunk)

# Read
print(lib.read('testkey'))

The output of the above script is:

0
Empty DataFrame
Columns: [something]
Index: []

0
Empty DataFrame
Columns: [something]
Index: []

date                    something
2017-05-01 01:00:00        100
2017-05-01 07:00:00        200
2017-05-01 13:00:00        300
2017-05-01 19:00:00        400
2017-05-02 01:00:00        500
2017-05-02 07:00:00        600
2017-05-02 13:00:00        700
2017-05-02 19:00:00        800

I would expect the output to be:

4
date                    something
2017-05-01 01:00:00        100
2017-05-01 07:00:00        200
2017-05-01 13:00:00        300
2017-05-01 19:00:00        400

4
date                    something
2017-05-02 01:00:00        500
2017-05-02 07:00:00        600
2017-05-02 13:00:00        700
2017-05-02 19:00:00        800


date                    something
2017-05-01 01:00:00        100
2017-05-01 07:00:00        200
2017-05-01 13:00:00        300
2017-05-01 19:00:00        400
2017-05-02 01:00:00        500
2017-05-02 07:00:00        600
2017-05-02 13:00:00        700
2017-05-02 19:00:00        800

The chunks variable inside the iterator() method is

[
    (b'2017-05-01', b'2017-05-01'), 
    (b'2017-05-02', b'2017-05-02')]`
]

this looks sensible, as you say we're chunking by day but then this produces ranges of:

[
    DateRange(start=datetime.datetime(2017, 5, 1, 0, 0), end=datetime.datetime(2017, 5, 1, 0, 0)), 
    DateRange(start=datetime.datetime(2017, 5, 2, 0, 0), end=datetime.datetime(2017, 5, 2, 0, 0))
]

The only timestamps that could possibly be in these date ranges are 2017-05-01T00:00:00 and 2017-05-02T00:00:00, respectively.

From looking closer it looks like this could be fixed by simply passing filter_data=False into the the following line:

yield self.read(symbol, chunk_range=c.to_range(chunk[0], chunk[1]))

What do you think?

@bmoscon
Copy link
Collaborator

bmoscon commented Jun 30, 2017

Alright, let me take a look this evening. Thanks for sending the code that replicates it. I can use that to generate a test case as well for the unit tests.

@bmoscon
Copy link
Collaborator

bmoscon commented Jul 1, 2017

oh ok, i see now that the issue is related to the times being part of the datetime index. Nothing we ever tried or intended to work, but let me see if I can get it working and preserve the base case (no times)

bmoscon added a commit that referenced this issue Jul 1, 2017
bmoscon added a commit that referenced this issue Jul 2, 2017
@bmoscon
Copy link
Collaborator

bmoscon commented Jul 2, 2017

fixed and merged

@bmoscon bmoscon closed this as completed Jul 2, 2017
@davidwneary
Copy link
Author

Thanks a lot!

Quick question just for my own understanding:

the issue is related to the times being part of the datetime index. Nothing we ever tried or intended to work

How were you using the chunkstore with no times stored in the datetime index? Is that just because you only had one data point per day? But in that case, what's the use case of the ChunkStore with a 'day' chunk size as it results in one document (or chunk) per data point?

Sorry for the questions, I just want to make sure I'm not misusing the library.

@bmoscon
Copy link
Collaborator

bmoscon commented Jul 4, 2017

i wouldnt use daily really, but there is a use case - imagine you have gigs of data per day. Ideally you want enough data in your chunk size that there is enough to get good compression out of it, but not so big that its bad for reads.

@davidwneary
Copy link
Author

I understand. Thanks a lot for your quick response and help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants