Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset train/eva/test partitions #2

Open
michaeltrs opened this issue Mar 2, 2020 · 1 comment
Open

Dataset train/eva/test partitions #2

michaeltrs opened this issue Mar 2, 2020 · 1 comment

Comments

@michaeltrs
Copy link

michaeltrs commented Mar 2, 2020

Hi,

For the provided dataset, I noticed there are more data saved in disk than what the total of the partitions found in tileids folder. For example for the 48x48 pixel data there are a total of 28515 .tfrecord.gz files while eval.tileids, train_fold*.tileids, test_fold*.tileids collectively contain 10494 samples per year. That leaves 28515 - 2*10494 = 7527 samples which are not split into train/eval/test for 2016 and 2017.
Is there something wrong in the above description? if not, then how should we treat the unassigned data?

Many thanks,
Michael

@MarcCoru
Copy link
Owner

MarcCoru commented Mar 23, 2020

Hi Michael,

Thanks for your issue and your patience.

The tileids files are used for the results in the paper. All results are obtained from the tiles of tileids/eval.tileids.

The number of tfrecord files can vary from the tileids in the data splits due to two effects: 1) data preprocessing failed (tileid listed in failedtiles201*.txt) and 2) the tileids are on the margin region between blocks of train/valid/eval as shown in Figure 4 in the paper.

Overall the preprocessing chain looked like this:

a) for each tile within AOI: crop images and store to tfrecord, if error: add id to failedtiles201*.txt

b) separate area of interest into blocks for train/valid/eval with margin. Store the ids of tiles that lie within the respective blocks into tileids folder.

Since b) defines the split, not all tiles that have been processed in a) will be used by the training script.
So, the number of tfrecord files and tileids can vary.

We decided to separate the tileids from the actual data samples to allow for different folds and some experiments with different data split similar to what we did in the CVPR paper (east vs west, size of blocks). At the end, we did not include these experiments in the work of the IJGI paper.

I hope this clarifies things.
We quantitatively evaluated the models on the eval.tileids of the 24px by 24px tiles. These would be the data tiles that you could compare your method with the results of the paper directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants