Problem with preprocessing of UCI datasets, especially MiniBooNE #8

VincentStimper · 2021-11-04T15:43:15Z

When doing density estimation on the UCI datasets HEPMASS and MiniBooNE, I saw in the appendix D.2 of the article that several dimensions of the raw data were removed since certain real values are reoccurring too frequently. This does make sense to me since such densities would involve Dirac delta distributions being problematic when trying to estimate them with continuous densities. However, when I checked the code I stumbled upon the following lines:

maf/datasets/hepmass.py

Line 91 in ea057bf

max_count = np.array([v for k, v in sorted(c.iteritems())])[0]

maf/datasets/miniboone.py

Line 52 in ea057bf

# max_count = np.array([v for k, v in sorted(c.iteritems())])[0]

They seem to compute the maximum over the counts of each real value but when implementing it myself this is not the case. The sorted function is sorting the array based on the first entry, which is the real value corresponding to the count and not the count itself. I demonstrate this problem in the following notebook:
https://gist.github.com/VincentStimper/bed1aa10ac187dc51eefa85e683a7df4
It also showcases the consequences. For the HEPMASS dataset there is coincidentally no difference between the features that get dropped and the features that would be dropped when max_count is computed correctly, i.e. by using

max_count = np.max(np.unique(feature, return_counts=True)[1])

On the other side, for MiniBooNE there are some dimension which are drop although max_count is only moderately high, e.g. 6, while dimensions with values reoccurring 3434 times are kept.

This might be a minor issue but since the version of the MiniBooNE dataset you made publicly available has been used numerous times by others as a benchmark for density estimation I think it is an issue which requires our attention.

The text was updated successfully, but these errors were encountered:

gpapamak · 2021-11-18T18:30:39Z

Great detective work and an excellent bug report, thanks Vincent!

Looks like you're right: max_count is the count of the smallest value, rather than the maximum count.

As you said the datasets have now been used many times, and I'd like this code to be the source of truth for them, so at this stage I'm going to call this a feature and keep it as is, even if it means that MINIBOONE is more discrete that we previously thought. I agree it's probably a minor issue, but it's good to be aware of it. What I'm going to do is pin this issue so that it's clearly visible from now on.

Many thanks again for the great work!

gpapamak pinned this issue Nov 18, 2021

gpapamak added bug wontfix labels Nov 18, 2021

gpapamak closed this as completed Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with preprocessing of UCI datasets, especially MiniBooNE #8

Problem with preprocessing of UCI datasets, especially MiniBooNE #8

VincentStimper commented Nov 4, 2021

gpapamak commented Nov 18, 2021

Problem with preprocessing of UCI datasets, especially MiniBooNE #8

Problem with preprocessing of UCI datasets, especially MiniBooNE #8

Comments

VincentStimper commented Nov 4, 2021

gpapamak commented Nov 18, 2021