Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with preprocessing of UCI datasets, especially MiniBooNE #8

Closed
VincentStimper opened this issue Nov 4, 2021 · 1 comment
Closed

Comments

@VincentStimper
Copy link

When doing density estimation on the UCI datasets HEPMASS and MiniBooNE, I saw in the appendix D.2 of the article that several dimensions of the raw data were removed since certain real values are reoccurring too frequently. This does make sense to me since such densities would involve Dirac delta distributions being problematic when trying to estimate them with continuous densities. However, when I checked the code I stumbled upon the following lines:

max_count = np.array([v for k, v in sorted(c.iteritems())])[0]

# max_count = np.array([v for k, v in sorted(c.iteritems())])[0]

They seem to compute the maximum over the counts of each real value but when implementing it myself this is not the case. The sorted function is sorting the array based on the first entry, which is the real value corresponding to the count and not the count itself. I demonstrate this problem in the following notebook:
https://gist.github.com/VincentStimper/bed1aa10ac187dc51eefa85e683a7df4
It also showcases the consequences. For the HEPMASS dataset there is coincidentally no difference between the features that get dropped and the features that would be dropped when max_count is computed correctly, i.e. by using

max_count = np.max(np.unique(feature, return_counts=True)[1])

On the other side, for MiniBooNE there are some dimension which are drop although max_count is only moderately high, e.g. 6, while dimensions with values reoccurring 3434 times are kept.

This might be a minor issue but since the version of the MiniBooNE dataset you made publicly available has been used numerous times by others as a benchmark for density estimation I think it is an issue which requires our attention.

@gpapamak gpapamak pinned this issue Nov 18, 2021
@gpapamak
Copy link
Owner

Great detective work and an excellent bug report, thanks Vincent!

Looks like you're right: max_count is the count of the smallest value, rather than the maximum count.

As you said the datasets have now been used many times, and I'd like this code to be the source of truth for them, so at this stage I'm going to call this a feature and keep it as is, even if it means that MINIBOONE is more discrete that we previously thought. I agree it's probably a minor issue, but it's good to be aware of it. What I'm going to do is pin this issue so that it's clearly visible from now on.

Many thanks again for the great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants