-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with preprocessing of UCI datasets, especially MiniBooNE #8
Comments
Great detective work and an excellent bug report, thanks Vincent! Looks like you're right: As you said the datasets have now been used many times, and I'd like this code to be the source of truth for them, so at this stage I'm going to call this a feature and keep it as is, even if it means that MINIBOONE is more discrete that we previously thought. I agree it's probably a minor issue, but it's good to be aware of it. What I'm going to do is pin this issue so that it's clearly visible from now on. Many thanks again for the great work! |
When doing density estimation on the UCI datasets HEPMASS and MiniBooNE, I saw in the appendix D.2 of the article that several dimensions of the raw data were removed since certain real values are reoccurring too frequently. This does make sense to me since such densities would involve Dirac delta distributions being problematic when trying to estimate them with continuous densities. However, when I checked the code I stumbled upon the following lines:
maf/datasets/hepmass.py
Line 91 in ea057bf
maf/datasets/miniboone.py
Line 52 in ea057bf
They seem to compute the maximum over the counts of each real value but when implementing it myself this is not the case. The
sorted
function is sorting the array based on the first entry, which is the real value corresponding to the count and not the count itself. I demonstrate this problem in the following notebook:https://gist.github.com/VincentStimper/bed1aa10ac187dc51eefa85e683a7df4
It also showcases the consequences. For the HEPMASS dataset there is coincidentally no difference between the features that get dropped and the features that would be dropped when
max_count
is computed correctly, i.e. by usingOn the other side, for MiniBooNE there are some dimension which are drop although
max_count
is only moderately high, e.g. 6, while dimensions with values reoccurring 3434 times are kept.This might be a minor issue but since the version of the MiniBooNE dataset you made publicly available has been used numerous times by others as a benchmark for density estimation I think it is an issue which requires our attention.
The text was updated successfully, but these errors were encountered: