Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TabularPandas data transform reproducibility #2826

Closed
Isaac-Flath opened this issue Sep 22, 2020 · 6 comments
Closed

TabularPandas data transform reproducibility #2826

Isaac-Flath opened this issue Sep 22, 2020 · 6 comments

Comments

@Isaac-Flath
Copy link
Contributor

Isaac-Flath commented Sep 22, 2020

Feature requests should first be proposed on the forum.

Link to forum discussion.

Zach said it's useful enough to put in a PR without waiting for a forum discussion. He Ok'd me calling him out in this way in the PR :)

I did place this in the fastaiv2 tabular thread here: https://forums.fast.ai/t/fastai-v2-tabular/53530/235

Is your feature request related to a problem? Please describe.

I have trained a model using XGBoost, but I did the data processing for training and validation sets using TabularPandas (similar to the approach done in the fastai book). I did not use dataloaders or a learner object. Now, I need to use it for monthly inference, but the only way I can get it to process properly is to have a training and validation set. I just want to apply the transforms to the validation set the same way each month. For example, I believe whether a _na column is created is dependent on the data given to it and if it has any null values. The order that categories show up in the training set also matters for categorify.
For inference, I just want to process the "validation set" and make my predictions.

Describe the solution you'd like

I would like a way to export the transform logic of a TabularPandas object, then import it whenever I want to process a dataframe into a TabularPandas object that can be used for inference.

Describe alternatives you've considered

I have not been able to get my alternative to work, without a pretty large training dataset being processed each time I want to do inference. Go-live for my project is coming up pretty quick, so I am moving my project off of fastai due to this.

The workaround I was attempting was to have a static dummy training set that gets processed with any new data so that I have a 'training' and validation set. I was attempting to doctored the training set to ensure that the right columns have null values, categories in the right order in the data, etc. Then I create the TabularPandas object and do inference on the validation set (new data). I have spent a good chunk of time trying to get this to work without a massive training set being reprocessed repeatedly, but I have to abandon those efforts for this project as I am on a time constraint.

Additional context

The example in the fastai tabular book where a RandomForest is being trained is a great example. Now, if you need to load the model up a month later to do inference only using the random forest - how would you process the new data?

Ideally, I would like to avoid dataloaders (as it doesn't give me anything for this problem). I would also like to avoid processing extra 'training' data as I only want to do inference and it really shouldn't be necessary.

@muellerzr
Copy link
Contributor

muellerzr commented Sep 22, 2020

@jph00 this'll be something I want to pick up and implement, as I could easily see its value when folks want to use fastai tabular for preprocessing and then use it for other libraries and move into production. In my head I view it as something like to.export(), and follows the same protocol that learn.export would do, just isolated to the TabularPandas level.

@marii-moe
Copy link
Contributor

Think this one was completed in #2857

@muellerzr
Copy link
Contributor

muellerzr commented Jun 24, 2021 via email

@jph00 jph00 closed this as completed Jun 25, 2021
@HenryDashwood
Copy link

I think this feature may have been deprecated or never merged. The export method isn't there in the master branch. Is this not a good way to reapply the TabularPandas processes on new data in e.g. production?

@muellerzr
Copy link
Contributor

@HenryDashwood I've done so on Walk with fastai, see here: https://walkwithfastai.com/tab.export (sadly on the PR it got lost on time so it didn't quite get merged 😢 )

@HenryDashwood
Copy link

Yeah that's what I've gone with. Seems to work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants