Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: How to add a missing column to DataView? #5967

Closed
torronen opened this issue Oct 11, 2021 · 8 comments
Closed

Q: How to add a missing column to DataView? #5967

torronen opened this issue Oct 11, 2021 · 8 comments

Comments

@torronen
Copy link
Contributor

torronen commented Oct 11, 2021

I am reading to a DataView with TextLoader and column inference. My dataset is missing some columns (boolean, in my case) that are expected by the ML.NET model. I would like to add all 0's or false's for this column and then run the prediction.

I can only find how to add missing values based on another column. In this case, I dont have any other column to copy with appended missing values.

I can find DropColumn method, but not AddColumn method. Is there some way to add columns with constant values with transformers or other way to IDataView?

@torronen
Copy link
Contributor Author

I might misunderstand something but finding documentation on how to add a column seems difficult for me.

I think adding columns is important with custom or static content, because there might be cases where we a) dont have all the data in a new dataset but we still want to reuse and evaluate an existing model or b) we want to evaluate the impact of some column. I believe this could be done with custom mapping but as I understand it requires first to explicitly declare classes. It makes working with CSV files a bit more slow.

I am trying to automatically dynamically add columns with 0's whenever a model is requiring a column which is missing from the CSV file.

@torronen
Copy link
Contributor Author

Fix: use DataFrame from Microsoft.Data.Analysis Nuget package. It possibly could be included more in the docs as it was took a bit of effort to find it.

Below my first hack, suggestion for improvement appreciated.

    foreach (var c in modelSchema)
            {
                if (!combinedDataView.Schema.Any( x => x.Name == c.Name))
                {
                    DataFrame df = combinedDataView.ToDataFrame();
                    DataFrameColumn dfCol = new BooleanDataFrameColumn(c.Name, df.Rows.Count());
                    df.Columns.Add(dfCol);
                    combinedDataView = df;
                }
            }

@torronen
Copy link
Contributor Author

torronen commented Oct 13, 2021

ToDataFrame() can be confusing because the default value is 100. It might be better to rename it to ToDataFramePreview, or put default value to -1

image

I reopened this issue for review. I should be able to continue tomorrow with this, but I think it is harder than it should be in future version (or I might be missing something). .ToDataFrame(-1) is also slow, probably why default is 100. Could it that DataFrame is slower than the default implementation of IDataView as well? My full simulations probably became about x5-x10 slower. It might be a bigger problem with big datasets which do not fit in RAM.

Transforms for appending cols would be better if there is a ways to do it for dynamic feature names. All alternatives appreciated.

This issue can be closed after review.

@torronen torronen reopened this Oct 13, 2021
@torronen torronen changed the title Q: How to add a missing column to DataFrame? Q: How to add a missing column to DataView? Oct 13, 2021
@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Oct 13, 2021

@pgovind for notification

You should be able to add a column similar with how you add a new key to dictionary

df["newColumn"] = DataFrameColumn.Create("test", Enumerable.Range(0, (int)df.Columns.First().Length)) // the length must be match with other columns

Or you can also add to df.Columns

df.Columns.Add(DataFrameColumn.Create("test", Enumerable.Range(0, (int)df.Columns.First().Length)));

@torronen
Copy link
Contributor Author

torronen commented Oct 13, 2021

Thanks. If textloader would return DataFrame, that would make it simpler. Now it returns IDataView which would not seem to expose methods or properties for adding columns.

Long-term, a transform to append a column with static values would be nice (similar to drop columns)

var textLoader = mlContext.Data.CreateTextLoader(columnInference.TextLoaderOptions);
// Load return IDataView
IDataView combinedDataView = textLoader.Load(new MultiFileSource(predictionDatasetPath));

@eerhardt
Copy link
Member

One option might be to use a CustomMapping which just always adds the false column.

See https://docs.microsoft.com/dotnet/api/microsoft.ml.custommappingcatalog.custommapping and https://www.youtube.com/watch?v=TEnQp5qtopo for how to create one.

@michaelgsharp
Copy link
Member

In the Microsoft.Data.Analysis package you can use DataFrame.LoadCsv to load a file directly to a DataFrame instead of an IDataView.

@michaelgsharp
Copy link
Member

@torronen I'm going to close this for now at is seems you have the answer you need. If you have more questions feel free to reopen as needed.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants