This tools aims to provide simple solution to save memory when using pandas' data frame. It is highly inspired by this kaggle post.
Important
The very basic principle : for each column, this tool reduces int and float precision as much as possible so that
- Approximate method
method='approx'
: no duplicated values appear and the minimum and maximum can be re-encoded - Exact method
method='exact'
: preserve absolute information by testing every value.
For object data type, the function is trying to create category.
pip install dataframe-memory
from data_memory import reduce_memory
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.array(
[[1, 2, "aaa"],
[4, 5, "bbb"],
[7, 8, "ccc"]] * 10000),
columns=['a', 'b', 'c'])
reduce_memory(df, method="exact", verbose=True)
Yields the following decrease of memory
Memory usage input: 5.04 MB
Memory usage output: 0.09 MB
Decreased by: 98.28 %
df.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 30000 non-null category
1 b 30000 non-null category
2 c 30000 non-null category
Warning
In method='approx'
,
- This tool destroys information and should not be applied automatically to any dataframe but big ones
- It preserves relative but not absolute information