UPDATE: here is a memory saving solution, which converts all your string to numerical categories:
In [13]: df
Out[13]:
c1 c2
0 stringa stringb
1 stringb stringc
2 stringd stringa
3 stringa stringb
4 stringb stringc
5 stringd stringa
6 stringa stringb
7 stringb stringc
8 stringd stringa
In [14]: x = (df.stack()
....: .astype('category')
....: .cat.rename_categories(np.arange(len(df.stack().unique())))
....: .unstack())
In [15]: x
Out[15]:
c1 c2
0 0 1
1 1 2
2 3 0
3 0 1
4 1 2
5 3 0
6 0 1
7 1 2
8 3 0
In [16]: x.dtypes
Out[16]:
c1 category
c2 category
dtype: object
OLD answer:
I think a you can categorize your columns:
In [63]: big.head(15)
Out[63]:
c1 c2
0 stringa stringb
1 stringb stringc
2 stringd stringa
3 stringa stringb
4 stringb stringc
5 stringd stringa
6 stringa stringb
7 stringb stringc
8 stringd stringa
9 stringa stringb
10 stringb stringc
11 stringd stringa
12 stringa stringb
13 stringb stringc
14 stringd stringa
In [64]: big.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
c1 object
c2 object
dtypes: object(2)
memory usage: 457.8+ MB
So big
DF has 30M rows and it's size is approx. 460MiB...
Let's categorize it:
In [65]: cat = big.apply(lambda x: x.astype('category'))
In [66]: cat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
c1 category
c2 category
dtypes: category(2)
memory usage: 57.2 MB
It takes now only 57MiB and looks exactly the same:
In [69]: cat.head(15)
Out[69]:
c1 c2
0 stringa stringb
1 stringb stringc
2 stringd stringa
3 stringa stringb
4 stringb stringc
5 stringd stringa
6 stringa stringb
7 stringb stringc
8 stringd stringa
9 stringa stringb
10 stringb stringc
11 stringd stringa
12 stringa stringb
13 stringb stringc
14 stringd stringa
let's compare it's size with similar numeric DF:
In [67]: df = pd.DataFrame(np.random.randint(0,5,(30000000,2)), columns=list('ab'))
In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
a int32
b int32
dtypes: int32(2)
memory usage: 228.9 MB