1

I want to make a pandas Dataframe with following columns.

my_cols = ['chrom', 'len_of_PIs']

and following values inside specific columns:

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
                    [np.random.randint(18, 55, 92)],
                    [np.random.randint(25, 61, 98)]])

I am expecting the output simply like:

chrom    len_PIs
chr1     49, 32, 30, 27, 52, 52,.....
chr2     27, 20, 40, 41, 44, 50,.....
chr3     35, 45, 56, 42, 58, 50,.....

where, the len_PIs can be a list or str, so I can do easy downstream analyses. But, I am not getting the data as expected when I do:

new_df = pd.DataFrame()
new_df['chrom'] = chrom

# this code is giving me an output like
new_df['len_PIs'] = len_of_PIs.astype(str)

      chrom                                            len_PIs
0  chr1  [array([49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1  chr2  [array([27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2  chr3  [array([35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...

# and each one of these below codes are giving me an output like
new_df['len_PIs'] = len_of_PIs.as_matrix()
new_df.insert(loc=1, value=len_of_PIs.astype(list) , column='len_PIs')
new_df['len_PIs'] = pd.DataFrame(len_of_PIs, columns=['len_PIs'], index=len_of_PIs.index)

      chrom                                            len_PIs
0  chr1  [[49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1  chr2  [[27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2  chr3  [[35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...

How can I update this method? If there are alternate and comprehensive method from beginning of column and data prepration that would be nice too.

everestial007
  • 6,665
  • 7
  • 32
  • 72

3 Answers3

2

I don't believe you need the inner lists in your len_of_PIs series. You may also find it convenient to instantiate your pd.DataFrame from a dictionary. The below produces your desired output.

It's generally not good practice to convert numeric data to strings, unless you absolutely must, so I have kept your array data as numeric.

import pandas as pd, numpy as np

my_cols = ['chrom', 'len_of_PIs']

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([np.random.randint(15, 59, 86),
                        np.random.randint(18, 55, 92),
                        np.random.randint(25, 61, 98)])

df = pd.DataFrame({'chrom': chrom,
                   'len_of_PIs': len_of_PIs},
                  columns=my_cols)

#   chrom                                         len_of_PIs
# 0  chr1  [17, 52, 48, 22, 27, 49, 26, 18, 46, 16, 22, 1...
# 1  chr2  [39, 52, 53, 29, 38, 51, 30, 44, 47, 49, 28, 4...
# 2  chr3  [46, 37, 46, 29, 49, 39, 56, 48, 29, 46, 28, 2...
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 2
    This is so simple. Why am I overthinking and wasted an hour - Grr. Thank you – everestial007 Mar 11 '18 at 19:13
  • Without the inner list is fine. But, your code is just giving the same output I didn't want, like `[[58, 51, ... `. Would that matter?. Is it different pandas version ? – everestial007 Mar 11 '18 at 19:18
  • 1
    What exactly are you worried about? The square brackets just represent that you have an array. They are not "real" square brackets. If you don't have good reason to, don't convert your numbers to strings. – jpp Mar 11 '18 at 19:21
  • Oh ok. I was concerned that I was getting list within list. Alright. worries over. – everestial007 Mar 11 '18 at 19:48
1

If want strings use list comprehension with extract inner list, cast to string and last join:

chrom = pd.Series(['chr1', 'chr2', 'chr3'])

len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
                    [np.random.randint(18, 55, 92)],
                    [np.random.randint(25, 61, 98)]])

a = [', '.join(x[0].astype(str)) for x in len_of_PIs]
df1 = pd.DataFrame({'len_PIs':a, 'chrom':chrom})
print (df1)
  chrom                                            len_PIs
0  chr1  57, 32, 44, 29, 38, 40, 19, 34, 24, 38, 42, 46...
1  chr2  19, 32, 36, 21, 44, 33, 53, 36, 21, 18, 43, 30...
2  chr3  27, 58, 60, 39, 54, 53, 32, 43, 33, 36, 60, 39...

And for lists for nested lists use list comprehension or str[0]:

df1 = pd.DataFrame({'len_PIs':[x[0] for x in len_of_PIs], 'chrom':chrom})
#alternative solution
#df1 = pd.DataFrame({'len_PIs':len_of_PIs.str[0], 'chrom':chrom})
print (df1)
 chrom                                            len_PIs
0  chr1  [18, 42, 34, 31, 57, 49, 56, 28, 56, 40, 19, 5...
1  chr2  [48, 29, 23, 21, 54, 28, 23, 27, 44, 51, 18, 3...
2  chr3  [47, 53, 57, 26, 49, 39, 37, 41, 29, 36, 36, 5...
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Is it possible to have `57, 32, 44, 29,... ` as integer but not string. I know `[57, 32, 44, 29, ... ]` can be a `[list of integer]`. I am trying not to switch between str and integers. – everestial007 Mar 11 '18 at 19:26
  • @everestial007 - Unfortunately not. there is possible strings with separated by `,` or list - there is posiible get ints. – jezrael Mar 11 '18 at 19:28
1

Notice, 49, 32, 30 is not a proper type in Python. If it is a list/tuple, it should have brackets/parentheses like [49, 32, 30]; and if it is a string, it should have quotes like "49, 32, 30". The latter, however, can be printed without quotes and give you exactly what you want. But it would be very hard to work with later on. The following modification of jpp's code will give you a result that looks exactly like your desired outcome; but given the fact that you will work on this DataFrame, you should stick with his answer.

import pandas as pd, numpy as np

my_cols = ['chrom', 'len_of_PIs']

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([", ".join(np.random.randint(15, 59, 86).astype(str)),
                        ", ".join(np.random.randint(18, 55, 92).astype(str)),
                        ", ".join(np.random.randint(25, 61, 98).astype(str))])

df = pd.DataFrame({'chrom': chrom,
                   'len_of_PIs': len_of_PIs},
                  columns=my_cols)

print(df) returns:
  chrom                                         len_of_PIs
0  chr1  17, 37, 38, 25, 51, 39, 26, 24, 38, 44, 51, 21...
1  chr2  23, 33, 20, 48, 22, 45, 51, 45, 20, 39, 29, 25...
2  chr3  49, 42, 35, 46, 25, 52, 57, 39, 26, 29, 58, 26...

The difficulty of working with this result is as follows. Take the first row of the len_of_PIs column as an example. It has to be processed before it can be used as a collection of numbers:

[float(e) for e in df.len_of_PIs[0].split(", ")]

which is a pain. So, yeah, there you go.

FatihAkici
  • 4,679
  • 2
  • 31
  • 48