2

I would like to save my dataframe in a way that matches an existing txt file (I have a trained model based on the this txt file and I now want to predict on new data, that needs to match this format).

The target txt file looks like this (first3 rows):

2 qid:0 0:0.4967141530112327 1:-0.1382643011711847 2:0.6476885381006925 3:1.523029856408025 4:-0.234153374723336 
1 qid:2 0:1.465648768921554 1:-0.2257763004865357 2:0.06752820468792384 3:-1.424748186213457 4:-0.5443827245251827 
2 qid:0 0:0.7384665799954104 1:0.1713682811899705 2:-0.1156482823882405 3:-0.3011036955892888 4:-1.478521990367427 

First column is just a random integer (here the 2 and the 1) The qid is always connected via colon to an integer. Then an integer is followed by a float, for the rest of the columns.

My dataframe looks like this:

data = {'label': [2,3,2],
        'qid': ['qid:0', 'qid:1','qid:0'],
       '0': [0.4967, 0.4967,0.4967],
       '1': [0.4967, 0.4967,0.4967],
       '2': [0.4967, 0.4967,0.4967],
       '3': [0.4967, 0.4967,0.4967],
       '4': [0.4967, 0.4967,0.4967]}

df = pd.DataFrame(data)
Tartaglia
  • 949
  • 14
  • 20
  • I'm a little confused, in that qid, in the text file, you two qid:0 and 1 qid:2, in the set below you put qid:2 as qid:1. Would that be a typo?? Is there a way for you to be explicit about how you are generating this txt, if you are using the txt file to configure the model or model result? Because you can generate the results before the txt and save along with this data – edd1 Apr 09 '23 at 16:59
  • can you change your `data=` format or it needs to be like this and you have to process it letter to make it as your txt file, and also provide some hints which will be the columns and which will be rows basically give different data so we can identify – Somen Das Apr 09 '23 at 18:45
  • The qid:1 and any other integer is not a typo. These appear like that in the original txt file that I am trying to emulate, so qid:0 and qid:1, qid:2 etc in all sorts of orders. – Tartaglia Apr 09 '23 at 20:01
  • Regarding the txt. The txt is required to be in this format for the model as an input. I am merely trying to match this format, so I can provide my own data as input. – Tartaglia Apr 09 '23 at 20:03
  • The data format can be changed. This is the format it is in right now, but if necessary, it could be changed. – Tartaglia Apr 09 '23 at 20:03

2 Answers2

2

try this and let us know if it works for you case

data = pd.read_csv('output_list.txt', sep=" ", header=None)

data.columns = ["a", "b", "c", "etc."]

google colab pic

Updated code very messy if this solves your problem then it can be updated to handle large amount of data using numpy array methods

for i in list(data.keys()):
  if i=="label" or i=="qid":
    pass
  else:
    data[i]=[str(i)+":"+str(j) for j in list(data[i])]

enter image description here

Somen Das
  • 366
  • 2
  • 7
0

Since your data appears to be structured, you can process it manually:

data = []
with open('file.txt') as fp:
    for row in fp:
        arg0, *args = row.strip().split()
        d = {'rand': arg0}
        d.update(dict([arg.split(':') for arg in args]))
        data.append(d)

# You can use .apply(pd.to_numeric) if all of your columns are numeric
df = pd.DataFrame(data).apply(pd.to_numeric)

Output:

>>> df
   rand  qid         0         1         2         3         4
0     2    0  0.496714 -0.138264  0.647689  1.523030 -0.234153
1     1    2  1.465649 -0.225776  0.067528 -1.424748 -0.544383
2     2    0  0.738467  0.171368 -0.115648 -0.301104 -1.478522
Corralien
  • 109,409
  • 8
  • 28
  • 52