1

I am trying to read a csv file, which doesn't contain a header line, and it contains an indefinite amount of columns, with pandas.

I have search how to work around this, but all the answers that I have found require for me to already know (search by opening the file) the maximum number that a column can have and create a names= attribute on read_csv function, like this:

names = ["a", "b", "c", "d"]
table = pandas.read_csv('freqs.tsv', header=None, sep='\t+', names=names)

My question is, is there any possible ways to do this without knowing the maximum number of columns? For future reusability of the script, I want to generalize if it is possible.

Here is a sample text file I was using to run some tests:

mathematics         1.548438245 1.4661764369999999      1.429891562 
english                     1.237816576 1.043399455
physics         2.415563662 11.165497484000001  5.954598265 7.853732762999999   7.929835858
drama           2.0439384830000003  9.81210385  5.068332477 8.579349377 5.962282599999999
health                      1.557941553 1.222267933
science                         1.550193476
gym             1.240610831 1.149375944 1.899408195 1.3713249980000002

Thank you

Cristian
  • 303
  • 2
  • 8
  • 2
    That isn't a CSV file. There are no commas. But let's step back a moment. What are you going to DO with the data once it's in pandas? Pandas doesn't like variable length lines because you can't do generalized processing with them. Pandas is exactly like a database; each column is a field. This data isn't like that. It's ENTIRELY possible that pandas is not the right tool. – Tim Roberts Feb 27 '21 at 02:12
  • I would only add to the above comment: is there significance to the spacing in this file? That is, is the first column in English *supposed* to be empty? Or can we assume that the spacing is arbitrary? – Kraigolas Feb 27 '21 at 02:41
  • My mistake, I meant to say that it is reading a text file, and later on dump the information into a csv file. The original intention was to later on iterate through specific columns, and do some calculations with each column (calculate the mean to be specific). The spacing is arbitrary. The given text file has random amounts of `\t` separators. – Cristian Feb 27 '21 at 02:54

1 Answers1

2

I get the following output

0 1 2 3 4
mathematics 1.54844 1.46618 1.42989 nan nan
english 1.23782 1.0434 nan nan nan
physics 2.41556 11.1655 5.9546 7.85373 7.92984
drama 2.04394 9.8121 5.06833 8.57935 5.96228
health 1.55794 1.22227 nan nan nan
science 1.55019 nan nan nan nan
gym 1.24061 1.14938 1.89941 1.37132 nan

By writing:

import pandas as pd 
# Assume your data is in test.txt in the current working directory 
f = open("test.txt", "r")

# This assumes your spacing is arbitrary 
data = [line.split() for line in f]
data = {line[0] : [float(item) for item in line[1:]] for line in data}
# The orient = "index" allows us to handle differing lengths of entries
df = pd.DataFrame.from_dict(data, orient="index")

# this just provides the above table for printing in StackOverflow
print(df.to_markdown()) 

Note that I've assumed the spacing in your file is arbitrary and hence we don't need to track which columns are empty, we can just split at spaces and keep the values.

Also note that nan means "not a number" and is what you should expect to see in your dataframe if you have rows of differing lengths.

Finally, if you want the subjects as the columns, use df = df.transpose().

Kraigolas
  • 5,121
  • 3
  • 12
  • 37