2

I have a huge txt file (over 90 000 lines) that I want to read as a single column in a pandas df and I have a specific symbol to mark the end of each line / row i.d. .

So far, I have tried : df = pd.read_csv(fic, sep='\t', lineterminator='‡', header = None, encoding="utf-8")

The output is indeed a df, but it skips to the line (3 932) as if the first ‡ exists there. This is not at all the case, as there are many (> 2 000) ‡ before.

The desired output would be something like :

Index Text_initial
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Ut enim ad minim veniam. ‡
2 Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Sunt in culpa qui officia deserunt mollit anim id est laborum. ‡

etc...
and not something like :

Index Text_initial
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Ut enim ad minim veniam.
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem donec massa sapien faucibus et molestie ac feugiat. Volutpat sed cras ornare arcu dui vivamus arcu. Cras sed felis eget velit aliquet sagittis id consectetur.
Viverra vitae congue eu consequat ac felis. Consectetur adipiscing elit duis tristique sollicitudin. Sem et tortor consequat id.
Sed blandit libero volutpat sed cras ornare arcu dui vivamus. Fermentum odio eu feugiat pretium nibh ipsum consequat. Consequat mauris nunc congue nisi vitae suscipit tellus mauris.
Morbi non arcu risus quis varius quam quisque id. Velit egestas dui id ornare.
......
2 Pharetra magna ac placerat vestibulum lectus. ‡
Nec feugiat nisl pretium fusce id velit ut. ‡
Amet justo donec enim diam vulputate ut pharetra. ‡
Nibh venenatis cras sed felis eget velit aliquet sagittis id. ‡

etc...

The ‡ symbol exists only to mark the end of the line/row and the file is encoded in utf-8.

Any suggestions on how to correctly read the txt file to a df and the reason why the output atm is not "correct" ?

Example from the real file :

PART I Item 1.
Business General 3D Systems Corporation (3D Systems or the Company or we or us) is a holding company incorporated in Delaware in 1993 that markets our products and services through subsidiaries in North America and South America (collectively referred to as Americas) , Europe and the Middle East (collectively referred to as EMEA) and the Asia Pacific region (APAC) .
We provide comprehensive 3D printing solutions, including 3Dprinters, materials, software, on demand manufacturing services and digital design tools.
Our solutions support advanced applications in a wide range of industries and key verticals including healthcare, aerospace, automotive and durable goods.
Our precision healthcare capabilities include simulation, Virtual Surgical Planning (VSP) , and printing of medical and dental devices, anatomical models, and surgical guides and instruments. 
We have over 30 years of experience and expertise which have proven vital to our development of end-to-end solutions that enable customers to optimize product designs, transform workflows, bring innovative products to market and drive new business models.
Customers can use our 3D solutions to design and manufacture complex and unique parts, eliminate expensivetooling, produce parts locally or in small batches and reduce lead times and time to market. ‡ 
A growing number of customers are shifting from prototyping applications to also using 3D printing for production.
We believe this shift will be further driven by our continued advancement and innovation of 3D printing solutions that improve durability, repeatability, productivity and total cost of operations. ‡
Marrluxia
  • 61
  • 1
  • 9
  • Can you show us the first lines of that file for an example. Does the line really ends (in the file!) with `‡`? Is there a `‡\n` or `‡\n\r` at the end of each line in that file? Or are there lines in that files that can have multiple `‡` in it? – buhtz Jul 22 '22 at 07:09
  • 1
    @buhtz Every cell contains multiline "values" which are separated by a ```\n``` and each row is separated by a single ```‡```. There are not multiple ```‡``` in the same line/row. Updated my question with the "real" file. – Marrluxia Jul 22 '22 at 07:17
  • Please describe your example file of the real file. What do we see there? Btw: I would say this is not a CSV file. You have to read it as a raw text file and reformat it yourself. Maybe it is better for that question if you could construct an example with less text in it. Just show us the possible combinations of multiline, newlines and your separator. – buhtz Jul 22 '22 at 07:25
  • @buhtz I described it in the beginning of my question : "I have a huge txt file (over 90 000 lines) that I want to read as a single column in a pandas df". By multine I mean that in each cell in the desired output there will be also \n as shown in the table example – Marrluxia Jul 22 '22 at 07:28

1 Answers1

0

Since lineterminator='‡' didn't seem to work for me, I am posting a "typical" workaround

Reading the txt file,

with open('corpus.txt') as f:
    corpus = f.read()

splitting it on my "separator" and then assigning/reading it to a df etc. worked

corpus_segm = corpus.split("‡")

P.s. I am still curious if lineterminator would have worked. If someone has a suggestion, feel free to comment.

Marrluxia
  • 61
  • 1
  • 9