0

I'm using Windows 7 and Python 3.4.

I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:

import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
    with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
        while True:
            _Line = InFile.readline()
            if len (_Line) == 0:
                break
            else:
                _LineString = str (_Line)
                OutFile.write (_LineString)

It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &amp, &nbsp and things like that. But the source files don't contain any of them. Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.

Considering all these:

  1. How can I correct my code so that it reads all files one after one, putting each in only one line?
  2. How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Vynylyn
  • 172
  • 11
  • The strange characters are almost certainly a unicode handling problem. Are you sure you're in UTF8? – aruisdante Feb 19 '15 at 19:10
  • 1
    How do you want to delimit each line read? If you read `Line 1\nLine2` how do you want that in the output file? – dawg Feb 19 '15 at 19:11
  • @Aruisdante: My files are all in Persian and I just GUESS they must be UTF8. I'm not sure just guess it by my experience. – Vynylyn Feb 19 '15 at 19:15
  • @Dawg: If it is like: Line1\nLine2\nLine3 I want it to be exactly Line1Line2Line3 – Vynylyn Feb 19 '15 at 19:17

2 Answers2

1

Here is an example that will do the merging portion of your question:

def merge_file(infile, outfile, separator = ""):
    print(separator.join(line.strip("\n") for line in infile), file = outfile)


def merge_files(paths, outpath, separator = ""):
    with open(outpath, 'w') as outfile:
        for path in paths:
            with open(path) as infile:
                merge_file(infile, outfile, separator)

Example use:

merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")

Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:

def merge_file(infile, outfile, separator = ""):
    for line in infile:
        outfile.write(line.strip("\n")+separator)
    outfile.write("\n")

It's slower, but shouldn't run into memory problems.

aruisdante
  • 8,875
  • 2
  • 30
  • 37
  • @ADante: Yay! It did the job successfully. I didn't go for removing those silly characters but merging action went well. Thank you. – Vynylyn Mar 09 '15 at 18:56
1

Answering question 1:

You were right about the UTF-8 part.
You probably want to create a function which takes multiple files as a tuple of files/strings of file directories or *args. Then, read all input files, and replace all "\n" (newlines) with a delimiter (Default ""). out_file can be in in_files, but makes the assumption that the contents of files can be loaded in to memory. Also, out_file can be a file object, and in_files can be file objects.

def write_from_files(out_file, in_files, delimiter="", dir="C:\Dir"):
    import _io
    import os
    import html.parser  # See part 2 of answer
    os.chdir(dir)
    output = []
    for file in in_files:
        file_ = file
        if not isinstance(file_, _io.TextIOWrapper):
            file_ = open(file_, "r", -1, "UTF-8")  # If it isn't a file, make it a file
        file_.seek(0, 0)
        output.append(file_.read().replace("\n", delimiter))  # Replace all newlines
        file_.close()  # Close file to prevent IO errors      # with delimiter
    if not isinstance(out_file, _io.TextIOWrapper):
        out_file = open(out_file, "w", -1, "UTF-8")
    html.parser.HTMLParser().unescape("\n".join(output))
    out_file.write(join)
    out_file.close()
    return join  # Do not have to return

Answering question 2:

I think you may of copied from a webpage. This does not happen to me. The &amp and &nbsp are the HTML entities, (&) and ( ). You may need to replace them with their corresponding character. I would use HTML.parser. As you see in above, it turns HTML escape sequences into Unicode literals. E.g.:

>>> html.parser.HTMLParser().unescape("Alpha &lt β")
'Alpha < β'

This will not work in Python 2.x, as in 3.x it was renamed. Instead, replace the incorrect lines with:

import HTMLParser
HTMLParser.HTMLParser().unescape("\n".join(output))
  • I can't rate up your answer due to my low rank but I thank you here. Although it's kind of complicated for a beginner like me, it gave me some idea and point to work on. – Vynylyn Feb 19 '15 at 20:09