0

I'm using a Ruby script to do a lot of manipulation and cleaning to get this, and a bunch of other files, ready for import.

I have a really large file with some data that I'm trying to import into a database. There are some data issues with newline characters being in the data where they should not be, messing with the import.

I was able to solve this problem with sed using this:

sed -i '.original' -e ':a' -e 'N' -e '$!ba' -e 's/Oversight Bd\n/Oversight Bd/g' -e 's/Sciences\n/Sciences/g' combined_old_individual.txt"

However, I can't call that command from inside a Ruby script, because Ruby messes up interpreting the newline characters and won't run that command. sed needs the non-escaped newline character but when calling a system command from Ruby it needs a string, where the newline character needs to be escaped.

I also tried doing this using Ruby's file method, but it's not working either:

File.open("combined_old_individual.txt", "r") do |f|
  File.open("combined_old_individual_new.txt","w") do |new_file|
    to_combine = nil
    f.each_line do |line|
      if(/Oversight Bd$/ =~ line || /Sciences$/ =~ line)
        to_combine = line
      else
        if to_combine.nil?
          new_file.puts line
        else
          combined_line = to_combine + line
          new_file.puts combined_line
          to_combine = nil
        end
      end
    end
  end
end

Any ideas how I can join lines where the first line ends with "Bd" or "Sciences", from within a Ruby script, would be very helpful.

Here's an example of what might go in a testfile.txt:

random line
Oversight Bd
should be on the same line as the above, but isn't
last line

and the result should be

random line
Oversight Bdshould be on the same line as the above, but isn't
last line
Solomon
  • 6,145
  • 3
  • 25
  • 34
  • 1
    You give us no samples of the input data or your desired output? Don't ask us to cobble up our own samples, otherwise the output probably won't match what you want. "This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include [a minimal example](http://stackoverflow.com/help/mcve) in the question itself." – the Tin Man Jan 20 '14 at 17:40
  • the Tin Man is right: an example input/output would be great. – Carlos Agarie Jan 20 '14 at 17:46
  • Hey Tin Man and agarie, I added an example with some input and output. – Solomon Jan 20 '14 at 17:48
  • I don't know `ruby`, but it looks like you can print a line without a newline using the `print` method instead of `puts`: http://stackoverflow.com/questions/8723120/how-to-print-something-without-a-new-line-in-ruby, http://stackoverflow.com/questions/5080644/how-can-i-use-puts-to-the-console-without-a-line-break-in-ruby-on-rails – Digital Trauma Jan 20 '14 at 19:27
  • sed does not need a non-escaped newline character, you *do* give an escaped newline character to sed. The problem is that sed works line by line and you cannot match for `some_pattern\n` directly but have to use the `N` command after matching `some_pattern` to get the newline and the next line in the buffer. – wich Jan 22 '14 at 08:44

3 Answers3

2

With (My first attempt at a answer):

File.open("combined_old_individual.txt", "r") do |f|
  File.open("combined_old_individual_new.txt","w") do |new_file|
    f.each_line do |line|
      if(/(Oversight Bd|Sciences)$/ =~ line)
        new_file.print line.strip
      else
        new_file.puts line
      end
    end
  end
end
Digital Trauma
  • 15,475
  • 3
  • 51
  • 83
  • Did you mean `print` instead of `printf`? Also, the comparison can be shorted a bit: `if line =~ /(Sciences|Oversight Bd_$/` – Wayne Conrad Jan 21 '14 at 13:49
  • @WayneConrad - yes, these are good suggestions. `printf` (and `write`) both work in this case, but `print` is probably more accurate. – Digital Trauma Jan 22 '14 at 03:28
0

You have to realize that sed normally works line by line, so you cannot match for \n in your initial pattern. You can however match for the pattern on the first line and then pull in the next line with the N command and then run the substitute command on the buffer to remove the newline like so:

sed -i -e '/Oversight Bd/ {;N;s/\n//;}' /your/file

Run from Ruby (without -i so that the output goes to stdout):

> cat test_text
aaa
bbb
ccc
aaa
bbb
ccc
> cat test.rb
cmd="sed -e '/aaa/ {;N;s/\\n//;}' test_text"
system(cmd)
> ruby test.rb
aaabbb
ccc
aaabbb
ccc
wich
  • 16,709
  • 6
  • 47
  • 72
  • Hi wich, this would work straight from the command line using `sed -i '.original' -e ':a' -e 'N' -e '$!ba' -e 's/Oversight Bd\n/Oversight Bd/g' -e 's/Sciences\n/Sciences/g' combined_old_individual.txt"`, however, I couldn't run it from inside a ruby script because ruby escaped the newline character. – Solomon Jan 20 '14 at 17:56
  • Then just escape the backslash again – wich Jan 20 '14 at 17:58
  • @Solomon there was no problem with ruby messing up your newlines if you escape them correctly. The problem was that your original sed command was not correct, you cannot match for newline directly you have to use the `N` command as in my answer, then it will work fine. – wich Jan 22 '14 at 08:40
0

Since you are asking in , here is a pure- solution:

$ r="(Oversight Bd|Sciences)$"
$ while read -r; do printf "%s" "$REPLY"; [[ $REPLY =~ $r ]] || echo; done < combined_old_individual.txt 
random line
Oversight Bdshould be on the same line as the above, but isn't
last line
$ 
Digital Trauma
  • 15,475
  • 3
  • 51
  • 83