1

I have a SAM file with an RX: field containing 12 bases separated in the middle by a - i.e. RX:Z:CTGTGC-TCGTAA

I want to remove the hyphen from this field, but I can't simply remove all hyphens from the whole file as the read names contain them, like 1713704_EP0004-T

Have mostly been trying tr, but this is just removing all hyphens from the file.:

tr -d '"-' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

input is a large SAM file of >10,000,000 lines like this:

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGC-TCGTAA

Desired output (i.e. last field)

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGCTCGTAA

How do I solve this problem?

lgallagher
  • 51
  • 6
  • 1
    Posting some sample lines would be useful. – BlackPearl May 01 '19 at 14:48
  • 1
    Specifically, it's not clear what a "read name" is and where it occurs in an input line. – Benjamin W. May 01 '19 at 14:54
  • 1
    Please post a few lines from the file and post a sample output – sjsam May 01 '19 at 14:54
  • 1
    @sjsam Done, hope that's clear – lgallagher May 01 '19 at 15:00
  • It is not yet clear. Please post lines that need **NOT** be substituted. Do the last field follow a pattern? – sjsam May 01 '19 at 15:01
  • 1
    @sjsam The last field is what I want changed `RX:Z:CTGTGC-TCGTAA`. My issue is i can't just delete all '-' from the file because there are hyphens in the read name `1902336-103-016_C1D1_1E-T:34` i.e. the first field – lgallagher May 01 '19 at 15:04
  • "header"? Can you clarify? – Paul Hodges May 01 '19 at 15:12
  • 1
    Including mine three answers have been given to your question, try them and give us feedbacks please. – oguz ismail May 01 '19 at 15:36
  • I regret I didn't pay much attention to `10,000,000`. That is indeed a big file which is more than 5190MB considering your record length. Hope you have an SSD and some decent CPU. Do let me know the results. – sjsam May 01 '19 at 15:55
  • You may accept the answer if it helped – sjsam May 01 '19 at 16:25
  • 1
    Why do you have a SAM file in the first place? You should **never** have SAM files, only BAM files. Once you have the BAM file, a correct solution is relatively easy using e.g. htslib and a few lines of C code. The answer you’ve currently accepted, by contrast, is wrong. – Konrad Rudolph May 09 '19 at 10:10
  • @KonradRudolph I had a sam file because I needed to make these edits. I admit I'm not experienced with pysam or any bam parser so I converted my bam to sam to help solve my problem. These in-place edits do work so I wouldn't call them 'wrong', just not optimal. I've ended up applying a regex to the problem which is a much safer way of doing it. I'm trying to learn pysam now, as my knowledge of C or perl is 0. – lgallagher May 09 '19 at 15:27
  • @lgallagher I called the accepted answer wrong because it will replace `-` *everywhere*, not just in your selected tag. That’s very wrong. – Konrad Rudolph May 09 '19 at 16:03
  • @KonradRudolph I've updated my question with a pysam solution. Happy to take feedback on it – lgallagher May 10 '19 at 11:49
  • @lgallagher please post this as an answer instead (self-answers are encouraged!) – Konrad Rudolph May 10 '19 at 12:52
  • @KonradRudolph Ah sorry - getting to grips with stackoverflow still – lgallagher May 10 '19 at 12:56

4 Answers4

4

awk

awk '{sub(/-/,"",$NF)}1' file

is what you need.

Explanation

  • From this it is clear that you're concerned only about the last field.
  • NF is the total number of fields that a record contains, hence $NF is the last field.
  • sub(/-/,"",$NF) replaces the - in the last field with an empty string, making the change persistent.

GNU sed

For this same reason,

sed -Ei 's/^(.*)-/\1/' file

will work. It has an added advantage that it can perform an inplace edit.

Explanation

  • The -E option enables the extended regular expression engine.
  • The (.*) is a greedy search that will match any character(.) any number of times(*). For the fact that is greedy it will match anything up to the last hyphen.
  • The () makes sed remember what was matched.
  • In the substitution part, we put just the matched part \1 (1 because we having only one pair of parenthesis, note that you can have as many as you like) without the hyphen, thus effectively removing it from the last field where it should occur.

Note : The GNU awk support -i inplace, but I'm not sure from which version on.

sjsam
  • 21,411
  • 5
  • 55
  • 102
  • Is that `1` a typo? – Paul Hodges May 01 '19 at 15:10
  • 1
    @PaulHodges : No, a `1` simply prints a record. Being idiomatic :-) – sjsam May 01 '19 at 15:12
  • Interesting. *Why* does it print a record? Still learning `awk`. I'd've just explicitly added a `print`. :) – Paul Hodges May 01 '19 at 15:15
  • @Paul it's a shorthand for `1{print}`, when a condition evaluates to true/1, it's action is executed. If a condition doesn't have an action, `{print}` is the default. – oguz ismail May 01 '19 at 15:17
  • @PaulHodges : `1` is the most basic command that awk understands. It just prints a record. Try `awk '1' file` :) – sjsam May 01 '19 at 15:18
  • 1
    Just surprised me to see it at the end of the code block, but then I guess technically it's at the beginning of the implied `{print}`, eh? – Paul Hodges May 01 '19 at 15:19
  • 2
    You don't want to do this on all lines of a 10M file – Inian May 01 '19 at 15:22
  • @Inian. In that case, he can try my `sed` solution. :-) I am not claiming it will be very fast anyways. But it may match up to some talarmade `python` solution. May it not? – sjsam May 01 '19 at 15:25
  • Your sed solution is pretty much exactly what I need. The size of my files make it quite hard to find a 'quick' solution. I can run this on a hpc node, so that should help. Thank you for all your help – lgallagher May 01 '19 at 16:12
  • 1
    wrt `[sed] has an added advantage that it can perform an inplace edit.` - you're using GNU sed to get `-i` and you can use GNU awk to get the equivalent `-i inplace`. That particular awk command would change the white space in the rest of the output which may not be desirable though. – Ed Morton May 01 '19 at 18:55
  • 1
    @EdMorton : Thank you Updating the answer. I have gawk 4.02 which doesn't have the `-i` option in the manual. Not sure when it was introduced. – sjsam May 01 '19 at 19:12
  • `-i` does not mean the same thing in gawk as in sed. In sed `-i` means `inplace` but in gawk `-i` means `include` and `inplace` is the library to include so the equivalent of `sed -i` is `awk -i inplace`. `-i` arrived in gawk 4.1, I'm not sure if the `inplace` library arrived in the same release or later. We're at gawk 5.0 now so it's been a while either way! – Ed Morton May 01 '19 at 19:16
  • @EdMorton I am a bit surprised my CentOS 7 still runs on an older `awk`. It doesn't have the `-i` option either. – sjsam May 01 '19 at 19:25
  • 1
    That's too bad. If you'd care to see the history of when/why/how `-i inplace` came to be, it all started in a mailing list long, long ago on a server far, far away... https://lists.gnu.org/archive/html/bug-gawk/2012-12/msg00048.html – Ed Morton May 01 '19 at 19:29
  • Careful with this answer, it’s wrong: It replaces *all* hyphens in the last column. But in actual SAM files that column contains multiple fields of structured data, and the replacement should only happen in one of those fields, not any other. – Konrad Rudolph May 09 '19 at 10:09
  • @KonradRudolph From [this](https://stackoverflow.com/questions/55938014/remove-character-from-the-middle-of-a-string/55938360?noredirect=1#comment98754727_55938014) I understand that the op has not given a proper sample for us to work on. The `awk` would replace only the first hyphen. But as I humbly acknowledge,these are not the right tools for the job. – sjsam May 09 '19 at 12:40
  • @sjsam Yes, the OP left many assumptions unstated which you only know about if you’ve got expertise in bioinformatics. – Konrad Rudolph May 09 '19 at 12:52
2

I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.

import pysam
import sys
from re import sub

# Provide a bam file
if len(sys.argv) == 2:
    assert sys.argv[1].endswith('.bam')

# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)

inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)

# Counters for reads processed and written
n = 0
w = 0

# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
    n += 1
    umi = read.get_tag('RX')
    assert umi is not None
    umifix = umi[:6] + umi[7:]
    read.set_tag('RX', umifix, value_type='Z')
    if '-' in umifix:
        print('Hyphen found in UMI:', umifix, read)
        break
    else:
        w += 1
        outbam.write(read)

inbam.close()
outbam.close()

print ('Processed', n, 'reads:\n',
       w, 'UMIs written.\n',
       str(int((w / n) * 100)) + '% of UMIs fixed')

lgallagher
  • 51
  • 6
1

The best solution is to work with BAM rather than SAM files, and to use a proper BAM parser/writer library, such as htslib.

Lacking that, you can cobble something together by searching for the regular expression ^RX:Z: in the optional tags (columns 12 and up).

Working with columns, while possible, is hard with sed. Instead, here’s how to do this in awk:

awk -F '[[:space:]]*' '{
    for (i = 12; i <= NF; i++) {
        if ($i ~ /^RX:Z:/) gsub("-", "", $i)
    }
}
1' file.sam

And here’s a roughly equivalent solution as a Perl “one-liner”:

perl -ape '
    for (@F[11..(scalar @F)]) {
        s/-//g if /^RX:Z:/;
    }
    $_ = join("\t", @F);
' file.sam

To perform the replacement in the original file, you can pass the option -i.bak to perl (this will create a backup file.sam.bak; if you don’t want the backup, omit the extension).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
0

This pattern is on many records that you want to edit, and is always at the end of the line? If so -

sed -E 's/^(.*)(\s..:.:......)-(......\s*)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam
Paul Hodges
  • 13,382
  • 1
  • 17
  • 36