38

I don’t do this stuff for a living so forgive me if it’s a simple question (or more complicated than I think). I‘ve been digging through the archives and found a lot of tips that are close but being a novice I’m not sure how to tweak for my needs or they are way beyond my understanding.

I have some large data files that I can parse out to generate a list of coordinate that are mostly sequential

5
6
7
8
15
16
17
25
26
27

What I want is a list of the gaps

1-4
9-14
18-24

I don’t know perl, SQL or anything fancy but thought I might be able to do something that would subtract one number from the next. I could then at least grep the output where the difference was not 1 or -1 and work with that to get the gaps.

agc
  • 7,973
  • 2
  • 29
  • 50
Shaun
  • 401
  • 1
  • 4
  • 4
  • What do you mean *mostly sequential*? That could make a difference in some people's answers. – squiguy Apr 07 '13 at 20:45
  • Your approach should work - just use the first number to calculate the gap: n2 - n1 - 1 is the gap size, n1 + 1 is the first number in the gap and n1 + gap size is the second number. – mzedeler Apr 07 '13 at 20:45

7 Answers7

81

With :

awk '$1!=p+1{print p+1"-"$1-1}{p=$1}' file.txt

explanations

  • $1 is the first column from current input line
  • p is the previous value of the last line
  • so ($1!=p+1) is a condition : if $1 is different than previous value +1, then :
  • this part is executed : {print p+1 "-" $1-1} : print previous value +1, the - character and fist columns + 1
  • {p=$1} is executed for each lines : p is assigned to the current 1st column
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Consider me an idiot when it comes to this type of stuff. I know a little bit but I'm really just a biologist trying to figure out how to crunch data. So to really help I need basic instructions but I am trying to learn and I'll try to figure out the coding from anything suggested. I've used some awk before so I think I see what you're getting at but I really don't understand all of the syntax. I'll play around with what you suggest and see what it does. I don't need an elagent solution. I just need something that works better than hours with Excel ;-) Shaun – Shaun Apr 07 '13 at 22:12
  • 1
    Wow that actually works ;-) Could hardly sleep last night waiting to get to work and give it a try. And thanks for the explanation. I'm almost starting to understand this stuff. – Shaun Apr 08 '13 at 13:47
  • 3
    Note that this beautiful one-liner will not work correctly if the source file begins with a header row (as is common for `csv` files) or a `0` value. This can be solved my skipping the first `X` rows of the file like so: `tail -n +X+1 'unique_items.csv' | awk '($1 != p + 1) {print p + 1 " - " $1 - 1} {p = $1}'`. For example, to skip both the header row AND a `0` value, one would use: `tail -n +3 'unique_items.csv' | awk '($1 != p + 1) {print p + 1 " - " $1 - 1} {p = $1}'`. – Priidu Neemre Dec 05 '14 at 10:55
  • Simple and functional. Great answer and explanation! – onalbi Feb 27 '15 at 08:02
  • 1
    Very handy +1! May I ask you (basic question coming) how does it knows what `p` is equal to at the first iteration? It feels to me that at the first iteration the program check `$1!=p+1` before defining p for the first time (`p=$1`). – Remi.b Jul 26 '16 at 19:59
  • If `p` was not defined, it's value is `0` – MevatlaveKraspek Jul 27 '16 at 19:57
4

interesting question.

sputnick's awk one-liner is nice. I cannot write a simpler one than his. I just add another way using diff:

 seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'

the output with your example would be:

1,4
9,14
18,24

I knew that there is comma in it, instead of -. you could replace the grep with sed to get -, grep cannot change the input text... but the idea is same.

hope it helps.

Kent
  • 189,393
  • 32
  • 233
  • 301
3

A Ruby Answer

Perhaps someone else can give you the Bash or Awk solution you asked for. However, I think any shell-based answer is likely to be extremely localized for your data set, and not very extendable. Solving the problem in Ruby is fairly simple, and provides you with flexible formatting and more options for manipulating the data set in other ways down the road. YMMV.

#!/usr/bin/env ruby

# You could read from a file if you prefer,
# but this is your provided corpus. 
nums = [5, 6, 7, 8, 15, 16, 17, 25, 26, 27]

# Find gaps between zero and first digit.
nums.unshift 0

# Create array of arrays containing missing digits.
missing_nums = nums.each_cons(2).map do |array|
                 (array.first.succ...array.last).to_a unless
                  array.first.succ == array.last
               end.compact
# => [[1, 2, 3, 4], [9, 10, 11, 12, 13, 14], [18, 19, 20, 21, 22, 23, 24]]

# Format the results any way you want.
puts missing_nums.map { |ary| "#{ary.first}-#{ary.last}" }

Given your current corpus, this yields the following on standard output:

1-4
9-14
18-24

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
2

Just remember the previous number and verify that the current one is the previous plus one:

#! /bin/bash
previous=0
while read n ; do
    if (( n != previous + 1 )) ; then
        echo $(( previous + 1 ))-$(( n - 1 ))
    fi
    previous=$n
done

You might need to add some checking to prevent lines like 28-28 for single number gaps.

choroba
  • 231,213
  • 25
  • 204
  • 289
  • @CharlesDuffy: What do you mean by "ongoing at the end"? The OP wants to detect gaps, not ranges, and if the sequence isn't infinite, there must be gaps at both ends... – choroba Sep 19 '17 at 15:45
0

Perl solution similar to awk solution from StardustOne:

perl -ane 'if ($F[0] != $p+1) {printf "%d-%d\n",$p+1,$F[0]-1}; $p=$F[0]' file.txt

These command-line options are used:

  • -n loop around every line of the input file, do not automatically print every line

  • -a autosplit mode – split input lines into the @F array. Defaults to splitting on whitespace. Fields are indexed starting with 0.

  • -e execute the perl code

Chris Koknat
  • 3,305
  • 2
  • 29
  • 30
0

Given input file, use the numinterval util and paste its output beside file, then munge it with tr, xargs, sed and printf:

gaps() { paste  <(echo; numinterval "$1" | tr 1 '-' | tr -d '[02-9]') "$1" | 
         tr -d '[:blank:]' | xargs echo | 
         sed 's/ -/-/g;s/-[^ ]*-/-/g' | xargs printf "%s\n" ; }

Output of gaps file:

5-8
15-17
25-27

How it works. The output of paste <(echo; numinterval file) file looks like:

    5
1   6
1   7
1   8
7   15
1   16
1   17
8   25
1   26
1   27

From there we mainly replace things in column #1, and tweak the spacing. The 1s are replaced with -s, and the higher numbers are blanked. Remove some blanks with tr. Replace runs of hyphens like "5-6-7-8" with a single hyphen "5-8", and that's the output.

user2070305
  • 445
  • 5
  • 9
agc
  • 7,973
  • 2
  • 29
  • 50
0

This one list the ones who breaks the sequence from a list.

Idea taken from @choroba but done with a for.

#! /bin/bash
previous=0
n=$( cat listaNums.txt )
for number in $n
do
    numListed=$(($number - 1))
    if [ $numListed != $previous ] && [ $number != 2147483647 ]; then
        echo $numListed
    fi
    previous=$number
done
AdrianEunoia
  • 59
  • 1
  • 2
  • 8