1

I need to create a tool to tokenize an input list. I would be providing a text file containing a list of subdomains in the following manner:

abc.xyc.kkk.com
hjk.pol.lll.kkk.com
...

And the output need to be :

abc
xyz
kkk
hjk
...

The delimiter is '.'

I have tried out the following code, which does not seem to be working:

#!/bin/bash

STRING= cat $1
IFS='.' read -ra VALUES <<< "$STRING"

## To print all values
for i in "${VALUES[@]}"; do
    echo $i
done
Socowi
  • 25,550
  • 3
  • 32
  • 54
Arjun S
  • 13
  • 2
  • 2
    Can you precisely specify the rules? Why is `.com` omitted – is it because `com` is in blacklist or simply because `com` is the last token in the current whitespace-free block? Also, I assume `….xyc.…` → `… xyz …` is a typo – you may want to fix that. Please edit your question using the grey edit button below your question. – Socowi Nov 28 '19 at 16:36
  • CHeck your code at https://shellcheck.net . Good luck. – shellter Nov 28 '19 at 16:36
  • `STRING= cat $1` is a duplicate of [Why does a space in a variable assignment give an error in bash?](https://stackoverflow.com/questions/41748466/why-does-a-space-in-a-variable-assignment-give-an-error-in-bash), and [How do I set a variable to the output of a command in bash?](https://stackoverflow.com/questions/4651437/how-do-i-set-a-variable-to-the-output-of-a-command-in-bash). – Charles Duffy Nov 28 '19 at 17:57
  • ...afaict, the rest of what you're referring to as "tokenization" is just splitting on a delimiter, which we also have duplicates already covering. – Charles Duffy Nov 28 '19 at 18:01

2 Answers2

1

Something like in awk can do the work:

awk 'BEGIN {FS="."; RS=" "} {$NF=""} 1'

Here is the test:

$ echo abc.xyc.kkk.com hjk.pol.lll.kkk.com |awk 'BEGIN {FS="."; RS=" "} {$NF=""} 1'
abc xyc kkk
hjk pol lll kkk
Romeo Ninov
  • 6,538
  • 1
  • 22
  • 31
1

Your script had some mistakes that I corrected. Now it works :

#!/bin/bash

STRING='
abc.xyc.kkk.com hjk.pol.lll.kkk.com
'

IFS=' ' read -d '' -a VALUES <<< "$STRING"

for i in ${VALUES[@]}; do
    echo "$i" | sed 's/\./ /g' 
done

Output

abc xyc kkk com
hjk pol lll kkk com

By the way, if you want to have each token in the array, instead of the entire url, you can do this :

#!/bin/bash

STRING='
abc.xyc.kkk.com hjk.pol.lll.kkk.com
'

IFS='.' read -d '' -a VALUES <<< "$STRING"

for i in ${VALUES[@]}; do
    echo "$i" | sed 's/\./ /g' 
done

Output

abc
xyc
kkk
com
hjk
pol
lll
kkk
com

Let me know if it works!

Matias Barrios
  • 4,674
  • 3
  • 22
  • 49
  • `"${VALUES[@]}"`, or else it has all the bugs of `${VALUES[*]}`. – Charles Duffy Nov 28 '19 at 17:58
  • And it's much more efficient to use `${i//./ }` than `sed`. – Charles Duffy Nov 28 '19 at 17:58
  • More generally, though, a question that doesn't specify *one single error* with the shortest code that reproduces it should be closed as too-broad, not answered (or closed as duplicate, with a separate duplicate for each specific/narrower question hidden within the larger/broader question). – Charles Duffy Nov 28 '19 at 17:59