20

I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememe and if I want to find the length I can use echo {#a} its word I would echo ${a}. But I want to apply it on this below

for i in `cat so.txt` do

Where so.txt contains words, I hope it makes sense.

Mildred Shimz
  • 607
  • 4
  • 11
  • 20

8 Answers8

31

bash one liner.

sed 's/ /\n/g' YOUR_FILENAME | sort | uniq | awk '{print length, $0}' | sort -nr | head -n 1
  1. read file and split the words (via sed)
  2. remove duplicates (via sort | uniq)
  3. prefix each word with it's length (awk)
  4. sort the list by the word length
  5. print the single word with greatest length.

yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.

rackpit
  • 13
  • 4
BlessedKey
  • 1,615
  • 1
  • 10
  • 16
14

Normally, you'd want to use a while read loop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.

#!/bin/bash
longest=0
for word in $(<so.txt)
do
    len=${#word}
    if (( len > longest ))
    then
        longest=$len
        longword=$word
    fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
8

Another solution:

for item in  $(cat "$infile"); do
  length[${#item}]=$item          # use word length as index
done
maxword=${length[@]: -1}          # select last array element

printf  "longest word '%s', length %d" ${maxword} ${#maxword}
Fritz G. Mehner
  • 16,550
  • 2
  • 34
  • 41
5
longest=""
for word in $(cat so.txt); do
    if [ ${#word} -gt ${#longest} ]; then
        longest=$word
    fi
done

echo $longest
Rob Wouters
  • 15,797
  • 3
  • 42
  • 36
3

awk script:

#!/usr/bin/awk -f

# Initialize two variables
BEGIN {
  maxlength=0;
  maxword=0
} 

# Loop through each word on the line
{
  for(i=1;i<=NF;i++) 

  # Assign the maxlength variable if length of word found is greater. Also, assign
  # the word to maxword variable.
  if (length($i)>maxlength) 
  {
    maxlength=length($i); 
    maxword=$i;
  }
}

# Print out the maxword and the maxlength  
END {
  print maxword,maxlength;
}

Textfile:

[jaypal:~/Temp] cat textfile 
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language 
consisting of a set of actions to be taken against textual data (either in files or data streams) 
for the purpose of producing formatted reports. 
The language used by awk extensively uses the string datatype, 
associative arrays (that is, arrays indexed by key strings), and regular expressions.

Test:

[jaypal:~/Temp] ./script.awk textfile 
data_extraction 15
davemyron
  • 2,483
  • 3
  • 24
  • 33
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
1
  1. Relatively speedy bash function using no external utils:

    # Usage: longcount <  textfile
    longcount () 
    { 
        declare -a c;
        while read x; do
            c[${#x}]="$x";
        done;
        echo ${#c[@]} "${c[${#c[@]}]}"
    }
    

    Example:

    longcount < /usr/share/dict/words
    

    Output:

    23 electroencephalograph's
    
  2. 'Modified POSIX shell version of jimis' xargs-based answer; still very slow, takes two or three minutes:

    tr "'" '_'  < /usr/share/dict/words |
    xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} "$1"' | 
    sort -n | tail | tr '_' "'"
    

    Note the leading and trailing tr bit to get around GNU xargs difficulty with single quotes.

agc
  • 7,973
  • 2
  • 29
  • 50
0
for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1
ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
jbleners
  • 1,023
  • 1
  • 8
  • 14
  • The number of *words* in a text file is often greater than the number of *lines* in that file. So `paste - so.txt` won't work reliably unless *so.txt* has only one column. – agc Jul 06 '19 at 23:20
-1

Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:

$ cat /usr/share/dict/words | \
    xargs -n1 -I '{}' -d '\n'   sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
    sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize

You can easily parallelize, e.g. to 4 CPUs by providing -P4 to xargs.

EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -d argument.

EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0 option to xargs. I also added -P4 to compute on 4 cores:

cat /usr/share/dict/words | tr '\n' '\0' | \
    xargs -0 -I {} -n1 -P4  sh -c  'echo ${#1} "$1"'  wordcount {} | \
    sort -n | tail
jimis
  • 794
  • 1
  • 9
  • 14
  • 1
    Tried this... cant get mine to work. I think the fact that there are single quotes in the 'words' file makes not work for me. How did you get around that? – Chai Ang Oct 01 '18 at 01:34
  • @chai No single quotes in my system, just one word per line. What is your OS? Which package provides the `words` file? – jimis Jan 02 '20 at 12:05
  • Ubuntu 7.4.0-1ubuntu1~18.04.1. Cant remember where it came from. apt-file says "wamerican", which I dont recall installing. Must have been because of some dependency from some other package. The first few lines of /usr/share/dict/words look like A A's AMD AMD's AOL AOL's Aachen Aachen's – Chai Ang Jan 15 '20 at 03:28
  • Looks like it is ispell. apt-cache rdepends wamerican wamerican Reverse Depends: |bsdmainutils |libpam-cracklib |xvkbd |sugarplum |ispell iamerican forensics-extra |bsdgames |libpam-cracklib |cracklib-runtime – Chai Ang Jan 15 '20 at 03:30
  • Updated to handle single quotes. Still not perfect (will not work with other quotes) but this method was just for fun anyway. – jimis Jan 19 '20 at 23:50