I have a list, "list A," containing tens of thousands of entries (4 example entries shown below). I'd like to create from "list A" another list, "list B." I need each entry of "list B" to contain only the 1st 4 (out of 5) characters of the 1st "word" following the ">" character that is at the start of each entry in "list A."
So, for example, I'd like a "list B" that looks like this: 2JUG 3JU9 1JU8 3JUE
I am new to script writing, and would appreciate any help you can offer. The closest I got to solving my problem was printing the 1st column, but that gave me all 5 characters of my 1st "word," plus the long string of letters in the next line. I am brand new to script writing, so if possible, please try to give me the "for dummies" version of your explanations. Thank you!
example entries from "List A" below
2JUGA 78 NMR NA NA NA no TubC protein [ANGIOCOCCUS DISCIFORMIS] || 2JUGB GPLGSSAGALLAHAASLGVRLWVEGERLRFQAPPGVMTPELQSRLGGARH ELIALLRQLQPSSQGGSLLAPVARNGRL
3JU9A 237 XRAY 2.10 0.207 0.253 no Concanavalin-Br [CANAVALIA BRASILIENSIS] || 1AZDA 1AZDB 1AZDC 1AZDD 4H55A ADTIVAVELDTYPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGKVGTA HIIYNSVGKRLSAVVSYPNGDSATVSYDVDLDNVLPEWVRVGLSASTGLY KETNTILSWSFTSKLKSNSTHETNALHFMFNQFSKDQKDLILQGDATTGT EGNLRLTRVSSNGSPQGSSVGRALFYAPVHIWESSAVVASFEATFTFLIK SPDSHPADGIAFFISNIDSSIPSGSTGRLLGLFPDAN
1JU8A 37 NMR NA NA NA no Leginsulin [NA] ADCNGACSPFEVPPCRSRDCRCVPIGLFVGFCIHPTG
3JUEA 368 XRAY 2.30 0.203 0.219 no ARFGAP with coiled-coil, ANK repeat and PH domain-containing protein 1 [HOMO SAPIENS] || 3JUEB GPLGSGSGHLAIGSAATLGSGGMARGREPGGVGHVVAQVQSVDGNAQCCD CREPAPEWASINLGVTLCIQCSGIHRSLGVHFSKVRSLTLDSWEPELVKL MCELGNVIINQIYEARVEAMAVKKPGPSCSRQEKEAWIHAKYVEKKFLTK LPEIRGRRGGRGRPRGQPPVPPKPSIRPRPGSLRSKPEPPSEDLGSLHPG ALLFRASGHPPSLPTMADALAHGADVNWVNGGQDNATPLIQATAANSLLA CEFLLQNGANVNQADSAGRGPLHHATILGHTGLACLFLKRGADLGARDSE GRDPLTIAMETANADIVTLLRLAKMREAEAAQGQAGDETYLDIFRDFSLM ASDDPEKLSRRSHDLHTL