Extracting parts of words from a file

Question

I have a list, "list A," containing tens of thousands of entries (4 example entries shown below). I'd like to create from "list A" another list, "list B." I need each entry of "list B" to contain only the 1st 4 (out of 5) characters of the 1st "word" following the ">" character that is at the start of each entry in "list A."

So, for example, I'd like a "list B" that looks like this: 2JUG 3JU9 1JU8 3JUE

I am new to script writing, and would appreciate any help you can offer. The closest I got to solving my problem was printing the 1st column, but that gave me all 5 characters of my 1st "word," plus the long string of letters in the next line. I am brand new to script writing, so if possible, please try to give me the "for dummies" version of your explanations. Thank you!

example entries from "List A" below

2JUGA 78 NMR NA NA NA no TubC protein [ANGIOCOCCUS DISCIFORMIS] || 2JUGB GPLGSSAGALLAHAASLGVRLWVEGERLRFQAPPGVMTPELQSRLGGARH ELIALLRQLQPSSQGGSLLAPVARNGRL

3JU9A 237 XRAY 2.10 0.207 0.253 no Concanavalin-Br [CANAVALIA BRASILIENSIS] || 1AZDA 1AZDB 1AZDC 1AZDD 4H55A ADTIVAVELDTYPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGKVGTA HIIYNSVGKRLSAVVSYPNGDSATVSYDVDLDNVLPEWVRVGLSASTGLY KETNTILSWSFTSKLKSNSTHETNALHFMFNQFSKDQKDLILQGDATTGT EGNLRLTRVSSNGSPQGSSVGRALFYAPVHIWESSAVVASFEATFTFLIK SPDSHPADGIAFFISNIDSSIPSGSTGRLLGLFPDAN

1JU8A 37 NMR NA NA NA no Leginsulin [NA] ADCNGACSPFEVPPCRSRDCRCVPIGLFVGFCIHPTG

3JUEA 368 XRAY 2.30 0.203 0.219 no ARFGAP with coiled-coil, ANK repeat and PH domain-containing protein 1 [HOMO SAPIENS] || 3JUEB GPLGSGSGHLAIGSAATLGSGGMARGREPGGVGHVVAQVQSVDGNAQCCD CREPAPEWASINLGVTLCIQCSGIHRSLGVHFSKVRSLTLDSWEPELVKL MCELGNVIINQIYEARVEAMAVKKPGPSCSRQEKEAWIHAKYVEKKFLTK LPEIRGRRGGRGRPRGQPPVPPKPSIRPRPGSLRSKPEPPSEDLGSLHPG ALLFRASGHPPSLPTMADALAHGADVNWVNGGQDNATPLIQATAANSLLA CEFLLQNGANVNQADSAGRGPLHHATILGHTGLACLFLKRGADLGARDSE GRDPLTIAMETANADIVTLLRLAKMREAEAAQGQAGDETYLDIFRDFSLM ASDDPEKLSRRSHDLHTL

Could you please provide a sample file - question formatting broke the input, and that's important for the possible solution. — Peter L., Feb 06 '13 at 20:31
Hi, Peter L. Wow, thanks for your fast reply. I'd be happy to provide the requested information. Do you have an e-mail address to which I could sent the sample file? — user2048166, Feb 07 '13 at 00:18
P.S. The formatting is that there are 2 lines per entry on the list. The 1st line begins with ">abcde," and "abcd" is the information that I want. The 2nd line is a long series of letters. Then there is an empty line, followed by the next entry on the list. — user2048166, Feb 07 '13 at 00:21
P.P.S. I don't know if this is helpful, but another thought I had was to use the ">" for my script, but the ">" of ">abcde" that begins each entry is not unique to that starting position. Some of the entries also contain this symbol at later points in the 1st line (though none of these ">" symbols are showing up in the original sample I posted). — user2048166, Feb 07 '13 at 00:26
Upload a file to dropbox or any similar filesharing service and drop a link here. If you don't have dropbox - install from here: http://db.tt/JNGZy87d — Peter L., Feb 07 '13 at 05:33
https://www.dropbox.com/sh/w8q8avk139a9wwo/Z02X0EFWXH/listA.rtf — user2048166, Feb 07 '13 at 20:42
https://www.dropbox.com/sh/w8q8avk139a9wwo/tpmyRX_VOZ/listB.rtf — user2048166, Feb 07 '13 at 20:43
Hi Peter. Thank you! Can you teach me how this works? For example, what do the "^p>", "@@@>", and "^p" mean? — user2048166, Feb 11 '13 at 23:04
`^p` is a pattern for paragraph mark. 3 replacement cycles in code remove all extra line breaks except those which follow your desired symbols. `@@@>` means that specific symbols combination - actually, this is the trick where we may use ANY string which is NOT in the original document, and then back replace it to paragraph mark. The rest of code is a loop through every paragraph which cut off all but first 5 chars from its start. — Peter L., Feb 12 '13 at 04:39

score 1 · Answer 1 · answered Feb 07 '13 at 21:37

Here is the solution - pretty straightforward, but will do the job assuming the input provided:

Sub NMRData()
    With Selection.Find
        .Text = "^p>"
        .Replacement.Text = "@@@>"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "@@@>"
        .Replacement.Text = "^p>"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll

    For k = ThisDocument.Paragraphs.Count To 1 Step -1
        Set oPara = ThisDocument.Paragraphs(k)
        oPara.Range.Text = Left(oPara.Range.Text, 5) & vbNewLine
    Next k

    ThisDocument.SaveAs FileName:="listB.docx", FileFormat:=wdFormatXMLDocument

End Sub

Desired output will be saved in the same folder as a new DOCX file.

To run the code, press ALT+F11, and then F5 - via VBA interface, or press ALT+F8 to select and run a Macro by name.

Sample DOCM with ready-to-go code: https://www.dropbox.com/s/6zt4nfn7rt8eqc7/NMRDataListA.docm

P.S. this is my very 1st Word-VBA experience)

Extracting parts of words from a file

1 Answers1