2

I have a lady at work who sends me phone numbers. They are sent in a messy manner. EVERY TIME. so I want to copy her entire message from Skype and have a batch file parse the saved .txt file, searching only for 10 consecutive digits.

e.g she sends me:

Hello more numbers for settings please,
WYK-0123456789 
CAMP-0123456789 
0123456789
Include 0123456789
This is an urgent number: 0123456789 
TIDO: 0123456789
Send to> 0123456789

It's quite a mess and the only constant is 10 digits. So I would like the .bat file to some how scan this monstrosity and leave me with something like below:

e.g what I want:

0123456789 
0123456789 
0123456789
0123456789
0123456789 
0123456789
0123456789

I tried this below

@echo off
setlocal enableDelayedExpansion
(
  for /f %%A in (
    'findstr "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]" yourFile.txt'
  ) do (
    set "ln=%%A"
    echo !ln:~0,9!
  )
)>newFile.txt

Unfortunately it only works if the beginning of each line starts with 10 digits and doesn't help me in the case where the 10 digits are in the middle or end of a line.

  • 1
    Does it need to be a Windows batch file? I would have thought using a batch file to call a more capable language would be easier. Can you install Python or PHP? I am not familiar with it, but perhaps PowerShell would be worth a look too. – halfer May 23 '17 at 12:24
  • removing the circumflex **`^`** will remove the beginning of line stipulation. Also I would suggest you use 'not in range' either side of your 10 digit string to prevent sequences of more than 10 being matched. – Compo May 23 '17 at 12:37
  • It may just be coincidental, only you can know, but all of your lines end with the number you need!!! Do you want to take each line ending with a ten digit string and output only that string? – Compo May 23 '17 at 12:44
  • @compo I wish it wasn't but it is. Thanks for the help though. – Johan Betsman May 23 '17 at 13:25
  • @halfer Thanks, well I know it's possible to use regular expressions in python. I just don't want to install anything extra. The thing I like about batch files is that you can edit, add or take away code effortlessly. – Johan Betsman May 23 '17 at 13:28
  • I see the point about not having to install anything, but surely in all the other options I mentioned, you can edit, add or take away code effortlessly? None of them are compiled languages. – halfer May 23 '17 at 14:09
  • PowerShell is "baked in" starting with Windows 7; no installation of anything needed. It is a far superior choice to cmd.exe shell script (batch) in every conceivable way. – Bill_Stewart May 23 '17 at 16:20

4 Answers4

2

Unfortunately, it is very difficult to solve this problem in a general way. The Batch file below correctly get the numbers from your example file, but if your real data includes a number with a different format, the program will fail... Of course, in such a case it just be needed to also include the new format in the program! :)

@echo off
setlocal EnableDelayedExpansion

set "digits=0123456789"

(
   rem Find lines with 10 consecutive digits (or more)
   for /f "delims=" %%A in (
      'findstr "[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]" yourFile.txt'
   ) do (
      set "ln=%%A"

      rem Separate line in "words" delimited by space or hypen
      set "ln=!ln: =" "!"
      set "ln=!ln:-=" "!"
      for %%B in ("!ln!") do (
         set "word=%%~B"

         rem If a word have exactly 10 chars...
         if "!word:~9,1!" neq "" if "!word:~10!" equ "" (
            rem and the first one is a digit
            for /F %%D in ("!word:~0,1!") do (
               if "!digits:%%D=!" neq "%digits%" echo !word!
            )
         )

      )
   )
) > newFile.txt

For example, this program will fail if a "word" with 10 chars, that is not a tel. number, start in digit...

Aacini
  • 65,180
  • 12
  • 72
  • 108
  • check if the word is exactly 10 numbers: `echo(%%B|findstr "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$" && echo %%B` – Stephan May 23 '17 at 13:08
  • 1
    @Stephan: Well, yes; your method is shorter/simpler. **`:)`** However, it requires to execute _two copies_ of `cmd.exe` file (one per each side of the pipe) plus `findstr.exe` file, that is, load and execute three large disk files _for each word_! **`:(`** If the data file is large, the difference in time may be notorious... – Aacini May 23 '17 at 13:39
  • Thanks for your answer.)) – Johan Betsman May 24 '17 at 06:30
2

Given that the 10-digit number is the first numeric part in every line of the file (let us call it numbers.txt) before any other numbers, you could use the following:

@echo off
setlocal EnableExtensions EnableDelayedExpansion

rem // Define constants here:
set "_FILE=.\numbers.txt"
set /A "_DIG=10"

rem // The first delimiter is TAB, the last one is SPACE:
for /F "usebackq tokens=1 delims=   ^!#$%%&'()*+,-./:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^^_`abcdefghijklmnopqrstuvwxyz{|}~ " %%L in ("!_FILE!") do (
    set "NUM=%%L#"
    if "!NUM:~%_DIG%!"=="#" echo(%%L
)

endlocal
exit /B

This makes use of for /F and its delims option string, which includes most ASCII characters except numerals. You may extend the delims option string to hold also extended characters (those with a code greater than 0x7F); make sure the SPACE is the last character specified.

This approach can extract the 10-digit number from a line like this:

garbage text>0123456789_more text0123-end

But it fails if a line looks like this, so when the first number is not the 10-digit one:

garbage text: 0123 tel. 0123456789; end

Here is a comprehensive solution based on the above approach. The character list for the delims option of for /F is created automatically here. This may take even a few seconds, but this is done once only at the very beginning, so for large files you will probably not recognise this overhead:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=.\numbers.txt"
set /A "_DIG=10"

rem // Define global variables here:
set "$CHARS="

rem // Capture current code page and set Windows default one:
for /F "tokens=2 delims=:" %%P in ('chcp') do set /A "CP=%%P"
> nul chcp 437

rem /* Generate list of escaped characters other than numerals (escaped means every character
rem    is preceded by `^`); there are some characters excluded:
rem    - NUL (this cannot be stored in an environment variable and should not occur anyway),
rem    - CR + LF, (they build up line-breaks, so they cannot occur within a line obviously),
rem    - SPACE, (because this must be placed as the last character of the `delims`option),
rem    - `"`, (because this impairs the quotation within the following code portion),
rem    - `!` + `^` (they may lead to unexpected results when delayed expansion is enabled): */
setlocal EnableDelayedExpansion
for /L %%I in (0x01,1,0xFF) do (
    rem // Exclude codes of aforementioned characters:
    if %%I GEQ 0x30 if %%I LSS 0x3A (set "SKIP=#") else (set "SKIP=")
    if not defined SKIP if %%I NEQ 0x00 if %%I NEQ 0x0A if %%I NEQ 0x0D (
        if %%I NEQ 0x20 if %%I NEQ 0x21 if %%I NEQ 0x22 if %%I NEQ 0x5E (
            rem // Convert code to character and append to list separated by `^`:
            cmd /C exit %%I
            for /F delims^=^ eol^= %%J in ('
                forfiles /P "%~dp0." /M "%~nx0" /C "cmd /C echo 0x220x!=ExitCode:~-2!0x22"
            ') do (
                set "$CHARS=!$CHARS!^^%%~J"
            )
        )
    )
)
endlocal & set "$CHARS=%$CHARS%"

rem /* Apply escaped list of characters as delimiters and apply some of the characters
rem    excluded before, namely SPACE, `"`, `!` and `^`;
rem    read file using `type` in order to convert from Unicode, if applicable: */
for /F tokens^=1*^ eol^=^ ^ delims^=^!^"^^%$CHARS%^  %%K in ('type "%_FILE%"') do (
    set "NUM=%%K#" & set "REST=%%L"
    rem // Test whether extracted numeric string holds the given number of digits:
    setlocal EnableDelayedExpansion
    if "!NUM:~%_DIG%!"=="#" echo(%%K
    endlocal
    rem /* Current line holds more than a single numeric portion, so process them in a
    rem    sub-routine; this is not called if the line contains a single number only: */
    if defined REST call :SUB REST
)

rem // Restore previous code page:
> nul chcp %CP%

endlocal
exit /B


:SUB  ref_string
    setlocal DisableDelayedExpansion
    setlocal EnableDelayedExpansion
    set "STR=!%~1!"
    rem // Parse line string using the same approach as in the main routine:
    :LOOP
    if defined STR (
        for /F tokens^=1*^ eol^=^ ^ delims^=^^^!^"^^^^%$CHARS%^  %%E in ("!STR!") do (
            endlocal
            set "NUM=%%E#" & set "STR=%%F"
            setlocal EnableDelayedExpansion
            rem // Test whether extracted numeric string holds the given number of digits:
            if "!NUM:~%_DIG%!"=="#" echo(%%E
        )
        rem // Loop back if there are still more numeric parts encountered:
        goto :LOOP
    )
    endlocal
    endlocal
    exit /B

This approach detects 10-digit numbers everywhere in the file, even if there are multiple ones within a single line.

aschipfl
  • 33,626
  • 12
  • 54
  • 99
2
@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "destdir=U:\destdir"
SET "filename1=%sourcedir%\q44134518.txt"
SET "outfile=%destdir%\outfile.txt"
ECHO %time%
(
FOR /f "usebackqdelims=" %%a IN ("%filename1%") DO SET "line=%%a"&CALL :process
)>"%outfile%"
ECHO %time%

GOTO :EOF

:lopchar
SET "line=%line:~1%"
:process
IF "%line:~9,1%"=="" GOTO :eof
SET "candidate=%line:~0,10%"
SET /a count=0
:testlp
SET "char=%candidate:~0,1%"
IF "%char%" gtr "9" GOTO lopchar
IF "%char%" lss "0" GOTO lopchar
SET /a count+=1
IF %count% lss 10 SET "candidate=%candidate:~1%"&GOTO testlp
ECHO %line:~0,10%
GOTO :eof

You would need to change the settings of sourcedir and destdir to suit your circumstances. I used a file named q44134518.txt containing your data plus some extra for my testing.

Produces the file defined as %outfile%

Read each line of data to %%a thence line.

Process each line starting at :process. See whether the line is 10 or more characters, if not terminate the subroutine.

Since the line is 10 or more characters, select the first 10 to candidate and clear count to 0.

assign the first character to char, and test for >'9' or less than '0'. If either is true, lop off the first character of line and try again (until either we have a numeric or line has 9 or fewer characters)

count each successive numeric. If we've not yet counted 10, drop the first character from candidate and check again.

When we reach 10 successive numerics, echo the first 10 chars of line, all of which are numeric and the data sought.

Magoo
  • 77,302
  • 8
  • 62
  • 84
1

Just another option

@echo off
    setlocal enableextensions disabledelayedexpansion

    rem Configure
    set "file=input.txt"

    rem Initializacion
    set "counter=0" & set "number="

    rem Convert file to a character per line and add ending line
    (for /f "delims=" %%a in ('
        ^( cmd /q /u /c type "%file%" ^& echo( ^)^| find /v ""
    ') do (
        rem See if current character is a number
        (for /f "delims=0123456789" %%b in ("%%a") do (
            rem Not a number, see if we have retrieved 10 consecutive numbers 
            set /a "1/((counter+1)%%11)" || (
                rem We probably have 10 numbers, check and output data
                setlocal enabledelayedexpansion
                if !counter!==10 echo !number!
                endlocal
            )
            rem As current character is not a number, initialize
            set "counter=0" & set "number="
        )) || ( 
            rem Number readed, increase counter and concatenate
            set /a "counter+=1"
            setlocal enabledelayedexpansion
            for %%b in ("!number!") do endlocal & set "number=%%~b%%a"
        )
    )) 2>nul 

The basic idea is to start a cmd instance with unicode output, type the file from this instance and filter the two bytes ouput with find, expanding each input line to a one character per line output.

Once we have each character in a separate line, with this output being processed inside a for /f command, we only need to concatenate sucesive numbers until a non numeric character is found. At this moment we check if a set of 10 numbers was readed or not and output the data if needed.

MC ND
  • 69,615
  • 8
  • 84
  • 126