3

Having such a text file (example): https://drive.google.com/open?id=0B1vq9WjkqkvzTEVEUnlXMGVFa00

Original file has 65k rows. I need to upload it to R and make processable. I used the following functions:

  1. read.table - didn't work (R never returned any result)
  2. fread from data.table package - required a lot of manual preprocessing of the file and didn't work as needed as quotes broke the line and the file wasn't in the appropriate form)
  3. scan got a vector, transformation into matrix didn't bring needed results.

Desired form of the file is a regular data frame:

mydata <- structure(list(fieldName = structure(c(3L, 3L), .Label = c("description", 
    "scraped_manufacturer", "title"), class = "factor"), foreign_id = c(13389, 
    13389), is_single_product = structure(1:2, .Label = c("FALSE", 
    "TRUE"), class = "factor"), matched_manufacturers = c("Foden /manId: 76775", 
    "Caterpillar /manId: 74, Skogsjan-Caterpillar /manId: 10329"), 
        matched_products = c("", "C12 /modelId: 32774 /manId: 74"
        ), raw_string = c("CAT FODEN C-12 ENGINE", "CATERPILLAR C-12 ENGINE"
        ), pagesource = structure(c(84L, 84L), .Label = c("", "585e362f6b010083d6962041", 
        "585f270a300000c614b819ed", "585f84be6b0100c6ee962ab1", "585f84dc66010074efac42ca", 
        "585f875a6b0100c7ee963000", "585f878c66010074efac483e", "585f87ad66010075efac4880", 
        "585f88e06b0100b6ee96331c", "585f8b4566010074efac4fcb", "agriaffaires", 
        "apex-auctions", "arlington-plastics-machinery", "auctelia", 
        "auctions-international", "autogilles", "baestlein", "baupool", 
        "bavaria-swiss-ag", "big-iron", "big-machinery", "blackforxx", 
        "blue-group", "bpi-associates", "buk-baumaschinen", "cegema", 
        "christophbusch", "cjm-asset", "classified", "cnc-auction", 
        "cottrill-and-co", "daan", "de-vries", "dechow", "dimex-import-export", 
        "e-farm", "ebay", "ebay-de", "eberle-hald-gmbh", "eggers-landmaschinen", 
        "euro-auctions", "fabricating-machinery-corp", "fastline", 
        "ferwood", "fh-machinery", "first-machinery-auctions-limited", 
        "forklift-international", "ga-tec-gabelstaplertechnik", "gambtec", 
        "geiger", "german-graphics", "goindustry-dovebid", "graf", 
        "gruma-nutzfahrzeuge-gmbh", "hanselmann", "heinrich-kuper-gmbh", 
        "hooray-machinery", "imz-maschinen", "industrial-discount", 
        "ipr-petmachinery", "ironplanet", "ironplanet-com", "karl-guenter-wirths-gmbh", 
        "karner-dechow", "kurt-steiger", "kvd-auctions", "lagermaschinen", 
        "leinweber-landtechnik", "mach4metal", "machinefinder", "machinery-park", 
        "machineryzone", "maschinenbau-rehnen-gmbh", "mideast-equipment", 
        "mmtequipment", "oskar-broziat-maschinen", "perfection-global", 
        "perlick", "perry-videx", "pfeifer-machinery", "plustech-as", 
        "polboto-agri-sp-z-oo", "pressenhaas", "rc-tuxford-exports", 
        "resale", "restlos", "richter-friedewald-gmbh", "ritchie-bros", 
        "rock-and-dirt", "rogiers", "rs-auktionen", "stig-bindner", 
        "surplex", "technikboerse", "themar-trucks", "traktorpool", 
        "unilift", "vebim", "vertimac", "zeppelin-caterpillar", "zoll-auktion", 
        "zuern-gmbh"), class = "factor")), .Names = c("fieldName", 
    "foreign_id", "is_single_product", "matched_manufacturers", "matched_products", 
    "raw_string", "pagesource"), row.names = 1:2, class = "data.frame")

Any ideas of how to make it possible to work with the file?

hamsy
  • 369
  • 6
  • 14
  • This is rtf (Rich Text Format). See [Wikipedia page on RTF](https://en.wikipedia.org/wiki/Rich_Text_Format#Code_syntax). There is an R package called rtf, but it seems to be only for creating rtf files, not reading them. There is some advice at [This SO Post](http://stackoverflow.com/questions/23634298/parsing-rtf-files-into-r) – G5W Jan 08 '17 at 11:47

1 Answers1

2

Consider opening the text file in a software that can read RTF types. On Windows machines, Microsoft Word and the built-in Wordpad can read .rtf documents. And in doing so, a valid json shows in document (without markup content).

JSON text

Fortunately, R on Windows can connect to the MS Word Object Library using the RDCOMClient library where you extract text using the Document.Content property. Once you read in the json text, use the jsonlite library to migrate content to a dataframe:

library(RDCOMClient)
library(jsonlite)

# OPEN WORD APP
wrdApp = COMCreate("Word.Application")
wrdDoc = wrdApp$Documents()$Open("C:\Path\To\Data.txt")    
wrdtext = wrdDoc[['Content']]

# EXTRACT TEXT TO R VARIABLE
doc = wrdtext$Text()

# CLOSE APP
wrdDoc$Close(FALSE)
wrdApp$Quit()

# RELEASE RESOURCES
wrdtext <- wrdDoc <- wrdApp <- NULL
rm(wrdtext, wrdDoc, wrdApp)
gc()

# RAW DF: NAME / COLUMNS / VALUES LIST TYPES
rawdf <- fromJSON(doc)[[1]][[1]][[1]]

# FINAL DF: NORMALIZING VALUES WITH COL NAMES
finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
                               rawdf$columns[[1]])

Output

Final Dataframe


Alternative

Should you not have MS Word installed. Launch the CMD prompt and open Wordpad (built-in Windows app) with the command line and copy all content to a .json file (or right click text file and open with Wordpad). Do the counterpart of special app and terminal call if on another OS (Linux/Mac):

write "D:\Path\To\Data.txt"

After json file is saved, then in R run:

rawdf <- do.call(rbind, lapply(paste(reaadLines("C:\Path\To\Data.json", warn=FALSE),
                                      collapse=""), 
                                jsonlite::fromJSON))[[1]][[1]][[1]]

finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
                    rawdf$columns[[1]])
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I use mac so I used the alternative way. Executed the original file in txt format, worked perfectly! Thank you a lot! – hamsy Jan 08 '17 at 19:15
  • Awesome! Glad to help. No doubt, the Mac MS Word also reads `.rtf` formats. – Parfait Jan 08 '17 at 19:21