-1

I have been given an oddly structured dataset that I need to prepare for visualisation in GIS. The data is from historic newspapers from different locations in China, published between 1921 and 1937. The excel table is structured as follows:

There is a sheet for each location, 2. each sheet has a column for every year and the variables for each newspaper is organised in blocks of 7 rows and separated by a blank row. Here's a sample from one of the sheets:

,1921年,1922年,1923年
,,,
Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,
,,,
Title of Newspaper 2,ノーウォスチ・ジーズニ(Nōuosuchi Jīzuni),ノォウォスチ・ジーズニ,ノウウスチジーズニ
(Language),露文,露文,露文
(Ideology),政治、社会、文学、商工新聞,社会民主,社会民主
(Owner),タワリシエスウオ・ペチャヤチャ,ぺチヤチ合名会社,ぺチヤチ合名会社
(Editior),イ・エフ・ブロクミユレル,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー
(Publication Frequency),日刊,日刊,日刊
(Circulation),"約3,000","約3,000","3,000"
(Others),1909年創刊、猶太人会より補助を受く、「エス・エル」党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓せるものにして記事多く比較的正確なり、日本お対露干渉排斥排日記事を掲載す,1909年創刊、エス・エル党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓し記事多く比較的正確にして最も有力なるものなり、日本お対露干渉排斥及一般排日記事を掲載し「チタ」政府を擁護す,1909年創刊、過激派に益々接近し長春会議以後は「ダリタ」通信と相待って過激派系の両翼たりし感あり、紙面整頓し記事比較的正確且つ金力に於て猶太人系の後援を有し最も有力なる新聞たり、一般排日記事を掲載し支那側に媚を呈す、「チタ」政権の擁護をなし当地に於ける機関紙たりと自任す
,,,
Title of Newspaper 3,北満洲(Kita Manshū),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun)
(Language),邦文,邦文,邦文
(Ideology),,,
(Owner),合資組織,社長 児玉右二,株式組織 (社長)児玉右二
(Editior),木下猛、遠藤規矩郎,編集長代理 阿武信一,(副社長)磯部検三 (主筆)阿武信一
(Publication Frequency),日刊,日刊,日刊
(Circulation),"1,000内外",,
(Others),大正3年7月創刊,大正11年1月創刊の予定、西比利亜新聞、北満洲及哈爾賓新聞の合同せるものなり,大正11年1月創刊

Yes, it's also in numerous non-latin languages, which makes it a little bit more challenging. I want to create a new matrix for every year, then rotate the table to turn the 7 rows for each newspaper into columns so that I end up with each row corresponding to one newspaper. Finally, I need to generate a new column that gives me the location of the newspaper. I would also like to add a unique identifier for each newspaper and add another column that states the year, just in case I decide to merge the entire dataset into a single matrix. I did the transformation manually in Excel but the entire dataset contains data several from thousand newspapers, so I need to automate the process. Here is what I want to achieve (sans unique identifier and year column):

Title of Newspaper,Language,Ideology,Owner,Editor,Publication Frequency,Circulation,Others,Location
直隷公報(Zhi Li Gong Bao),漢文,直隷省公署の公布機関,直隷省,,日刊,2500,光緒22年創刊、官報の改称,Tientsin
大公報(Da Gong Bao),漢文,稍親日的,合資組織,樊敏鋆,日刊,,光緒28年創刊、倪嗣仲の機関にて現に王祝山其の全権を握り居れり、9年夏該派の没落と共に打撃を受け少しく幹部を変更して再発行せり、但し資金は依然王より供給し居れり,Tientsin
天津日々新聞(Tianjin Ri Ri Xin Wen),漢文,日支親善,方若,郭心培,日刊,2000,光緒27年創刊、親日主義を以て一貫す、國聞報の後身なり民国9年安直戦争中直隷派の圧迫を受けたるも遂に屈せさりし,Tientsin
時聞報(Shi Wen Bao),漢文,中立,李大義,王石甫,,1000,光緒30年創刊、紙面相当価値あり,Tientsin

Is there a way of doing this in R? How will I go about it?

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Can you post the data in reproducible format? – Chris Ruehlemann Nov 06 '21 at 20:47
  • Sorry, I thought that copy and p[aste from a CSV would preserve the format but obviously did not. – Thorsten Kahlert Nov 06 '21 at 22:00
  • I copied it into the text again, hopefully you should be able to copy it into notepad/text edit and save as csv. – Thorsten Kahlert Nov 06 '21 at 22:14
  • I think the basic strategy would be to read in from text with readLines, drop lines 1-2, identify groupings with `cumsum(grepl(",,,", inp)` as the second arg to `split`, and then use `read.csv` and build dataframes within groups with the first column values being new column names and the second column values becoming the row values. Finally use `rbindlist`. – IRTFM Nov 07 '21 at 02:16
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Nov 08 '21 at 14:29
  • Sorry, but it's taking more time to figure out how to do this in RStudio than making the changes manually in Excel. Thank you for taking the time to help me with my problem. – Thorsten Kahlert Nov 08 '21 at 20:38

1 Answers1

0

I've outlined a plan in a comment above. This is untested code that makes it more concrete. I'll keep testing till it works

inps <- readLines( ~/Documents/R_code/Tientsin unformatted.txt")
inp2 <- inps[ -(1:2) ]
  # 'identify groupings with cumsum(grepl(",,,", inp2) as the second arg to split'

inp.df <- lapply( split(inp2, cumsum(grepl(",,,", inp2) , read.csv)
library(data.table) # only needed if you use rbindlist not needed for do.call(rbind , ..

# make a list of one-line dataframes as below
# finally run rbindlist or do.call(rbind, ...)

in.dt <- do.call( rbind, (inp.df)) # rbind  checks for ordering of columns

This is the step that makes a one line dataframe from a set of text lines:

 txt <- 'Title of Newspaper 1,遠東報(Yuan Dong Bao),,
 (Language),漢文,,
 (Ideology),東支鉄道機関紙,,
 (Owner),(総経理)史秉臣,,
 (Editior),張福臣,,
 (Publication Frequency),日刊,,
 (Circulation),"1,000",,
 (Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,'

                                                                  
 temp=read.table(text=txt, sep="," , colClasses=c(rep("character", 2), NA, NA))
 in1 <- setNames( data.frame(as.list(temp$V2)), temp$V1)
 in1
#-------------------------
 Title of Newspaper 1 (Language)     (Ideology)          (Owner) (Editior) (Publication Frequency) (Circulation)
1 遠東報(Yuan Dong Bao)       漢文 東支鉄道機関紙 (総経理)史秉臣    張福臣                    日刊         1,000
                                                      

    (Others)
1 1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す

So it looks like the column names of the individually constructed items would need to further processing to make them capable of being successfully "bindable" by plyr::rbindlist or data.table::rbindlist

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I'm not sure. I just copied that text from the SO version you offered. It might be that the text on your harddrive file is not in the same encoding as expected by your installation of R. It's a separate issue from the original question and one that I expect has been asked and answered on SO. You need either to solve it by doing some research on your part or by making a sample of the data available at a web-accessible location so we can work on it in its original form. – IRTFM Nov 07 '21 at 17:27
  • Thanks for this. I could replicate the results using the data you provided but when I replaced the above table with a path to my csv file I got this: > txt <- read.csv("~/Dropbox/Japan Project/Tientsin unformatted.txt", encoding="UTF-8-MAC") > View(txt) > temp=read.table(text=txt, sep="," , colClasses=c(rep("character", 2), NA, NA)) Error in textConnection(text, encoding = "UTF-8") : invalid 'text' argument Here is a dropbox link to the original csv: https://www.dropbox.com/s/881bc39ugbeb35y/Tientsin%20unformatted.txt?dl=0 – Thorsten Kahlert Nov 07 '21 at 17:27
  • My apologies for being a bit messy here but I'm completely new to posting on stack overflow and I am not very versed in RStudio either. I'm from a GIS background and try to use RStudio for a very specific problem. – Thorsten Kahlert Nov 07 '21 at 17:29
  • My comment advised that it should be read using `readLInes`. (Because you need to break it up into digestible chunks.) When I do that on Linux box after "direct download" from the dropbox, I get no encoding error. – IRTFM Nov 07 '21 at 17:33
  • Ah OK. I needed to get my head around your first comment but it makes more sense now. I have to admit that I have no idea how to do this part though. I will search for instructions on here. – Thorsten Kahlert Nov 07 '21 at 17:52