I have been given an oddly structured dataset that I need to prepare for visualisation in GIS. The data is from historic newspapers from different locations in China, published between 1921 and 1937. The excel table is structured as follows:
There is a sheet for each location, 2. each sheet has a column for every year and the variables for each newspaper is organised in blocks of 7 rows and separated by a blank row. Here's a sample from one of the sheets:
,1921年,1922年,1923年
,,,
Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,
,,,
Title of Newspaper 2,ノーウォスチ・ジーズニ(Nōuosuchi Jīzuni),ノォウォスチ・ジーズニ,ノウウスチジーズニ
(Language),露文,露文,露文
(Ideology),政治、社会、文学、商工新聞,社会民主,社会民主
(Owner),タワリシエスウオ・ペチャヤチャ,ぺチヤチ合名会社,ぺチヤチ合名会社
(Editior),イ・エフ・ブロクミユレル,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー
(Publication Frequency),日刊,日刊,日刊
(Circulation),"約3,000","約3,000","3,000"
(Others),1909年創刊、猶太人会より補助を受く、「エス・エル」党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓せるものにして記事多く比較的正確なり、日本お対露干渉排斥排日記事を掲載す,1909年創刊、エス・エル党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓し記事多く比較的正確にして最も有力なるものなり、日本お対露干渉排斥及一般排日記事を掲載し「チタ」政府を擁護す,1909年創刊、過激派に益々接近し長春会議以後は「ダリタ」通信と相待って過激派系の両翼たりし感あり、紙面整頓し記事比較的正確且つ金力に於て猶太人系の後援を有し最も有力なる新聞たり、一般排日記事を掲載し支那側に媚を呈す、「チタ」政権の擁護をなし当地に於ける機関紙たりと自任す
,,,
Title of Newspaper 3,北満洲(Kita Manshū),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun)
(Language),邦文,邦文,邦文
(Ideology),,,
(Owner),合資組織,社長 児玉右二,株式組織 (社長)児玉右二
(Editior),木下猛、遠藤規矩郎,編集長代理 阿武信一,(副社長)磯部検三 (主筆)阿武信一
(Publication Frequency),日刊,日刊,日刊
(Circulation),"1,000内外",,
(Others),大正3年7月創刊,大正11年1月創刊の予定、西比利亜新聞、北満洲及哈爾賓新聞の合同せるものなり,大正11年1月創刊
Yes, it's also in numerous non-latin languages, which makes it a little bit more challenging. I want to create a new matrix for every year, then rotate the table to turn the 7 rows for each newspaper into columns so that I end up with each row corresponding to one newspaper. Finally, I need to generate a new column that gives me the location of the newspaper. I would also like to add a unique identifier for each newspaper and add another column that states the year, just in case I decide to merge the entire dataset into a single matrix. I did the transformation manually in Excel but the entire dataset contains data several from thousand newspapers, so I need to automate the process. Here is what I want to achieve (sans unique identifier and year column):
Title of Newspaper,Language,Ideology,Owner,Editor,Publication Frequency,Circulation,Others,Location
直隷公報(Zhi Li Gong Bao),漢文,直隷省公署の公布機関,直隷省,,日刊,2500,光緒22年創刊、官報の改称,Tientsin
大公報(Da Gong Bao),漢文,稍親日的,合資組織,樊敏鋆,日刊,,光緒28年創刊、倪嗣仲の機関にて現に王祝山其の全権を握り居れり、9年夏該派の没落と共に打撃を受け少しく幹部を変更して再発行せり、但し資金は依然王より供給し居れり,Tientsin
天津日々新聞(Tianjin Ri Ri Xin Wen),漢文,日支親善,方若,郭心培,日刊,2000,光緒27年創刊、親日主義を以て一貫す、國聞報の後身なり民国9年安直戦争中直隷派の圧迫を受けたるも遂に屈せさりし,Tientsin
時聞報(Shi Wen Bao),漢文,中立,李大義,王石甫,,1000,光緒30年創刊、紙面相当価値あり,Tientsin
Is there a way of doing this in R? How will I go about it?