0

I have some xml files in the format

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id"/>
<journal-title-group>
<journal-title>Eleventh International Conference on Correlation Optics</journal-title>
</journal-title-group>
<issn pub-type="epub">0277-786X</issn>
<publisher>
<publisher-name>SPIE</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">@@</article-id>
<title-group>
<article-title>Comparison of absorption spectra of adenocarcinoma and squamous cell carcinoma cervical tissue</article-title>
</title-group>
<pub-date>
<year>2013</year>
</pub-date>
<volume>9066</volume>
<issue>90661T</issue>
<isbn content-type="conf-proc">978-1-5106-0374-5</isbn>
<fpage>90661T-1</fpage>
<lpage>90661T-7</lpage>
</article-meta>
</front>

and I have xlsx file like below

Filename        Pages       DOI             Publisher   Year    Price
01022018_1020   14    10.1111/12.2033558    Springer    2017    $15.13 
01022018_1021   7     10.1111/12.2033559    Elsevier    2018    $15.25 
01022018_1130   30    10.1113/20.3023658    Elsevier    2017    $19.25 
01022018_1301   10    10.1113/20.3023777    Wiley       2015    $11.50 

I want to get the respective values of the fields DOI and Publisher for each files(searching by file name) and update them to the xml files i.e. if the file name is 01022018_1021.xml then get its respective DOI and Publisher i.e. 10.1111/12.2033559 and Elsevier and put them inside nodes like <publisher-name>Elsevier</publisher-name> and <article-id pub-id-type="doi">10.1111/12.2033559</article-id>

I've never used exceldatareader before but I tried like this

var xmlFiles=Directory.EnumerateFiles(@"D:\test\testobject","*.xml",SearchOption.AllDirectories);
Parallel.ForEach(xmlFiles,xmlFile=>{
    var name=Path.GetFileName(xmlFile);

    FileStream stream = File.Open(@"C:\Don\Downloads\Download-Sample-File-xlsx.xlsx", FileMode.Open, FileAccess.Read);
    IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);

    DataSet result = excelReader.AsDataSet(new ExcelDataSetConfiguration() {
    ConfigureDataTable = (_) => new ExcelDataTableConfiguration() {
    UseHeaderRow = true
    }
    });

    DataTable dt = result.Tables[0];

    foreach (DataRow row in dt.Rows)
        {
        var x = (string)row["Publisher"];
        var y = (string)row["DOI"];
        /*how do I get the required values
        var pub_name="...";
        var doi="...";
        */
        XDocument doc=XDocument.Load(xmlFile);
        doc.Descendants("pub-name").First().Value=pub_name;
        doc.Descendants("article-id").First().Value=doi;
        doc.Save(xmlFile);

        }
    excelReader.Close();
});

Console.WriteLine("Done");
Console.ReadLine();

Can anyone help?

Don_B
  • 243
  • 2
  • 15

1 Answers1

2

It looks like you already have the values.

doc.Descendants("pub-name").First().Value=x;
doc.Descendants("article-id").First().Value=y;

Edit: I didn't catch the bit about needing to find the right row.

DataTable dt = result.Tables[0];

var filename = Path.GetFileNameWithoutExtension(xmlFile);

// Get the row for the current file
var rowForFile = dt.Rows.Cast<DataRow>()
    .SingleOrDefault(r => (string)r["Filename"] == filename);

if (rowForFile != null)
{
    XDocument doc = XDocument.Load(xmlFile);
    doc.Descendants("publisher-name").First().Value = (string)rowForFile["Publisher"];
    doc.Descendants("article-id").First().Value = (string)rowForFile["DOI"];
    doc.Save(xmlFile);
}
asherber
  • 2,508
  • 1
  • 15
  • 12
  • but the `var x = (string)row["Publisher"];` and `var y = (string)row["DOI"];` gets all the values of that row one by one, but I only want the values **"Publisher"** and **"DOI"** whose **Filename** value is say **01022018_1130**, how do I do that? – Don_B Feb 25 '18 at 02:03
  • Sorry, I missed that part of the issue. I'll amend my answer. – asherber Feb 25 '18 at 04:04
  • thanks, btw do you mind explaining the `var rowForFile` part, like how it works? – Don_B Feb 25 '18 at 04:38
  • `dt.Rows` is like a non-generic collection of objects, so `Cast()` says 'treat each item like a `DataRow`'. Then `SingleOrDefault()` says 'give me the one row that matches this criterion, or null if there isn't one'. (It will throw an exception if there's more than one matching row.) And the criterion is a lambda that says 'I want a row where the the Filename field matches the value of `filename`'. – asherber Feb 25 '18 at 13:36
  • You could do the same thing by using the `Select()` method on `DataTable`, which returns a `DataRow[]`, along with a little LINQ. – asherber Feb 25 '18 at 13:40