file_get_contents from html explode, write to cell of spreadsheet

Question

What I'm attempting to achieve is a tweezing of specific content from a URL's source via file_get_contents(), then explode() the mark-up around where that content lives, returning just the HTML formatted content, and then writing it to the single cells of a spreadsheet or CSV. Easy, I thought.

This is what I've got:

<?php

//My .html

$url = 'http://spiderlearning.com/demo/ALG_SA_U1_L1.html';

//Get content

$content = file_get_contents($url);

//Get content sections

$lesson_name = explode( '<section id="nameField" class="editable" contenteditable="false">' , $content);

$section_title1 = explode( '<a onclick="goToByScroll(\'obj0\')" href="#">' , $content);

$challenge_q = explode( '<section id="redactor_content" class="editable" contenteditable="false">' , $content);

//Write content

$write1 = explode("</section>" , $lesson_name[1]);
$write2 = explode("</a>" , $section_title1[1]);
$write3 = explode("</section>" , $challenge_q[1]);

//Into arrays

$line1 = array($write1[0],$write2[0],$write3[0]);

$list = array($line1);

//Open .csv

$file = fopen("data/data.csv", "w");

//Write as line, delimitate with ";"

foreach ($list as $line) fputcsv($file, $line, ';');

//Close

fclose($file);

?>

Which returns:

CSV

Excel

What I'm looking for is:

CSV:

Unit 1 Lesson 1; 1. Challenge Questions; <p><img src="https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg" alt="" rel="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;" style="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>

The problem seems to me to be the carriage returns in the formatted content. It's also picking up parenthesis around the returned content as well, but I'm not sure from where. Is there any way to escape these? I've put together similar functions in the past with no problem whatsoever, but this is my first file_get_contents() into CSV, and a couple of weeks in I've finally hit a wall with it.

Your approach for this is understandable. What if html web page changes? Then your "explode" aproach will not work anymore. You should use some library for extracting data from web page via class names, html element types and similar, for example https://github.com/paquettg/php-html-parser. — pregmatch, Aug 28 '16 at 14:57
Apologies, for context I would be running this against local files (about 4,000 of them) stored on my machine. I've only hosted this one so you could see what I'm working with. Every file is set up in the exact same way, and there will be no changes or updates to them. I just need this specific info spreadsheeted in a clean way so that I can quickly update media links and such via find/replace, copyedit, and then fire the content into a CMS via an API from the CSV I've created, if this makes sense. — SJGaliardi, Aug 28 '16 at 15:03

score 0 · Accepted Answer · answered Aug 28 '16 at 15:19

First to get rid of the line breaks do this: foreach ($list as $line) fputcsv($file, preg_replace( "/\r|\n/", "", $line), ';');

It would be better to leave those field delimiters introduced by fputcsv. The reason is that any semicolon inside one of the fields would break your CSV above The CSV you want then looks like:

"Unit 1 Lesson 1";"1. Challenge Questions";"<p><img src=""https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg"" alt="""" rel=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"" style=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;""></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>"

But you cannot directly open this in excel in most cases (there is a global setting somewhere). You need to import this data and then set the following things:

Norman, this worked perfectly, thank you! A simple solution to a simple problem. — SJGaliardi, Aug 28 '16 at 15:30

score 0 · Answer 2 · answered Aug 28 '16 at 16:26

Here is an alternative solution based on PHP's DOMDocument class:

$url = 'http://spiderlearning.com/demo/ALG_SA_U1_L1.html';
// Load HTML via DOMDocument class
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
// Extract the elements of interest
$xpath = new DOMXPath($doc);
$list = [
    [
        "lesson" => $doc->getElementById('nameField')->textContent,
        "section" => $xpath->query("//div[@class='activitySelect']//a")[0]->textContent,
        "challenge" => innerHTML($doc->getElementById('redactor_content'))
    ]
];
// Write CSV (unchanged code)
$file = fopen("php://output", "w");
foreach ($list as $line) fputcsv($file, $line, ';');
fclose($file);

// Utility function
function innerHTML($node) {
    return implode(array_map([$node->ownerDocument,"saveHTML"], 
                             iterator_to_array($node->childNodes)));
}

file_get_contents from html explode, write to cell of spreadsheet

2 Answers2