2

I'm parsing XML which contains URLs and I want to iterate over this XML to get all URLs and make a request to each URL, but the strings contain new line character \n. How can I avoid this new lines in URL?

Go version is go1.12.7 darwin/amd64. I have solution for this problem I just removing this characters from string.

package main

import (
    "encoding/xml"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "strings"
)



type SitemapIndex struct {
    Locations []string `xml:"sitemap>loc"`
}

type NewsMap struct {
    Keyword  string
    Location string
}

type News struct {
    Titles    []string `xml:"url>news>title"`
    Keywords  []string `xml:"url>news>keywords"`
    Locations []string `xml:"url>loc"`
}


func main() {
    var s SitemapIndex
    var n News
    newsMap := make(map[string]NewsMap)
    resp, _ := http.Get("https://washingtonpost.com/news-sitemaps/index.xml")
    bytes, _ := ioutil.ReadAll(resp.Body)

    xml.Unmarshal(bytes, &s)

    for _, Location := range s.Locations {
        tempURL := strings.Replace(Location, "n", "", -1) // how to avoid new lines character in url?
        resp, err := http.Get(tempURL)
                // do some stuff...
}

Without this replace method on Location Im getting an error parse https://www.washingtonpost.com/news-sitemaps/politics.xml : net/url: invalid control character in URL exit status 1

Here is example XML file https://www.washingtonpost.com/news-sitemaps/politics.xml

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • My question is how to avoid new line character from XML file not how to remove it. I wonder if it can be avoided, maybe I did something wrong. – Damian Wysocki Jul 27 '19 at 14:34
  • The code is missing a backslash. Use this: `strings.Replace(Location, "\n", "", -1)` – Charlie Tumahai Jul 27 '19 at 14:48
  • @CeriseLimón I already used this in code and It works but why I need to do this? – Damian Wysocki Jul 27 '19 at 14:50
  • 2
    @DamianWysocki looking at the raw XML from the washingtonpost URL you give it contains things like "\n\nhttps://…\n\n\n". In other words it seems to have a ton of extraneous newlines and in particular each `loc` element is "\nDATA\n". I'd probably use `strings.Trim(Location, "\n")` (or `strings.TrimSpace`) rather than replacing within the entire string. – Dave C Jul 27 '19 at 14:58
  • @DaveC I missed that. What a shame. I thought that my Code are doing something wrong. Thanks for explemantion – Damian Wysocki Jul 27 '19 at 15:15
  • The question is regarding this tutorial: https://pythonprogramming.net/go/parsing-xml-go-language-programming-tutorial/ – Charlie Tumahai Jul 28 '19 at 16:22

1 Answers1

3

The XML text contains newlines as mentioned by Dave C in a comment. Because newline is not allowed in URLs, you must remove the newlines.

Fix by replacing newline (instead of n) with "". Note the backslash.

tempURL := strings.Replace(Location, "\n", "", -1) 

A better fix is to use strings.TrimSpace (also mentioned by Dave C). This will handle all extraneous whitespace that might be present in the file:

tempURL := strings.TrimSpace(Location) 
nehoory
  • 46
  • 1
  • Just to clarify, `TrimSpace` only removes leading and trailing whitespace. It won't remove whitespace in the middle of a string. – Jessie Jul 27 '19 at 19:15
  • 1
    @user2896976, correct but since it only needs to look at the start/end of the string it'll be faster. It already appears the source data is "broken" and it's a matter of how broken you expect it to ever be... is it just adding extraneous \n at the begining and end (probable) or adding random whitespace at any old point within the URL (unlikely). Also note, the Trim functions can always just return a sub-string where-as Replace may need to allocate a new string (and may do so even in cases where it could just trim). – Dave C Jul 28 '19 at 12:46
  • Yes, I wanted to clarify because "This will handle all extraneous whitespace that might be present in the file" can easily be misinterpreted to mean, "This will remove all whitespace present in the file" – Jessie Jul 29 '19 at 15:08