2

I'm working with StreamSets on a Cloudera Distribution, trying to ingest some data from this website http://files.data.gouv.fr/sirene/

I've encountered some issues choosing the parameters of both the HTTP Client and the Hadoop FS Destination.

https://image.noelshack.com/fichiers/2017/44/2/1509457504-streamsets-f.jpg

I get this error : HTTP_00 - Cannot parse record: java.io.IOException: org.apache.commons.compress.archivers.ArchiveException: No Archiver found for the stream signature

I'll show you my configuration.

HTTP Client :

General

Name : HTTP Client INSEE

Description : Client HTTP SIRENE

On Record Error : Send to Error

HTTP

Resource URL : http://files.data.gouv.fr/sirene/

Headers : sirene_ : sirene_

Mode : Streaming

Per-Status Actions

HTTP Statis Code : 500 | Action for status : Retry with exponential backoff |

Base Backoff Interval (ms) : 1000 | Max Retries : 10

HTTP Method : GET

Body Time Zone : UTC (UTC)

Request Transfert Encoding : BUFFERED

HTTP Compression : None

Connect Timeout : 0

Read Timeout : 0

Authentication Type : None

Use OAuth 2

Use Proxy

Max Batch Size (records) : 1000

Batch Wait Time (ms) : 2000

Pagination

Pagination Mode : None

TLS

UseTLS

Timeout Handling

Action for timeout : Retry immediately

Max Retries : 10

Data Format

Date Format : Delimited

Compression Format : Archive

File Name Pattern within Compressed Directory : *.csv

Delimiter Format Type : Custom

Header Line : With Header Line

Max Record Length (chars) : 1024

Allow Extra Columns

Delimiter Character : Semicolon

Escape Character : Other \

Quote Character : Other "

Root Field Type : List-Map

Lines to Skip : 0

Parse NULLs

Charset : UTF-8

Ignore Control Characters

Hadoop FS Destination :

General

Name : Hadoop FS 1

Description : Writing into HDFS

Stage Library : CDH 5.7.6

Produce Events

Required Fields

Preconditions

On Record Error : Send to Error

Output Files

File Type : Whole File

Files Prefix

Directory in Header

Directory Template : /user/pap/StreamSets/sirene/

Data Time Zone : UTC (UTC)

Time Basis : ${time:now()}

Use Roll Attribute

Validate HDFS Permissions : ON

Skip file recovery : ON

Late Records

Late Record Time Limit (secs) : ${1 * HOURS}

Late Record Handling : Send to error

Data Format

Data Format : whole File

File Name Expression : ${record:value('/fileInfo/filename')}

Permissions Expression : 777

File Exists : Overwrite

Include Checksum in Events

... so what am I doing wrong ? :(

  • It might be because the first file from http://files.data.gouv.fr/sirene/ is README.txt and not a csv file within a zip. How do I ignore this file since there is not File Name Pattern like in the SFTP FTP Client ? – VincentM000 Oct 31 '17 at 14:28

2 Answers2

2

It looks like http://files.data.gouv.fr/sirene/ is returning a file listing, rather than a compressed archive. This is a tricky one, since there isn't a standard way to iterate through such a listing. You might be able to read http://files.data.gouv.fr/sirene/ as text, then use the Jython evaluator to parse out the zip file URLs, retrieve, decompress and parse them, adding the parsed records to the batch. I think you'd have problems with this method, though, as all the records would end up in the same batch, blowing out memory.

Another idea might be to use two pipelines - the first would use HTTP client origin and a script evaluator to download the zipped files and write them to a local directory. The second pipeline would then read in the zipped CSV via the Directory origin as normal.

If you do decide to have a go, please engage with the StreamSets community via one of our channels - see https://streamsets.com/community

metadaddy
  • 4,234
  • 1
  • 22
  • 46
0

I'm writing the Jython evaluator. I'm not familiar with the available constants/objects/records as presented in comments. I tried to adapt this python script into the Jython evaluator :

import re
import itertools
import urllib2
data = [re.findall(r'(sirene\w+.zip)', line) for line in open('/home/user/Desktop/filesdatatest.txt')]
data_list = filter(None, data)
data_brackets = list(itertools.chain(*data_list))
data_clean = ["http://files.data.gouv.fr/sirene/" + url for url in data_brackets]
for url in data_clean:
    urllib2.urlopen(url)

records = [re.findall(r'(sirene\w+.zip)', record) for record in records] gave me this error message SCRIPTING_05 - Script error while processing record: javax.script.ScriptException: TypeError: expected string or buffer, but got in at line number 50

filesdatatest.txt contains things like :

Listing of /v1/AUTH_6032cb4c2159474684c8df1da2e2b642/storage/sirene/  
Name    Size    Date  
../            
README.txt  2Ki     2017-10-11 03:31:57  
sirene_201612_L_M.zip   1Gi     2017-01-05 00:12:08  
sirene_2017002_E_Q.zip  444Ki   2017-01-05 00:44:58  
sirene_2017003_E_Q.zip  6Mi     2017-01-05 00:45:01  
sirene_2017004_E_Q.zip  2Mi     2017-01-05 03:37:42  
sirene_2017005_E_Q.zip  2Mi     2017-01-06 03:40:47  
sirene_2017006_E_Q.zip  2Mi     2017-01-07 05:04:04  

so I know how to parse records.