I'm working with StreamSets on a Cloudera Distribution, trying to ingest some data from this website http://files.data.gouv.fr/sirene/
I've encountered some issues choosing the parameters of both the HTTP Client and the Hadoop FS Destination.
https://image.noelshack.com/fichiers/2017/44/2/1509457504-streamsets-f.jpg
I get this error : HTTP_00 - Cannot parse record: java.io.IOException: org.apache.commons.compress.archivers.ArchiveException: No Archiver found for the stream signature
I'll show you my configuration.
HTTP Client :
General
Name : HTTP Client INSEE
Description : Client HTTP SIRENE
On Record Error : Send to Error
HTTP
Resource URL : http://files.data.gouv.fr/sirene/
Headers : sirene_ : sirene_
Mode : Streaming
Per-Status Actions
HTTP Statis Code : 500 | Action for status : Retry with exponential backoff |
Base Backoff Interval (ms) : 1000 | Max Retries : 10
HTTP Method : GET
Body Time Zone : UTC (UTC)
Request Transfert Encoding : BUFFERED
HTTP Compression : None
Connect Timeout : 0
Read Timeout : 0
Authentication Type : None
Use OAuth 2
Use Proxy
Max Batch Size (records) : 1000
Batch Wait Time (ms) : 2000
Pagination
Pagination Mode : None
TLS
UseTLS
Timeout Handling
Action for timeout : Retry immediately
Max Retries : 10
Data Format
Date Format : Delimited
Compression Format : Archive
File Name Pattern within Compressed Directory : *.csv
Delimiter Format Type : Custom
Header Line : With Header Line
Max Record Length (chars) : 1024
Allow Extra Columns
Delimiter Character : Semicolon
Escape Character : Other \
Quote Character : Other "
Root Field Type : List-Map
Lines to Skip : 0
Parse NULLs
Charset : UTF-8
Ignore Control Characters
Hadoop FS Destination :
General
Name : Hadoop FS 1
Description : Writing into HDFS
Stage Library : CDH 5.7.6
Produce Events
Required Fields
Preconditions
On Record Error : Send to Error
Output Files
File Type : Whole File
Files Prefix
Directory in Header
Directory Template : /user/pap/StreamSets/sirene/
Data Time Zone : UTC (UTC)
Time Basis : ${time:now()}
Use Roll Attribute
Validate HDFS Permissions : ON
Skip file recovery : ON
Late Records
Late Record Time Limit (secs) : ${1 * HOURS}
Late Record Handling : Send to error
Data Format
Data Format : whole File
File Name Expression : ${record:value('/fileInfo/filename')}
Permissions Expression : 777
File Exists : Overwrite
Include Checksum in Events
... so what am I doing wrong ? :(