Questions tagged [sequencefile]

A SequenceFile is a Hadoop binary file containing key/value pairs.

A SequenceFile is a file format used by Hadoop for the efficient storage and retrieval of key/value pairs. It is also possible to use compression techniques for more efficient storage.

For more information view the API documentation or the Wiki page.

157 questions
2
votes
1 answer

Migrating a huge Bigtable database in GCP from one account to another using DataFlow

I have a huge database stored in Bigtable in GCP. I am migrating the bigtable data from one account to another GCP Account using DataFlow. but, when I created a job to create a sequence file from the bigtable it has created 3000 sequence files on…
2
votes
1 answer

Java Map Reduce use SequenceFIle as reducer output

I have a working Java Map Reduce Program with 2 jobs. The output of the first reduce is written on a file and read by the second mapper. I would like to change the first reducer output to be a SequenceFile. How can i do this? This is the main of my…
2
votes
0 answers

How can we read multiple sequence files in Apache Flink parallely as Batch Job

I have a use case of reading sequence files as a Batch job in Flink Dataset. The files are stored in S3 bucket which I have to consume in a Flink Dataset. I am not able to read the files by providing comma(,) separated file paths to read in the…
2
votes
0 answers

SequenceFile.Writer leads to NullPointerException

Hi I am trying to create a simple sequencefile with mahout libraries using the bellow code. While running the code I am getting NullPointerException after creating a empty file, public class SequenceFileWriter { public static void main(String[]…
2
votes
0 answers

One field in Protocol Buffers is always missing when reading from SequenceFile

Something mysterious is happening for me: What I wanted to do: 1. Save a Protocol Buffers object as SequenceFile format. 2. Read this SequenceFile text and extract the field that I need. The mystery part is: One field that I wanted to retrieve is…
2
votes
1 answer

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime to convert into SequenceFile format And, I want to do this efficiently. As far as I know, Spark supports only one mapper per gzip file same as it does for text files. Is there…
Marcel Mars
  • 388
  • 5
  • 16
2
votes
1 answer

Sequence file reading issue using spark Java

i am trying to read the sequence file generated by hive using spark. When i try to access the file , i am facing org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: I have…
vishal raj
  • 21
  • 1
2
votes
1 answer

How to create splits from a sequence file in Hadoop?

In Hadoop, I have a sequence file of 3GB size. I want to process it in parallel. Therefore, I am going to create 8 maptasks and hence 8 FileSplits. FileSplit class has constructors that require the: Path of the file Start position Length For…
Mosab Shaheen
  • 1,114
  • 10
  • 25
2
votes
2 answers

Reading Sequence File in PySpark 2.0

I have a sequence file whose values look like (string_value, json_value) I don't care about the string value. In Scala I can read the file by val reader = sc.sequenceFile[String, String]("/path...") val data = reader.map{case (x, y) =>…
Max
  • 837
  • 4
  • 11
  • 20
2
votes
1 answer

Can I create sequence file using spark dataframes?

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
mahan07
  • 887
  • 4
  • 14
  • 32
2
votes
1 answer

Get HDFS file path in PySpark for files in sequence file format

My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things: Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles…
Arnkrishn
  • 29,828
  • 40
  • 114
  • 128
2
votes
1 answer

Hadoop SequenceFile vs splittable LZO

We're choosing the file format to store our raw logs, major requirements are compressed and splittable. Block-compressed (whichever codec) SequenceFiles and Hadoop-LZO look the most suitable so far. Which one would be more efficient to be processed…
k0_
  • 101
  • 3
2
votes
0 answers

Flume - how to create a custom key for a HDFS SequenceFile?

I'm using Flume's HDFS SequenceFile sink for writing data to HDFS. I'm looking for a possibility to create "custom keys". Per default, Flume is using the Timestamp as key within a SequenceFile. However, in my usecase I would like to use a customized…
Thomas Beer
  • 230
  • 3
  • 9
2
votes
1 answer

Spark: how to read CompactBuffer from an objectFile?

I am reading the following structure from an object file: (String, CompactBuffer(person1, person2, person3 ...) ) If I tried to read like this: val input = sc.objectFile[(String, ListBuffer[Person])]("inputFile.txt") val myData = input.map { t => …
Edamame
  • 23,718
  • 73
  • 186
  • 320
2
votes
1 answer

hsync() not working for SequenceFile Writer

I have a small program that writes 10 records to a block compressed SequenceFile on HDFS every second, and then run sync() every 5 minutes to ensure that everything older than 5 minutes are available for processing. As my code is quite a few lines,…
agnsaft
  • 1,791
  • 7
  • 30
  • 49
1
2
3
10 11