Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
21
votes
2 answers

Strategies for reading in CSV files in pieces?

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit…
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
21
votes
10 answers

Python Shared Memory Dictionary for Mapping Big Data

I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python. Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded,…
Jon Deaton
  • 3,943
  • 6
  • 28
  • 41
21
votes
2 answers

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory…
monster
  • 1,762
  • 3
  • 20
  • 38
21
votes
3 answers

Reading big data with fixed width

How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that's not my case. The data has 558MB, and I don't know how many lines. I'm using: dados <-…
Rcoster
  • 3,170
  • 2
  • 16
  • 35
20
votes
6 answers

PySpark DataFrames - way to enumerate without converting to Pandas?

I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just…
Maria Koroliuk
  • 297
  • 1
  • 2
  • 8
20
votes
3 answers

Fastest way to compare row and previous row in pandas dataframe with millions of rows

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row. As an example, this is a simplified version of my problem: User Time …
AdO
  • 445
  • 1
  • 6
  • 17
20
votes
3 answers

How do I determine the size of my HBase Tables ?. Is there any command to do so?

I have multiple tables on my Hbase shell that I would like to copy onto my file system. Some tables exceed 100gb. However, I only have 55gb free space left in my local file system. Therefore, I would like to know the size of my hbase tables so that…
gautham
  • 313
  • 1
  • 2
  • 6
20
votes
1 answer

Hive padding leading zeroes

I need the output of a string column in my table as 13 length char, irrespective of whatever length it is, i need to stuff the remaining chars with 0... I tried to use the following code in my hive query, but failed to get the desired…
Muthu Palaniappan
  • 221
  • 1
  • 2
  • 5
20
votes
2 answers

How to quickly export data from R to SQL Server

The standard RODBC package's sqlSave function even as a single INSERT statement (parameter fast = TRUE) is terribly slow for large amounts of data due to non-minimal loading. How would I write data to my SQL server with minimal logging so it writes…
jpd527
  • 1,543
  • 1
  • 14
  • 30
19
votes
2 answers

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such. The thing is, I now often find myself writing processing steps…
19
votes
3 answers

How to efficiently save a Pandas Dataframe into one/more TFRecord file?

First I want to quickly give some background. What I want to achieve eventually is to train a fully connected neural network for a multi-class classification problem under tensorflow framework. The challenge of the problem is that the size of…
Ling Gu
  • 249
  • 2
  • 6
19
votes
1 answer

Why does Spark's OneHotEncoder drop the last category by default?

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default. For example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss =…
Corey
  • 1,845
  • 1
  • 12
  • 23
19
votes
1 answer

Memory limits in data table: negative length vectors are not allowed

I have a data table with several social media users and his/her followers. The original data table has the following format: X.USERID FOLLOWERS 1081 4053807021,2476584389,4713715543, ... So each row contains a user together with his/her ID and…
19
votes
2 answers

what is the basic difference between jobconf and job?

hi i wanted to know the basic difference between jobconf and job objects,currently i am submitting my job like this JobClient.runJob(jobconf); i saw other way of submitting jobs like this Configuration conf = getConf(); Job job = new Job(conf,…
user1585111
  • 1,019
  • 6
  • 19
  • 35
19
votes
2 answers

Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example. The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse…
luispedro
  • 6,934
  • 4
  • 35
  • 45