Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.
Questions tagged [data-partitioning]
337 questions
0
votes
0 answers
What causes the tasks to not evenly partition in Spark?
In my experience, sometimes when I apply transformation() to large data, it seems that the tasks are not evenly partitioned and are skewed to one side so that only a few tasks are working. As a result, it was confirmed that the efficiency of the…

S.Kang
- 581
- 2
- 10
- 28
0
votes
1 answer
Moving historical data from google cloud storage to date-partitioned bigquery table using python
I need to organize a large amount of historical data into date-partitions in google bigquery. It will partition the load date for you (current date only) but that doesn't really help with historical data. The only solutions I have seen so far are…

Ralph
- 65
- 1
- 6
0
votes
2 answers
Why Google BigQuery doesn't use partition date correctly when using views
I have a date partitioned table (call it sample_table) with 2 columns, one to save dateTime in UTC and other to save timezone offset. I have a view on top of this table (call it sample_view). The view takes _partitiontime in from table and exposes…

opensourcegeek
- 5,552
- 7
- 43
- 64
0
votes
1 answer
Ignored duplicate property derby.module.dataDictionary in Hive
I have an EMPLOYEES table which is partitioned on the basis of COUNTRY and STATE. Below are the partitions.
hive (human_resources)> show partitions employees ;
OK
country=IN/state=PU
country=US/state=CA
country=US/state=IL
Time taken: 0.119 seconds,…

Ruhani Chawlia
- 1
- 1
0
votes
1 answer
Vertical Partitioning in Scalding
I have a TypedTipe[(String, String, Long)] where the first String can assume only a limited (~10) number of values. I'd like to partition my output so that a folder is created for each type (I.E. 10 folders with the name of the first String). This…

Marsellus Wallace
- 17,991
- 25
- 90
- 154
0
votes
1 answer
Partition Pruning using DATE and RANGE COLUMNS
I'm trying to partition a database using a DATE column to take advantage of partition pruning in MySQL 5.7. For Internal reasons, I need to partition by RANGE COLUMNS because it is easy and fast to add/drop partitions.
While the MySQL website…

csn
- 39
- 4
0
votes
1 answer
Can a database table partition name be used as a part of WHERE clause for IBM DB2 9.7 SELECT statement?
I am trying to select all data out of the same specific table partition for 100+ tables using the DB2 EXPORT utility. The partition name is constant across all of my partitioned tables, which makes this method more advantageous than using some…

J. Williams
- 3
- 5
0
votes
2 answers
Case Statement in Where Clause in SQL Server
Good day!
I have a query using SQL which gives a result set of sales per tenant. Now, I want to get a final result set that shows the top 5 and bottom 5 in terms of sales (may be flexible, but sets 5 as an example)
I used rank function to get the…

rickyProgrammer
- 1,177
- 4
- 27
- 63
0
votes
0 answers
awk Splitting huge file creates error "too many open files"
I have a bash script for the purpose of splitting up a huge input file -- at the moment it's 400MB, later the script should split a 4GB file.
The core or this processing is the following awk script:
INPUTFILE="FA.txt"
awk -F $'\t' 'BEGIN{
count…

Friedrich
- 29
- 2
0
votes
2 answers
Partition on Exisiting Table with Millions of Record
I need you suggestion on creating a partition on a table having millions of record.
table definitions
CompanyId
Type_Of_Data
Emp_id
Destination
Destination_id
Now here for a single company ,type of data and emp_id can be different
COMPANY_ID …

Stay Curious
- 101
- 10
0
votes
3 answers
How to algorithmically partition a keyspace?
This is related to consistent hashing and while I conceptually understand what I need to do, I'm having a hard time translating this into code.
I'm trying to divide a given keyspace (say, 128 bits) into equal sized partitions. I want the upper bound…

Patrick Hogan
- 2,098
- 4
- 20
- 28
0
votes
0 answers
Divide dataset into training and testing dataset
I have two datasets of images: subjects 1-200 and each having c (e.g. c=8) images per subject. Now I want to divide this two datasets into training and testing sets for my algorithm. I typically want to do it for this following cases:
CASES…

roni
- 1,443
- 3
- 28
- 49
0
votes
1 answer
Updating Kafka Event Log
I am using Kafka as a pipeline to store analytics data before it gets flushed to S3 and ultimately to Redshift. I am thinking about the best architecture to store data in Kafka, so that it can easily be flushed to a data warehouse.
The issue is…

Scott Switzer
- 1,064
- 1
- 15
- 25
0
votes
1 answer
Finding next first record for UserID using first registered row
I'm getting a bit tied down with this and hope to find a solution. Say I have a data set like this:
PersonID RowID Reg_date Reg_Time Process_first_Date Process_first_time Process_Last_Date Process_Last_time End_date …

Andrew
- 1,728
- 8
- 28
- 39
0
votes
1 answer
which approach is better for increasing performance of a stored procedure
I have a sp which has to select data from 8 tables and each select query has a lot of 'where clauses' and each table has thousands of rows of data.
Now, the requirement is to increase performance of this sp.
Below mentioned approaches are suggested…

Onki
- 1,879
- 6
- 38
- 58