The process of taking raw data and parsing, filtering, extracting, organizing, combining, cleaning or otherwise converting it into a consistent useable form for further processing or input to an algorithm or system.
Questions tagged [data-munging]
236 questions
-1
votes
3 answers
Python pandas how to duplicate certain columns
I have a dataframe row:
key1 key2 key3 val1 val2 val3 .. valn
a b c 1 2 3 14
I want to duplicate the value columns:
key1 key2 key3 val1_0 val2_0 val3_0 .. valn_0 val1_1 val2_1 val3_1 .. valn_1
a b c 1 2 3 …

Cranjis
- 1,590
- 8
- 31
- 64
-1
votes
1 answer
Pandas dataframe how to add column of distance from previous row?
I have a dataframe of locations:
df = X Y
1 1
2 1
2 1
2 2
3 3
5 5
5.5 5.5
I want to add a columns, with the distance to the previous point:
So it will be:
df = X Y Distance
…

Cranjis
- 1,590
- 8
- 31
- 64
-1
votes
4 answers
Complex extract of all row entries based on string pattern using awk, sed or R
I have a 7 column file like this:
ID ANNOTATION OR PVAL VAR_INFO INFO_TAGS_USED_TO_ANNOTATE INFO_TAGS_USED_TO_ANNOTATE
1 ANN1 1.66 0.0028 1:154837796(1.12e-06,0) 1:154834092(1.49e-05,0)|1:154834911(1.2e-05,1)|…

Darren
- 277
- 4
- 17
-1
votes
1 answer
Melting Data by Date Range
I'm running into an RStudio data issue regarding properly melting data. It currently is in the following form:
Campaign, ID, Start Date, End Date, Total Number of Days, Total Spend, Total Impressions, Total Conversions
I would like my data to look…

Marc
- 9
- 1
-1
votes
1 answer
Is their an R function that loops through the rows of a data frame and returns the highest 3 column values for each row
I would like to go through every row of the dataframe and figure out which three column names have the top three maximum values for that row.
I do have code that does it with a for loop, but it is too slow. Does anyone have a faster way to do the…

Kyle Peters
- 41
- 6
-1
votes
1 answer
Best way to net values in successive rows of CSV
Looking for advice on the best way to perform the following operation. Preferably in python, javascript, or excel. Data is in CSV (although I removed the commas below). I'm a noob; I should be able to do it, but I'm thinking there's an elegant…

Matt
- 9
- 1
-2
votes
2 answers
Pandas how to divide columns of rows within groupby based on condition
I have the dataframe
C1 c10 val val_type
1 3 5 target
1 3 8 end
1 3 9 other
2 8 1 end
2 8 2 target
2 8 9 other
The values of C1, C10 creates groups of 3.
Within these groups I want to create…

Cranjis
- 1,590
- 8
- 31
- 64
-2
votes
3 answers
Python dataframe get borders of zeros segments throught the column
I have a pandas series:
s = [3,7,8,0,0,0,6,12,0,0,0,0,0,8,5,0,2]
I want to find all the indices in which there is a start or an end of a zeros segment, where the number of zeros is more than 3
so here I want to get:
[8,12]
What is the best way to…

Cranjis
- 1,590
- 8
- 31
- 64
-2
votes
2 answers
R - data munging and scalable code
Hy,
in the last days I had a small/big problem.
I have a transaction dataset, with 1 million rows and two columns (Client Id and product id) and I want transform this in a binary matrix.
I used reshape and spread function, but in both cases I used…

Kardu
- 865
- 3
- 13
- 24
-4
votes
1 answer
Sorting data in a dataframe in R
After data munging and using spread, I arrived at the following table:
Complaint types and Boroughs
I would like to identify the top 4 issues in each Borough. Sort does not help since there are 4 Boroughs. Any thoughts on how to get?

Jyo Nookula
- 111
- 1
- 1
- 6
-5
votes
1 answer
How do I split or create a new column for a list of data in a dataframe?
Please have a look at the preview of the data in theimage. I would like to create 3 new columns i.e. Start, End, Density and create new row for each record in these 3 columns.

Srikar Sud
- 6
- 2