Questions tagged [data-manipulation]

Data manipulation is the process of altering data from a less useful state to a more useful state.

Data manipulation is the process of taking data from either a source or format that isn't easy to read or search into a format or data storage solution that can be quickly read and/or searched. For example, a log's output could be split into rows of a database to make it easier to pull out just the entries that pertain to a situation, or simply reordered to make locating entries based on the ordered field easier. Data manipulation can make data mining easier.

The process of taking raw data and parsing, filtering, extracting, organizing, combining, cleaning or otherwise converting it into a consistent usable form for further processing or input to an algorithm or system.

3845 questions
8
votes
1 answer

Pivot cateorical values into boolean columns SQL

I'm looking to 'flatten' my dataset in order to facilitate data mining. Each categorical column should be changed to multiple Boolean columns. I have a column with categorical values, e.g.: ID col1 1 A 2 B 3 A I'm looking for…
Omri374
  • 2,555
  • 3
  • 26
  • 40
7
votes
8 answers

awk command: if line doesn't starts with a character remove new line on before line

Trying to use awk command to implement this rule: if line doesn't starts with "O|" or "A|" or "S|" I want to remove new line on before line I have this file in input…
Luca L
  • 71
  • 1
7
votes
5 answers

Conditionally Concatenating Strings in R

I have this dataset in R: id = 1:5 col1 = c("12 ABC", "123", "AB", "123344567", "1345677.") col2 = c("gggw", "12", "567", "abc 123", "p") col3 = c("abw", "abi", "klo", "poy", "17df") col4 = c("13 AB", "344", "Huh8", "98", "b") my_data =…
stats_noob
  • 5,401
  • 4
  • 27
  • 83
7
votes
6 answers

Making Combinations of Items

Suppose I have the following lists of factor: factor_1 = c("A1", "A2", "A3") factor_2 = c("B1", "B2") factor_3 = c("C1", "C2", "C3", "C4") factor_4 = c("D1", "D2", "D3") I made the following data frame that contains all (3 * 2 * 4 * 3 = ) 72…
stats_noob
  • 5,401
  • 4
  • 27
  • 83
7
votes
3 answers

Solving Logic Puzzles Using R

I came across the following logic problem: In this problem, you are required to match the real names of basketball players to their nicknames, and sort the basketball players by their heights. Normally, this problem would require you to manually…
stats_noob
  • 5,401
  • 4
  • 27
  • 83
7
votes
3 answers

Create a variable capturing the most frequent occurence by group

Define: df1 <-data.frame( id=c(rep(1,3),rep(2,3)), v1=as.character(c("a","b","b",rep("c",3))) ) s.t. > df1 id v1 1 1 a 2 1 b 3 1 b 4 2 c 5 2 c 6 2 c I want to create a third variable freq that contains the most frequent observation…
Fred
  • 1,833
  • 3
  • 24
  • 29
7
votes
5 answers

Windows command for cutting columns from a text

The following content is stored in a file: chrome.exe 512 Console 0 73,780 K chrome.exe 800 Console 0 11,052 K chrome.exe 1488 Console 0 …
Vineel Kumar Reddy
  • 4,588
  • 9
  • 33
  • 37
7
votes
1 answer

pandas merge on date column issue

I am trying to merge two dataframes on date column (tried both as type object or datetime.date, but fails to give desired merge output: import pandas as pd df1 = pd.DataFrame({'amt': {0: 1549367.9496070854, 1: 2175801.78219801, 2:…
muon
  • 12,821
  • 11
  • 69
  • 88
7
votes
3 answers

Cumulative Sum of a division with varying denominators R

Ok, here is the problem that I would love to solve using an efficient, elegant solution such as data.table or dplyr. Define: DT = data.table(group=c(rep("A",3),rep("B",5)),value=c(2,9,2,3,4,1,0,3)) time group value 1: 1 A 2 2: …
EdM
  • 164
  • 7
7
votes
2 answers

dplyr's filter function: how to return every value (or «cancel» the effect of filter)?

This may seem like a weird question, but is there a way to pass a value to filter() that basically does nothing? data(cars) library(dplyr) cars %>% filter(speed==`magic_value_that_returns_cars?`) And you'd get the whole data frame cars back. I'm…
brodrigues
  • 1,541
  • 2
  • 14
  • 19
7
votes
3 answers

Clean R data frame so that in a column no row value is bigger than 2 times next row value

I have a data frame exemplified by the following dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4) id <- rep("A",length(dist)) df<-cbind.data.frame(id,dist) df id dist 1 A 1.1 2 A 1.0 3 A 10.0 4 A 5.0 5 A 2.1 6 A 12.2 7 A 3.3 8 A 3.4 I…
Kristian
  • 73
  • 3
7
votes
1 answer

Fast way to split string and convert to long format in data.table

I do the following library(data.table) library(stringr) dt <- data.table(string_column = paste(sample(c(letters, " "), 500000, replace = TRUE) , sample(c(letters, " "), 500000, replace = TRUE) …
RInatM
  • 1,208
  • 1
  • 17
  • 39
7
votes
3 answers

Generating a moving sum variable in R

I suspect this is a somewhat simple question with multiple solutions, but I'm still a bit of a novice in R and an exhaustive search didn't yield answers that spoke well to what I'm wanting to do. I'm trying to create, for lack of better term,…
steve
  • 593
  • 6
  • 22
7
votes
3 answers

Can I access an object in C++ other than using an expression?

According to C++03 3.10/1 every expression is either an lvalue or an rvalue. When I use = to assign a new value to a variable the variable name on the left of the assignment is an lvalue expression. And it looks like whatever I try to do with a…
sharptooth
  • 167,383
  • 100
  • 513
  • 979
7
votes
5 answers

Efficiently center a large matrix in R

I have a large matrix that I would like to center: X <- matrix(sample(1:10, 5e+08, replace=TRUE), ncol=10000) Finding the the means is quick and efficient with colMeans: means <- colMeans(X) But what's a good (fast and memory efficient) way to…
Zach
  • 29,791
  • 35
  • 142
  • 201