Questions tagged [data-manipulation]

Data manipulation is the process of altering data from a less useful state to a more useful state.

Data manipulation is the process of taking data from either a source or format that isn't easy to read or search into a format or data storage solution that can be quickly read and/or searched. For example, a log's output could be split into rows of a database to make it easier to pull out just the entries that pertain to a situation, or simply reordered to make locating entries based on the ordered field easier. Data manipulation can make data mining easier.

The process of taking raw data and parsing, filtering, extracting, organizing, combining, cleaning or otherwise converting it into a consistent usable form for further processing or input to an algorithm or system.

3845 questions
1
vote
1 answer

How can I add a column with mutate () to each of the multiple data sets I read?

I am a beginner in R and currently learn how to do the data wrangling job in multiple data sets. Right now I read 55 csv.file data sets with 300 rows using the following code: Rawdata <- list.files(pattern = "*.csv") for(i in 1:length(Rawdata)){ …
1
vote
2 answers

Create column with dplyr based on value and also frequency of another column, in R

I will edit the post name shortly as I think up a better title, but for the time being, a short example below highlights what I am struggling with: dput(mydf) structure(list(gameID = c("34", "34", "34", "34", "34", "25", "25", "25")), class =…
Canovice
  • 9,012
  • 22
  • 93
  • 211
1
vote
1 answer

Extract part of URL in R

dput(mydf) structure(list(urls = c("/players/a/abdulma02.html", "/players/a/abdulta01.html", "/players/a/abdursh01.html", "/players/a/alexaco01.html", "/players/a/alexaco02.html" ), names = c("Mahmoud Abdul-Rauf", "Tariq Abdul-Wahad", "Shareef…
Canovice
  • 9,012
  • 22
  • 93
  • 211
1
vote
2 answers

Group by two columns and get sum?

x1 = [{'id1': 'Africa', 'id2': 'Europe', 'v': 1}, {'id1': 'Europe', 'id2': 'North America', 'v': 5}, {'id1': 'North America', 'id2': 'Asia', 'v': 2,}, {'id1': 'North America', 'id2': 'Asia', 'v': 3}] df = pd.DataFrame(x1) How…
1
vote
2 answers

SAS set value less than mean to missing

Let's say I have data that look like this: DATA temp; INPUT id a1 b2 d1 f8; DATALINES; 1 2.3 2.1 4.2 1.2 2 5.3 2.3 1.5 3.2 3 1.2 5.4 6.6 6.6 ; run; What I want to do is use the data and set statements to say that if the values in a1 and f8 are…
1
vote
1 answer

Applying a function only works for one column instead of multiple?

x = [{'list1':'[1,6]', 'list2':'[1,1]'}, {'list1':'[1,7]', 'list2':'[1,2]'}] df = pd.DataFrame(x) Now I'm going to transform it from string to list type: df[['list1','list2']].apply(lambda x: ast.literal_eval(x.strip())) >> ("'Series' object…
Chipmunkafy
  • 566
  • 2
  • 5
  • 17
1
vote
1 answer

Is there an efficient way to display time chart in dc.js with date range data?

I am trying to create a timechart to show the number of rooms occupied in a different scenarios using dc.js. To reduce data transmission, my room occupancy data is represented by discrete start and end times. [{"room": "1", "start":"10/13/2018…
Jernigan
  • 11
  • 3
1
vote
2 answers

Assign date to all lines below until the next date

df index col1 ------------------------ 0 2017-01-01 1 a 2 b 3 c 4 2017-01-02 5 d 6 e 7 f 8 2017-01-03 9 g 10 h 11 i expected df index …
Chipmunkafy
  • 566
  • 2
  • 5
  • 17
1
vote
1 answer

Calculate how many reports are running at a certain time

I am trying to calculate how many reports are running at a certain time. The data is like: ReportID StartTime Duration 1 2018-11-02 13:00:00 240 seconds 2 2018-11-02 14:00:00 300 seconds 3 2018-11-02 14:01:15 300 seconds …
1
vote
2 answers

Get sum of values from last nth row by group id

I just want to know how to get the sum of the last 5th values based on id from every rows. df: id values ----------------- a 5 a 10 a 10 b 2 c 2 d 2 a 5 a 10 a 20 a 10 a …
Mike
  • 121
  • 1
  • 1
  • 9
1
vote
2 answers

R: Counting occurrences in each column and replacing that column's value with the count (SQL?)

Here is an example of the original data: ID Test1 Test2 Test3 Test4 1 0 0 NA 1.2 1 0 NA NA 3.0 1 NA NA NA 0 2 0 …
aspratle
  • 43
  • 4
1
vote
3 answers

How to group this dataframe in python?

I have this problem: import pandas as pd stripline = "----------------------------" rawData = { 'order number': ['11xa', '11xa', '11xa', '21xb', '31xc'], 'working area': ['LLA', 'LLE', 'LLS', 'MLA', 'MLE'], 'time': [1, 6, 13, 35,…
1
vote
1 answer

Create columns based on bins

I have a data: # dt Column1 1 2 3 4 5 6 7 8 9 I want to create a new column by bins' average of min and max. # dt Column1 Column2 1 2 2 2 3 …
Peter Chen
  • 1,464
  • 3
  • 21
  • 48
1
vote
1 answer

Replace values to row above based on condition

I want to replace values to row above based on condition as follows: If pc_no = DELL, assign to value of pc_no and cust_id into row above to event_rep and loc_id. After that want to delete the row which has "DELL". id pc_no cust_id event_id…
kimi
  • 525
  • 5
  • 17
1
vote
2 answers

proportion data frame for each factor level based on another column

I would like to summarize a data frame by month where each column is the proportion of each factor level based on the Records column in the data frame below. I have been attempting to use dplyr but haven't quite figured it…
alleyway
  • 90
  • 6
1 2 3
99
100