Questions tagged [data-science]

Implementation questions about data science. Data science concerns extracting knowledge or insights from data, in whatever shape or form. It can contain predictive analytics and usually takes a lot of data wrangling. General questions about data science should be posted to their respective communities.

Data science is an interdisciplinary field that uses scientific methods, processes, and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to .

Wikipedia

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead. Otherwise you're probably off-topic.

9099 questions
2
votes
2 answers

Adding percent column to data frame

I have a pandas df like the following: User Purchase_Count Location_Count 1 2 3 2 10 5 3 5 1 4 20 4 5 2 3 6 2 3 7…
Hashtag
  • 23
  • 3
2
votes
1 answer

Pandas: Error while searching asterisk in dataframe. Eg: busiest_hosts['host'].str.contains('***.botol.dk')

Below is what my dataframe looks like, as you would see one of my dataframe column is URL and other is timestamp count. When I am running this code: busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk')==True] I get an error: error:…
jubins
  • 317
  • 2
  • 7
  • 18
2
votes
1 answer

Pandas: How can I convert 'timestamp' values in my dataframe column from object/str to timestamp?

My timestamp looks like below in the dataframe of my column but it is in 'object'. I want to convert this into 'timestamp'. How can I convert all values such in my dataframe column into timestamp? 0 01/Jul/1995:00:00:01 1 …
jubins
  • 317
  • 2
  • 7
  • 18
2
votes
2 answers

How to pass variable as a column name pandas

I'm using Python 2.7 I try do create new column based on variable form a list tickers=['BAC','JPM','WFC','C','MS'] returns=pd.DataFrame for tick in tickers: returns[tick]=bank_stocks[tick][]1'Close'].pct_change()** But I get this…
David Lerech
  • 55
  • 2
  • 7
2
votes
3 answers

Interactions between dummies variables in python

I'm trying to understand how can I address columns after using get_dummies. For example, let's say I have three categorical variables. first variable has 2 levels. second variable has 5 levels. third variable has 2…
Adi Milrad
  • 135
  • 3
  • 9
2
votes
1 answer

What's the best way to subset a spark dataframe (in sparklyr) based on the column data type

I'm converting a bunch of columns into dummy variables. I want to remove the original categorical variable from the dataframe. I'm struggling to figure out how to do it in sparklyr. It's straightforward in dplyr, but the dplyr functionality isn't…
schristel
  • 245
  • 1
  • 13
2
votes
1 answer

stars() function in R

I am struggling a bit trying to do star plots in R. I am currently generating the following star plot from this data frame: COMPACTNESS ELONGATEDNESS RADIUS_RATIO SCALED_VARIANCE 16 96 201 32 227 20 …
Pythoner
  • 69
  • 4
2
votes
2 answers

Getting started with data visualization. What is a good 'hello world' type of project?

I have been gaining interest in data visualization lately. I especially enjoy articles with narrative driven data-viz like the ones in http://polygraph.cool/ for example. What would be a great 'hello world' project to learn about conveying…
P-Man
  • 45
  • 3
2
votes
1 answer

Scikit-Learn with Dask-Distributed using nested parallelism?

For example suppose I have the code: vectorizer = CountVectorizer(input=u'filename', decode_error=u'replace') classifier = OneVsRestClassifier(LinearSVC()) pipeline = Pipeline([ ('vect', vectorizer), ('clf', classifier)]) with…
2
votes
2 answers

Combing Columns after Transposing Columns Pandas Dataframes

Suppose I have a set of data frames df1 is ID C1 0 0 0.000000 1 1 0.538516 2 2 0.509902 3 3 0.648074 4 4 0.141421 df2 is ID C1 0 0 0.538516 1 1 0.000000 2 2 0.300000 3 3 0.331662 4 4 0.608276 and df3 is …
Hormigas
  • 1,429
  • 5
  • 24
  • 45
2
votes
2 answers

How to decide threshold in classification model?

Suppose I build a classification model and then to improve, lets say,precision I just increase my threshold probability of higher class. Does this make sense? I am not changing the model but just changing the threshold probability to get better…
Neo
  • 4,200
  • 5
  • 21
  • 27
2
votes
1 answer

Error when trying to parse XML in R

I keep getting error while trying to parse xml file in R. Here is what I am trying to do: library(XML) fileUrl <- "http://www.w3schools.com/xml/simple.xml" doc <- xmlTreeParse(fileUrl, useInternal=TRUE) I get these error below: " Opening and…
addicted
  • 2,901
  • 3
  • 28
  • 49
2
votes
1 answer

Interpreting robots.txt vs. terms of use

I'm interested in scraping craigslist, solely for the purpose of data analysis for a blog post (i.e., no commercial or financial gain, no posting/emailing, no personal data collection, no sharing of data scraped). Their robots.txt file is the…
Dodgie
  • 643
  • 1
  • 10
  • 17
2
votes
0 answers

Input/output error while copying from hadoop file system to local

hadoop fs -copyToLocal /paulp /abcd (I want to copy the folder paulp in hadoop file system to abcd folder in local) But the oputput of that command shows like this( copyToLocal: mkdir `/abcd': Input/output error) I use ubuntu 14.04 and hadoop…
2
votes
1 answer

R boruta package - (list) object cannot be coerced to type 'double'

I am trying to run a boruta feature selection on my data set. The code is below: df<-read.csv('F:/DataAnalyticsClub/DACaseComp/DatasetDist/Datasets/BestFile.csv',stringsAsFactors=FALSE ) install.packages("Boruta") library(Boruta) df[is.na(df)] <-…
Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70