Questions tagged [data-science]

Implementation questions about data science. Data science concerns extracting knowledge or insights from data, in whatever shape or form. It can contain predictive analytics and usually takes a lot of data wrangling. General questions about data science should be posted to their respective communities.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data-mining.

Wikipedia

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead. Otherwise you're probably off-topic.

9099 questions

votes

2 answers

Adding percent column to data frame

I have a pandas df like the following: User Purchase_Count Location_Count 1 2 3 2 10 5 3 5 1 4 20 4 5 2 3 6 2 3 7…

python pandas data-science

asked Apr 06 '17 at 21:50

Hashtag

votes

1 answer

Pandas: Error while searching asterisk in dataframe. Eg: busiest_hosts['host'].str.contains('***.botol.dk')

Below is what my dataframe looks like, as you would see one of my dataframe column is URL and other is timestamp count. When I am running this code: busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk')==True] I get an error: error:…

python r pandas data-science text-analysis

asked Apr 03 '17 at 01:08

jubins

votes

1 answer

Pandas: How can I convert 'timestamp' values in my dataframe column from object/str to timestamp?

My timestamp looks like below in the dataframe of my column but it is in 'object'. I want to convert this into 'timestamp'. How can I convert all values such in my dataframe column into timestamp? 0 01/Jul/1995:00:00:01 1 …

python pandas data-science data-cleaning data-scrubbing

asked Apr 01 '17 at 21:20

jubins

votes

2 answers

How to pass variable as a column name pandas

I'm using Python 2.7 I try do create new column based on variable form a list tickers=['BAC','JPM','WFC','C','MS'] returns=pd.DataFrame for tick in tickers: returns[tick]=bank_stocks[tick][]1'Close'].pct_change()** But I get this…

python-2.7 pandas data-science

asked Mar 23 '17 at 08:25

David Lerech

votes

3 answers

Interactions between dummies variables in python

I'm trying to understand how can I address columns after using get_dummies. For example, let's say I have three categorical variables. first variable has 2 levels. second variable has 5 levels. third variable has 2…

python pandas data-science

asked Mar 23 '17 at 07:19

Adi Milrad

votes

1 answer

What's the best way to subset a spark dataframe (in sparklyr) based on the column data type

I'm converting a bunch of columns into dummy variables. I want to remove the original categorical variable from the dataframe. I'm struggling to figure out how to do it in sparklyr. It's straightforward in dplyr, but the dplyr functionality isn't…

r apache-spark machine-learning data-science sparklyr

asked Mar 10 '17 at 14:34

schristel

votes

1 answer

stars() function in R

I am struggling a bit trying to do star plots in R. I am currently generating the following star plot from this data frame: COMPACTNESS ELONGATEDNESS RADIUS_RATIO SCALED_VARIANCE 16 96 201 32 227 20 …

r statistics visualization data-science

asked Feb 19 '17 at 03:06

Pythoner

votes

2 answers

Getting started with data visualization. What is a good 'hello world' type of project?

I have been gaining interest in data visualization lately. I especially enjoy articles with narrative driven data-viz like the ones in http://polygraph.cool/ for example. What would be a great 'hello world' project to learn about conveying…

data-visualization data-science

asked Feb 16 '17 at 16:37

P-Man

votes

1 answer

Scikit-Learn with Dask-Distributed using nested parallelism?

For example suppose I have the code: vectorizer = CountVectorizer(input=u'filename', decode_error=u'replace') classifier = OneVsRestClassifier(LinearSVC()) pipeline = Pipeline([ ('vect', vectorizer), ('clf', classifier)]) with…

parallel-processing scikit-learn data-science dask joblib

asked Feb 13 '17 at 03:00

gman9732

votes

2 answers

Combing Columns after Transposing Columns Pandas Dataframes

Suppose I have a set of data frames df1 is ID C1 0 0 0.000000 1 1 0.538516 2 2 0.509902 3 3 0.648074 4 4 0.141421 df2 is ID C1 0 0 0.538516 1 1 0.000000 2 2 0.300000 3 3 0.331662 4 4 0.608276 and df3 is …

python pandas numpy data-science

asked Jan 29 '17 at 20:51

Hormigas

1,429
5
24
45

votes

2 answers

How to decide threshold in classification model?

Suppose I build a classification model and then to improve, lets say,precision I just increase my threshold probability of higher class. Does this make sense? I am not changing the model but just changing the threshold probability to get better…

machine-learning classification data-science threshold

asked Jan 16 '17 at 04:48

Neo

4,200
5
21
27

votes

1 answer

Error when trying to parse XML in R

I keep getting error while trying to parse xml file in R. Here is what I am trying to do: library(XML) fileUrl <- "http://www.w3schools.com/xml/simple.xml" doc <- xmlTreeParse(fileUrl, useInternal=TRUE) I get these error below: " Opening and…

r xml data-science

asked Jan 04 '17 at 09:50

addicted

2,901
3
28
49

votes

1 answer

Interpreting robots.txt vs. terms of use

I'm interested in scraping craigslist, solely for the purpose of data analysis for a blog post (i.e., no commercial or financial gain, no posting/emailing, no personal data collection, no sharing of data scraped). Their robots.txt file is the…

web-scraping web-crawler robots.txt data-science craigslist

asked Dec 21 '16 at 19:27

Dodgie

votes

0 answers

Input/output error while copying from hadoop file system to local

hadoop fs -copyToLocal /paulp /abcd (I want to copy the folder paulp in hadoop file system to abcd folder in local) But the oputput of that command shows like this( copyToLocal: mkdir `/abcd': Input/output error) I use ubuntu 14.04 and hadoop…

linux hadoop data-science bigdata

asked Dec 17 '16 at 10:44

paul vineeth

votes

1 answer

R boruta package - (list) object cannot be coerced to type 'double'

I am trying to run a boruta feature selection on my data set. The code is below: df<-read.csv('F:/DataAnalyticsClub/DACaseComp/DatasetDist/Datasets/BestFile.csv',stringsAsFactors=FALSE ) install.packages("Boruta") library(Boruta) df[is.na(df)] <-…

r machine-learning feature-selection data-science

asked Nov 26 '16 at 22:55

Maksim Khaitovich

4,742
7
39
70

Prev 1 2 3

…

99 100 Next