Questions tagged [data-science]

Implementation questions about data science. Data science concerns extracting knowledge or insights from data, in whatever shape or form. It can contain predictive analytics and usually takes a lot of data wrangling. General questions about data science should be posted to their respective communities.

Data science is an interdisciplinary field that uses scientific methods, processes, and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to .

Wikipedia

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead. Otherwise you're probably off-topic.

9099 questions
2
votes
1 answer

How to create Histograms in Panda Python Using Specific Rows and Columns in Data Frame

I have the following data frame in the picture, i want to take a Plot a histogram to show the distribution of all countries in the world for any given year (e.g. 2010). Following is my code table generates after the following code of…
Shan Khan
  • 9,667
  • 17
  • 61
  • 111
2
votes
2 answers

Skewed class and Imbalanced class in machine learning

Is there any difference between skewed class && imbalanced class in machine learning?Or both are same with different terminologies?
Sriharsha
  • 150
  • 3
  • 16
2
votes
1 answer

Basic questions about linear regression example from NVIDIA DIGITS

I've a lot of values from all days over one entire year. I'm wanna verify if they have a kind of similarity for each month (verify if these days values correspond to the correct month and/or predict for future same months from another future year).…
2
votes
3 answers

managing data in big data

I am reading book on big data for dummies. Welcome to Big Data For Dummies. Big data is becoming one of the most important technology trends that has the potential for dramatically changing the way organizations use information to enhance the…
venkysmarty
  • 11,099
  • 25
  • 101
  • 184
2
votes
2 answers

How do I choose training data set for job recommendation using linear regression model?

I have two kind of profiles in database.one is candidate prodile,another is job profile posted by recruiter. in both the profiles i have 3 common field say location,skill and experience i know the algorithm but i am having problem in creating…
Anshuman Singh
  • 1,134
  • 1
  • 13
  • 21
2
votes
0 answers

R programming - Split a column based on the column values

I have a dataset whose format is shown below: a1 a2 a3 | class 0 0 0 | c1 0 0 1 | c2 0 1 1 | c3 I want to split the column 'class' based on the values of the column. I want the output to look like this: a1 …
Arat254
  • 449
  • 2
  • 5
  • 17
2
votes
3 answers

Get distinct items from rows of comma separated strings in Spark 2.0

I am using Spark 2.0 to analyze a data set. One column contains string data like this: A,C A,B A B B,C I want to get a JavaRDD with all distinct items that appears in the column, something like this: A B C How can this be done efficiently in…
user622194
2
votes
1 answer

How to find outliers in data with discrete variables in R

I'm beginning to learn R and data science in general. I have a data frame and most of my variables and the class I want to predict are discrete. What I need to do is find outliers in this data so I can deal with them by imputation or whatever. Some…
Renato Borges
  • 1,043
  • 9
  • 12
2
votes
2 answers

converting an object to float in pandas along with replacing a $ sign

I am fairly new to Pandas and I am working on project where I have a column that looks like the following: AverageTotalPayments $7064.38 $7455.75 $6921.90 ETC I am trying to get the cost factor out of it where the cost could be…
ravenUSMC
  • 495
  • 5
  • 23
2
votes
1 answer

How feature importance and forest structures are related in scikit-learn RandomForestClassifier?

Here is a simple example of my problem, using the Iris dataset. I am puzzled when trying to understand how feature importance are computed and how this is visible when visualizing the forest of estimators using export_graphviz. Here is my…
2
votes
1 answer

Samples with no label assignment using multilabel random forest in scikit-learn

I am using Scikit-Learn's RandomForestClassifier to predict multiple labels of documents. Each document has 50 features, no document has any missing features, and each document has at least one label associated with it. clf =…
2
votes
3 answers

Python data error: ValueError: invalid literal for int() with base 10: '42152129.0'

I am working on a simple data science project with Python. However, I am getting an error which is the following: ValueError: could not convert string to float: Here is what my code looks like: import matplotlib.pyplot as plt import csv from…
ravenUSMC
  • 495
  • 5
  • 23
2
votes
1 answer

How should zero standard deviation in one of the features be handled in multi-variate gaussian distribution

I am using multi-variate guassian distribution to analyze abnormality. This is how the training set looks 19-04-16 05:30:31 1 0 0 377816 305172 5567044 0 0 0 14 62 75 0 0 100 0 0
2
votes
2 answers

Merging multiple rows by disticnt row value into a single column

Using R trying to merge raw matrix to result a matrix based on the value of the row value. Ex: from: 1 2 a1 10 a1 20 a1 40 a2 45 a2 50 a3 40 a4 45 a4 60 to: 10 20 40 45 50 40 45 60
shiva.n404
  • 463
  • 1
  • 7
  • 18
2
votes
0 answers

how to implement conditional pattern matching on multiple elements in Scala/Spark?

I'm writing a method that classifies elements in a time series based on when I see a unique label in a range. For example, I have an orange, an apple, and a pear, and I see the oranges, apples, and pears at different time intervals throughout the…
bitData
  • 21
  • 3