Questions tagged [data-analysis]

Data Analysis involves extracting meaning and insights from raw data. It involves methods and algorithms that examine, clean, transform and model the data to obtain conclusions.

Data Analysis involves extracting meaning and insights from raw data.

It involves methods and algorithms that examine, clean, transform and model the data to obtain conclusions and insights.

Typically, data analysis involves a series of steps. Starting with measuring some parameters of interest, collecting the data, cleaning it, storing it in meaningful ways, then summarizing and examining it, and also testing various hyoptheses about the data.

More information can be found the Wikipedia's Data Analysis page.

4642 questions
18
votes
4 answers

What to do with missing values when plotting with seaborn?

I replaced the missing values with NaN using lambda following function: data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x) where data is the dataframe I am working on. Using seaborn afterwards, I tried to…
datavinci
  • 795
  • 2
  • 7
  • 27
17
votes
3 answers

Machine learning project: split training/test sets before or after exploratory data analysis?

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data? I'm working on my first full machine learning project (a recommendation system for a course…
Amy Gill
  • 178
  • 1
  • 8
17
votes
3 answers

Speed up Matplotlib?

I've read here that matplotlib is good at handling large data sets. I'm writing a data processing application and have embedded matplotlib plots into wx and have found matplotlib to be TERRIBLE at handling large amounts of data, both in terms of…
David Morton
  • 1,744
  • 2
  • 15
  • 20
17
votes
2 answers

Python : How to use Multinomial Logistic Regression using SKlearn

I have a test dataset and train dataset as below. I have provided a sample data with min records, but my data has than 1000's of records. Here E is my target variable which I need to predict using an algorithm. It has only four categories like…
15
votes
3 answers

Pandas - equivalent of str.contains() in pandas query

Creating a dataframe using subsetting with below conditions subset_df = df_eq.loc[(df_eq['place'].str.contains('Chile')) & (df_eq['mag'] > 7.5),['time','latitude','longitude','mag','place']] Want to replicate the above subset using query() in…
raul
  • 631
  • 2
  • 10
  • 23
14
votes
5 answers

General techniques to work with huge amounts of data on a non-super computer

I'm taking some AI classes and have learned about some basic algorithms that I want to experiment with. I have gotten access to several data sets containing lots of great real-world data through Kaggle, which hosts data analysis competitions. I have…
Rishi
  • 3,538
  • 5
  • 29
  • 40
14
votes
1 answer

Data analysis with JavaScript?

Today my data analysis routine would be something like the following: do the heavy work with either R, Julia or Python and then display it in the web with JavaScript (for example, using D3.js). My initial focus with JS was mainly data…
Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66
14
votes
1 answer

Matplotlib: Formatting dates on the x-axis in a 3D Bar graph

Given this 3D bar graph sample code, how would you convert the numerical data in the x-axis to formatted date/time strings? I've attempted using the ax.xaxis_date() function without success. I also tried using plot_date(), which doesn't appear to…
pokstad
  • 3,411
  • 3
  • 30
  • 39
13
votes
4 answers

Getting error when adding a new row to my existing dataframe in pandas

I have the below data frame. df3=pd.DataFrame(columns=["Devices","months"]) I am getting row value from a loop row, print(data) Devices months 1 Powerbank Feb month When I am adding this data row to my df3 I am getting an error. …
pyco
  • 191
  • 1
  • 2
  • 10
13
votes
3 answers

groupby multiple values, and plotting results

I'm using some data on fungicide usage which has the Year, Fungicide, Amount used, along with some irrelevant columns in a panda DataFrame. It looks somewhat like: Year, State, Fungicide, Value 2011, California, A, 12879 2011,…
A. Chatfield
  • 145
  • 1
  • 1
  • 7
12
votes
5 answers

Data analysis using R/python and SSDs

Does anyone have any experience using r/python with data stored in Solid State Drives. If you are doing mostly reads, in theory this should significantly improve the load times of large datasets. I want to find out if this is true and if it is worth…
signalseeker
  • 4,100
  • 7
  • 30
  • 36
12
votes
1 answer

What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?

Suppose there is a matrix B, where its size is a 500*1000 double(Here, 500 represents the number of observations and 1000 represents the number of features). sigma is the covariance matrix of B, and D is a diagonal matrix whose diagonal elements are…
Shawn
  • 333
  • 1
  • 6
  • 15
12
votes
2 answers

Google Analytics trackevent in single-page web app

What is the best (most practical) way to use Google Analytics trackevent for tracking "pageviews" in a single-page web app? trackevent takes four arguments: CATEGORY, ACTION, LABEL, VALUE. The last two are optional. Which field should I use for the…
11
votes
2 answers

Add a circle to ggmap

Let's assume I generate a map of London using ggmap package: library(ggmap) library(mapproj) map <- get_map(location = "London", zoom = 11, maptype = "satellite") p <- ggmap(map)+ theme(legend.position = "none") print(p) Now I would like…
Michał
  • 273
  • 1
  • 3
  • 13
11
votes
3 answers

R: Cross validation on a dataset with factors

Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y. For example, using package boot: library(boot) d…
musically_ut
  • 34,028
  • 8
  • 94
  • 106