Questions tagged [data-science]

Implementation questions about data science. Data science concerns extracting knowledge or insights from data, in whatever shape or form. It can contain predictive analytics and usually takes a lot of data wrangling. General questions about data science should be posted to their respective communities.

Data science is an interdisciplinary field that uses scientific methods, processes, and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to .

Wikipedia

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead. Otherwise you're probably off-topic.

9099 questions
37
votes
3 answers

GridSearchCV - XGBoost - Early Stopping

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task.…
ayyayyekokojambo
  • 1,165
  • 3
  • 13
  • 33
37
votes
2 answers

pandas reset_index after groupby.value_counts()

I am trying to groupby a column and compute value counts on another column. import pandas as pd dftest = pd.DataFrame({'A':[1,1,1,1,1,1,1,1,1,2,2,2,2,2], 'Amt':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]}) dftest looks like A…
muon
  • 12,821
  • 11
  • 69
  • 88
34
votes
4 answers

Difference between Standard scaler and MinMaxScaler

What is the difference between MinMaxScaler() and StandardScaler(). mms = MinMaxScaler(feature_range = (0, 1)) (Used in a machine learning model) sc = StandardScaler() (In another machine learning model they used standard-scaler and not…
Chakra
  • 647
  • 1
  • 8
  • 16
34
votes
3 answers

How to use advanced activation layers in Keras?

This is my code that works if I use other activation layers like tanh: model = Sequential() act = keras.layers.advanced_activations.PReLU(init='zero', weights=None) model.add(Dense(64, input_dim=14,…
pr338
  • 8,730
  • 19
  • 52
  • 71
28
votes
3 answers

How to plot multiple pandas columns

I have dataframe total_year, which contains three columns (year, action, comedy). How can I plot two columns (action and comedy) on y-axis? My code plots only one: total_year[-15:].plot(x='year', y='action', figsize=(10,5), grid=True)
Bilal Butt
  • 1,202
  • 3
  • 12
  • 15
27
votes
2 answers

Adjust size of ConfusionMatrixDisplay (ScikitLearn)

How to set the size of the figure ploted by ScikitLearn's Confusion Matrix? import numpy as np from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix cm = confusion_matrix(np.arange(25), np.arange(25)) cmp = ConfusionMatrixDisplay(cm,…
Raphael
  • 1,518
  • 2
  • 14
  • 27
27
votes
3 answers

Removing non-English words from text using Python

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk. For example…
Andre Croucher
  • 395
  • 1
  • 3
  • 9
25
votes
2 answers

ValueError: continuous format is not supported

I have written a simple function where I am using the average_precision_score from scikit-learn to compute average precision. My Code: def compute_average_precision(predictions, gold): gold_predictions = np.zeros(predictions.size, dtype=np.int) …
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161
25
votes
6 answers

What is the difference between Big Data and Data Mining?

As Wikpedia states The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use How is this related with Big Data? Is it correct if I say that Hadoop…
DesirePRG
  • 6,122
  • 15
  • 69
  • 114
24
votes
6 answers

import langchain => Error : TypeError: issubclass() arg 1 must be a class

I want to use langchain for my project. so I installed it using following command : pip install langchain but While importing "langchain" I am facing following Error: File /usr/lib/python3.8/typing.py:774, in _GenericAlias.__subclasscheck__(self,…
M. D. P
  • 604
  • 2
  • 6
  • 18
23
votes
1 answer

Macbook m1 and python libraries

Is new macbook m1 suitable for Data Science? Do Data Science python libraries such as pandas, numpy, sklearn etc work on the macbook m1 (Apple Silicon) chip and how fast compared to the previous generation intel based macbooks?
wizarpy_vm
  • 396
  • 1
  • 2
  • 10
23
votes
1 answer

How to structure Machine Learning projects using Object Oriented programming in Python?

I have observed that staticians and machine learning scientist generally doesnt follow OOPS for ML/data science projects when using Python (or other languages). Mostly it should be due to lack of understanding of best software engineering practises…
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80
23
votes
2 answers

pandas.to_numeric - find out which string it was unable to parse

Applying pandas.to_numeric to a dataframe column which contains strings that represent numbers (and possibly other unparsable strings) results in an error message like…
clstaudt
  • 21,436
  • 45
  • 156
  • 239
23
votes
1 answer

Plotting decision boundary for High Dimension Data

I am building a model for binary classification problem where each of my data points is of 300 dimensions (I am using 300 features). I am using a PassiveAggressiveClassifier from sklearn. The model is performing really well. I wish to plot the…
Anuj Gupta
  • 6,328
  • 7
  • 36
  • 55
22
votes
2 answers

find the "elbow point" on an optimization curve with Python

i have a list of points which are the inertia values of a kmeans algorithm. To determine the optimum amount of clusters i need to find the point, where this curve starts to flatten. Data example Here is how my list of values is created and…
ItFreak
  • 2,299
  • 5
  • 21
  • 45