Questions tagged [data-science]

Implementation questions about data science. Data science concerns extracting knowledge or insights from data, in whatever shape or form. It can contain predictive analytics and usually takes a lot of data wrangling. General questions about data science should be posted to their respective communities.

Data science is an interdisciplinary field that uses scientific methods, processes, and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to .

Wikipedia

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead. Otherwise you're probably off-topic.

9099 questions
1
vote
3 answers

How to pass only necessary features to pipeline after SelectKBest

I have a regular tabular dataset, 100 features from the database are added I want to push it into a regular sklearn.pipeline in which there will be preprocessing, encoding, some custom transformers, etc. Penultimate estimator would be…
1
vote
1 answer

How to get average/mean with mapping in pandas dataframe?

I have a dataframe that looks something like this: Birthyear Weight 1992 2 1993 2.2 1992 3 1993 2.5 1994 2.4 1993 1.8 1994 2.1 Note: This is an example, I have +100k of rows and years I want to get a new DataFrame in which I…
1
vote
0 answers

All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough'

I was studying In Depth: k-Means Clustering section from the textbook Jake VanderPlas's Python Data Science Handbook and I came across the following code block: from sklearn.datasets import load_digits from sklearn.manifold import TSNE from…
1
vote
0 answers

I'm getting an import error with ydata-profiling-4.4.0: `BaseSettings` has been moved to the `pydantic-settings` package

I know that Pydantic V2 introduced new things which make it incompatible with V1, so I switched from pandas_profiling to ydata_profiling. Because of that, I had to switch versions of the dependencies, but now I'm getting a complex error which makes…
1
vote
0 answers

Trying to find Optimal threshold using Youden Index and ROC curve, but its accuracy,f1 score is much lower than most of thresholds?

Youden’s J statistic J = Sensitivity + Specificity – 1 J = Sensitivity + (1 – FalsePositiveRate) – 1 J = TruePositiveRate – FalsePositiveRate Goal is to get - > Maximum TPR and Minimum FPR fpr, tpr, thresholds =…
Sauron
  • 551
  • 2
  • 11
1
vote
1 answer

How to download XLSX file from DOI link?

I want to download two files automatically from Python for a reproducible statistical analysis. These links https://doi.org/10.1371/journal.pone.0282068.s001 https://doi.org/10.1371/journal.pone.0282068.s002 I tried import requests url =…
Galen
  • 1,128
  • 1
  • 14
  • 31
1
vote
0 answers

Seisbench can't download dataset because Firewall

I use anaconda_jupyter notebook for doing some data science stuff, when i want to download the data using data = sbd.Iquique() I face this log 2023-08-01 20:50:49,209 | seisbench | WARNING | Check available storage and memory before downloading and…
Lyfora
  • 11
  • 1
1
vote
1 answer

Folium popup not working when rendering HTML

I want to do HTML formatting into a folium map popup. When I try to render html by using def format_popup_content(row) function then the map does not display. How do I format popup? This is what I have tried so far def format_popup_content(row): …
Ocean Vue
  • 19
  • 1
1
vote
0 answers

Clustering Algorithms with Periodic Boundary Conditions

I've been working on a project that involves the clustering of data with periodic boundary conditions. So, I am looking for clustering algorithms that can effectively handle datasets where periodicity plays a significant role. My data is 3D and I am…
1
vote
2 answers

How to calculate time differences without a date and only with times?

import pandas as pd stoptimes_df = pd.DataFrame({ 'trip_id': ['1', '1', '1', '2', '2', '2'], 'arrival_time': ["12:10:00", "12:20:00", "12:30:00", "27:32:00", "27:39:00", "27:45:00"], 'departure_time': ["12:10:00", "12:20:00",…
leolumpy
  • 63
  • 6
1
vote
1 answer

Fill NaN values in Polars using a custom-defined function for a specific column

I have this code in pandas: df[col] = ( df[col] .fillna(method="ffill", limit=1) .apply(lambda x: my_function(x)) ) I want to re-write this in Polars. I have tried this: df = df.with_columns( …
Honio
  • 21
  • 6
1
vote
3 answers

Using linear optimisation, how do I minimize the Total Cost in a dataframe

I have a Pandas dataframe with 3 columns (Product, Weight, Total Cost) as follows (expanded to make it clearer): df = { 'Product': ['Product 1', 'Product 2', 'Product 3', 'Product 4', 'Product 1', 'Product 2', 'Product 3',…
t24opb
  • 21
  • 2
1
vote
0 answers

8bit Quantization: Prediction outputs uncorrelated to underlying model

I quantized a basic TFLite regression model to int8 but the prediction output seems to be highly uncorrelated with the actual underlying model prior to quantizing it. All the code and steps taken to train and quantize the model are seen below to…
Bemz
  • 129
  • 1
  • 16
1
vote
1 answer

How to convert Pandas Dataframe to the shape of a correlation matrix

I have a pandas dataframe which looks vaguely like this: Out[130]: xvar yvar meanRsquared 0 filled_water precip 0.119730 1 filled_water snow 0.113214 2 filled_water …
yeet_man
  • 48
  • 3
1
vote
1 answer

How to avoid NaN values when I use frame['Colum'].map(dict)

I have the following dataset frame1 Color Item Red Shirt White Shoes Yellow Shirt Green Shoes I want to set all the colors for Shoes item to be "Blue", I use map x = {"Shoes": "Blue"} fr1["Color"] = fr1["Item"].map(x) I expected…