37

I have column in a dataframe and i am trying to extract 8 digits from a string. How can I do it

    Input
 Shipment ID
20180504-S-20000
20180514-S-20537
20180514-S-20541
20180514-S-20644
20180514-S-20644
20180516-S-20009
20180516-S-20009
20180516-S-20009
20180516-S-20009

Expected Output

Order_Date
20180504
20180514
20180514
20180514
20180514
20180516
20180516
20180516
20180516

I tried below code and it didnt work.

data['Order_Date'] = data['Shipment ID'][:8]
qwr
  • 9,525
  • 5
  • 58
  • 102
Rahul rajan
  • 1,186
  • 4
  • 18
  • 32

4 Answers4

71

You are close, need indexing with str which is apply for each value of Series:

data['Order_Date'] = data['Shipment ID'].str[:8]

For better performance if no NaNs values:

data['Order_Date'] = [x[:8] for x in data['Shipment ID']]

print (data)
        Shipment ID Order_Date
0  20180504-S-20000   20180504
1  20180514-S-20537   20180514
2  20180514-S-20541   20180514
3  20180514-S-20644   20180514
4  20180514-S-20644   20180514
5  20180516-S-20009   20180516
6  20180516-S-20009   20180516
7  20180516-S-20009   20180516
8  20180516-S-20009   20180516

If omit str code filter column by position, first N values like:

print (data['Shipment ID'][:2])
0    20180504-S-20000
1    20180514-S-20537
Name: Shipment ID, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

You can also use str.extract

Ex:

import pandas as pd

df = pd.DataFrame({'Shipment ID': ['20180504-S-20000', '20180514-S-20537', '20180514-S-20541', '20180514-S-20644', '20180514-S-20644', '20180516-S-20009', '20180516-S-20009', '20180516-S-20009', '20180516-S-20009']})
df["Order_Date"] = df["Shipment ID"].str.extract(r"(\d{8})")
print(df)

Output:

       Shipment ID Order_Date
0  20180504-S-20000   20180504
1  20180514-S-20537   20180514
2  20180514-S-20541   20180514
3  20180514-S-20644   20180514
4  20180514-S-20644   20180514
5  20180516-S-20009   20180516
6  20180516-S-20009   20180516
7  20180516-S-20009   20180516
8  20180516-S-20009   20180516
Rakesh
  • 81,458
  • 17
  • 76
  • 113
0

You can also decide to delete from -S to the end

df["Order_Date"]=df['Shipment ID'].replace(regex=r"\-.*",value="")
df
        Shipment ID Order_Date
0  20180504-S-20000   20180504
1  20180514-S-20537   20180514
2  20180514-S-20541   20180514
3  20180514-S-20644   20180514
4  20180514-S-20644   20180514
5  20180516-S-20009   20180516
6  20180516-S-20009   20180516
7  20180516-S-20009   20180516
8  20180516-S-20009   20180516

Also you can capture the first 8 digits then delete everything and replace back with a backreference of the captured group:

df['Shipment ID'].replace(regex=r"(\d{8}).*",value="\\1")
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

I decided to use .str accessor with subsetting:

data['p_dt'] = data['timestamp'].astype(str).str[:8]
johnnyheineken
  • 543
  • 7
  • 20