4

I have a time series data of around 6 months having columns 'entity' (Free text) and 'cost'. My goal is to predict 'Cost' given the date and entity. I have used word embedding and word count to incorporate text feature in LSTM. It is also possible that:

  • There is no record in a date
  • Not all entity exist in a single date

I am confused regarding the representation of the data. How do I embed multiple rows of a single date in LSTM and how will LSTM mitigate new entities ?

Below is the sample of the data.

enter image description here

I have converted my input to vectors:

enter image description here

Kindly assist me.

user1584253
  • 975
  • 2
  • 18
  • 55
  • Are "entities" just "A" and "B" and "C"? and they are different products? – Peyman Oct 20 '19 at 09:59
  • no there are other entities too, its just a free text, like aa, developer and so on – user1584253 Oct 20 '19 at 10:58
  • if there are really many "texts" that are not independent of the cost (so you have to do NLP). I don't think LSTM would work by just seeing 6 months of data. Because you have lots of features and only 6 months of data. NLP needs a good amount of data. But if you want to try, I think this link will help you get an understanding of how to deal with multi-columns https://blog.usejournal.com/stock-market-prediction-by-recurrent-neural-network-on-lstm-model-56de700bff68?gi=476dc479dbf6. – Peyman Oct 20 '19 at 11:24
  • The link you have deals with a dataset which has one record per date, I have more than one records per date. Also, I have created vectors to deal with text as you suggested – user1584253 Oct 20 '19 at 11:32
  • my answer is updated. I hope I did understand your problem this time! – Peyman Oct 20 '19 at 12:22

1 Answers1

2

I may don't understand your problem well. But as I can see in the picture you have more than one record per day and that is your main problem. solving this is not very hard. note that you can pass tensors as your X and bind all you multiple records into a matrix.

For example, the simple tensor of your data will be this:

date :                        [Entity vector, Weight vector, …, Cost vector]
----------------------------------------------------------------------------
24/05/2019:          [[A, B, C, D], [18.1, 22, 36, 46], …, [25, 24, 23, 50]]
25/05/2019:                         [[A, B, C], [43, 44, 35], …, [24, 0, 0]]
27/05/2019: [[A, B, C, D, F], [34, 46, 31, 27, 60], …, [27, 24, 23, 50, 35]]

NOTE: it may be necessary to have all of your vectors in the same length (for matrix mult). Then you can use "Padding". (it means just put 0 or -1 for missing entities. There is two possible senarios for you:

1) if you have finite entities, like having just A to F. You just add -1 for not peresent values. And no need for the first vector, becuse they are fixed and for exapmple index 1 always represents A. The final tensors will be like this after padding:

date : "indexes are [A, B, C, D, E, F]" [Weight vector, …, Cost vector]
-----------------------------------------------------------------------
24/05/2019:   [[18.1, 22, 36, 46, -1, -1], …, [25, 24, 23, 50, -1, -1]]
25/05/2019:       [[43, 44, 35, -1, -1, -1], …, [24, 0, 0, -1, -1, -1]]
27/05/2019:     [[34, 46, 31, 27, 60, -1], …, [27, 24, 23, 50, 35, -1]]

2) if you have infinite entities, I mean if your entities could be anything. then you have to keep the first vector and just pad all vectors to the maximum length vector. The final tensors will be like this in this case after padding (supposing 27/05/2019 has the max length):

date :                            [Entity vector, Weight vector, …, Cost vector]
--------------------------------------------------------------------------------
24/05/2019:  [[A, B, C, D, -1], [18.1, 22, 36, 46, -1], …, [25, 24, 23, 50, -1]]
25/05/2019:     [[A, B, C, -1, -1], [43, 44, 35, -1, -1], …, [24, 0, 0, -1, -1]]
27/05/2019:     [[A, B, C, D, F], [34, 46, 31, 27, 60], …, [27, 24, 23, 50, 35]]

TIP: if your entities are more than one word, then you can use a hash to transfer them to just one number. (I don't recommend using a series of word-embeddings for this! this is too heavy for this 6-moths-data LSTM model, and you won't get a good result out of it.


Now, you feed these vectors into your LSTM. In the picture below, X0 and X1 and … are these tensors. (and you many expect the next day price from hs).

enter image description here

Peyman
  • 3,097
  • 5
  • 33
  • 56
  • What about the target variable (cost)? How to represent it. Also, I have additional columns which I will use as features like word_count and so on. How will I represent it? – user1584253 Oct 20 '19 at 08:55
  • @user1584253 the "cost" is your target now! it is in time series. you give the `cost x0`, `cost x1`, ..., `cost x{t}` to your RNN and target `cost x{t+1}` (that is how RNNs work). if you have additional columns, please describe your problem in detail so I can understand it. – Peyman Oct 20 '19 at 09:04
  • sorry about the short question. I have elaborated my question more. – user1584253 Oct 20 '19 at 09:32
  • I have mold my data with the solution you provided, update my question. I am now training my model on this one, then I'll update – user1584253 Oct 22 '19 at 05:39
  • When I try to convert it to tensor, then it gives error "ValueError: Failed to convert numpy ndarray to a Tensor (Unable to get element as bytes.)." – user1584253 Oct 22 '19 at 09:23
  • Here are the shape of my data: x_train: (249, 7) y_train: (249,) x_val: (2, 7) y_val: (2,) – user1584253 Oct 22 '19 at 09:24