-1

I want to pre-train Bert from scratch using my own domain data. Here are my basic info about my data:

  1. millions of texts and around 60,000 vocab size;
  2. every single text is a list of number, like "2233 3454 3679";
  3. the longest text in my data is less than 15, most of them are 3 or 4;
  4. the test meaning is AS-Path in BGP updates or announcements.

So now I'm wondering if there is any tricks to train Bert on this type of data to get a good performance?

I want to know some tricks about training Bert on my domain data.

  • hi @Jacjeu Shi . accept answer and up vote motivated every body. actualy WordPiece algorithm, which is designed to handle subword tokenization of words, might not be the most optimal choice (i think) . some considerations for you : Create a Set ,Assign IDs ,Special Tokens ,Subword Tokenization and etc – Ilya Aug 15 '23 at 05:26
  • Welcome to Stackoverflow! Asking for recommendations might not be appropriate on the Stackoverflow (https://stackoverflow.com/help/how-to-ask) but it might be possible to ask the question on https://softwarerecs.stackexchange.com. Also, logging it on https://stackoverflow.com/collectives/nlp/beta/discussions/76949597 – alvas Aug 25 '23 at 16:33

0 Answers0