Lital
Kuchy

From One-Hot Encoding to BERT

Technion

Lital Kuchy

Lital
Kuchy

From One-Hot Encoding to BERT

Technion

Lital Kuchy

Bio

Currently I am pursuing my master’s degree in Data Science from the Technion. My Research thesis is in the areas of Information Retrieval and Machine Learning.

 

In the last 3 years, I am working as a data scientist. As part of my job, I am developing different ML models in the fields of NLP and anomaly detection.

 

I am also a lecturer in an eight-month course designed to provide the knowledge and tools needed to develop ML projects and implement them in production systems, in the division of continuing education at the Technion.

Bio

Currently I am pursuing my master’s degree in Data Science from the Technion. My Research thesis is in the areas of Information Retrieval and Machine Learning.

 

In the last 3 years, I am working as a data scientist. As part of my job, I am developing different ML models in the fields of NLP and anomaly detection.

 

I am also a lecturer in an eight-month course designed to provide the knowledge and tools needed to develop ML projects and implement them in production systems, in the division of continuing education at the Technion.

Abstract

Text representation is one of the fundamental problems in text mining and information retrieval. It is used extensively in classification (e.g., sentiment analysis), text similarity (e.g., between a query and potential documents), clustering, question answering, translation and more.

 

Many methods for representing text were developed over the years. From bag of words, tf-idf, word2vec and the latest state-of-the-art transformers models (including BERT, GPT, XLNet). With each method has its pros and cons. In this round table discussion, we will review few of these methods. We will then turn to the application side and see how text embeddings are used in different tasks.

 

Discussion of real life applications will be encouraged, so to learn of the challenges of moving from the research domain to practical usage. We will follow-up with a discussion of different methods for computing text similarities. E.g., Jaccard Similarity, Cosine Similarity, Word Mover’s Distance.

Abstract

Text representation is one of the fundamental problems in text mining and information retrieval. It is used extensively in classification (e.g., sentiment analysis), text similarity (e.g., between a query and potential documents), clustering, question answering, translation and more.

 

Many methods for representing text were developed over the years. From bag of words, tf-idf, word2vec and the latest state-of-the-art transformers models (including BERT, GPT, XLNet). With each method has its pros and cons. In this round table discussion, we will review few of these methods. We will then turn to the application side and see how text embeddings are used in different tasks.

 

Discussion of real life applications will be encouraged, so to learn of the challenges of moving from the research domain to practical usage. We will follow-up with a discussion of different methods for computing text similarities. E.g., Jaccard Similarity, Cosine Similarity, Word Mover’s Distance.