Named Entity Recognition in Tweets

Lakshya Rajoria

Authors

Lakshya Rajoria Jaipur, India

Keywords:

Broad Twitter Corpus, Indian tweets, Named entity recognition, Natural Language Processing, Semi-Supervised Learning, Twitter

Abstract

Named Entity Recognition (NER) is part of Natural Language Processing (NLP) and is a form of information extraction that helps locate and classify named entities in unstructured text into categories such as locations, people, organizations etc. While the performance of conventional NLP tools is rigorous for formal pieces of literature such as articles, it is severely degraded in the noisy, informal corpus of 280 character messages that are tweets. That, coupled with the insufficient information in a tweet, named entities being out-of-vocabulary (OOV) and lack of training data makes NER all the more challenging. Recently, several works have been posited to tackle NER including implementing part-of-speech or POS tagging which would identify entities as verb, noun etc. phrases, Conditional Random Fields (CRFs), normalization and other forms of distant supervision or unsupervised learning. In this paper, I propose conducting domain adaptation where the Broad twitter corpus or BTC (Derczynski et al. 2016) is preferred as a means for training, development and test data over the Ritter et al. 2011 dataset. The former is not only significantly bigger than the latter but is also sampled across different regions, time periods, and types of Twitter users. To further delve into the consideration of named entities, we use domain transfer by modifying the corpus from Ritter et al. 2011 to match the 3 named entities specified in the BTC (Person, Location, Organization) and using algorithms put forward in Ritter to evaluate the BTC data. In addition to the BTC data, we will evaluate the results on our own baseline Indian tweets data. Using these new datasets, we hope to test state-of-the-art natural language processing algorithms and machine learning algorithms. We demonstrated that our proposed method of evaluating Ritter algorithms on the BTC and Indian tweets increased the FB1 score by 34.69 (BTC Development DataSet) and 6.65 respectively when compared with the tests run using Ritter train data.

Downloads

Download data is not yet available.

Named Entity Recognition in Tweets

Authors

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

For Authors

Submit Paper Online

Submit Paper by email

Contact Us

Indexing/Abstracting

WhatsApp