Text-ferramenta dataset

The dataset is composed of text description of adverts in Italian language from online and physical retailers. The description contains features and other technical information about the items on sale. This produced a dataset characterized by short texts and grammatically ill formed sentences, which made this dataset more compelling. We created the ground truth using a query based software that clusters commercial offers based on a text matching software. The dataset consists of 88,010 text description instances randomly split in 66,141 for train and 21,869 for test sets, belonging to 52 classes, e.g. paint brush, hinge, tape, safe, chain, ladder, cart etc. in a hardware category. Text description in dataset contain 22,045 different words for train and 20,083 for test sets.

[download training set]
[download validation set]

Please cite the paper Semantic Text Encoding for Text Classification using Convolutional Neural Networks if you use this dataset.
Authors: Ignazio Gallo, Shah Nawaz and Alessandro Calefati

Applied Recognition Technology Laboratory

Department of Theoretical and Applied Science

Text-ferramenta dataset