We propose efficient and effective algorithms to perform approximate string joins with abbreviations in database systems.

String joins have wide applications in data integration andcleaning. The inconsistency of data caused by data errors,
term variations and missing values has led to the need for
approximate string joins (ASJ). In this work, we study
ASJ with abbreviations, which are a frequent type of term
variation. Our method is an end-to-end workflow with three main components: (1) a new string similarity measure taking abbreviations into
account (2) an efficient join algorithm following the filter-verification
framework and (3) an automatic approach to
learn a dictionary of abbreviation rules from input strings.
We evaluate our workflow on four real-world datasets and
show that our workflow outputs accurate join results, scales
well as input size grows and greatly outperforms state-of-the-art
alternatives in both accuracy and efficiency.

Research Areas