Data Collection

How to tackle text data collection challenges and improve text annotation

Language played a pivotal role in the ascendancy of humans over (other) animals. It enabled thoughts and ideas to be transferred from one generation to the next thus allowing successive generations to improvise and progress. Writing gave this a boost and the internet supercharged it.

And now, thanks to the abundance of text data and computational prowess (and human ingenuity), it has become possible to create intelligence—or at least a semblance of it— artificially. Text data are the bedrock of chatbots, sentiment analysis and text classification systems, and machine translation.

Harnessing the power of data requires meticulous preparation involving multiple steps. Two crucial ones are data collection and annotation. Both of which are unglamorous and come with a set of challenges. Understanding these is essential.

Challenges in text data collection

Text data collection is the first—and arguably the most significant—step in any AI/ML/data analysis project’s pipeline. It impacts everything that follows.

Collecting text data may at first seem straightforward. It is anything but simple. The process starts with defining and articulating what you want. And then finding the relevant sources, choosing the right tools and methods, determining the size of data to collect, and then storing them for subsequent processes. Each of these is beset by challenges that if not seen to can cause issues in the process and downstream.

Lack of availability of relevant data sources

The quality of the data set is predicated on the data sources, which in turn influence the accuracy of any project. Finding reliable, relevant, and diverse sources representative of the real world, or the context in which an AI model is expected to handle, is however a hurdle. It is thus easy to fall for what is known as “convenience sampling”: collecting data merely because they are easy to access and not, primarily, for their relevance or accuracy. Internal data may scarcely suffice and external data sources may be inaccurate, incomplete, outdated, or poisoned.

The lack of available data is a considerable challenge. A 2018 McKinsey survey showed a lack of collected data as the most significant barrier to the adoption of AI for 24 percent of organizations; the challenge persists even today. That nearly nine-tenths of companies do not have a clear strategy for sourcing data does not help.

Volume and diversity of data

The sheer volume, variety, and velocity of data can make the process of collecting text data daunting—and the veracity of the data dubious. Massive amounts of data can make it difficult to ensure that the sampled data are balanced and representative, making it more likely for biases to creep in. The variety of data sources and types adds complexity to the text data collection process.

Another challenge is keeping up with the flux in data. Data acquisition is seldom a one-time process. Unless the training data are updated, the tool, project, or model will soon lose relevance.

Privacy and legal compliance

Regulations on data privacy such as GDPR and the California Consumer Privacy Act, and copyright laws, though not to be seen as a challenge, impose significant constraints on data collection. Gathering data in contravention of these can incur heavy penalties.

Besides the legal, ethical concerns also need to be considered. This entails respecting data privacy, avoiding biases, and ensuring fair use of the collected data. Companies may also have policies against data scraping that need to be respected.

Data extraction and scraping

The web provides a rich source of text data but scraping them is not always smooth—and there are pitfalls that one fall into. Different websites may have different web standards and structures. Websites may also have anti-scraping measures that hamper data collection.

There may also be duplication of content across different pages or websites. This can lead to over-representation of some data sets. Websites may also present information in different formats, such as tables, lists, or free-form text and the data available in different formats (for example, PDF and text) which can make the extraction process more tedious.

Data storage and processing

Storage and processing, though strictly not part of the text data collection process, are critical aspects of any AI/ML project. Managing and processing vast quantities of data present a number of challenges encompassing efficiency, scalability, security, and compliance.

The choice of a data repository largely depends on the amount and types of data collected. Ensuring that the storage is scalable, handles both structured and unstructured data efficiently, allows effective indexing and fast retrieval, and has robust security mechanisms among other things is essential.

Data collection is but one aspect of data preparation. Before the data can be used for training (a supervised machine learning algorithm), they need to be annotated so that computer systems can learn and interpret them. Labeling text provides context, meaning, intent, etc. behind a word or phrase.

Best text annotation practices

Annotating text not only makes it easier for computers to learn but also makes it easier to index, find, and analyze the data. The efficacy with which a computer does these depends greatly on the quality of the labeling. Incorporating certain best practices of text annotation can greatly enhance the accuracy and reliability of the data. Below are some key ones.

Choose the right sample to annotate

The best option is to annotate all the raw datasets—which is not feasible. The best feasible thing is therefore selecting a diverse and representative sample to annotate. Depending on the specificity or genericity of the intended application, the samples need to be considered accordingly. This clarity helps reduce resource wastage and improves efficiency.

A diverse and balanced sample helps mitigate bias and prevent overfitting. A sample containing a variety of linguistic patterns, regional and ethnic peculiarities, topics, and contexts will help in making the model more generalizable.

Establish clear guidelines

A clear and comprehensive guideline is crucial. It helps reduce human bias and subjectivity and ensures that annotations are consistent.

The guideline should not just be a mere rubric. Provide tips and tricks, ways to handle ambiguities, and examples. The instructions should be such that everyone, amateur or professional, can follow.

Have a diverse set of annotators

Having multiple annotators from diverse backgrounds introduces diversity of perspectives and reduces subjectivity and biases invariably associated with humans. This also helps in identifying and rectifying discrepancies in the annotations and makes the annotations more thorough.

To preclude replication of bias and provide maximum room for individual expression, let the annotators label the data independently. Then have them review each other’s annotations and assess the level of agreement. High agreement is a good indication of reliability.

Randomly sample annotations for quality check

Periodic and random review of a subset of annotation helps assess the quality of annotations. This helps identify potential inconsistencies and inaccuracies and address them early on. This also brings to notice areas that are prone to errors, which enables taking proactive action to reduce them. A timely intervention prevents the propagation of errors throughout the dataset.

Provide regular feedback

Regular feedback and open communication are crucial. Continuous feedback helps ensure that annotators adhere to the guidelines, increasing consistency and reducing subjectivity. It also fosters communication and collaboration between annotators at different levels.

An effective feedback mechanism should make it easy not just to receive feedback but also to raise questions, resolve issues, and share learnings. This can help in refining guidelines and optimizing the annotation approach.

The result of this is continuous improvement, greater consistency and accuracy, and shared learning.

The optimal way to collect and annotate text data

Data collection and annotation are two essential pillars of any AI system. Text data collection forms the basis of data preparation and the subsequent training processes. The accuracy of any system or analysis will only be as good as the data collected—and the annotations made thereon.

The acquisition and annotation, for them to be any good, have to be representative and accurate, cost-effective, and easily and rapidly scalable. This isn’t easy. But there are solutions.

Automation has made both data collection and annotation much less tedious. This, however, is a partial answer at best. Data collection service providers are another. They lie between two less ideal approaches: automation and in-house data preparation—the former is prone to error and the latter is expensive. Companies providing text data collection services can help collect and annotate data at scale, at reasonable cost, and with accuracy.