Data is known as the backbone of industries today. But data and metadata are not the same. Where the former is pure and combined content, the latter is:

Metadata, often referred to as “data on data”, is the detailed information that is generated about the data assets of an organization. This processed piece of information can give actionable insights about the data it is associated with.

It is also generated when communication in an organization happens through technology. It can include timestamps of calls/conversations, locations and names of recipients of transactions, keyword descriptions of the content, file type, the date of creation, dates of modifications, tracked changes, sources of the data and its uploads, version history etc.

Metadata can be extracted from both structured and unstructured data using Metadata Extraction and Management tools. Structured data includes the data that is given as an input through surveys, questionnaires, or card swiping details. Unstructured data is the data available all over the internet in the form of tweets, user reviews, or engagements on social media. But in both cases, the metadata will only be useful as long as the user has given out the data attentively and willingly.

Apart from textual documents, metadata can also be recorded for images. That would include photographer’s details, date and time when the picture was clicked, administrative rights, type of compression, type of camera and lenses used, keywords, lens aperture, focal length, camera shutter speed, ISO sensitivity, whether flash was used and resolution of the image.

Types of metadata:

1. Structural metadata:

Structural metadata is the one that breaks down the structural integrity of a resource. For example, in a book it might be the index with the mentions of chapters, sections, and page numbers. It helps in describing the hierarchical relationship by storing object-to-object relationships. It can be useful when it comes to deriving relationship between two data assets and for simplified identification of versions.

2. Technical metadata:

This kind includes the technical information about a resource which can include usage rights and intellectual property, file type, date of creation and modifications, and finally all information essential for decoding and rendering the files. For photos, metadata including the resolution, lens aperture, and camera used – would come under this category.

3. Descriptive metadata:

The metadata that helps in the identification of assets would come under this category. Descriptive metadata includes name of the author, title of the document, database information, attributes, and relevant keywords – that enhance the discoverability and ease of navigation through the document database.

What part can Artificial Intelligence play in metadata extraction?

Automatic Metadata Extraction (AME) tools use AI techniques like Machine Learning (ML), Deep Learning, String pattern search techniques, and Natural Language Processing (NLP) to identify potential keyword tags from the content. The main aim of such tools is to automate the processing of the bulk of documents with large number of attributes, so they can be filled in manually with the relevant information later. Following are some automatic metadata extraction techniques:

1. Form recognition:

Forms have a definite structure and thus are easy to read by computers. With the automatic reading out of the forms, data fields can be recognized by identification of the form type out of the ones fed in the system.

2. Links with external system:

Information related to one document isn’t always found in one place. Cross-departmental information can be combined in the form of metadata, to provide a holistic view of the document.

3. Use in case systems:

Metadata used for archiving the information needs to have all the versions and information of all editors. This information is automatically derived using the system logs of the company.

4. Text analysis:

Computers are already learning to read texts using techniques like Optical Character Recognition. A rule-based metadata search model can be created by programming the ML algorithms. These algorithms would extract the embedded metadata which can be used for the classification of emails, like the ones with complaints, recommendations, and instructions.

With AME, more structured data can be formed at a lower cost and with better precision.

Identification of Red Flags:

Now, that the metadata – which is a pretty accurate description of almost all the data that is present in an organization – is in place, what if someone with malicious intentions gets their hands on that metadata?

For instance, if an employee has made a presentation using some references, downloaded images and a little peer help. Metadata records would easily be able to get the versions of the document, the sub-ordinates who have reviewed and contributed to the presentation, the sources of the image that has been used, ink annotations, server location, and comments. While all this data might seem insignificant at the moment, but someone with nefarious intentions can easily misuse it.

Another probable red flag is the exposure of classified data. Exposure to confidential files can lead to a leak of sensitive personal and financial data, company’s work strategies, and hidden information. The staff working in the risk and compliance roles has to identify measures to restrict access to such resources.

Apart from the proprietary data in an organization, through metadata emails, messages, and call records can be accessed. They can reveal a lot of information about the unique digital trail the employees leave behind. This trail could include employee routines, organizational perspective, personal relations, decision driven changes, software used etc. If this data is leaked and used for identification purposes, it would be considered a data breach under the Global Data Protection Regulations (GDPR).


The end goal of any organization is to arrive at a point where there can be a holistic approach towards data extraction and usage. Metadata certainly does help in the process. But the security of the metadata should be emphasized too. Organizations should have a routine of checking attachments before sending them out or even password protecting them if needed. Documents should be sanitised and properly encrypted so as to clear out all the potential information that could jeopardize the security of the organization or the employee. With only intended information sharing in place, there are reduced chances of a data breach.