Know how to Extract Text Using Pattern


Regular expression allows you to find the pattern in the text. But the text you want to find is fixed. But if you use the rule base matching using Spacy then the text is matched using tokens, phrases, entities e.t.c which is a set of pre-defined patterns. To achieve it you have to use Spacy matcher.

There are three kinds of matching methods available as follows.

  1. Token Matcher
  2. Phrase Matcher
  3. Entity Ruler

In this entire tutorial, you will know how to perform rule base matching using Token Matcher.

What is Token Matcher

Spacy provides the rule-based matching engine that is Matcher. It operates on tokens extracted from text. The rule matcher also lets you pass in a custom callback to act on matches. All the matches are done using the patterns defined by the Matcher.

Steps to implement Token Matcher

In this entire section, you will know how to extract information by matching text as per defined patterns. But before going to the demonstration part make sure you have installed spacy in your system. Also, follow all the steps for better understanding.

Step 1: Import required package

The first step is to import all the required packages for implementing the spacy matcher. Here I am using the spacy package only and also importing Matcher. Use the below line of code to import them.

import spacy
from spacy.matcher import Matcher

Step 2: Load the Language model

There are many languages to implement Spacy matcher.  In my example, I am using the English language model so let’s load them using the spacy.load() method. But make sure you have downloaded the model in your system.

To download the model use the following command in your terminal.

nlp = spacy.load("en_core_web_sm")

Step 3: Call the Spacy Matcher

The third step is to call all the vocabulary of the NLP and pass it into the Matcher() constructor.

matcher = Matcher(nlp.vocab)

Step 4: Define the Pattern

Let’s create a pattern that will use to match the entire document and find the text according to that pattern. For example, I want to find an email address then I will define the pattern as below.

pattern = [{"LIKE_EMAIL":True}],

You can find more patterns on Spacy Documentation.

After that, you have to add the pattern to the Matcher that will be used for finding the text. Add the below line to add the pattern.

matcher.add("EMAIL",[pattern])

You can use any name for the pattern you want. In my case, I am defining the pattern name “EMAIL”.

Step 5: Apply the pattern

After defining the pattern now you have to apply this pattern to the document. For the sake of simplicity, I am creating a sample document. However, you can use your own document. Below is the document I have created.

text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)

After that pass the document as a parameter to the matcher.

matches = matcher(doc)

Step 6: Display the matched Text

The last step is to find the matched text from the document. In my case, it is an email address. There can be more than one match in the document. Therefore I have to run the loop over it. Add the following lines of code.

for match_id,start,end in matches:
    print(doc[start:end])

Here is the complete code. When you will run the code you will get all the email addresses in the document.

Complete Code

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern =[{"LIKE_EMAIL":True}]
matcher.add("EMAIL",[pattern])

text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)
matches = matcher(doc)

for match_id,start,end in matches:
    print(doc[start:end])

Output

Extracted email address from the document
Extracted email address from the document

 

You can also define more than one pattern and find the text in your document. For example, I also want to find all the names or nouns in the text then I will use the pattern [{“POS”: “PROPN”}].

Run the complete code given below. You will find all the names with the email address for the document.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern = [[{"LIKE_EMAIL": True}], [{"POS": "PROPN"}]]
matcher.add("My_Pattern",pattern)

text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)
matches = matcher(doc)

for match_id,start,end in matches:
    print(doc[start:end])

Output

Extracting email and nouns from the document
Extracting email and nouns from the document

Conclusion

Spacy matcher is very useful for finding any text in a document using rule-based matching. There are many applications of it. For example, you can use it to extract email address, name, address e.t.c from an invoice in pdf format. These are the steps for implementing a spacy matcher. I hope you have liked this tutorial. If you have any queries then you can contact us for more help.

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.





Source link