Using Python and RegEx to Parse ISBNs from Text

Find the repo here

In the realm of data manipulation and extraction, Python stands as a formidable ally, offering a plethora of libraries and tools to streamline complex tasks. One such task is extracting International Standard Book Numbers (ISBNs) from a text document, a process that can be effortlessly accomplished with the help of Python and regular expressions (RegEx). Let's delve into the intricacies of this process, starting with the generation of a text document filled with random data, including ISBNs.

Generating a Random Text Document

To begin, we employ the Faker library, which is instrumental in generating fake data for various purposes, including testing and filling databases with dummy data. In our case, we utilize Faker to create random words, ISBNs, zip codes, and dates. Here's a snippet of the code that accomplishes this:

import random
from faker import Faker

Initialize Faker
fake = Faker()

Generate random data
words = [fake.word() for _ in range(10)]
isbns = [fake.isbn13() for _ in range(10)]
zip_codes = [fake.zipcode() for _ in range(10)]
dates = [fake.date() for _ in range(10)]

Combine lists into a single string
random_text = "\n".join([
"Random Words:",
*words,
"\nRandom ISBNs:",
*isbns,
"\nRandom Zip Codes:",
*zip_codes,
"\nRandom Dates:",
*dates
])

Write to a text file
with open("random_data.txt", "w") as file:
file.write(random_text)

print("File created: random_data.txt")

This code generates a text document named random_data.txt, containing a list of random words, ISBNs, zip codes, and dates, each separated by a newline.

Parsing ISBNs Using RegEx

With our random text document in place, the next step is to extract the ISBNs from the text. This is where regular expressions (RegEx) come into play. RegEx is a powerful tool for string manipulation, allowing us to define a pattern that matches the structure of an ISBN. Here's how we can use RegEx to extract ISBNs from the text:

import re

def extract_isbns(text):
Define a regex pattern to match ISBNs
isbn_pattern = r"((978[-– ])?[0-9][0-9-– ]{10}[-– ][0-9xX])|((978)?[0-9]{9}[0-9Xx])"

Find all matches of the ISBN pattern in the text
isbns = re.findall(isbn_pattern, text)

Extract only the portions with numbers and dashes
isbns_cleaned = [isbn[0] for isbn in isbns]

return isbns_cleaned

Extract ISBNs from the random text
isbns_found = extract_isbns(random_text)
print("Cleaned ISBNs:", isbns_found)


Harnessing Python to Extract ISBNs from a Random Text Document

In the realm of data manipulation and extraction, Python stands as a formidable ally, offering a plethora of libraries and tools to streamline complex tasks. One such task is extracting International Standard Book Numbers (ISBNs) from a text document, a process that can be effortlessly accomplished with the help of Python and regular expressions (RegEx). Let's delve into the intricacies of this process, starting with the generation of a text document filled with random data, including ISBNs.

Generating a Random Text Document

To begin, we employ the Faker library, which is instrumental in generating fake data for various purposes, including testing and filling databases with dummy data. In our case, we utilize Faker to create random words, ISBNs, zip codes, and dates. Here's a snippet of the code that accomplishes this:

pythonCopy codeimport random
from faker import Faker

# Initialize Faker
fake = Faker()

# Generate random data
words = [fake.word() for _ in range(10
)]
isbns = [fake.isbn13() for _ in range(10)]
zip_codes = [fake.zipcode() for _ in range(10)]
dates = [fake.date() for _ in range(10)]

# Combine lists into a single string
random_text = "\n"
.join([
"Random Words:",
*words,
"\nRandom ISBNs:",
*isbns,
"\nRandom Zip Codes:",
*zip_codes,
"\nRandom Dates:",
*dates
])

# Write to a text file
with open("random_data.txt", "w") as
file:
file.write(random_text)

print("File created: random_data.txt")

This code generates a text document named random_data.txt, containing a list of random words, ISBNs, zip codes, and dates, each separated by a newline.

Parsing ISBNs Using RegEx

With our random text document in place, the next step is to extract the ISBNs from the text. This is where regular expressions (RegEx) come into play. RegEx is a powerful tool for string manipulation, allowing us to define a pattern that matches the structure of an ISBN. Here's how we can use RegEx to extract ISBNs from the text:

pythonCopy codeimport re

def extract_isbns(text):
# Define a regex pattern to match ISBNs
isbn_pattern = r"((978[-– ])?[0-9][0-9-– ]{10}[-– ][0-9xX])|((978)?[0-9]{9}[0-9Xx])"

# Find all matches of the ISBN pattern in the text

isbns = re.findall(isbn_pattern, text)

# Extract only the portions with numbers and dashes
isbns_cleaned = [isbn[0] for isbn in
isbns]

return isbns_cleaned

# Extract ISBNs from the random text
isbns_found = extract_isbns(random_text)
print("Cleaned ISBNs:", isbns_found)

In this code, we define a function extract_isbns that takes a string of text as input and returns a list of extracted ISBNs. The isbn_pattern is a regular expression that matches both the 13-digit and 10-digit ISBN formats, including variations with dashes or spaces. The re.findall function searches the text for all occurrences of this pattern, and the list comprehension extracts the matched ISBNs, ensuring they are in a clean format.

Conclusion

Through this demonstration, we've seen how Python, coupled with the Faker library and regular expressions, can be a powerful tool for generating random data and extracting specific patterns from text. This approach can be particularly useful in various applications, such as cataloging books, validating data inputs, or automating the extraction of information from large text files.

Author:
Salome Grasland
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab