Wednesday, November 30, 2022
HomeSoftware DevelopmentTextual content Scraping in Python | Developer.com

Textual content Scraping in Python | Developer.com


On this second a part of our sequence on Python textual content processing, builders will proceed studying the best way to scrape textual content, constructing upon the knowledge in our earlier article, Python: Extracting Textual content from Unfriendly File Codecs. When you have but to take action, we encourage you to take a second to learn half one, because it serves as a constructing block for this Python programming tutorial. On this half, we are going to look at the code we will use to start extracting textual content from information.

Parsing Textual content from Recordsdata with Python

Most programming languages course of textual content enter information through the use of an iterative – or looping – line-by-line technique. Within the instance right here, all the info associated to a single report might be discovered throughout the 4 traces that usually comprise every report, with the change from one SSN to a different delimiting a person report. Because of this we’d like a loop that can seize all the info within the desk in our earlier article earlier than the loop finds a brand new SSN. We additionally have to maintain monitor of the earlier report’s values in order that we all know when now we have discovered a brand new report. The best solution to begin is to write down a primary Python script that may acknowledge when a brand new SSN begins. The Python code instance under exhibits the best way to parse out the SSN for every report:

Extractor.py

# Extractor.py

# For command-line arguments
import sys

def principal(argv):
    attempt:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            elevate IndexError("There have to be a filename specified.")
        with open(sys.argv[1]) as input_file:
            # Create variables to carry every output report's info,
            # together with corresponding values to carry the earlier report's
            # info.
            currentSSN = ""
            previousSSN = ""
            # Deal with the file as an enumerable object, cut up by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from proper (trailing newlines)
                currentLine = line.rstrip()
                
                # For this instance, a single report consists of 4 traces.
                # We want to ensure we get every bit of data
                # earlier than we transfer on to the following report.

                # Python strings are 0-indexed, so the thirteenth character is
                # at place 12, and we should add the size of 11 to 12
                # to get place 23 to finish the substring perform.
                currentSSN = currentLine[12:23]
                #print("Present SSN is ["+currentSSN+"]")

                if (previousSSN != currentSSN):
                    # We're at a brand new report, and hopefully the finished
                    # report's info is saved within the "earlier"
                    # variations of all these variables.  Be aware that on the
                    # first iteration of this loop, the earlier variations
                    # of those variables will all be clean.
                    if ("" != previousSSN):
                        print ("Discovered report with SSN ["+previousSSN+"]")

                    # Reset for the following report.
                    previousSSN = currentSSN
            # Be aware that on the finish of the loop, the final report's info
            # will probably be within the earlier variations of the variables.  We have to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Final report with SSN ["+previousSSN+"]")
            #print(str(x+1)+" traces learn.")
        return 0
    besides IndexError as ex:
        print(str(ex))
    besides FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] can't be learn.")
    return 1

# Name the "principal" program.
if __name__ == "__main__":
    principal(sys.argv[1:])


The pattern information file is within the itemizing under:

42594       001-00-0837 Z000019    UZ3  5H2K 000000006518G    2022              
            001-00-0837      HPZ000000000000000000000000082725                  
2022     87 001-00-0837      NMR SMITH,ADAM           
            001-00-0837      VBPYT8923FZ00000000000000000                       
42594       020-01-0000 Z000019    UZ3  5H2K 000000011025Q    2022              
            020-01-0000      HPZ000000000000000000000000091442                  
2022     87 020-01-0000      NMR WILLIAMS,JAMES           
            020-01-0000      VBPYT8923FZ00000000000000000                       
            020-33-0000      HPZ000000000000000000000000000000                  
            020-33-0000      SW        A00000000000000000                       
42594       200-00-0111 Z000019    UZ3  5H2K 000000003717H    2022              
            200-00-0111      HPZ000000000000000000000000061551                  
2022     87 200-00-0111      NMR MARLEY,RICHARD           
            200-00-0111      VBPYT8923FZ00000000000000000                       
42594       817-22-0000 Z000019    UZ3  5H2K 000000004235G    2022              
            817-22-0000      HPZ000000000000000000000000033258                  
2022     33 817-22-0000      NMR DOUGH,JOHN           
            817-22-0000      VBPYT8923FZ00000000000000000                       
42594       300-00-0001 Z000019    UZ3  5H2K 000000003096H    2022              
            300-00-0001      HPZ000000000000000000000000066889                  
2022     87 300-00-0001      NMR WILLIST,DOUGLAS           
            300-00-0001      VBPYT8923FZ00000000000000000                       


The above is the Pattern Knowledge file, saved as Pattern Knowledge.txt

Be aware: Python makes use of 0-based indexing for strings, and the substring notation makes use of the ending place within the file as its finish, not the size of the substring. Due to this, the beginning of the SSN is represented by one quantity under its place to begin when discovered within the textual content editor.

For the ending place, the string size of 11 must be added to the 0-based index place of the string begin. As 12 is the 0-based index of the string begin, 23 is the 0-based index of the string finish.

Should you run this code in Home windows, you’re going to get the next output:

Text Extraction in Python

Be aware additionally that, on this instance, the complete path to the Python interpreter is specified. Relying in your setup, chances are you’ll not have to be so express. Nevertheless, on some techniques, each Python 2 and Python 3 could also be put in, and the “default” Python interpreter that runs when a path just isn’t specified will not be the proper one. It’s also assumed that the Pattern Knowledge.txt file is in the identical listing because the Extractor.py file. As a result of this file has an area within the title, it have to be encapsulated in citation marks to be acknowledged as a single parameter. This is applicable to each Home windows and Linux techniques.

Earlier than going any additional, make sure that all the SSNs from the pattern file are displayed within the output. A typical mistake in these implementations is to disregard the guide processing of the final report.

Learn: Prime On-line Programs to Be taught Python

Extracting Textual content from Recordsdata with Python

Now that the SSN is correctly parsed out, the remaining gadgets might be extracted by including appropriate logic:

Full-Extractor.py

# Full-Extractor.py

# For command-line arguments
import sys

def principal(argv):
    attempt:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            elevate IndexError("There have to be a filename specified.")
        with open(sys.argv[1]) as input_file:
            # Create variables to carry every output report's info,
            # together with corresponding values to carry the earlier report's
            # info.
            currentSSN = ""
            previousSSN = ""

            currentName = ""

            currentMonthlyAmount = ""

            currentYearlyAmount = ""

            # This time, we have to know if we're processing the primary report.  If we do not maintain monitor of this, the
            # first report will course of incorrectly and every subsequent report will probably be flawed.
            firstRecord = True
            
            # Deal with the file as an enumerable object, cut up by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from proper (trailing newlines)
                currentLine = line.rstrip()
                
                # For this instance, a single report consists of 4 traces.
                # We want to ensure we get every bit of data
                # earlier than we transfer on to the following report.

                # Python strings are 0-indexed, so the thirteenth character is
                # at place 12, and we should add the size of 11 to 12
                # to get place 23 to finish the substring perform.
                currentSSN = currentLine[12:23]
                if (True == firstRecord):
                    previousSSN = currentSSN
                    firstRecord = False

                # For the primary report, previousSSN could be clean and currentSSN would have a price, and this situation could be true.
                # We don't want this, so we'd like the logic above to set the values to be the identical for the primary report.
                if (previousSSN != currentSSN):
                    # We're at a brand new report, and hopefully the finished
                    # report's info is saved within the "earlier"
                    # variations of all these variables.  Be aware that on the
                    # first iteration of this loop, the earlier variations
                    # of those variables will all be clean.

                    # Additionally observe the "Disconnect" between the earlier and present notation.
                    if ("" != previousSSN):
                        print ("Discovered report with SSN ["+previousSSN+"], title ["+currentName+"], month-to-month quantity [" + currentMonthlyAmount+
                               "] yearly quantity [" + currentYearlyAmount + "]")

                    # Reset for the following report.  This logic wants to come back earlier than the remaining information extractions, or you should have
                    # "off by one" errors.
                    previousSSN = currentSSN

                    # Clean out the "present" variations of the variables (besides the SSN!) so the circumstances above will probably be true once more.
                    currentName = ""
                    currentMonthlyAmount = ""
                    currentYearlyAmount = ""

                # Get the title if we don't have already got it.  This situation prevents us from overwriting the title.  Be aware that if the
                # information was structured in a method that there was multiple piece of data at this place within the file, you'd
                # want further logic to find out what it's you're parsing out.  On this instance, the simplistic logic of checking if
                # the primary character is current and {that a} comma is within the substring is the "take a look at".

                if ("" == currentName) and (False == (currentLine[33:].startswith(' '))) and (True == currentLine[33:].__contains__(',')):
                    # Additionally observe that the title can go to the top of the road, so
                    # no ending place is included right here.
                    currentName = currentLine[33:]

                # Comply with the identical logic for extracting the opposite info.  On this case, make sure that the string comprises solely
                # numeric values.  Within the case of the month-to-month quantity, we solely wish to course of traces that finish in "2022" as these are
                # the one traces which comprise this info.
                if ("" == currentMonthlyAmount) and (currentLine.endswith("2022")):
                    currentMonthlyAmount = currentLine[51:57]

                if ("" == currentYearlyAmount) and currentLine[57:62].isdigit():
                    currentYearlyAmount = currentLine[57:62]

            # Be aware that on the finish of the loop, the final report's info
            # will probably be within the earlier variations of the variables.  We have to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Final report with SSN ["+previousSSN+"], title [" + currentName +"], month-to-month quantity [" + currentMonthlyAmount+
                       "] yearly quantity [" + currentYearlyAmount + "]")
            #print(str(x+1)+" traces learn.")
        return 0
    besides IndexError as ex:
        print(str(ex))
    besides FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] can't be learn.")
    return 1

# Name the "principal" program.
if __name__ == "__main__":
    principal(sys.argv[1:])

                 


The location of the logic that determines the report boundary, on this case going from one SSN to a different, is essential, as a result of the opposite information will probably be “off by one” if that logic stays on the backside of the loop. It’s also equally essential that, for the primary report, the earlier SSN “matches” the present one, particularly in order that this logic won’t be executed on the primary iteration.

Working this code provides the next output:

Python Text Scrape

Be aware the highlighted report and the way it has no title or quantities related to it. This “error,” as a result of lacking information within the unique textual content file, is rendering accurately. The extraction course of mustn’t add or take away from the information. As a substitute, it ought to signify the information precisely as-is from the unique supply, or a minimum of point out that there’s some type of error with the information. This fashion, a consumer can look again on the unique information supply within the ERP to determine why this info is lacking.

Learn: File Dealing with in Python

Exporting Knowledge right into a CSV File with Python

Now that we will extract the information programmatically, it’s time to write it out to a pleasant format. On this case, it is going to be a easy CSV file. First, we’d like a CSV file to write down to, and, on this case, it is going to be the identical title because the enter file, with the extension modified to “.csv”. The total code with the output to .CSV is under:

Full-Extractor-Export.py

 
# Full-Extractor-Export.py

# For command-line arguments
import sys

def principal(argv):
    attempt:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            elevate IndexError("There have to be a filename specified.")

        fileNameParts = sys.argv[1].cut up(".")
        fileNameParts[-1] = "csv"
        outputFileName = ".".be a part of(fileNameParts)
        outputLines = "";
        with open(sys.argv[1]) as input_file:
            # Create variables to carry every output report's info,
            # together with corresponding values to carry the earlier report's
            # info.
            currentSSN = ""
            previousSSN = ""

            currentName = ""

            currentMonthlyAmount = ""

            currentYearlyAmount = ""

            # This time, we have to know if we're processing the primary report.  If we do not maintain monitor of this, the
            # first report will course of incorrectly and every subsequent report will probably be flawed.
            firstRecord = True
            
            # Deal with the file as an enumerable object, cut up by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from proper (trailing newlines)
                currentLine = line.rstrip()
                
                # For this instance, a single report consists of 4 traces.
                # We want to ensure we get every bit of data
                # earlier than we transfer on to the following report.

                # Python strings are 0-indexed, so the thirteenth character is
                # at place 12, and we should add the size of 11 to 12
                # to get place 23 to finish the substring perform.
                currentSSN = currentLine[12:23]
                if (True == firstRecord):
                    previousSSN = currentSSN
                    firstRecord = False

                # For the primary report, previousSSN could be clean and currentSSN would have a price, and this situation could be true.
                # We don't want this, so we'd like the logic above to set the values to be the identical for the primary report.
                if (previousSSN != currentSSN):
                    # We're at a brand new report, and hopefully the finished
                    # report's info is saved within the "earlier"
                    # variations of all these variables.  Be aware that on the
                    # first iteration of this loop, the earlier variations
                    # of those variables will all be clean.

                    # Additionally observe the "Disconnect" between the earlier and present notation.
                    if ("" != previousSSN):
                        print ("Discovered report with SSN ["+previousSSN+"], title ["+currentName+"], month-to-month quantity [" + currentMonthlyAmount+
                               "] yearly quantity [" + currentYearlyAmount + "]")
                        # As a result of CSV is a trivially format to write down to, string processing can be utilized.  Be aware that we additionally want to separate
                        # the title into the primary and final names.
                        nameParts = currentName.cut up(",")
                        # That is trivial error checking.  Ideally a extra strong error response system must be right here.
                        firstName = "Error"
                        lastName = "Error"
                        if (2 == len(nameParts)):
                            firstName = nameParts[1]
                            lastName = nameParts[0]
                        # Ought to there be any citation marks in these strings, they have to be escaped within the CSV file through the use of
                        # double citation marks.  Strings in CSV information ought to all the time be delimited with citation marks.
                        outputLines += (""" + previousSSN.change(""", """") + "","" + lastName.change(""", """") + "","" +
                            firstName.change(""", """") + "","" + currentMonthlyAmount.change(""", """") + "","" +
                            currentYearlyAmount.change(""", """") +""rn")
                    # Reset for the following report.  This logic wants to come back earlier than the remaining information extractions, or you should have
                    # "off by one" errors.
                    previousSSN = currentSSN

                    # Clean out the "present" variations of the variables (besides the SSN!) so the circumstances above will probably be true once more.
                    currentName = ""
                    currentMonthlyAmount = ""
                    currentYearlyAmount = ""

                # Get the title if we don't have already got it.  This situation prevents us from overwriting the title.  Be aware that if the
                # information was structured in a method that there was multiple piece of data at this place within the file, you'd
                # want further logic to find out what it's you're parsing out.  On this instance, the simplistic logic of checking if
                # the primary character is current and {that a} comma is within the substring is the "take a look at".

                if ("" == currentName) and (False == (currentLine[33:].startswith(' '))) and (True == currentLine[33:].__contains__(',')):
                    # Additionally observe that the title can go to the top of the road, so
                    # no ending place is included right here.
                    currentName = currentLine[33:]

                # Comply with the identical logic for extracting the opposite info.  On this case, make sure that the string comprises solely
                # numeric values.  Within the case of the month-to-month quantity, we solely wish to course of traces that finish in "2022" as these are
                # the one traces which comprise this info.
                if ("" == currentMonthlyAmount) and (currentLine.endswith("2022")):
                    currentMonthlyAmount = currentLine[51:57]

                if ("" == currentYearlyAmount) and currentLine[57:62].isdigit():
                    currentYearlyAmount = currentLine[57:62]

            # Be aware that on the finish of the loop, the final report's info
            # will probably be within the earlier variations of the variables.  We have to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Final report with SSN ["+previousSSN+"], title [" + currentName +"], month-to-month quantity [" + currentMonthlyAmount+
                       "] yearly quantity [" + currentYearlyAmount + "]")
                nameParts = currentName.cut up(",")
                firstName = "Error"
                lastName = "Error"
                if (2 == len(nameParts)):
                    firstName = nameParts[1]
                    lastName = nameParts[0§6]
                    outputLines += (""" + previousSSN.change(""", """") + "","" + lastName.change(""", """") + "","" +
                        firstName.change(""", """") + "","" + currentMonthlyAmount.change(""", """") + "","" +
                        currentYearlyAmount.change(""", """") +""rn")
            # Because the string already comprises newlines, make sure that to clean these out.
            outputFile = open(outputFileName, "w", newline="")
            outputFile.write(outputLines)
            outputFile.shut()
            print ("Wrote to [" + outputFileName + "]")
            #print(str(x+1)+" traces learn.")
        return 0
    besides IndexError as ex:
        print(str(ex))
    besides FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] can't be learn.")
    return 1

# Name the "principal" program.
if __name__ == "__main__":
    principal(sys.argv[1:])


For max compatibility, all information parts in a CSV file must be encapsulated with citation marks, with any citation marks inside these strings being escaped with double citation marks. The above code displays this.

Working the code provides this output:

Python text processing tutorial

The above command makes use of the caret (^) conference to separate the command between traces for higher readability. The command might be a single line if you happen to selected.

The output as displayed in Notepad++:

Python text parsing

In fact, the entire level of this train is to see the output in Excel, so now let’s open this file there:

How to text scrape with Python

Now that the information is in Excel, all the instruments that the answer brings to the desk can now be utilized to this information. Any sorts of errors which will have been current within the ERP-generated information can now be correctly troubleshot and corrected in the identical method, with out having to fret about sending unhealthy information out earlier than it may be verified.

Be aware that the “Error” values within the cells above are intentional, as they’re the results of the unique information being clean. From a excessive degree, this may assist to point out a reviewer in a short time that there’s a drawback.

One other method this might be accomplished is to write down out a extra informative error message right into a neighboring cell in the identical row.

Learn: Find out how to Type Lists in Python

Remaining Ideas on Textual content Scraping in Python

The fantastic thing about this resolution is that, whereas the code appears to be like complicated, particularly in comparison with an much more complicated common expression, it may be extra readily reused and tailored for equally structured information. This type of resolution can develop into an indispensable instrument within the arsenal of somebody who’s tasked with verifying this type of info from a “black field” information supply like an ERP, or past.

Learn extra Python programming tutorials and developer guides.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments