How to Automate Document Archiving With Python – A Case Study (Part 2)

In the previous part (How to Automate Document Archiving With Python – A Case Study (Part 1)).of this blog post I have explained how we can identify the latest document version for all the documents in our project. In this part I will show you how we can actually archive all old document versions and generate an excel report with Python. As we already know which are the latest files, only thing we need to do is to traverse through the files again and check if a file is latest or not. If now then we will simply move that file to the ‘Archive’ folder.

Step 3: Move All Older Document Versions to the Archive Folder

The advantages of using Python is that you will get a huge set of already available Python modules that will help you achieve everything you may ever need. We will use shutil module to move old document versions to ‘Archive’ folder. Shutil is a very popular module for file and directory operations. This module provides useful high level functions which helps us achieving our goal with minimal coding (please refer to the import section of the previous post).

I have written a function archive_old_files() to archive the old files. Please check the code snippet below:

 

def archive_old_files(folder,file_mod_time_dict):
   os.chdir(folder)
   latest_files = []

   for base_file in file_mod_time_dict.keys():
       latest_files.append(file_mod_time_dict[base_file].full_name)

   print(latest_files)

   for file in os.listdir():
       if os.path.isfile(file):
           # Add the file in the dictionary
           file_obj = populate_file_object(file)
           if file_obj.base_name:
               if file in latest_files:
                   print(file_obj.full_name, ' is the latest file. Keeping this version....')
               else:
                   print(file_obj.full_name, ' is an old version. so, moving to Archive....')
                   shutil.move(file_obj.full_name,'Archive/'+file_obj.full_name)

       else:
           if file != 'Archive':
               os.chdir(file)
               archive_old_files(str(os.getcwd()), file_mod_time_dict)
               os.chdir('../')

Now let me explain how the code works.

  • In the first step I have created a blank list latest_files[] and populated the latest document versions in that list from the dictionary that holds all document names as keys and the latest file object as values (please refer to Part 1 to check how we populated this dictionary). I have iterated over the keys of the dictionary. For each key, I have retrieved the value corresponding to the key which holds the latest file version. I have appended the full name of the latest file in the list.
  • Now that I have the list of files I want to keep, I have iterated through the list of files in our project root directory.
  • Now as a folder may contain both file and directory, I have used os.path.isfile() method to check whether the file is actually a file or a directory.
  • If it is a file then we have invoked populate_file_object() method (please refer to part 1) to get the FileInfo object for the file so that we can use its attributes for comparison.
  • Now I have checked if the file in the current loop iteration is present in the list of latest files. If it is present then I have not done anything as we want to keep the latest file version. If the file is not present in the list of latest files then we have to archive the file. So, I have used shutil.move(file_obj.full_name,’Archive/’+file_obj.full_name) method to move the file to the ‘Archive’ folder.
  • This is simple enough, isn’t it? But what should we do if we encounter a sub-directory? We should perform the same set of steps that we have performed for the root directory. So, I have invoked the archive_old_files() method from within the same method to do this. However, this time I have passed current directory as input to this method instead of the root directory. This programming concept of invoking a method from within the same method is called recursive method call. So, as per the conditions, the recursive call will continue until there is no subdirectory within a directory.

So, in this step we have archived all older document versions. In the next section, we will create an excel report with list of files along with current version number, last modified date etc.

Step 4: Generate report specifying each document with latest version number, last modified date and user who modified the file

In this step we will generate a report of the latest file versions. We can use this report as document master for our project. Before I explain how we can do this, please have a look at the code snippet below:

 

def generate_excel_report(file_mod_time_dict):
   report_root_dir = '/Users/macbookair/Documents/Developement/Python Projects/Blog Article 4/My Project/Reports'

   #Open an excel workbokk to write the report
   os.chdir(report_root_dir)
   workbook = openpyxl.load_workbook('Document List v1.0.xlsx')
   document_list_sheet = workbook.get_sheet_by_name('Sheet1')

   #Set the column headers
   document_list_sheet['A1'] = 'Document Name'
   document_list_sheet['B1'] = 'Latest Document Version'
   document_list_sheet['C1'] = 'Last Modified On'
   document_list_sheet['D1'] = 'Last Modified By'

   #Write latest document version info in the excel sheet
   count = 2
   for file_item in file_mod_time_dict.values():

       file_base_name_cell = 'A'+str(count)
       file_current_version_cell = 'B'+str(count)
       file_last_modified_on_cell = 'C'+str(count)
       file_last_updated_by_cell = 'D'+str(count)

       document_list_sheet[file_base_name_cell] = file_item.base_name
       document_list_sheet[file_current_version_cell] = file_item.full_name
       document_list_sheet[file_last_modified_on_cell] = strftime("%d %b %Y", time.gmtime(file_item.last_mod_time))
       document_list_sheet[file_last_updated_by_cell] = file_item.user_id

       workbook.save('Document List v1.0.xlsx')

       count = count + 

Now, let us check step by step how this code works. I have used openpyxl module for updating the excel file.

  • I have created a directory named ‘Reports’ within the project root directory. The reports generated will be saved in this directory. I have used os.chdir() method to set the current directory to the ‘Reports’ folder.
  • Then I have used openpyxl.load_workbook(‘Document List v1.0.xlsx’) method to get the reference of the workbook object. Once I have the workbook instance, I have invoked workbook.get_sheet_by_name(‘Sheet1’) method to get the sheet named ‘Sheet1’. We will save the report in this worksheet.
  • Once we have the sheet, we can set values for any specific cell by using sheet_name[‘cell_ref’] = cell_value notation. Here sheet_name is the reference of the worksheet object that we have got using get_sheet_by_name() method, ‘cell_ref’ is the reference of the cell expressed in column character and row number combination e.g. A3, D5 etc.
  • In the first row (row number 1) I have set the column headers.
  • Then I have iterated through the values of the file_mod_time_dict dictionary and added the values for each file attribute as separate rows.
  • I have used workbook.save() method to save the updates. I have saved the workbook for each row update. However, saving the workbook after adding all rows will also be fine.

That’s it! We have successfully created the document master report. Now, let’s look at the excel report created.

Now let us see the actual program body that is actually invoking all the functions we have seen so far.
This is self explanatory, so I will not go to details.

 

 

#Following code is the actual process flow executed
# Define the root directory for your project

project_root_dir = '/Users/macbookair/Documents/Developement/Python Projects/Blog Article 4/My Project'

# Define an empty dictionary to store each distinct file with associated latest FileInfo object

file_mod_time_dict = {}
old_files = []

# For each Directory and Sub Directory, identify distinct files (excluding version number)
find_latest_files(project_root_dir,file_mod_time_dict)

for file_base_name in file_mod_time_dict.keys():
   print('File: ', file_base_name,'\n', 'Latest Version Info: ',file_mod_time_dict[file_base_name],'\n')

#Archive Old File Versions
archive_old_files(project_root_dir,file_mod_time_dict)

#Generate report of latest files
generate_excel_report(file_mod_time_dict)


We can add formatting to this report through the capabilities of openpyxl module. However, that is out of scope for this article. I will publish a separate article on most commonly used excel handling methods and their usage in near future. So, stay tuned for similar interesting articles to boost your career.

Leave a Reply

Your email address will not be published. Required fields are marked *

Positive SSL