Automated PDF creation with Python and Latex

Automated Document Generation with Python & LaTeX

Did you ever needed to do automated PDF creation using a table of information?. 

Creating a Word document with certain formatting, styles, and tables would be the easy choice for a few pages of data. But what if you have data that might fill 100+ pages?. Would you do it by hand? or can we automate it?

Some documents require information to be filled at specific locations in a certain formatting. Creating documents of hundreds of pages, filled with data from an excel file or database with specific formatting and styling will be tiring and time-consuming for humans. In this article, I am sharing code and strategy I used to tackle such a problem.

So here was the problem, my institute required to keep a list of approved scientific journals with details such as Journal name, publisher, ISSN, year of start etc. Every year new journals are added and the list goes long. Luckily the data was available as an excel file which is easy to update with new additions every year. The first few rows and columns of the data would look like the following.

annualReport sample data
Meta information of journals

If someone tried to compile a document with any text processing software table by table, it would take a lot of time and effort. Most importantly the formatting may become inconsistent. Here I applied my LaTeX and Python programming skills to come up with a solution. If you are interested in LaTeX, I have created a comprehensive tutorial on latex with codes and examples.

The typesetting programming language LaTex is well known for creating large documents with consistent styles and formatting. LaTex along with Python would be a killer combination for automated document generation. The Python Pandas is used for data filtering and sorting. The ‘jinja2’ library is used to create LaTex code from a basic template which is iterated many times. The basic template used is shown below.

Code
\documentclass[11pt]{article}
\usepackage[a4paper, width=170mm, top=30.5mm, bottom=30.5mm	]{geometry}
\usepackage[utf8x]{inputenc}
\usepackage{datetime}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{float}
\usepackage[T1]{fontenc}
\usepackage{longtable}
\usepackage[compact]{titlesec}
\titleformat{\section}{\normalfont\bfseries}{\thesection}{1em}{}
\titlespacing*{\section}{0pt}{*0}{0pt}
\usepackage{hyperref}
\hypersetup{
    colorlinks=true,
    linkcolor=blue,
    urlcolor=blue,
    citecolor=blue}


\begin{document}


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\begin{center}
	\large{\textbf{Appendix 6 A}} \\
	{\normalsize \textbf{Revised list of Approved Standard Journals in
		Computational Biology \& Bioinformatics}}\\
	{\normalsize (Approved in meeting dated 05/12/2018)}
\end{center}
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %


\BLOCK{for index, row in data.iterrows()}
\section{\normalsize {\VAR{row['JournalName']}}}

\begin{longtable}[H]{|c|l|p{8cm}|}
	\hline
	\multicolumn{1}{|c|}{\textbf{Sl. No.}} & \multicolumn{2}{c|}{\textbf{Journal Details - \VAR{index+1}}}                                    \\ \hline
	1.  & Name of the Journal               & \textbf{\VAR{row['JournalName']}}               \\ \hline
	2.  & Publisher \& place of publication & \VAR{row['Publisher_Place']}                    \\ \hline
	3.  & Year of start                     & \VAR{row['YearofStart']}                        \\ \hline
	4.  & Hard copies published             & \VAR{row['HardCopies']}                         \\ \hline
	5.  & e-publication                     & \VAR{row['Epublication']}                       \\ \hline
	6.  & ISSN Number                       & \VAR{row['ISSN']}                               \\ \hline
	7.  & Peer/Referee Reviewed             & \VAR{row['PeerReviewed']}                       \\ \hline
	8.  & Indexing status                   & \VAR{row['IndexingStatus']}                     \\ \hline
	9.  & Impact factor / Rating            & \VAR{row['ImpactFactor']} (Thomson and Reuters) \\ \hline
	10. & UGC Journal Number                & \VAR{row['UGCNo']}                              \\ \hline
\end{longtable}


\BLOCK{endfor}

\newpage
\tableofcontents

The data to be filled is read, filtered and sorted with Python Pandas library and is loaded into a dataframe. Once we have the basic template and data we can populate the placeholders with dataframe content in a loop. The Python code for the Tex file generation is shown below.

import jinja2
import os
from jinja2
import Template
import pandas as pd
from pandas
import ExcelWriter
import numpy as np

latex_jinja_env = jinja2.Environment(
  block_start_string = '\BLOCK{',
  block_end_string = '}',
  variable_start_string = '\VAR{',
  variable_end_string = '}',
  comment_start_string = '\#{',
  comment_end_string = '}',
  line_statement_prefix = '%%',
  line_comment_prefix = '%#',
  trim_blocks = True,
  autoescape = False,
  loader = jinja2.FileSystemLoader(os.path.abspath('.')))

year = 2018
version = 4.0
bosdate = '05/12/2018'
outfile = 'Standard Journal List_Bioinformatics_' + str(year) + '_V' + str(version) + '.tex'

data = pd.read_excel(open('Edited_Consolidated_Journal_List_2018.xlsx', 'rb'), index = False)
submittedjournals = data['JournalName'].values
uniqueJournals = list(set(submittedjournals))
data​ = data.drop_duplicates(['ISSN'])

data​ = data.iloc[: ,: ]
data['UGCNo'] = data['UGCNo'].fillna(-1)
data['UGCNo'] = data['UGCNo'].astype(int)
data['UGCNo'] = data['UGCNo'].replace(-1, 'Not available')
data​ = data.sort_values(by = 'JournalName', ascending = 1)
data​ = data.reset_index(drop = True)
print data.iloc[253,: ]
print 'No of repeats: ' + str(len(submittedjournals) - len(uniqueJournals))

template = latex_jinja_env.get_template('BoSTemplate.tex')
render = template.render(data = data, bosdate = bosdate)

f = open(outfile, 'w')
f.write(render.encode('utf8'))
f.close()

Once the Tex file is created, we can use TeXstudio or any other LaTex software to generate the PDF as shown below. Only the fist page of the file is shown for illustration purpose.

Annual Report Rendered
Report Rendered

As you can see, the alignment, spacing, page numbering etc are consistent. With this,, we can sort/remove duplicates./shortlist journals based on any attribute it has (e.g. impact factor, year of start, journal name etc.) and create an equivalent PDF file in seconds.

Hope this would be useful to someone and save a lot of your time.

Leave a Comment

Your email address will not be published. Required fields are marked *