Friday, February 24, 2017

Models and interconnected data

With a basic server setup, the next part is getting the data in. Since, this is a database of publications, the most important table is going to be that of papers. Since the idea is to build a linked database, it would be best if I can create tables for the most searched for parts of every paper. For example, in a paper the important fields are - title, authors, journal, volume, number, pages, keywords, abstract. Out of these, authors and journal, are separate data that can be linked back to the paper. For example, if I want to get a list of paper published by Author X or the list of papers published by authors X and Y, it would be much faster if there was a separate table of authors that could be linked to the table of papers.

With this basic idea of a database in mind,I have created the following hierarchy. A paper will have the essential field as title without which it can't be created. The other fields of volume, number, pages, month, year are optional as sometimes they can't be found. A paper can have multiple authors and authors have multiple papers. So authors will be a ManyToManyField which is an external relationship to another model. As for journal, a paper will belong to one journal but a journal will have several papers. Therefore, the journal field in the paper database will be a many-to-one relationship and that means it will be a ForeignKey.

Therefore, the basic structure of models in models.py will be:



With this database structure, the idea is to populate the database automatically from the data extracted from the BibTex file. For that the following view function is written (click on view raw at the bottom right to view in separate window):



The above view function is fairly self explanatory in that it checks if a field exits in the BibTex entry for a paper. If the entry is found, it is added to the model. So, if an entry is missing, for example, there is no abstract, it wouldn't be a problem. The only catch was how to deal with the ForeignKey and ManyToManyFields. A ForeignKey field meant that there was an item (in this case journal) that was external to the database. A paper could belong to only one Journal but a Journal could have many papers. So, to be able to save even a basic definition of Paper, it was essential to relate a Journal.

For example, what is found in the code is (click on view raw at the bottom right to view in separate window):

        new_paper_entry = Paper()
        new_paper_entry.paper_title = paper_item["title"]
        new_paper_entry.paper_journal = new_journal_entry
        new_paper_entry.save()
        for author_in_paper in list_of_authors_in_paper:
            new_paper_entry.paper_authors.add(author_in_paper)
            new_paper_entry.save()

To be able to perform
new_paper_entry.save()

It was essential to define:
new_paper_entry.paper_journal = new_journal_entry

If I did:
new_paper_entry = Paper()
new_paper_entry.save()

It would have given an error that the Paper database doesn't have a Journal entry. This is because each paper has a single journal and therefore to be able to save a valid iteration of the database, a journal assignment was necessary.

At the same time, the save command was essential after the journal assignment:
new_paper_entry.save()

Because the paper has the ManyToManyField - Authors. The paper cannot assign a many to many field assignment unless it exists in the database. And this happens only after the save() function.

It took me a day to figure this out. But was a good learning. When a ForeignKey field exists, it must be defined to be able to save the model. Unless this fields can be made optional. Also, to be able to define a many-to-many field, the model must be saved in the database.
 

Sunday, February 19, 2017

Templates

So far I have not done much template rendering except listing all the papers in the BibTex file. But before moving on to more complex stuff, I am trying to read as much as possible.

To begin with Django used the concept of loose coupling - URLs, views and data. With the URLconf list in urls.py, Django specifies which URL will call which function in views.py. Therefore, it is possible that the functionality of a URL can be changed by changing the view function without affecting any other part of the code. The view function on the other hand can access the database and render a template while passing the necessary data extracted from the database or from the URL. The template which should be rendered can be changed in the view function without changing any other part of the code. Finally, models.py specifies the structure of the database that can be changed independent of the views or the URLs. Of course, functions in views.py have to be designed flexibly enough the be able to adapt to changes in the database and the URLs.

Within the view functions, I have been reading about templates and contexts. The simplest way to generate a display on a webpage is using the HttpResponse() function. As an example:

return HttpResponse("Hello world")

will display Hello World in a webpage corresponding to the URL that points to the function with the above return statement. But to do more complex stuff, you would need a separate HTML file. This again is in alignment with the concept of loose coupling. The contents of the webpage should be separate from the view function that acts as the buffer between the URL and the database.

Suppose a separate HTML file was to exist in the templates folder in the application folder paperarchive/papercollection. This is the default directory that Django will search for templates when the 'APP_DIRS'=True is set in TEMPLATES variable in settings.py. The other option is to specify a list of directories in DIRS in the same variable. The conventional way to load this HTML file is with the get_template function in django.template.loader. So, suppose:

from django.template.loader import get_template
t = get_template("my_html.html")

is present in a view function, the template object t will be created with the contents of the HTML file. This HTML file could be a simple "Hello world" display as before or could be more complicated with variables called template tags and a bit of programming to deal with these template tags.

Since, variables are present in the template, they need data. The data is in the form of a dictionary with the keys being the variable names in the HTML file. This dictionary of variables is the context. So,

from django.template import Context
c = Context({"name": "Django"})

Will create a context object with the variable "name" being "Django". To pass this data to the HTML file, the template object that was created with the HTML file is rendered with this context by:

t.render(c)

When the view function returns the above template with the context,

return t.render(c)

The webpage is displayed with the data we specified. This concept is fairly convenient as the HTML file can be a regular HTML file with some amount of programming in the form of template tags. The view function can change the variables that are needed by the HTML file by extracting from the database or from user entered data in forms using the HTTP request object "request".

To simplify the above process, there are two functions in Django in django.shortcuts - render and render_to_response. They are similar but render_to_response is being discouraged as it may be discontinued later. The above process of creating a template object and rendering it with a context can be performed in one line as:

render(request, "my_html.html" , {"name": "django"})

or

render_to_response("my_html.html" , {"name": "django"})

Only difference is render needs the request object to be the first argument while render_to_response doesn't.

Additionally, these two functions also provide the possibility of context_processors. Instead of just the template and the context, a RequestContext object can be passed as a context_instance. So,

render(request, "my_html.html" , {"name": "django"} ,
context_instance = RequestContext(request [, context dictionary]
[, processors = <custom_processors>])
)

I took some time reading back and forth about this. RequestContext takes the request object as the first argument and will generate a context object that contains global variables that Django provides by default to save you the trouble from writing code. For example, context data about the user logged in etc. Check out the "context_processors" list in TEMPLATES variables in settings.py. This list contains the default global context processors. A context processor is a function that returns a dictionary which becomes the context and takes the request object as the only single argument. So the default global processors in the settings.py file are functions that are automatically added when a RequestContext function appears anywhere in a view function and these provide as context data that a user can conveniently use for a number of reasons like user authentication etc. Additionally, to the RequestContext function can be added custom processors which the user specifically has designed. The only requirement as before is that these custom processors should take as an argument the request object and return a dictionary as context data. The only catch in using the RequestContext seems to be that a number of context variables will be provided to the template that may not be needed as it calls all the context processors listed in the settings.py file.

So, the Context() function specifies data while RequestContext() function requests data and also allows you to add custom processors which add their data. So choosing to use Context() or RequestContext() seems to depend only on whether the user needs those global data that Django automatically generates or whether the user wishes to call other user-defined context processors that return code specific context data.

Saturday, February 18, 2017

Setting up the development server

To get started with building the database, I will run the default development server with SQLite that ships with Django. I created a repository on Github:

https://github.com/shivkiyer/publications_db

With the simplest of Django commands this means, creating a project with:

django-admin startproject paperarchive

So paperarchive is the parent folder which can also be found on the Github repository. This folder contains another folder called paperarchive which has the settings.py, urls.py and wsgi.py. the file settings.py will be changed a couple of time and urls.py will changed repeatedly.

My app will be called papercollection. So, inside the parent folder paperarchive, using the manage.py script:

python manage.py startapp papercollection

This creates another folder called papercollection below the parent paperarchive folder that contains models.py, views.py, admin.py (which I don't intend to use), tests.py (which I won't use right now). For now no need to worry about models.py as I only need to run the server.

But first, let's get going with the Python code to extract a BibTex file. A sample BibTex with around 30 publications has been copied from the IEEE Xplore website on to the file input_data_file.txt. A few sample BibTex entries are as follows (click on view raw to see in a new window):



The sample file shows three publication entries in the BibTex format. However, it should be quickly noticeable that BibTex entries can have differences though they are fairly similar. For example, almost all of them have the fields title, author, year, month, volume, number, abstract and keywords. However, for publications in journals, the name of the journal is specified as "journal" while for a conference, the name of the conference is called "booktitle". There may be other versions that I have not encountered so far and in that case the code will be modified later. Also, in some cases, the values of the fields are enclosed in quotes and in other cases in curly brackets {}. Latex compiles both so our code will have to also.

The code can be found in backup_data.py (click on view raw to see in a new window)



The file contains a function that reads the text file containing BibTex references, scrubs them to remove special characters used by BibTex/Latex. It then splits every line with the "=" sign as a separator as the BibTex fields are key = value entries. The first item is compared with a known list and those that are needed are added to a dictionary object. The dictionary object is finally added to a list. This list now contains all the publication information in pain text form that can be displayed using an HTML file.

To display an HTML file, I configured urls.py in paperarchive/paperarchive folder (click on view raw to see in a new window):




To get started, I defined two urls - /start-db/ and /display-db/.



The url /start-db/ points to the function db_populate in views.py. This function calls the function that reads the BibTex file. To display these publications, we use the render_to_response shortcut that will load a template which in this case is list_paper.html with the context being the list of publications extracted. At this stage, the context_instance is probably not needed as the request object received is not being used. But we pass it as an argument with render_to_response anyway as I'll be doing more advanced stuff soon.

The list_papers.html file can be found in the templates folder in paperarchive/papercollection/. A very basic HTML file is as follows (click on view raw to see in a new window):



This uses template tags to extract the dictionary items in each publication and list them. The result can be viewed by checking out the link:

http://127.0.0.1:8000/start-db/

So with this, I got a very basic Django server going and very basic extraction script that displays the papers in the Bibtex file on a webpage. The next step is to insert this into a database and create links and forms for the user to be able to edit them.

The beginning

I have been programming with Python for almost five years while building my circuit simulator Python Power Electronics. I tried out Django a couple of times for small web related projects like creating my own blog. But mainly out of curiosity since Django is one of the most popular frameworks built using Python. And it is only obvious that Django preserves the fundamental elegance of Python in allowing a web developer to build web apps efficiently with beautiful code.

So, with an interesting idea for a web app, I am now going to dive into Django the way I did with Python. As someone who wrote a Masters and a PhD thesis, I am well aware of the mess that cross-referencing research articles causes. To describe how research progressed on the topic you are writing on requires citing publications in a number of different ways, chronologically, how they are linked together and how they differ. There are a number of software for this, so a researcher is not without any tools. But as with my circuit simulator, I would like to build an application specifically to my tastes.

So this is the plan. Many researchers in engineering and science using LaTex for documentation. Latex uses BibTex for generating bibliography. BibTex collects references in a separate BibTex file (.bib) in a special format. When I wrote my thesis, this format had to be manually prepared, which was a bit of a pain. But the advantage was that you could add any article as an item to the .bib file but only those articles will appear that have been cited in the publication or thesis that you are writing. And BibTex takes care of the order in which they appear based on the order in which they are cited. So the chances are pretty slim that you would be referring to an article that doesn't appear on your list of references or that the list of references contains articles that have never been cited.

Now, most journals will provide the citation information for publications in BibTex format. Researchers don't have to manually prepare them. You just export them from the publication link. This makes it fairly convenient to generate a long list of references as all you need to do is click on the "Export citation" button and copy the BibTex entry that appears on a new window. As an example, I could generate a list of 30 BibTex references in less than an hour while this took me days when I was writing my thesis several years back. The only drawback is that the final pdf file that is generated by compiling these list of references will not provide much insight that is useful while cross-referencing or performing literature survey.

So, my plan is to take these BibTex files and insert them in a database. For this I will use Django. So the database will have a number of fields for title, name of journal/conference, authors, year etc. Eventually, the idea is to link these publications together using several categories - chronologically, according to authors, who cited whom, who collaborated with whom etc. The results of these search strings will produce a networked list of articles that will be much more useful for writing a literature survey or while cross-referencing.

The work has already started though I am continuously learning about Django at the same time. The work will be hosted on GitHub and code with description will be posted here. So stay tuned if linked databases are something that interests you. I hope learning Django and blogging about it will be as much fun as it was with my circuit simulator.