Tuesday, July 25, 2017

Restructuring the database

I was so far trying to build an interconnected database so that not only is it possible to have a list of publications but also it is possible to have them linked together on different aspects - Paper, Author, Journal, Affiliation etc which would make the database easier to search. The problem now appears while writing to the database particularly in bulk. The final aim is to scrape a significant number of papers from the web using some search criterion and save those papers in the database. When the database is as interconnected as the one I have chosen - ForeignKey and ManyToMany relationships, a bulk write using the in-built functions that Django provides is not possible. And this means, the database write is slow.

I spent some time reading about this and all the documents say the same thing - minimize the interaction with the database. Every query on the database and every write cause an overhead. If the database is being written one paper at a time, the total process will be very slow. If there are ForeignKey and ManyToMany relationships, each paper will have to be written one at a time along with the authors in a separate database and the journal is another database. So, this meant, the database structure will need to be redesigned.

There are two purposes to the web app that I am designing. The first, to be able to get a number of papers from the web and generate a list of references that can be written to a BibTex file making it easier for researchers to write papers and maybe even a thesis. The second is for researchers to generate over a period of time, a relational database of papers, authors, journals and universities that they could use for performing searches on a continuous basis. The first purpose needs the web app to be fast at extracting papers from the web and writing them to a BibTex/Latex file. The second purpose is additional and could be built over and around the first.

With this in mind, the database has been simplified. The fundamental unit of the database is now the model Paper. This is completely independent - no ForeignKey or ManyToMany relationships. Before authors were in the model Author that was added to Paper as a ManyToMany relationship. But now authors is a field in the model Paper. The Author model exists and the user can use the web app to edit author information and add papers to the author. Same with journal, the user can edit the journal information and add papers to it.

By having Paper as a fundamental and independent model, a bulk write to the database table associated with the Paper model is now possible with the bulk_create command. The result is now several hundred papers can be written to the database in a few seconds which is an acceleration of maybe 50x. The user can still build the relational database manually.

To scrape papers from the web I am using the IEEE API. I will post soon about how I am creating a search page to perform IEEE searches.

Tuesday, July 11, 2017

Starting off with class based views

I started migrating a few of the view functions to class based views. So far, either my views are very different which implies there is not much scope of class inheritance or I need to spend some time thinking of a class inheritance hierarchy. So far, the code below gives an idea of the class based views I have generated.



So I created a BaseView that inherits the in-built View class from Django. This BaseView has a get and a post function which is what I need now. Maybe I can add more later. Each function calls the get_context_data member function which creates the member data context. This context is passed to a template called template_name using the render function. Neither template_name or context are defined in the BaseView. The next class view is the PapersDisplay view that inherits the BaseView. This therefore does not have the get or post methods but merely defines the template_name and contains the get_context_data method that generates the context dictionary.

In a way, this results in cleaner and lesser code because every view does not need the render function which was repeated every time in every view function. However, the code structure is fairly simple. Another option is to create another class with extracts all the papers and inherit that class to add the authors and other data to it. The reason is that the class which extracts all papers could be reused again and again as it seems like a fairly commonly performed task. To add even more abstraction, it might also be possible to create another class which inherits the papers class and provides to each paper, the authors in the paper.

The documentation suggests to be wary about adding too many layers of abstraction as in that case modifying code may become difficult. So I guess I will have to perform the tradeoff between efficiency in code and ease in later development.

Monday, July 10, 2017

Class based views

Having got the user interface for my circuit simulator up and going, I am now getting back to this literature database project. To begin with, using Django for my circuit simulator accelerated the learning process with respect to Django and most importantly highlighted what I need to learn further. First task on the agenda is to implement class based views instead of function based views which I am currently using. The aim is to make code more usable and more efficient through inheritance and other features of object oriented programming in general.

I read the Django documentation on class based views:
https://docs.djangoproject.com/en/1.11/topics/class-based-views/
https://docs.djangoproject.com/en/1.11/ref/class-based-views/base/#view

The documentation is fairly vast and Django comes with a number of built-in modules for performing a wide range of tasks. However, I will use the base class View in django.views.generic.base.view and develop my class based views from there on. As before, the best way to learn is through coding and so best to generate as much code as possible on my own rather than use built-in modules in the beginning. Migrating to built-in modules at a later stage will be much easier. For that matter, it is hard to know what industry standard code will use anyway.

The first task will be to convert all function based views to class based views and then continue adding functionalities to the web interface.

Sunday, June 4, 2017

User interface for Python Power Electronics

Though this is related to my circuit simulator Python Power Electronics, it is related to how Django and web interface is used which makes it more relevant in this blog. I released another version of my circuit simulator but this time the user interface is a web interface rather than the usual command line. The objective is to make the simulator more interactive and easier to use.

To begin with, I will describe the basic philosophy of using Django as a user interface. This is a concept which I don't think will be accepted by the main stream Django community as Django was never meant to be used as a user interface but as a web application to design a website that could be driven by a database. However, many of the features of Django make it very suitable for a user interface.

To begin with, any software GUI will have a few basic menu options on the start-up window which I have designed with Django's urls.py file. Every webpage has the header links that are similar to a standard GUI - browsing the simulation library, creating a new simulation, documentation (or help) and a contact page. This has been built into a base framework template file which has been extended by every other HTML template file. Clicking on a link will send you to a URL which in turn appears in the urls.py file and directs to a function which in turn renders another HTML page. This way a user can move around the software just like a GUI.

Typically, a simulation software GUI will allow you to load a simulation case which is a file stored on the user's computer. In the web app, every simulation case is a database entry stored in the database simulation_collection. The uppermost table is called SimulationCase. This table contains the title, description and parameters of every simulation created by a user. When the user clicks on the "Simulation library" link, all the entries in the table SimulationCase are listed out for the user to load any one.

Each SimulationCase entry is linked to several other tables. The first level of relationship is below:

SimulationCase
-----> CircuitSchematics
-----> CircuitComponents
-----> MeterComponents
-----> ControllableComponents
-----> PlotLines
-----> CircuitPlot
-----> ControlFile

These tables are linked to a SimulationCase as ForeignKey relationships as a single simulation case can have a large number of them. As an example, a simulation could have 10 circuit schematic spreadsheets, 100 circuit components altogether in all schematics, 20 meters, 15 controllable components, 45 elements that are to be written to the output data file and made available for user plotting, 30 circuit plots, and 5 control files. Such a hierarchy makes it convenient to segregate data and relate them in a logical manner which is useful particularly in creating forms with the models from ModelForm.

While creating a new simulation, it starts with a single new database entry for a SimulationCase with parameters. The user then adds CircuitSchematics. That results in a new database entry with the circuit file that is linked to the SimulationCase entry. From all the components in the circuit schematics, database entries are made for CircuitComponents, MeterComponents, ControllableComponents and these are also linked to the SimulationCase.

The next level of hierarchy is as follows:

CircuitSchematics
--------> Resistor
--------> VariableResistor
--------> Inductor
--------> VariableInductor
--------> Capacitor
--------> Voltage_Source
--------> Controlled_Voltage_Source
--------> Ammeter
--------> Voltmeter
--------> Diode
--------> Switch

The simulator will look for components in the circuit schematic spreadsheets and on finding a component will create a database entry in the table corresponding to the type of object. The objective behind separating the components into their respective types and having separate tables for each type was to use the ModelForm to create forms for each type rather than create a single component type. This results in customized forms, error checking and feedback messages.

CircuitPlot
--------> CircuitWaveforms

When the user creates a new circuit plot, a new data base entry in the table CircuitPlot is created. Each Circuit Plot can have numerous waveforms. When a waveform is added to a CircuitPlot, a new entry is created in the table CircuitWaveforms and linked to the entry in CircuitPlot. There is another layer:

CircuitWaveforms
--------> PlotLines

Every simulation case will have a number of data items that will be written to the output data file. These are called PlotLines. They may be meter outputs or VariableStorage elements in control files. The user can choose which PlotLine will appear in a CircuitWaveform. A CircuitWaveform can have several PlotLines and conversely a PlotLine can appear in several CircuitWaveforms. This results in a ManyToMany relationship.

ControlFile
--------> ControlInputs
--------> ControlOutputs
--------> ControlStaticVariable
--------> ControlTimeEvent

SimulationCase
--------> ControlVariableStorage

These are the input/output ports and the special variables of a control file. When a user adds a control file to a simulation case, a new entry is created in the table ControlFile. This control file entry can be configured. The user can add inputs, outputs, static variables, time events to a control file that can be used in the control code. Variable storage elements are global variables and therefore are related to the simulation case rather than a control file.

Many of these database entries are dynamic - the user can create and delete them. In some cases, the simulator creates the entries in which case they are created and deleted when the simulation is run when all the circuit files are processed. When a simulation is loaded, all data items related to the SimulationCase are also loaded. This is similar to the GUI based circuit simulators which will present the circuit in the latest state when a file is opened.

Thursday, March 16, 2017

Sequence in ManyToMany fields

Up till now I had designed the database with Paper being a class that had a ManyToMany field connected to the Author class. So essentially a paper will have multiple authors and an author will have several papers. The concept works except for one problem. Defining a ManyToMany relationship in the Paper class in the following manner:

paper_authors = models.ManyToManyField(Author)

Allows you to add multiple Author objects to objects of the class Paper. However, adding authors in a particular sequence does not guarantee that sequence will be maintained. If a query is performed:

paper.paper_authors.all()

The Author objects in the object paper will be extracted randomly from the database. The only way to solve this as I could see from the answer to a question I posted on Stack Overflow is:

http://stackoverflow.com/questions/42741591/order-of-manytomany-field-in-model-changed-when-one-object-is-replaced

What was suggested is that I define a membership class and use the "through" attribute (click view raw at the bottom to see code in a new window).



Now the ManyToMany field has the following definition:

paper_authors = models.ManyToManyField(Author, through = 'Contributor')

It uses a membership class using the "through" attribute to define additional details about how Author is related to Paper. From the above code, now the Contributor class has a "position" defined which designates the author's position in the paper. Also, before, authors could be added by using the add function as:

paper_Y.paper_authors.add(X)

However, now the membership object has to be defined as:

xy = Contributor(paper=paper_Y, author=X,position=1)
xy.save()

With this change made, the functions in views.py have been changed and now a paper can be edited to change the authors without losing the sequence.

Now that a basic database has been created, I'll tinker around a little to make sure I haven't missed anything and then I'll create a better set of web pages to make it easier to navigate this application.

Friday, February 24, 2017

Models and interconnected data

With a basic server setup, the next part is getting the data in. Since, this is a database of publications, the most important table is going to be that of papers. Since the idea is to build a linked database, it would be best if I can create tables for the most searched for parts of every paper. For example, in a paper the important fields are - title, authors, journal, volume, number, pages, keywords, abstract. Out of these, authors and journal, are separate data that can be linked back to the paper. For example, if I want to get a list of paper published by Author X or the list of papers published by authors X and Y, it would be much faster if there was a separate table of authors that could be linked to the table of papers.

With this basic idea of a database in mind,I have created the following hierarchy. A paper will have the essential field as title without which it can't be created. The other fields of volume, number, pages, month, year are optional as sometimes they can't be found. A paper can have multiple authors and authors have multiple papers. So authors will be a ManyToManyField which is an external relationship to another model. As for journal, a paper will belong to one journal but a journal will have several papers. Therefore, the journal field in the paper database will be a many-to-one relationship and that means it will be a ForeignKey.

Therefore, the basic structure of models in models.py will be:



With this database structure, the idea is to populate the database automatically from the data extracted from the BibTex file. For that the following view function is written (click on view raw at the bottom right to view in separate window):



The above view function is fairly self explanatory in that it checks if a field exits in the BibTex entry for a paper. If the entry is found, it is added to the model. So, if an entry is missing, for example, there is no abstract, it wouldn't be a problem. The only catch was how to deal with the ForeignKey and ManyToManyFields. A ForeignKey field meant that there was an item (in this case journal) that was external to the database. A paper could belong to only one Journal but a Journal could have many papers. So, to be able to save even a basic definition of Paper, it was essential to relate a Journal.

For example, what is found in the code is (click on view raw at the bottom right to view in separate window):

        new_paper_entry = Paper()
        new_paper_entry.paper_title = paper_item["title"]
        new_paper_entry.paper_journal = new_journal_entry
        new_paper_entry.save()
        for author_in_paper in list_of_authors_in_paper:
            new_paper_entry.paper_authors.add(author_in_paper)
            new_paper_entry.save()

To be able to perform
new_paper_entry.save()

It was essential to define:
new_paper_entry.paper_journal = new_journal_entry

If I did:
new_paper_entry = Paper()
new_paper_entry.save()

It would have given an error that the Paper database doesn't have a Journal entry. This is because each paper has a single journal and therefore to be able to save a valid iteration of the database, a journal assignment was necessary.

At the same time, the save command was essential after the journal assignment:
new_paper_entry.save()

Because the paper has the ManyToManyField - Authors. The paper cannot assign a many to many field assignment unless it exists in the database. And this happens only after the save() function.

It took me a day to figure this out. But was a good learning. When a ForeignKey field exists, it must be defined to be able to save the model. Unless this fields can be made optional. Also, to be able to define a many-to-many field, the model must be saved in the database.
 

Sunday, February 19, 2017

Templates

So far I have not done much template rendering except listing all the papers in the BibTex file. But before moving on to more complex stuff, I am trying to read as much as possible.

To begin with Django used the concept of loose coupling - URLs, views and data. With the URLconf list in urls.py, Django specifies which URL will call which function in views.py. Therefore, it is possible that the functionality of a URL can be changed by changing the view function without affecting any other part of the code. The view function on the other hand can access the database and render a template while passing the necessary data extracted from the database or from the URL. The template which should be rendered can be changed in the view function without changing any other part of the code. Finally, models.py specifies the structure of the database that can be changed independent of the views or the URLs. Of course, functions in views.py have to be designed flexibly enough the be able to adapt to changes in the database and the URLs.

Within the view functions, I have been reading about templates and contexts. The simplest way to generate a display on a webpage is using the HttpResponse() function. As an example:

return HttpResponse("Hello world")

will display Hello World in a webpage corresponding to the URL that points to the function with the above return statement. But to do more complex stuff, you would need a separate HTML file. This again is in alignment with the concept of loose coupling. The contents of the webpage should be separate from the view function that acts as the buffer between the URL and the database.

Suppose a separate HTML file was to exist in the templates folder in the application folder paperarchive/papercollection. This is the default directory that Django will search for templates when the 'APP_DIRS'=True is set in TEMPLATES variable in settings.py. The other option is to specify a list of directories in DIRS in the same variable. The conventional way to load this HTML file is with the get_template function in django.template.loader. So, suppose:

from django.template.loader import get_template
t = get_template("my_html.html")

is present in a view function, the template object t will be created with the contents of the HTML file. This HTML file could be a simple "Hello world" display as before or could be more complicated with variables called template tags and a bit of programming to deal with these template tags.

Since, variables are present in the template, they need data. The data is in the form of a dictionary with the keys being the variable names in the HTML file. This dictionary of variables is the context. So,

from django.template import Context
c = Context({"name": "Django"})

Will create a context object with the variable "name" being "Django". To pass this data to the HTML file, the template object that was created with the HTML file is rendered with this context by:

t.render(c)

When the view function returns the above template with the context,

return t.render(c)

The webpage is displayed with the data we specified. This concept is fairly convenient as the HTML file can be a regular HTML file with some amount of programming in the form of template tags. The view function can change the variables that are needed by the HTML file by extracting from the database or from user entered data in forms using the HTTP request object "request".

To simplify the above process, there are two functions in Django in django.shortcuts - render and render_to_response. They are similar but render_to_response is being discouraged as it may be discontinued later. The above process of creating a template object and rendering it with a context can be performed in one line as:

render(request, "my_html.html" , {"name": "django"})

or

render_to_response("my_html.html" , {"name": "django"})

Only difference is render needs the request object to be the first argument while render_to_response doesn't.

Additionally, these two functions also provide the possibility of context_processors. Instead of just the template and the context, a RequestContext object can be passed as a context_instance. So,

render(request, "my_html.html" , {"name": "django"} ,
context_instance = RequestContext(request [, context dictionary]
[, processors = <custom_processors>])
)

I took some time reading back and forth about this. RequestContext takes the request object as the first argument and will generate a context object that contains global variables that Django provides by default to save you the trouble from writing code. For example, context data about the user logged in etc. Check out the "context_processors" list in TEMPLATES variables in settings.py. This list contains the default global context processors. A context processor is a function that returns a dictionary which becomes the context and takes the request object as the only single argument. So the default global processors in the settings.py file are functions that are automatically added when a RequestContext function appears anywhere in a view function and these provide as context data that a user can conveniently use for a number of reasons like user authentication etc. Additionally, to the RequestContext function can be added custom processors which the user specifically has designed. The only requirement as before is that these custom processors should take as an argument the request object and return a dictionary as context data. The only catch in using the RequestContext seems to be that a number of context variables will be provided to the template that may not be needed as it calls all the context processors listed in the settings.py file.

So, the Context() function specifies data while RequestContext() function requests data and also allows you to add custom processors which add their data. So choosing to use Context() or RequestContext() seems to depend only on whether the user needs those global data that Django automatically generates or whether the user wishes to call other user-defined context processors that return code specific context data.