Friday, February 24, 2017

Models and interconnected data

With a basic server setup, the next part is getting the data in. Since, this is a database of publications, the most important table is going to be that of papers. Since the idea is to build a linked database, it would be best if I can create tables for the most searched for parts of every paper. For example, in a paper the important fields are - title, authors, journal, volume, number, pages, keywords, abstract. Out of these, authors and journal, are separate data that can be linked back to the paper. For example, if I want to get a list of paper published by Author X or the list of papers published by authors X and Y, it would be much faster if there was a separate table of authors that could be linked to the table of papers.

With this basic idea of a database in mind,I have created the following hierarchy. A paper will have the essential field as title without which it can't be created. The other fields of volume, number, pages, month, year are optional as sometimes they can't be found. A paper can have multiple authors and authors have multiple papers. So authors will be a ManyToManyField which is an external relationship to another model. As for journal, a paper will belong to one journal but a journal will have several papers. Therefore, the journal field in the paper database will be a many-to-one relationship and that means it will be a ForeignKey.

Therefore, the basic structure of models in models.py will be:



With this database structure, the idea is to populate the database automatically from the data extracted from the BibTex file. For that the following view function is written (click on view raw at the bottom right to view in separate window):



The above view function is fairly self explanatory in that it checks if a field exits in the BibTex entry for a paper. If the entry is found, it is added to the model. So, if an entry is missing, for example, there is no abstract, it wouldn't be a problem. The only catch was how to deal with the ForeignKey and ManyToManyFields. A ForeignKey field meant that there was an item (in this case journal) that was external to the database. A paper could belong to only one Journal but a Journal could have many papers. So, to be able to save even a basic definition of Paper, it was essential to relate a Journal.

For example, what is found in the code is (click on view raw at the bottom right to view in separate window):

        new_paper_entry = Paper()
        new_paper_entry.paper_title = paper_item["title"]
        new_paper_entry.paper_journal = new_journal_entry
        new_paper_entry.save()
        for author_in_paper in list_of_authors_in_paper:
            new_paper_entry.paper_authors.add(author_in_paper)
            new_paper_entry.save()

To be able to perform
new_paper_entry.save()

It was essential to define:
new_paper_entry.paper_journal = new_journal_entry

If I did:
new_paper_entry = Paper()
new_paper_entry.save()

It would have given an error that the Paper database doesn't have a Journal entry. This is because each paper has a single journal and therefore to be able to save a valid iteration of the database, a journal assignment was necessary.

At the same time, the save command was essential after the journal assignment:
new_paper_entry.save()

Because the paper has the ManyToManyField - Authors. The paper cannot assign a many to many field assignment unless it exists in the database. And this happens only after the save() function.

It took me a day to figure this out. But was a good learning. When a ForeignKey field exists, it must be defined to be able to save the model. Unless this fields can be made optional. Also, to be able to define a many-to-many field, the model must be saved in the database.
 

No comments:

Post a Comment