Lphys’18 Schedule

2018/07/05

Introduction

This July I plan to attend the Lphys’18 workshop in Nottingham. I have attended the previous two in the series and I really liked that experience for the ability to learn lots of new stuff. The workshop has 9 seminars that run in parallel, I find two or three of them quite relevant to what I do, and the officially published timetable does not really help to navigate between the interesting talks. To simplify this, i decided to parse the timetables and to convert them to something more visual, namely to the ical files which can be then explored side by side. As a part of the challenge I decided to do that with python.

Note: The code for this project is available in the repo on github. There are as well the schedules in form of lists and in form of icals. The timetables contain names and affiliations of the speakers. I hope this does not break any laws, since this information is freely available on the Lphys web page.

Getting the data

First we have to grab the actual timetables from the workshop site and strip the unnecessary html. The easiest way to do so is to use pandoc to convert html to markdown. By the way, pandoc is great and you should definitely use it if you do not yet.

This can be completed with

for i in {1..9};
do wget -qO- --no-check-certificate \
	https://www.lasphys.com/workshops/lasphys18/program-seminar-$i \
	| pandoc -f html -t markdown --wrap=none \
	| sed -e '1,/Schedule/d' -e '/^$/d' \
	> p_sem_$i.md ;
done

This first gets the schedule of the corresponding seminar, pipes it to pandoc to get simple markdown instead of all the bloated html lists, then sed removes the lines from the first one till the one that has ‘Schedule’ in it and removes the empty lines. The sed part is optional, but it somehow helps to keep cleaner markdown files. The markdown consists mostly of lists of talks, one per line. The lines are still quite ugly, like this

1.  [[11:15 – 11:40 ]{ ..attributes.. }
[First Author]{}, [Second Author]{}[, 
[Presenting Author]{style="text-decoration: underline"}
(Affiliation)]{}]{ ..attributes.. }
*The title* (invited talk) [[Abstract] ..something more .. 

There is, however, some structure that helps us to extract the information. First, the start and end of the talk are at the beginning of the string in the square brackets, second, the presenting author is underlined, therefore there are always the attributes in parens, and finally the title is typeset in bold, so in markdown it is surrounded by the asterisks. In vim this could be easily converted to sane format with the replacement of the regex

:%s@.*\[\(\d\d:\d\d – \d\d:\d\d\) \].*\[\(.\{-}\)]{style="text-decoration: underline"}.*\*\(.*\)\*.*@\1 \2 "\3"@

which transforms the monster above to

11:15 – 11:40 Presenting Author "The title"

So perhaps with a little more regex and macros magic the task of joining the schedules would be solved, but at this point I really wanted to play with python.

The ical format

The ical does not seem to be a very pleasant format to read, but luckily there is the library icalendar for python. The latter reference has by the way a very reasonable description of the format. Importantly for our needs, the calendar can have events, and each of the events has start and end timestamps, description and annotation (useless).

We will use events for sessions and talks. Each of those require knowing the start/finish timestamps (including year, month, day, hour, minute and seconds + timezone). The session description should include chair name, and the talk description along the speaker name and talk title should ideally have the session label as a prefix.

Parsing the lines with python

The workhorse of the code will be the function that parses the individual lines of the markdown file. Those schedule files consist of the lines which announce either - the day - the session - the talk

In the file it looks like this:

1.  ##### Monday, 16 July, 2018 {..attributes..}
2.  ***S1.1* (11:15 – 12:30)**[**Chair**: FirstName LastName]{..attributes..}
    1.  [[11:15 – 11:40 ]{..attributes..}[ The author line ]

We will parse lines one by one, if we encounter the new date (16 July above), we will keep the day number as a global variable for the timestamps of the events. If we see the first line of the session (like S1.1), we will keep it as a global variable as well. Alongside this, we add the session as an event to the calendar. And from the talk line we extract the timestamps, and the speaker and the talk name, and immediately create an event.

The lines have a similar structure across the individual files, and can be parsed by regular expressions.1

day_line = re.search(r'.*#####.*?day, (\d\d) July.*', line)
if day_line:
	day = int(day_line.group(1))
	return

There is not much more to do with the day-line, so if we see it, we immediately finish processing the current line, returning nothing. There are similar regexes for the chair-line and the talk-line. We compare the current line with those regexes, and if none matches, we go to next line without any more actions.

Processing the files

To process the file means to transform the file at input to an ical calendar at the output. This is done by a function parse_file, which first creates an empty calendar with the two required lines

cal = Calendar()
cal.add('prodid', 'andrey.rakhubovsky@gmail.com')
cal.add('version', '2.0')

and then iterates over the lines of the schedule file. When all the lines are processed, we write the resulting calendar to file

fe = open(os.path.join(directory, calname), 'wb')
fe.write(cal.to_ical())
fe.close()

Applying this to all the schedule files in the directory, we obtain the icals for each of them, as required. The calendars can be imported to the software of choice, i used calcurses for checks and then Google Calendar, where it looks like this: Calendar View

This is not much better than before, but now we can view the schedules all at one place and pick the most relevant talks from different seminars.


  1. Somehow I was able to construct the vim regex may be on the third attempt, but with python I spent about half an hour trying to get what I want. The regexes are still far from perfect, and I would appreciate any positive feedback.