For the publisher interested in optimizing their own websites to be "paper.li friendly", this article is for you. Or, if you simply want to know more about why we are choosing the image, title or article from your favorite content source, read on...
Our challenge
Paper.li was designed to allow publishers to curate either articles, videos or photos and present the result in a paper. That's what we do best!
When we find a webpage that includes an article, our paper.li extractor needs to detect the correct title, the correct image (if there is one), and the correct article content.
Although all webpages use HTML or a variant of it to create the page, it's important to keep in mind that there's an endless variety to how a webpage can be built. Since there are few hard-and-fast rules in the wild world of web design, we "accept the chaos" and do our best to account for it.
Considering the number of articles we correctly process and link back to each day and the infinite number of ways a webpage can be constructed, our extractor does a pretty good job detecting and extracting the correct title, article and image from a webpage.
Extracting elements from a webpage
Our paper.li extractor uses a complex algorithm to identify and present the correct title, article image, and content. Below are some rules which explain how we extract the different elements (title, article image, and content) from a page to present in a paper.
The full content of an article is not stored on our servers. We always extract and present only part of the article content and link back to the original source.
Title extraction
Looks for the specific tags in the order you see below in Example Site to extract a title.
URL extraction
Looks for the specific tags in the order of the Example Site, below. The fall back is the URL the content was retrieved from.
If you notice that the URL for an article pulled into your paper is not correct, check the article's source code for the following meta tags.
<link rel="canonical" href="http://example.com/article1"/>
<meta property="og:url" content="http://example.com/article2"/>
Some tips:
- If the URL begins with the prefix http://, copy and paste that URL into your browser and check for typos.
- If the URL found in the meta property tags do not begin with http:// prefix, the URL is interpreted to be relative. For relative URLS, we will append the base URL to the beginning of the article.
A relative URLs should lead with "/".
Image extraction
The extractor looks for a meta tag after the <head> element on the page, namely:
<meta property="og:image" content="http://example.com/yourimagehere.jpg"/>
These types of tags, called open graph tags, are widely used by social media services to identify an image, title and other meta-data for a webpage.
You can find out more about how to implement them here: http://ogp.me/
If there is no meta tag, the platform attempts to make a "best guess" as to which image on the page to pull for the article. The images found on the page are given scores.
The scoring algorithm is complex, however following these tips can help:
-
img tag with larger height and width numbers is best.
-
the existence of title attribute is a plus.
-
the existence of alt attribute is a plus.
-
“ad” at least two times in the name of the source url is a blocker
-
“avatar” in the name of source url is a blocker
Content extraction
Complex scoring algorithm with these points to the given tag and its children:
-
the more text content the more weight it has
-
these are pluses in id, tag and class names: body|content|entry|main|page|post|text|blog|story|haupt|article|artikel|instapaper_body
-
minuses: community|discuss|foot|header|menu|remark|reply)|rss|share|sponsor|ad|all|archive|attachment|pager|pagination|popup|print|login|sidebar
-
big minuses: nav|navigation|user|com|comment|contact|foot|masthead|media|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|tags|tool|widget|player|disclaimer|toc|infobox|vcard|avatar|follow-me-twitter
Optimizing your site to work with our extractor
If you're the site owner of a page and want to be paper.li friendly, you can immediately test your changes to see how our extractor views your paper by inserting the URL of your article into this endpoint. Replace the text www.example.com/article found below with the URL you wish to test.
http://paper.li/~api/paper_items/analyze_directly?url=http://www.example.com/article
The result will look like something similar and lists the elements that we will display: title, link to the image we found, URL used to link back to original source, and the content.
Example result from extractor URL
status":200,
"data":
{
"type":"link",
"domain":"www.theverge.com",
"title":"The greatest myth about phones is that you are in control",
"img":"http://cdn0.vox-cdn.com/uploads/chorus_image/image/41279606/DSC_6708-hero.0_cinema_1200.0.jpg",
"url":"http://www.theverge.com/2014/10/8/6947027/the-greatest-myth-about-phones-is-that-you-are-in-control",
"id":"25887c4eaac573cfba35a6eb52eb75ae56f84ee2",
"favicon_url":"http://cdn0.vox-cdn.com/images/verge/2.0/verge-icon-196x196.v503bbf1.png",
"content":"In the human-machine relationship, you, fellow Homo sapiens, are commonly designated as the owner, operator, or otherwise dictator of the subservient device in your possession. Consider, however, t..."}
}
Example site
<!DOCTYPE html>
<html lang="en">
<head>
<title>Title1</title>
<meta name="title" content="Title2" />
<meta property="og:title" content="Title3" />
<meta name="twitter:title" content="Title4"/>
<link rel="canonical" href="http://example.com/article1"/>
<meta property="og:url" content="http://example.com/article2"/>
<meta name="twitter:url" content="http://example.com/article3"/>
<meta property="og:image" content="http://example.com/image1"/>
<meta name="twitter:image" content="http://example.com/image2"/>
<link rel="image_src" href="http://example.com/image3"/>
<meta name="thumbnail" content="http://example.com/image4"/>
</head>
<body>
<img width="500" height="300" src="images/my_image1.jpg" title="is good" alt="is good too"/>
<div id="content" class="article">
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>
Comments
0 comments
Article is closed for comments.