Abstract be extra careful/polite during the crawling process,

Published by admin on

The optimal web data mining analysis of web page structure
acts as a key factor in educational domain which provides the
systematic way of novel implementation towards real-time
data with different level of implications. Our experimental
setup initially focuses with retrieval of web structure such that
WebPages as nodes and hyperlinks as edges in order to
identify the webpage as a popular webpage or similar
webpage. This paper perform a detailed study of web structure
retrieval schema towards variant effect of periodic web pages
in the field of educational domain which can be carried out
with expected optimal output strategies. We will implement
our experimental web structure restoration techniques with
real time implementation of object representation in the
motive of educational Domains such as a college webpage
required for an open data analysis system. We will also
perform algorithmic procedural strategies for the successful
implementation of our proposed research technique in several
sampling domains with a maximum level of improvements. In
near future we will implement the Web usage techniques for
the efficient data analysis domain.
Keywords: Web Mining, Hyperlink, Web Structure Mining,
Pattern, Classification
When comparing web mining with traditional data mining,
there are three main differences to consider 1:
In traditional data mining, processing 1 million records from a
database would be large job. In web mining, even 10 million
pages wouldn’t be a big number.
When doing data mining of corporate information, the data is
private and often requires access rights to read. For web
mining, the data is public and rarely requires access rights.
But web mining has additional constraints, due to the implicit
agreement with webmasters regarding automated (non-user)
access to this data. This implicit agreement is that a
webmaster allows crawlers access to useful data on the
website, and in return the crawler (a) promises not to overload
the site, and (b) has the potential to drive more traffic to the
website once the search index is published. With web mining,
there often is no such index, which means the crawler has to
be extra careful/polite during the crawling process, to avoid
causing any problems for the webmaster.
A traditional data mining task gets information from a
database, which provides some level of explicit structure. A
typical web mining task is processing unstructured or semistructured
data from web pages. Even when the underlying
information for web pages comes from a database, this often
is obscured by HTML markup.
A strategic analysis department can undermine their client
archives with data mining software to determine what offers
they need to send to what clients for maximum conversions
rates. For example, a company is thinking about launching
cotton shirts as their new product 2. Through their client
database, they can clearly determine as to how many clients
have placed orders for cotton shirts over the last year and how
much revenue such orders have brought to the company. After
having a hold on such analysis, the company can make their
decisions about which offers to send both to those clients who
had placed orders on the cotton shirts and those who had not.
This makes sure that the organization heads in the right
direction in their marketing and not goes through a trial and
error phase to learn the hard facts by spending money
needlessly 3. These analytical facts also shed light as to what
the percentage of customers is who can move from your
company to your competitor.
The data mining also empowers companies to keep a record of
fraudulent payments which can all be researched and studied
through data mining 4. This information can help develop
more advanced and protective methods that can be undertaken
to prevent such events from happening. Buying trends shown
through web data mining can help you to make forecast on
your inventories as well 5. This is a direct analysis, which
will empower the organization to fill in their stocks
appropriately for each month depending on the predictions
they have laid out through this analysis of buying trends 6.
The data mining technology is going through a huge evolution
and new and better techniques are made available all the time
to gather whatever information is required. Web data mining
technology is opening avenues on not just gathering data but it
is also raising a lot of concerns related to data security. There
is loads of personal information available on the internet and
web data mining had helped to keep the idea of the need to
secure that information at the forefront 7.
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556
© Research India Publications. http://www.ripublication.com
Proposed Methodology
The proposed methodology describes the structure of a typical
web graph consists of web pages as nodes, and hyperlinks as
edges connecting related pages. Web structure mining is the
process of discovering structure information from the web.
This can be further divided into two kinds based on the kind
of structure information used.
A hyperlink is a structural unit that connects a location in a
web page to a different location, either within the same web
page or on a different web page. A hyperlink that connects to
a different part of the same page is called an intra-document
hyperlink, and a hyperlink that connects two different pages is
called an inter-document hyperlink. There has been a
significant body of work on hyperlink analysis 8 provide an
up-to-date survey.
Document Structure:
In addition, the content within a Web page can also be
organized in a tree structured format, based on the various
HTML and XML tags within the page. Mining efforts here
have focused on automatically extracting document object
model (DOM) structures out of documents 9.
Figure 1: Proposed methodology for web data mining in
online sales domain
Google Page Rank:
Websites link to interesting websites, so they “vote” for them.
The more websites vote to a website, the more interesting it is
also regard the votes for recommending Websites. Every
website has a starting score
Which are calculated incremental? 10
If there are few links, a specific one will be chosen with high
If there are many links, a specific one will be chosen with low
? Many in-links: Authority
? Many out-links: Hub
The Page Rank can be calculated as follows,
PR(pi)=(1?d)/N+d?PR(pj)/L(pj)— (1)
p j? M(pi)
? PR: Page Rank
? pi: page I
? d: damping factor
? N: number of pages
? L: out-links
? M: in-links
The implementation of the web structure mining is done in
the basically procedure as follows,
1. Extracting the page Rank manual or automatic.
2. Extracting the hyperlinks in a web page.
3. Internet domain classification.
4. Major domain influence computation.
5. Identify the URL characteristics.
The actual implementation of web content extraction can
be utilized by using the following java programming
1. The pseudo code algorithm for calculating the rank of web
pages is presented below.
Where e is the vector with all elements 1, € is the accuracy
threshold and 1 is the norm of the vector calculating by
summing up its elements.
2. Extracting the links in a webpage
public class ExtractLinks{
public static void main(String args) throws Exception {
String sUrl_yahoo = “http://www. mamma. com/result.
String nextLine;
String webPage;
StringBuffer wPage;
String sSql;
java. net. URL siteURL = new java. net. URL (sUrl_yahoo);
java. net. URLConnection siteConn = siteURL.
java. io. BufferedReader in = new java. io. BufferedReader (
java. io. InputStreamReader(siteConn. getInputStream() ) );
wPage = new StringBuffer(30*1024);
while ( ( nextLine = in. readLine() ) != null ) {
wPage. append(nextLine); }
in. close();
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556
© Research India Publications. http://www.ripublication.com

Categories: Marketing


I'm Iren!

Would you like to get a custom essay? How about receiving a customized one?

Check it out