A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland – Scientific Data
Study area: The Helsinki Metropolitan Area
The dataset covers the Helsinki Metropolitan Area (HMA) in Finland, which consists of four municipalities: Helsinki, Vantaa, Espoo and Kauniainen (Fig.1). The study area has a population of over 1.1 million inhabitants (1,154,967 on 31.12.2017), which represents roughly one-fifth (21%) of the total Finnish population27. The average population density in the study area based on residential data is approximately 1,500 people/km2, being the highest in the inner city of Helsinki, which is located on the peninsula in the southern part of the study area.
Mobile phones are used extensively in the study area. At the end of 2017, the mobile phone penetration rate (mobile subscriptions=SIM cards/100 inhabitants) of Finnish households was 126% with approximately 6,960,000 mobile subscriptions28, which is above the global and the European average rates 103.6% and 120.4%, accordingly29. It is estimated that 89% of 1689-year-olds own a smartphone in the Finnish capital region30. The results of the survey suggest that there is no significant difference between women and men in terms of the phone ownership or use. A survey done in 2017 from the study area shows that 69% of 7-year-old children already have their personal mobile phone31. At the end of 2018, Elisa Oyj has the largest market share of mobile subscriptions (38%) in Finland followed by Telia Finland Oyj (33%) and DNA Oyj (28%)32.
Data processing steps flowchart
Producing the data required various processing steps. First, we pre-processed the raw data by cleaning, reclassifying and aggregating the data into polygons representing the approximated coverage areas of the operator base stations. Secondly, we used the pre-processed data as input to estimate hourly weekday population distribution in the study area by applying a dedicated dasymetric interpolation method to enhance the spatial accuracy of the mobile phone data. We calculated the hourly weekday (Monday-Thursday), Saturday and Sunday population distribution using a network-driven mobile phone dataset defined as High-Speed Packet Access (HSPA) calls (see details below). Friday was left out since time use patterns of people on Friday typically deviate from the other non-weekend days. We validated the data against the official population register data representing the residential population and workplace data. Finally, we packaged and visualized the data to provide an understanding of the dynamic population. The steps in the empirical study were conducted primarily using Python for analysis and QGIS for visualizing the results. The workflow of the study is illustrated in Fig.2.
Mobile phone data
Network-driven mobile phone data from a two-and-a-half-month study period from late October 2017 till early January 2018 provided by the Elisa Oyj mobile network operator (MNO) was used to map the dynamic population distribution in the study area. More specifically, we use HSPA (High-Speed Packet Access) call data which are automatically collected and pre-calculated key performance indicator (KPI) for data transmission by users in the mobile network based on the standard principles introduced by 3GPP33. Since HSPA data are calculated based on radio network counters there are no identifiers or links to any mobile device nor personal information. As the HSPA data are inherently anonymous, there is no opt-in or opt-out possibility. Thus, all the mobile devices connected to the network are in scope, including foreign mobile devices using roaming services.
The mobile phone data used was passively (automatically) collected and processed by the MNO prior to providing us with the data. First, the MNO aggregated the set of raw counters used to calculate HSPA calls from antenna (cell) level to base station (site) level before calculating the actual KPI for each base station according to the principles defined by 3GPP33. The raw counters as well as the data at base station (BS) level had the temporal accuracy of one hour. Second, the BS coordinates of the base stations equipped with multiple directional antennae, were approximated using the coordinates of the antenna with the maximum X and Y coordinate value by the MNO. This only has an impact on the spatial accuracy of the BS coordinates when the antennae were not attached to a mast-like cell tower.
Finally, for reasons of business confidentiality, a randomized error of up to 100 metres was set to BS coordinates in the inner city of Helsinki by the MNO before providing us the data. That is, each coordinate pair is randomly relocated within the range between 100 metres and +100 metres from the original location. Outside the inner city, the error was set up to 200 metres, accordingly. In general, the spatial accuracy of the data is dependent on the density of the base station network (highest in the city centre and other densely populated areas, where use rates are highest)1,34. The median theoretical coverage area based on Voronoi polygon modelling in our study area is 0.24 km2.
Content of the raw data
The original dataset contained approximately 3.8 million rows of data and covered all base stations by the given operator in the Uusimaa region in Southern Finland. The original dataset received from the MNO contained six attributes: the hourly count of HSPA calls, the identifier of a base station and data record, geographical (X, Y) coordinates (in ETRS-TM35FIN coordinate system) and timestamp with an hourly precision (YYYY-MM-DD hh) (Table1).
To contextualize the HSPA call data, it is a collection of downlink (HSDPA) and uplink (HSUPA) protocols, which enables faster data transmission in a Universal Mobile Telecommunications System (UMTS) cellular network35. In general, radio access bearers (RAB) are responsible for transmitting voice or data in 3G telecommunication networks, but if HSPA is supported by the network, data transfer can be replaced by HSPA bearers when prompted by HSPA call requests35. Thus, the HSPA calls in the dataset encompass the majority of 3G mobile data transmissions. Data transfer from 4G networks was, however, not available for the study.
Temporal distribution of the raw data
The HSPA call data show clear temporal patterns both at weekly and daily levels. Regarding the whole study period, a recurring weekly rhythm can be distinguished (Fig.3). The amount of network activity is relatively similar between the weekdays from Monday to Friday, which decreases during the weekend, with the lowest rates on Sundays. The weekly pattern is disrupted during the holiday season with lower mobile phone usage compared to the day of the week average. Examples include Finlands Independence Day (6.12.), New Years Day and Christmas Day. Days with abnormally high values are system biases inherent in the raw dataset.
There is a distinct pattern in the temporal distribution of network activities, even at the diurnal level. On a regular workday (MondayThursday), mobile phone data follow a similar pattern as shown in the activities of people from the Time Use Survey with lowest values during the night, from 00:00 to 05:00 and more evenly distributed over the course of the day (Fig.4).
Pre-processing of mobile phone data
The mobile phone data were prepared for constructing the dynamic population by filtering, cleaning, manipulating and aggregating the original data (see Fig.5). We excluded days (n=3) with abnormal data (Fig.3) and any hourly values (incorrect or missing data from a base station) that might distort the results. We further cropped the data to the extent of the study area, removed a handful of base stations with no activity during the whole study period (or if two base stations had identical ID in different locations), and mergeda few base stations with identical coordinates. We also filtered out duplicate hour entries caused by the transition to winter time.
After cleaning and editing the data, we filtered the data to present regular workdays (Monday-Thursday), and separately both weekend days Saturday and Sunday due to distinctive temporal activity pattern. After the filtering, 62 days were left for further analysis, out of which 42 between Monday and Thursday and 10 on both Saturdays and Sundays. Finally, we aggregated the data to get the median number of HSPA calls during the study period for every base station (BS) for every hour of the day.
Constructing the dynamic population from mobile phone data
To distribute the mobile phone data from the base stations to the statistical grid squares, we used the multi-temporal function-based dasymetric (MFD) interpolation method1, see Figs.5 and 6. The MFD method is a dasymetric interpolation method belonging to the same family of areal interpolation methods as areal weighting. However, dasymetric interpolation differs from areal weighting because it uses ancillary data to improve the interpolation of data from existing spatial units (i.e. source zones) to desired spatial units (i.e. target zones). This approach has been regarded as one of the most feasible methods for refining the spatial resolution of population and has been widely applied in different application fields17,18.
The datasets used for preparing the dynamic population distribution using a dasymetric interpolation method are listed in Table2.
Creation of the physical surface layer
In the first stage of the MFD method, land cover and building data were pre-processed and combined to create the physical surface layer which is a spatial layer representing land use information including a vertical dimension (building volumes). It is used as an input data for calculating the likelihood of human presence at the later stages of the MFD1,36. Each feature in the physical surface layer was assigned an activity function type, which enabled us to further link the data with the time use survey data (Table3).
Regarding the land cover data, we used a country-specific CORINE Land Cover raster dataset (the most recent version of it at the time) with a spatial accuracy of 20m20m to determine the land cover classes of the study area37. The spatial accuracy of the more broadly available Pan-European CORINE Land Cover vector dataset was too coarse (25ha) for the study purposes. Similarly, the more recent openly available land cover data provided by the National Land Survey of Finland and the Helsinki Region Environmental Services Authority HSY were not applicable due to too low spatial accuracy. The refined land cover classification enabled us to link land use classes to activity types in the time use data.
To prepare the land cover data for the MFD method, the dataset was transformed into vector format, reclassified and cropped to the extent of the study area. Like Jrv et al.1, the land cover data were reclassified from the original classes (n=48) to five classes based on their activity function types: (1) residential, (2) work, (3) transport, (4) restricted and (5) other (see Bergroth p.5838; Fig.7). To improve the classification, both the international Helsinki-Vantaa airport area (mid-north; Fig.7) and the Vuosaari cargo harbour area (east) were reclassified from transport class to the work class as an important site for workforce due to their work-driven functions.
In terms of the building data, building polygons were extracted from the National Topographic Database39. In total, 160,490 buildings were located in the study area. The building data were cleaned by calculating the area of each building footprint and filtering out buildings with an area below 20 m2 (n=6,860) leaving 153,357 buildings left for further analysis. Similarly with Jrv et al.1, the buildings were first classified into three types according to their primary activity function type residential, work and other buildings (see Bergroth p. 5838). Here, non-classified buildings were assumed to have work as the main activity function (i.e. work buildings), given that the dataset has accurate classification for buildings that have primary activity functions associated with residential and other activity, but not for work activity function. To further enrich the data and refine the classification, we retrieved additional building information from OpenStreetMap (n=72,574)40. Using the OpenStreetMap data, the building classification was expanded to cover also retail and service and transport activity function types, which could not be extracted from the original building data (see Bergroth p. 14038; Fig.8).
Only one activity function type was assigned to each building. We recognize the crudeness of the selected approach as buildings may have multiple use types either simultaneously or at different times. However, the current level of accuracy is expected to be feasible for the purpose of this study. The final classification of buildings per activity function type is presented in Fig.8 and Table4.
The physical surface layer also takes into account the vertical dimension in the likelihood of human presence. To retrieve the vertical dimension, we used information about building footprints, floor area (m2) and floor counts based on national building registers (not available for the city of Kauniainen). The municipal data were further cleaned, combined and joined to the original building dataset. Finally, a geometric union was performed to combine the reclassified building and land cover layers.
Spatial disaggregation by the source and target zones
After creating the physical surface layer, a geometric union was performed between the physical surface layer, source zone and target zone layers to create the disaggregated physical surface layer a layer where physical surface layer units are divided into subunits so that each subunit (referred as s in Formulas, below) is designated both to one unique source zone (j) and one unique target zone (z), see Fig.6. In general, any spatial division can be used regarding the source zones and target zones. Voronoi polygons were used to estimate the theoretical coverage areas of base stations (source zones), and 250m250m statistical grid cells were used as the target zones41. As a result, the study area was divided into 345,917 subunits, each with a designated activity function type and spatial unit type (building or land) as well as floor area. The area of each subunit was recalculated after the overlay operation.
Next, the relative floor area of each subunit was calculated to include the vertical dimension in the interpolation. First, the absolute floor area was assigned to the subunits based on their spatial unit type and activity function type. For subunits with the spatial unit type land, the geometric area of the subunit was set as the floor area. For subunits with the spatial unit type building, the floor area was based on openly available building data from the municipalities of Espoo42, Helsinki43 and Vantaa44 containing the building register-based floor areas and floor counts. The use of actual floor areas provides a more accurate estimate than the LiDAR-based approach applied in Jrv et al.1, in which the floor area was estimated from the building height extracted from the digital surface model (DSM).
In case the building register data were not openly available (e.g. in Kauniainen), the floor area was estimated based on the actual or mean floor count and a specific floor area coefficient. The mean floor count was 2 for residential, service and retail buildings, and 1 for others. The floor area coefficient was 0.95 for residential buildings, 0.91 for service and retail buildings, and 0.98 for other buildings. The floor area coefficient was calculated as the median ratio between the actual floor area and the product of the building footprint area and the floor count. Both the mean floor count and the floor area coefficient were calculated separately for buildings of each activity function type. Finally, the relative floor area (RFA) was calculated for each subunit within a source zone, based on the Formula 1:
$$RFA_s^j=fracFA_s^jsum FA_s^jin jforall sin j$$
(1)
where
RFA=relative floor area
FA=floor area
s=spatial subunit
j=source zone
As a result, the sum of the relative floor area of all subunits within one source zone (Voronoi polygon) equals to 1. The higher the relative floor area of the subunit, the higher the likelihood that activity is allocated to that subunit.
Integration of the temporal human activity data
In the third phase of the MFD method, time use data were used to integrate the physical surface layer to create a probability matrix for allocating the mobile phone data to target zones within each source zone. As a result, each spatial subunit got an hourly likelihood rate of human presence based on its activity function type.
The estimated human presence (EHP) in each subunit was calculated using human activity data based on the latest Finnish time use survey45 carried out in 2009, according to the guidelines for Harmonised European Time Use Surveys (HETUS) issued by Eurostat. The time use survey allows for the calculation of the human activity data for each hour based on the activity location of over 10-year-olds in the HMA (Fig.9).
To calculate the estimated human presence, we first aggregated the human activity to the hourly level. Second, we reclassified human activity from the survey to the following classes based on the location, where the activity was undertaken to join it with the physical surface layer: 1) residential, 2) work (incl. education), 3) transport, 4) retail and service, 5) unknown and 6) other (such as recreational areas) (see Bergroth p. 13938).
An hourly probability coefficient (H) was assigned to every hour of the day based on the time use data. In addition, a seasonal probability coefficient (M) was assigned to account for the impact of the season on the distribution of people indoors and outdoors. According to a study conducted by Hussein et al.46, people were found to spend approximately 90% of the day indoors in Helsinki during the winter and spring. Similarly, as in Jrv et al.1, the results are assumed to be suitable for the dasymetric interpolation, since the mobile phone data used for estimating the population distribution were also collected during winter. The seasonal factor was applied for three of the activity function types (residential, work and education, other). Thus, a subunit of the work activity function type would receive a coefficient of 0.9 if the spatial unit type was building and a coefficient of 0.1. if the spatial unit type was land. Subunits with the other activity function types were assigned a factor of 1, except restricted areas, which were assigned a factor of 0. This way, the MFD method prevents population being allocated to a subunit of a restricted type. Overall, the estimated human presence per every spatial subunit at a given time unit (hour) was calculated using Formula 2:
$$EHP_s^j,t=left[H_a,u^ttimes M_a,uright]times RFA_s^j$$
(2)
where
EHP=estimated human presence
t=time unit
H=hourly factor
M=seasonal factor
RFA=relative floor area
a=activity function type
u=spatial unit type (building or land)
s=spatial subunit
j=source zone
Integration of the mobile phone data
In the fourth phase of the dasymetric interpolation, the mobile phone data, were integrated to the physical surface layer enriched with hourly and seasonal human activity data. The mobile phone data containing the hourly median number of the different network activities were linked to the physical surface layer based on the BS identifier. First, the mobile phone activity per spatial subunit was normalized by dividing it by the sum of the corresponding value of all spatial subunits in the study area. Hence, the sum of the relative proportion of mobile phone data of all subunits in the study area is 1. The relative proportion of mobile phone data per spatial subunit of study area total at given hour was calculated using Formula 3:
$$RMP_s^j,t=fracMP_s^j,tsum MP_s^j,tin S$$
(3)
where
RMP=relative proportion of mobile phone data
MP=mobile phone data
s=spatial subunit
t=time unit
j=source zone
S=study area
The formula was calculated separately for each of the three weekdays regular workday (Monday Thursday), Saturday and Sunday. Secondly, the hourly normalized mobile phone data for each weekday were multiplied by the hourly estimated human presence to allocate the population to the subunits based on the physical surface layer and time use statistics. The relative observed population was calculated using Formula 4:
$$ROP_s^j,t=EHP_s^j,ttimes RMP_s^j,t$$
(4)
where
ROP=relative observed population
EHP=estimated human presence
RMP=relative proportion of mobile phone data
s=spatial subunit
t=time unit
j=source zone
Spatial aggregation to target zones
In the fifth and final phase of the MFD method, the spatial subunits were aggregated to the statistical 250m250m grid cells (n=13,231). The aggregation was performed by dissolving the subunits based on the target zone ID. As a result, each target zone was assigned the sum of the relative observed population of all spatial subunits within the given target zone. The aggregation to the target zones can be summarized as follows (Formula 5):
$$ZROP^z,t=sum _sin zROP_s^z,t$$
(5)
where
ZROP=spatially aggregated relative observed population per target zone
ROP=relative observed population
t=time unit
s=spatial subunit
z=target zone
As a final result of the MFD method, three normalized population data layers for each hour of the day for regular workday (Monday Thursday), Saturday and Sunday were created. After normalization, the sum of all values for each one-hour period equals to 100 (i.e. 100% of total population). The script used to run the MFD method is based on Jrv et al.47 and openly shared via GitHub: https://github.com/DigitalGeographyLab/mfd-helsinki.